Most of the time, specialist terms used in technical and scientific documents cannot be enumerated in advance. Scientific naming in many domains such as Chemistry, biology, Astronomy, etc. follows nomenclatures and conventions which are highly generative. Similarly, citations to bibliographical items (articles, patents, etc.) present a very high variability. All these expressions are essential to the scientific communication and, as a consequence, to scientific information access tools.
Machine learning techniques are particularly adapted and efficient for recognizing these kind of terms and are used by all our dedicated modules.
Recognition and normalization of quantities
The expressions of quantities and measurement are fundamental in STEM and can be viewed as a particular nomenclature. Identifying and normalizing these expressions into SI base units make possible a vast range of applications, in particular to search quantities in document collections, to apply numerical data mining techniques, or to extract automatically knowledge about experimental conditions.
grobid-quantities recognizes in textual documents (text, PDF, XML) expressions of measurements (e.g. pressure, temperature, etc.), then parses, normalizes them, and finally converts these measurements into SI units. For English, the tool supports more than 120 base units and expressions of atomic, interval and lists of values. In addition, grobid-quantities tries to identify and attached to the measurements the “quantified” substance or objects.
These normalized measurement can then be exploited by search or data mining tools. Our front-end search tool dedicated to scholar articles for instance allows to express search queries including quantity criteria (e.g. interval) which are then processed with the range query search possibilities of ElasticSearch to retrieve and rank matching documents annotated beforehand by grobid-quantities.