Towards the Automatic Extraction of Plant Traits from Textual Descriptions

Dagtekin, Onatkut; Ulate, William; Batista-Navarro, Riza

doi:10.22032/dbt.37984

Vortrag 2018 CC BY 4.0

Veröffentlicht

Towards the Automatic Extraction of Plant Traits from Textual Descriptions

Dagtekin, Onatkut; Ulate, William; Batista-Navarro, Riza

Many ecological restoration programmes are informed by evidence coming from empirical research. Specifically, such programmes analyse species traits in order to differentiate species that are suitable for restoration from unsuitable ones. Indeed, understanding plant traits (and their relationships with each other) informs research into vegetation modelling and environmental change prediction, which in turn help in answering many ecological questions. In 2006, the Center for Tropical Forest Science (CTFS) formulated recommendations in support of their research programme, the foremost of which is the creation of trait databases by building upon published information catalogued by existing herbaria. In this work, we aim to enrich World Flora Online (WFO), a web-based inventory of known plant species, by integrating trait information contained in data sets coming from botanical institutions all over the world. This poses a few challenges, as trait information tends to be buried within verbose textual descriptions and do not conform with conventions of writing. Specifically, they typically do not come in the form of full sentences and look like long-winded enumerations of various types of plant attributes or characteristics. Such descriptions are difficult to search and understand unless decomposed into meaningful units. In order to decompose textual descriptions of plant species into spans pertaining to specific types of attributes, we have developed a machine learning-based approach to automatic text segmentation. Casting the problem as a sequence labelling task, we have investigated a number of probabilistic classifiers including conditional random fields (CRFs), hidden Markov models (HMMs) and naïve Bayes (NB). To train our models, we utilised data contributed by the South African National Biodiversity Institute (SANBI) which contain traits labelled as one of the following trait categories: morphology, habitat and distribution. To help the models discriminate between these categories, we designed features capturing word characteristics (e.g., n-grams at the character and word level), context (i.e., surrounding words within a predefined window), as well as domain knowledge (i.e., words that match terms in plant-related ontologies). In this way, we can automatically elucidate exactly which parts of the original descriptions pertain to plant traits such as morphology, habitat or distribution. By applying the resulting models on textual descriptions coming from several botanical institutes, we can facilitate the automatic population of WFO with plant traits for a number of species.

Vorschau

Einordnung

Konferenz:: ICEI 2018 : 10th International Conference on Ecological Informatics- Translating Ecological Data into Knowledge and Decisions in a Rapidly Changing World. 24-28 September, 2018. Jena, Germany
Erschienen in:: ICEI 2018 : 10th International Conference on Ecological Informatics- Translating Ecological Data into Knowledge and Decisions…
(2018)
Datum der Veröffentlichung:: 2018
DOI:: 10.22032/dbt.37984
Sprache:: Englisch
Ressourcentyp:: Text
Umfang:: 21 Seiten
Erscheinungsort:: Jena
Schlagwörter:: Text Segmentation, Plant Traits, Machine Learning, World Flora Online
DDC-Sachgruppe der DNB:: 004 Informatik
DDC-Sachgruppe der DNB:: 570 Biowissenschaften, Biologie
DDC-Sachgruppe der DNB:: 580 Pflanzen (Botanik)
DDC-Sachgruppe der DNB:: 590 Tiere (Zoologie)
DDC-Sachgruppe der DNB:: 600 Technik
DDC-Sachgruppe der DNB:: 630 Landwirtschaft, Veterinärmedizin
Einrichtung:: Friedrich-Schiller-Universität Jena

auf die Merkliste

Zitieren

Zitierform:

10.22032/dbt.37984
Zitier-Link kopieren

Rechte

Nutzung und Vervielfältigung:

Export

BibTeX, Endnote, MODS, MARCXML, RIS, ISI, PICA, DC, CSV