Matches in SemOpenAlex for { <https://semopenalex.org/work/W4386932457> ?p ?o ?g. }
Showing items 1 to 78 of
78
with 100 items per page.
- W4386932457 abstract "Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form <noun phrase, relation phrase, noun phrase> (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodiversity Institute of Costa Rica (INBio). Specifically, the project focused on selecting records of morphological descriptions of plant species written in Spanish. The system processes the morphological descriptions using a workflow that includes phases like data selection and pre-processing, feature extraction, test PLM, local language model training, and test and evaluate results. Fig. 2 shows the general workflow used in this research. Pre-processing and Annotation: Descriptions were standardized by removing special characters like double and single quotes, replacing abbreviations, tokenizing text, and other transformations. Some records of the dataset were annotated with the ground-truth structured information in the form of triples that were extracted from each paragraph. Additionally, structured data from the project carried out by Mora and Araya (Mora and Araya 2018) were included in the dataset. Feature extraction: The token vectorization was done using word embedding directly by the language models. Test PLM: The evaluation process of PLM models used the zero-shot approach and involved applying the models to the test dataset, extracting information, and comparing it to annotated ground truth. Local Language Model Training: The annotated data was split into 80% training data and 20% test data. Using the training data, a language model based on the Transformers architecture was trained. Evaluate results: Evaluation metrics such as precision, recall, and F1 (a meaure of the model's accuracy) were calculated comparing the extracted information and the ground truth. The results were analyzed to understand the models' performance, identify strengths and weaknesses, and gain insights into their ability to extract accurate and relevant information. Based on the analysis, the evaluation process iteratively improved models results. The main contributions of this project are: A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website. A Transformers-based language model to extract information from morphological descriptions of plants written in Spanish available on the project website.*1 A corpus of morphological descriptions of plants, written in Spanish, labeled for information extraction, and made available on the project website. The results of the project, the first of its kind applied to morphological descriptions of plants written in Spanish, published on the project website." @default.
- W4386932457 created "2023-09-22" @default.
- W4386932457 creator A5049056819 @default.
- W4386932457 creator A5064725734 @default.
- W4386932457 creator A5084852550 @default.
- W4386932457 creator A5088674170 @default.
- W4386932457 creator A5092916884 @default.
- W4386932457 creator A5092916885 @default.
- W4386932457 date "2023-09-21" @default.
- W4386932457 modified "2023-09-30" @default.
- W4386932457 title "Structuring Information from Plant Morphological Descriptions using Open Information Extraction" @default.
- W4386932457 cites W1996816368 @default.
- W4386932457 cites W2811447149 @default.
- W4386932457 cites W3201921748 @default.
- W4386932457 cites W4283367106 @default.
- W4386932457 cites W4285604902 @default.
- W4386932457 doi "https://doi.org/10.3897/biss.7.113055" @default.
- W4386932457 hasPublicationYear "2023" @default.
- W4386932457 type Work @default.
- W4386932457 citedByCount "0" @default.
- W4386932457 crossrefType "journal-article" @default.
- W4386932457 hasAuthorship W4386932457A5049056819 @default.
- W4386932457 hasAuthorship W4386932457A5064725734 @default.
- W4386932457 hasAuthorship W4386932457A5084852550 @default.
- W4386932457 hasAuthorship W4386932457A5088674170 @default.
- W4386932457 hasAuthorship W4386932457A5092916884 @default.
- W4386932457 hasAuthorship W4386932457A5092916885 @default.
- W4386932457 hasBestOaLocation W43869324571 @default.
- W4386932457 hasConcept C10138342 @default.
- W4386932457 hasConcept C120567893 @default.
- W4386932457 hasConcept C130217890 @default.
- W4386932457 hasConcept C136764020 @default.
- W4386932457 hasConcept C151730666 @default.
- W4386932457 hasConcept C154945302 @default.
- W4386932457 hasConcept C162324750 @default.
- W4386932457 hasConcept C18903297 @default.
- W4386932457 hasConcept C195807954 @default.
- W4386932457 hasConcept C205649164 @default.
- W4386932457 hasConcept C23123220 @default.
- W4386932457 hasConcept C2522767166 @default.
- W4386932457 hasConcept C2775945657 @default.
- W4386932457 hasConcept C2781083858 @default.
- W4386932457 hasConcept C41008148 @default.
- W4386932457 hasConcept C86803240 @default.
- W4386932457 hasConceptScore W4386932457C10138342 @default.
- W4386932457 hasConceptScore W4386932457C120567893 @default.
- W4386932457 hasConceptScore W4386932457C130217890 @default.
- W4386932457 hasConceptScore W4386932457C136764020 @default.
- W4386932457 hasConceptScore W4386932457C151730666 @default.
- W4386932457 hasConceptScore W4386932457C154945302 @default.
- W4386932457 hasConceptScore W4386932457C162324750 @default.
- W4386932457 hasConceptScore W4386932457C18903297 @default.
- W4386932457 hasConceptScore W4386932457C195807954 @default.
- W4386932457 hasConceptScore W4386932457C205649164 @default.
- W4386932457 hasConceptScore W4386932457C23123220 @default.
- W4386932457 hasConceptScore W4386932457C2522767166 @default.
- W4386932457 hasConceptScore W4386932457C2775945657 @default.
- W4386932457 hasConceptScore W4386932457C2781083858 @default.
- W4386932457 hasConceptScore W4386932457C41008148 @default.
- W4386932457 hasConceptScore W4386932457C86803240 @default.
- W4386932457 hasLocation W43869324571 @default.
- W4386932457 hasLocation W43869324572 @default.
- W4386932457 hasOpenAccess W4386932457 @default.
- W4386932457 hasPrimaryLocation W43869324571 @default.
- W4386932457 hasRelatedWork W104581431 @default.
- W4386932457 hasRelatedWork W1788528807 @default.
- W4386932457 hasRelatedWork W2074327869 @default.
- W4386932457 hasRelatedWork W2153799433 @default.
- W4386932457 hasRelatedWork W2352337653 @default.
- W4386932457 hasRelatedWork W2367301249 @default.
- W4386932457 hasRelatedWork W2379157006 @default.
- W4386932457 hasRelatedWork W2393978999 @default.
- W4386932457 hasRelatedWork W2725657302 @default.
- W4386932457 hasRelatedWork W2748952813 @default.
- W4386932457 hasVolume "7" @default.
- W4386932457 isParatext "false" @default.
- W4386932457 isRetracted "false" @default.
- W4386932457 workType "article" @default.