Matches in SemOpenAlex for { <https://semopenalex.org/work/W2126607189> ?p ?o ?g. }
- W2126607189 abstract "This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by extracting relevant concepts from the textual documents of the client’s organization. Such a time-consuming task is usually done manually which is prone to fatigue, errors, and omissions. The business documents are considered unstructured and are less formal and straightforward than software requirements specifications created by an expert. In addition, our research is done on documents written in French, for which text analysis tools are less accessible or advanced than those written in English. As a result, the presented system integrates accessible tools in a processing pipeline with the goal of increasing the quality of the extracted list of concepts.Our first contribution is the definition of a high-level process used to extract domain concepts which can help the rapid discovery of knowledge by software experts. To avoid undesirable noise from high level linguistic tools, the process is mainly composed of positive and negative base filters which are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. We introduce a new metric to assess the performance discovery speed of relevant concepts. We also present a method to help obtain a gold standard definition of software engineering oriented concepts for knowledge extraction tasks.Our second contribution is a statistical method to extract large and complex multiword expressions which are found in business documents. These concepts, which can sometimes be exemplified as named entities or standard expressions, are essential to the full comprehension of business corpora but are seldom extracted by existing methods because of their form, the sparseness of occurrences and the fact that they are usually excluded by the candidate generation step. Current extraction methods usually do not target these types of expressions and perform poorly on their length range. This article describes a hybrid method based on the local maxima technique with added linguistic knowledge to help the frequency count and the filtering. It uses loose candidate generation rules aimed at long and complex expressions which are then filtered using n-grams semilattices constructed with root lemma of multiword expressions. Relevant expressions are chosen using a statistical approach based on the global growth factor of n-gram frequency. A modified statistical approach was used as a baseline and applied on two annotated corpora to compare the performance of the proposed method. The results indicated an increase of the average F1 performance by 23.4% on the larger corpora and by 22.2% on the smaller one when compared to the baseline approach.Our final contribution helped to further develop the acronym extraction module which provides an additional layer of filtering for the concept extraction. This work targets the extraction of implicit acronyms in business documents, a task that have been neglected in the literature in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non predictive nature of business documents hinders the effectiveness of the extraction methods used on biomedical documents, and fail to deliver the expected performance. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature which permits a flexible classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences." @default.
- W2126607189 created "2016-06-24" @default.
- W2126607189 creator A5090884768 @default.
- W2126607189 date "2014-10-23" @default.
- W2126607189 modified "2023-09-27" @default.
- W2126607189 title "Concept exploration and discovery from business documents for software engineering projets using dual mode filtering" @default.
- W2126607189 cites W105439543 @default.
- W2126607189 cites W109926203 @default.
- W2126607189 cites W116705248 @default.
- W2126607189 cites W138738183 @default.
- W2126607189 cites W1484050045 @default.
- W2126607189 cites W1485658078 @default.
- W2126607189 cites W1490626979 @default.
- W2126607189 cites W1498763386 @default.
- W2126607189 cites W150095201 @default.
- W2126607189 cites W1503940717 @default.
- W2126607189 cites W1505448766 @default.
- W2126607189 cites W1511312643 @default.
- W2126607189 cites W1516346799 @default.
- W2126607189 cites W1525742988 @default.
- W2126607189 cites W1531499883 @default.
- W2126607189 cites W153318553 @default.
- W2126607189 cites W1538315778 @default.
- W2126607189 cites W1547957094 @default.
- W2126607189 cites W1549611544 @default.
- W2126607189 cites W1550436572 @default.
- W2126607189 cites W1564649749 @default.
- W2126607189 cites W1568297139 @default.
- W2126607189 cites W1570448133 @default.
- W2126607189 cites W1572615921 @default.
- W2126607189 cites W1572964175 @default.
- W2126607189 cites W1574901103 @default.
- W2126607189 cites W1576212969 @default.
- W2126607189 cites W1592324698 @default.
- W2126607189 cites W1593045043 @default.
- W2126607189 cites W1593249691 @default.
- W2126607189 cites W1595247926 @default.
- W2126607189 cites W162154816 @default.
- W2126607189 cites W1670263352 @default.
- W2126607189 cites W1694571275 @default.
- W2126607189 cites W1706796029 @default.
- W2126607189 cites W1746620543 @default.
- W2126607189 cites W178298440 @default.
- W2126607189 cites W1805473862 @default.
- W2126607189 cites W1816480286 @default.
- W2126607189 cites W1865928303 @default.
- W2126607189 cites W1881647329 @default.
- W2126607189 cites W1882802185 @default.
- W2126607189 cites W1901580349 @default.
- W2126607189 cites W1925498800 @default.
- W2126607189 cites W1933390880 @default.
- W2126607189 cites W1936022305 @default.
- W2126607189 cites W1965401060 @default.
- W2126607189 cites W1975928975 @default.
- W2126607189 cites W1977766834 @default.
- W2126607189 cites W1985108724 @default.
- W2126607189 cites W1987566315 @default.
- W2126607189 cites W1989534258 @default.
- W2126607189 cites W1990075054 @default.
- W2126607189 cites W1991154713 @default.
- W2126607189 cites W1992249082 @default.
- W2126607189 cites W1993243953 @default.
- W2126607189 cites W1997721713 @default.
- W2126607189 cites W1999700406 @default.
- W2126607189 cites W2002121299 @default.
- W2126607189 cites W2003281491 @default.
- W2126607189 cites W2003601922 @default.
- W2126607189 cites W2005051014 @default.
- W2126607189 cites W2017541337 @default.
- W2126607189 cites W2019862689 @default.
- W2126607189 cites W2024046085 @default.
- W2126607189 cites W2028705450 @default.
- W2126607189 cites W2031627786 @default.
- W2126607189 cites W2032219148 @default.
- W2126607189 cites W2033511853 @default.
- W2126607189 cites W2037768235 @default.
- W2126607189 cites W2042085493 @default.
- W2126607189 cites W2042879619 @default.
- W2126607189 cites W2044070623 @default.
- W2126607189 cites W2046025209 @default.
- W2126607189 cites W2049107599 @default.
- W2126607189 cites W2051885446 @default.
- W2126607189 cites W2057138859 @default.
- W2126607189 cites W2061071413 @default.
- W2126607189 cites W2061167720 @default.
- W2126607189 cites W2067743731 @default.
- W2126607189 cites W2068737686 @default.
- W2126607189 cites W2073438787 @default.
- W2126607189 cites W2074228526 @default.
- W2126607189 cites W2077656994 @default.
- W2126607189 cites W2078523032 @default.
- W2126607189 cites W2080029368 @default.
- W2126607189 cites W2080579867 @default.
- W2126607189 cites W2090601551 @default.
- W2126607189 cites W2094160481 @default.
- W2126607189 cites W2099515101 @default.
- W2126607189 cites W2101196694 @default.
- W2126607189 cites W2101350932 @default.
- W2126607189 cites W2102065370 @default.
- W2126607189 cites W2102740905 @default.