Matches in SemOpenAlex for { <https://semopenalex.org/work/W2281943135> ?p ?o ?g. }
Showing items 1 to 88 of
88
with 100 items per page.
- W2281943135 abstract "This master thesis reports an applied machine learning research internship done at digital library of the European Organization for Nuclear Research (CERN). The way an author’s name may vary in its representation across scientific publications creates ambiguity when it comes to uniquely identifying an author; In the database of any scientific digital library, the same full name variation can be used by more than one author. This may occur even between authors from the same research affiliation. In this work, we built a machine learning based author name disambiguation solution. The approach consists in learning a distance function from a ground-truth data, blocking publications of broadly similar author names, and clustering the publications using a semi-supervised strategy within each of the blocks. The main contributions of this work are twofold; first, improving the distance model by taking into account the (estimated) ethnicity of the author’s full name. Indeed, names from different ethnicities, for example Asian versus Arabic names, should be processed differently. This added feature led to a better clustering evaluation. It also got a high contribution percentage in the feature importances analysis. The second main contribution was to decide on a thresholding strategy to form a flat clustering from the agglomerative hierarchical clustering. Six different strategies were evaluated to estimate the number of clusters in each block. The strategy that provides the best evaluation results was using a blocking function that groups signatures with common last name and first name initial, then applying the semi-supervised clustering on the blocks that contains samples from the ground truth. The blocks that do not have any labeled sample will form a single cluster. A smaller contribution also made to the distance model including feature engineering and pairs sampling. Overall, the model accuracy is 98% compared to 94% if we only disambiguate on the common normalized last name and first name initial. My work contributed to raise the accuracy from 97% to slightly more than 98%. This is equivalent to reduce the error rate by about 35%. During the project, I have also contributed to an open source project which will eventually be deployed in the high-energy physics digital library of CERN (http://inspirehep.net). There were many factors that led to achieve such an accurate disambiguation model. A key factor was having a ground-truth data which allowed us to design a very good semi-supervised clustering. Another factor was learning an accurate distance model with an appropriate feature engineering in which we manage to incorporate an external knowledge of the name ethnicity." @default.
- W2281943135 created "2016-06-24" @default.
- W2281943135 creator A5033225245 @default.
- W2281943135 date "2015-07-21" @default.
- W2281943135 modified "2023-09-26" @default.
- W2281943135 title "Bibliographic Entity Automatic Recognition and Disambiguation" @default.
- W2281943135 cites W106880317 @default.
- W2281943135 cites W107286722 @default.
- W2281943135 cites W1496319844 @default.
- W2281943135 cites W1536860849 @default.
- W2281943135 cites W1560593431 @default.
- W2281943135 cites W1678356000 @default.
- W2281943135 cites W1967178753 @default.
- W2281943135 cites W1973676655 @default.
- W2281943135 cites W1985697096 @default.
- W2281943135 cites W1987971958 @default.
- W2281943135 cites W1999254925 @default.
- W2281943135 cites W2007995029 @default.
- W2281943135 cites W2016381774 @default.
- W2281943135 cites W2043793719 @default.
- W2281943135 cites W2066965880 @default.
- W2281943135 cites W2070358287 @default.
- W2281943135 cites W2071949631 @default.
- W2281943135 cites W2072240081 @default.
- W2281943135 cites W2079205903 @default.
- W2281943135 cites W2090987348 @default.
- W2281943135 cites W2123402141 @default.
- W2281943135 cites W2129558264 @default.
- W2281943135 cites W2131193521 @default.
- W2281943135 cites W2148019918 @default.
- W2281943135 cites W2261544779 @default.
- W2281943135 cites W2613826676 @default.
- W2281943135 cites W2732236724 @default.
- W2281943135 cites W2962698288 @default.
- W2281943135 cites W3098845338 @default.
- W2281943135 cites W8870360 @default.
- W2281943135 cites W2995035536 @default.
- W2281943135 cites W3005347330 @default.
- W2281943135 hasPublicationYear "2015" @default.
- W2281943135 type Work @default.
- W2281943135 sameAs 2281943135 @default.
- W2281943135 citedByCount "0" @default.
- W2281943135 crossrefType "journal-article" @default.
- W2281943135 hasAuthorship W2281943135A5033225245 @default.
- W2281943135 hasConcept C119857082 @default.
- W2281943135 hasConcept C138885662 @default.
- W2281943135 hasConcept C14036430 @default.
- W2281943135 hasConcept C154945302 @default.
- W2281943135 hasConcept C17744445 @default.
- W2281943135 hasConcept C199539241 @default.
- W2281943135 hasConcept C204321447 @default.
- W2281943135 hasConcept C23123220 @default.
- W2281943135 hasConcept C2776359362 @default.
- W2281943135 hasConcept C2776401178 @default.
- W2281943135 hasConcept C41008148 @default.
- W2281943135 hasConcept C41895202 @default.
- W2281943135 hasConcept C73555534 @default.
- W2281943135 hasConcept C78458016 @default.
- W2281943135 hasConcept C86803240 @default.
- W2281943135 hasConcept C92835128 @default.
- W2281943135 hasConcept C94625758 @default.
- W2281943135 hasConceptScore W2281943135C119857082 @default.
- W2281943135 hasConceptScore W2281943135C138885662 @default.
- W2281943135 hasConceptScore W2281943135C14036430 @default.
- W2281943135 hasConceptScore W2281943135C154945302 @default.
- W2281943135 hasConceptScore W2281943135C17744445 @default.
- W2281943135 hasConceptScore W2281943135C199539241 @default.
- W2281943135 hasConceptScore W2281943135C204321447 @default.
- W2281943135 hasConceptScore W2281943135C23123220 @default.
- W2281943135 hasConceptScore W2281943135C2776359362 @default.
- W2281943135 hasConceptScore W2281943135C2776401178 @default.
- W2281943135 hasConceptScore W2281943135C41008148 @default.
- W2281943135 hasConceptScore W2281943135C41895202 @default.
- W2281943135 hasConceptScore W2281943135C73555534 @default.
- W2281943135 hasConceptScore W2281943135C78458016 @default.
- W2281943135 hasConceptScore W2281943135C86803240 @default.
- W2281943135 hasConceptScore W2281943135C92835128 @default.
- W2281943135 hasConceptScore W2281943135C94625758 @default.
- W2281943135 hasLocation W22819431351 @default.
- W2281943135 hasOpenAccess W2281943135 @default.
- W2281943135 hasPrimaryLocation W22819431351 @default.
- W2281943135 hasRelatedWork W104268446 @default.
- W2281943135 hasRelatedWork W131100047 @default.
- W2281943135 hasRelatedWork W2118424569 @default.
- W2281943135 isParatext "false" @default.
- W2281943135 isRetracted "false" @default.
- W2281943135 magId "2281943135" @default.
- W2281943135 workType "article" @default.