Matches in SemOpenAlex for { <https://semopenalex.org/work/W2912708577> ?p ?o ?g. }
Showing items 1 to 70 of
70
with 100 items per page.
- W2912708577 abstract "From a digital historian’s point of view, Ancien Regime French texts suffer from obsolete grammar, unreliable spelling, and poor optical character recognition, which makes these texts ill-suited to digital analysis. This paper summarizes methodological experiments that have allowed the author to extract useful quantitative data from such unlikely source material. A discussion of the general characteristics of hand-keyed and OCR’ed historical corpora shows that they differ in scale of difficulty rather than in nature. Behavioural traits that make text mining certain eighteenth century corpora particularly challenging, such as error clustering, a relatively high cost of acquisition relative to salience, outlier hiding, and unpredictable patterns of error repetition, are then explained. The paper then outlines a method that circumvents these challenges. This method relies on heuristic formulation of research questions during an initial phase of open-ended data exploration; selective correction of spelling and OCR errors, through application of Levenshtein’s algorithm, that focuses on a small set of keywords derived from the heuristic project design; and careful exploitation of the keywords and the corrected corpus, either as raw data for algorithms, as entry points from which to construct valuable data manually, or as focal points directing the scholar’s attention to a small subset of texts to read. Each step of the method is illustrated by examples drawn from the author’s research on the hand-keyed Encyclopedie and Bibliotheque Bleue and on collections of periodicals obtained through optical character recognition. Du point de vue d’un historien numerique, les textes francais d’Ancien Regime souffrent d’une grammaire obsolete, d’une orthographe irreguliere et d’une reconnaissance optique des caracteres de faible qualite. Cet article resume les experiences methodologiques qui ont permis a l’auteur d’extraire des mesures quantitatives utiles de ces improbables matieres premieres. Une discussion des caracteristiques generales des corpus de textes historiques transcrits a la main et des corpus produits par reconnaissance optique revele qu’ils different en degre de difficulte mais non en nature. Les comportements qui rendent certains de ces corpus particulierement difficiles a traiter numeriquement, dont la distribution non aleatoire des erreurs, un cout unitaire d’acquisition relativement eleve, la dissimulation des documents atypiques et l’imprevisibilite des erreurs repetees, sont ensuite expliques. L’article trace ensuite les grandes lignes d’une methode qui contourne ces problemes. Cette methode repose sur la selection heuristique de questions de recherche pendant une phase d’exploration ouverte des donnees; la correction selective des erreurs a l’aide de l’application de l’algorithme de Levenshtein a un petit nombre de mots-cles choisis pendant la phase d’exploration; et l’exploitation des mots-cles et du corpus corrige soit en tant que donnees brutes, soit comme points d’entree permettant l’extraction manuelle de donnees probantes, soit comme boussoles permettant d’orienter l’attention du chercheur vers un sous-ensemble de documents pertinents a lire. Des exemples tires de la recherche de l’auteur, qui porte a la fois sur des corpus ocerises de periodiques et sur les corpus reconstitues manuellement de l’Encyclopedie et de la Bibliotheque bleue, illustrent chacune des etapes. Mots-cles: fouille de texte; fouille de donnees; textometrie; production de l’espace; histoire numerique; correction d’erreurs" @default.
- W2912708577 created "2019-02-21" @default.
- W2912708577 creator A5030996558 @default.
- W2912708577 date "2019-01-01" @default.
- W2912708577 modified "2023-10-03" @default.
- W2912708577 title "How to Extract Good Knowledge from Bad Data: An Experiment with Eighteenth Century French Texts" @default.
- W2912708577 cites W2069172670 @default.
- W2912708577 doi "https://doi.org/10.16995/dscn.299" @default.
- W2912708577 hasPublicationYear "2019" @default.
- W2912708577 type Work @default.
- W2912708577 sameAs 2912708577 @default.
- W2912708577 citedByCount "0" @default.
- W2912708577 crossrefType "journal-article" @default.
- W2912708577 hasAuthorship W2912708577A5030996558 @default.
- W2912708577 hasBestOaLocation W29127085771 @default.
- W2912708577 hasConcept C108154423 @default.
- W2912708577 hasConcept C115961682 @default.
- W2912708577 hasConcept C138885662 @default.
- W2912708577 hasConcept C154945302 @default.
- W2912708577 hasConcept C173801870 @default.
- W2912708577 hasConcept C177264268 @default.
- W2912708577 hasConcept C199360897 @default.
- W2912708577 hasConcept C204321447 @default.
- W2912708577 hasConcept C2524010 @default.
- W2912708577 hasConcept C26022165 @default.
- W2912708577 hasConcept C2777801307 @default.
- W2912708577 hasConcept C2778121359 @default.
- W2912708577 hasConcept C2780801425 @default.
- W2912708577 hasConcept C2780861071 @default.
- W2912708577 hasConcept C28719098 @default.
- W2912708577 hasConcept C33923547 @default.
- W2912708577 hasConcept C41008148 @default.
- W2912708577 hasConcept C41895202 @default.
- W2912708577 hasConcept C546480517 @default.
- W2912708577 hasConceptScore W2912708577C108154423 @default.
- W2912708577 hasConceptScore W2912708577C115961682 @default.
- W2912708577 hasConceptScore W2912708577C138885662 @default.
- W2912708577 hasConceptScore W2912708577C154945302 @default.
- W2912708577 hasConceptScore W2912708577C173801870 @default.
- W2912708577 hasConceptScore W2912708577C177264268 @default.
- W2912708577 hasConceptScore W2912708577C199360897 @default.
- W2912708577 hasConceptScore W2912708577C204321447 @default.
- W2912708577 hasConceptScore W2912708577C2524010 @default.
- W2912708577 hasConceptScore W2912708577C26022165 @default.
- W2912708577 hasConceptScore W2912708577C2777801307 @default.
- W2912708577 hasConceptScore W2912708577C2778121359 @default.
- W2912708577 hasConceptScore W2912708577C2780801425 @default.
- W2912708577 hasConceptScore W2912708577C2780861071 @default.
- W2912708577 hasConceptScore W2912708577C28719098 @default.
- W2912708577 hasConceptScore W2912708577C33923547 @default.
- W2912708577 hasConceptScore W2912708577C41008148 @default.
- W2912708577 hasConceptScore W2912708577C41895202 @default.
- W2912708577 hasConceptScore W2912708577C546480517 @default.
- W2912708577 hasLocation W29127085771 @default.
- W2912708577 hasOpenAccess W2912708577 @default.
- W2912708577 hasPrimaryLocation W29127085771 @default.
- W2912708577 hasRelatedWork W1488159990 @default.
- W2912708577 hasRelatedWork W1493753405 @default.
- W2912708577 hasRelatedWork W1540041489 @default.
- W2912708577 hasRelatedWork W2061574186 @default.
- W2912708577 hasRelatedWork W2128719260 @default.
- W2912708577 hasRelatedWork W2152428909 @default.
- W2912708577 hasRelatedWork W2208237615 @default.
- W2912708577 hasRelatedWork W2912708577 @default.
- W2912708577 hasRelatedWork W770098 @default.
- W2912708577 hasRelatedWork W9662544 @default.
- W2912708577 isParatext "false" @default.
- W2912708577 isRetracted "false" @default.
- W2912708577 magId "2912708577" @default.
- W2912708577 workType "article" @default.