Matches in SemOpenAlex for { <https://semopenalex.org/work/W942652018> ?p ?o ?g. }
- W942652018 abstract "Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end, though most of it is focussed on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition (OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps, which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors." @default.
- W942652018 created "2016-06-24" @default.
- W942652018 creator A5013219521 @default.
- W942652018 creator A5090791372 @default.
- W942652018 date "2012-01-01" @default.
- W942652018 modified "2023-09-27" @default.
- W942652018 title "Bayesian text analytics for document collections" @default.
- W942652018 cites W10112829 @default.
- W942652018 cites W140312209 @default.
- W942652018 cites W1493526108 @default.
- W942652018 cites W1523141340 @default.
- W942652018 cites W1550206324 @default.
- W942652018 cites W1592735339 @default.
- W942652018 cites W1599022437 @default.
- W942652018 cites W1601566332 @default.
- W942652018 cites W1651093245 @default.
- W942652018 cites W1663973292 @default.
- W942652018 cites W1745685099 @default.
- W942652018 cites W1812398186 @default.
- W942652018 cites W1834232446 @default.
- W942652018 cites W1880262756 @default.
- W942652018 cites W1947594277 @default.
- W942652018 cites W1972016971 @default.
- W942652018 cites W1977053705 @default.
- W942652018 cites W1981081578 @default.
- W942652018 cites W1985615910 @default.
- W942652018 cites W2001082470 @default.
- W942652018 cites W2007463795 @default.
- W942652018 cites W2020111636 @default.
- W942652018 cites W202303397 @default.
- W942652018 cites W2024315245 @default.
- W942652018 cites W2024348326 @default.
- W942652018 cites W2039013656 @default.
- W942652018 cites W2045656233 @default.
- W942652018 cites W2047865878 @default.
- W942652018 cites W2059216481 @default.
- W942652018 cites W2060397030 @default.
- W942652018 cites W2063397738 @default.
- W942652018 cites W2064513326 @default.
- W942652018 cites W2065392216 @default.
- W942652018 cites W2067818150 @default.
- W942652018 cites W2069429561 @default.
- W942652018 cites W2072644219 @default.
- W942652018 cites W2079501320 @default.
- W942652018 cites W2080972498 @default.
- W942652018 cites W2081366726 @default.
- W942652018 cites W2085763602 @default.
- W942652018 cites W2087101057 @default.
- W942652018 cites W2089484716 @default.
- W942652018 cites W2090784712 @default.
- W942652018 cites W2096152098 @default.
- W942652018 cites W2096878708 @default.
- W942652018 cites W2099064293 @default.
- W942652018 cites W2099873701 @default.
- W942652018 cites W2101286420 @default.
- W942652018 cites W2103587173 @default.
- W942652018 cites W2106490775 @default.
- W942652018 cites W2107469355 @default.
- W942652018 cites W2107743791 @default.
- W942652018 cites W2109450366 @default.
- W942652018 cites W2112971401 @default.
- W942652018 cites W2116237179 @default.
- W942652018 cites W2118034653 @default.
- W942652018 cites W2121853761 @default.
- W942652018 cites W2123515711 @default.
- W942652018 cites W2123549998 @default.
- W942652018 cites W2127042504 @default.
- W942652018 cites W2130428211 @default.
- W942652018 cites W2132827946 @default.
- W942652018 cites W2133058591 @default.
- W942652018 cites W2136445629 @default.
- W942652018 cites W2137435333 @default.
- W942652018 cites W2138615112 @default.
- W942652018 cites W2140124448 @default.
- W942652018 cites W2141441158 @default.
- W942652018 cites W2144100511 @default.
- W942652018 cites W2145767445 @default.
- W942652018 cites W2146341620 @default.
- W942652018 cites W2147152072 @default.
- W942652018 cites W2147706904 @default.
- W942652018 cites W2157023969 @default.
- W942652018 cites W2157305458 @default.
- W942652018 cites W2159426623 @default.
- W942652018 cites W2161353674 @default.
- W942652018 cites W2163455955 @default.
- W942652018 cites W2165599843 @default.
- W942652018 cites W2169606435 @default.
- W942652018 cites W2171343266 @default.
- W942652018 cites W2259123132 @default.
- W942652018 cites W2294798173 @default.
- W942652018 cites W2402783873 @default.
- W942652018 cites W2435251607 @default.
- W942652018 cites W263845233 @default.
- W942652018 cites W2950477597 @default.
- W942652018 cites W2953092676 @default.
- W942652018 cites W2963673689 @default.
- W942652018 cites W3123857276 @default.
- W942652018 cites W316065036 @default.
- W942652018 hasPublicationYear "2012" @default.
- W942652018 type Work @default.