Matches in SemOpenAlex for { <https://semopenalex.org/work/W852237108> ?p ?o ?g. }
Showing items 1 to 72 of
72
with 100 items per page.
- W852237108 abstract "DETECTING SIMILAR HTML DOCUMENTS USING A SENTENCE-BASED COPY DETECTION APPROACH Rajiv Yerra Department of Computer Science Master of Science Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this thesis, we present a new approach for detecting similar (HTML) Web documents and evaluate its performance. To detect similar documents, we first apply our sentence-based copy detection approach to determine whether sentences in any two documents should be treated as the same or different according to the degrees of similarity of the sentences, which is computed by using either the three leastfrequent 4-gram approach or the fuzzy set information retrieval (IR) approach. These copy detection approaches, which achieve a high success rate in detection similar (not necessary the same) sentences, (i) handles wide range of documents in different subject areas (such as sports, news, and science, etc.) and (ii) does not require static word lists, which means that there is no need to look up for words in a predefined dictionary/thesaurus to determine the similarity among words. Not only we can detect similar sentences in two documents, we can graphically display the relative locations of similar (not necessary the same) sentences detected in the documents using the dotplot views, which is a graphical tool. Experimental results show that the fuzzy set IR approach outperforms the three least-frequent 4-gram approach in copy detection. For this reason we adopt the fuzzy set IR copy detection approach for detecting similar Web documents, especially HTML documents, by computing the degree of resemblance between any two HTML documents, which represents to what extent the two documents under consideration are similar. Hereafter, we match the corresponding hierarchical content of the two documents using a simple tree matching algorithm. Our copy detection approach is unique since it is sentence-based, instead of wordbased on which most of the existing copy detection approaches are developed, and can specify the relative positions of same (or similar) sentences in their corresponding HTML documents graphically, as well as hierarchically, according to the document structures. The targeted documents to which our copy detection approach applies is different from others, since it (i) performs copy detection on HTML documents, instead of any plain text documents, (ii) detects HTML documents with similar sentences apart from exact matches, and (iii) is simple, as it uses the fuzzy set IR model for determining related words in documents and filtering redundant Web documents, and is supported by well-known and yet simple mathematical models. Experimental results on detection of similar documents have been performed to check for accuracy using false positives, false negatives, precision, recall, and Fmeasure values. With over 90% F-measure, which indicates that the percentage of error is relatively small, our approach to detect similar documents performs reasonably well. The time complexity for our copy detection approach is O(n), where, n is the total number of sentences in a HTML document, whereas the time complexity for detecting similar HTML documents using our copy detection approach is O(n log n). The overall time complexity of our copy detection and similar HTML documents detection approach is O(n log n+ n) ∼= O(n)." @default.
- W852237108 created "2016-06-24" @default.
- W852237108 creator A5084070147 @default.
- W852237108 date "2005-01-01" @default.
- W852237108 modified "2023-09-27" @default.
- W852237108 title "Detecting Similar HTML Documents Using A Sentence-Based Copy Detection Approach" @default.
- W852237108 cites W1548134554 @default.
- W852237108 cites W1596691921 @default.
- W852237108 cites W1609518033 @default.
- W852237108 cites W1647641001 @default.
- W852237108 cites W1660390307 @default.
- W852237108 cites W174630521 @default.
- W852237108 cites W1918365723 @default.
- W852237108 cites W1956218947 @default.
- W852237108 cites W1968736254 @default.
- W852237108 cites W197093853 @default.
- W852237108 cites W1971393620 @default.
- W852237108 cites W1980776243 @default.
- W852237108 cites W2003677545 @default.
- W852237108 cites W2006803276 @default.
- W852237108 cites W2007842132 @default.
- W852237108 cites W2072208512 @default.
- W852237108 cites W2078396547 @default.
- W852237108 cites W2091340087 @default.
- W852237108 cites W2098162425 @default.
- W852237108 cites W2113641473 @default.
- W852237108 cites W2127363101 @default.
- W852237108 cites W2140689849 @default.
- W852237108 cites W2148044665 @default.
- W852237108 cites W2421250929 @default.
- W852237108 hasPublicationYear "2005" @default.
- W852237108 type Work @default.
- W852237108 sameAs 852237108 @default.
- W852237108 citedByCount "0" @default.
- W852237108 crossrefType "journal-article" @default.
- W852237108 hasAuthorship W852237108A5084070147 @default.
- W852237108 hasConcept C103278499 @default.
- W852237108 hasConcept C106131492 @default.
- W852237108 hasConcept C115961682 @default.
- W852237108 hasConcept C138885662 @default.
- W852237108 hasConcept C154945302 @default.
- W852237108 hasConcept C177264268 @default.
- W852237108 hasConcept C199360897 @default.
- W852237108 hasConcept C204321447 @default.
- W852237108 hasConcept C23123220 @default.
- W852237108 hasConcept C2777530160 @default.
- W852237108 hasConcept C31972630 @default.
- W852237108 hasConcept C41008148 @default.
- W852237108 hasConcept C41895202 @default.
- W852237108 hasConcept C90805587 @default.
- W852237108 hasConceptScore W852237108C103278499 @default.
- W852237108 hasConceptScore W852237108C106131492 @default.
- W852237108 hasConceptScore W852237108C115961682 @default.
- W852237108 hasConceptScore W852237108C138885662 @default.
- W852237108 hasConceptScore W852237108C154945302 @default.
- W852237108 hasConceptScore W852237108C177264268 @default.
- W852237108 hasConceptScore W852237108C199360897 @default.
- W852237108 hasConceptScore W852237108C204321447 @default.
- W852237108 hasConceptScore W852237108C23123220 @default.
- W852237108 hasConceptScore W852237108C2777530160 @default.
- W852237108 hasConceptScore W852237108C31972630 @default.
- W852237108 hasConceptScore W852237108C41008148 @default.
- W852237108 hasConceptScore W852237108C41895202 @default.
- W852237108 hasConceptScore W852237108C90805587 @default.
- W852237108 hasLocation W8522371081 @default.
- W852237108 hasOpenAccess W852237108 @default.
- W852237108 hasPrimaryLocation W8522371081 @default.
- W852237108 hasRelatedWork W2923603359 @default.
- W852237108 isParatext "false" @default.
- W852237108 isRetracted "false" @default.
- W852237108 magId "852237108" @default.
- W852237108 workType "article" @default.