Matches in SemOpenAlex for { <https://semopenalex.org/work/W4381328570> ?p ?o ?g. }
- W4381328570 endingPage "18" @default.
- W4381328570 startingPage "1" @default.
- W4381328570 abstract "Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable." @default.
- W4381328570 created "2023-06-21" @default.
- W4381328570 creator A5045267283 @default.
- W4381328570 creator A5066066915 @default.
- W4381328570 creator A5084296041 @default.
- W4381328570 date "2023-06-13" @default.
- W4381328570 modified "2023-10-07" @default.
- W4381328570 title "Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation" @default.
- W4381328570 cites W1565650557 @default.
- W4381328570 cites W179875071 @default.
- W4381328570 cites W1963838724 @default.
- W4381328570 cites W1968625547 @default.
- W4381328570 cites W1973001156 @default.
- W4381328570 cites W1980229659 @default.
- W4381328570 cites W1991175610 @default.
- W4381328570 cites W2026968007 @default.
- W4381328570 cites W2065259291 @default.
- W4381328570 cites W2067432306 @default.
- W4381328570 cites W2097776316 @default.
- W4381328570 cites W2111549955 @default.
- W4381328570 cites W2119455368 @default.
- W4381328570 cites W2121269638 @default.
- W4381328570 cites W2129750215 @default.
- W4381328570 cites W2134212491 @default.
- W4381328570 cites W2139660688 @default.
- W4381328570 cites W2164634022 @default.
- W4381328570 cites W2294331997 @default.
- W4381328570 cites W2430378630 @default.
- W4381328570 cites W2489320908 @default.
- W4381328570 cites W2798412430 @default.
- W4381328570 cites W2912924812 @default.
- W4381328570 cites W3184324824 @default.
- W4381328570 cites W4282565958 @default.
- W4381328570 doi "https://doi.org/10.1145/3589324" @default.
- W4381328570 hasPublicationYear "2023" @default.
- W4381328570 type Work @default.
- W4381328570 citedByCount "0" @default.
- W4381328570 crossrefType "journal-article" @default.
- W4381328570 hasAuthorship W4381328570A5045267283 @default.
- W4381328570 hasAuthorship W4381328570A5066066915 @default.
- W4381328570 hasAuthorship W4381328570A5084296041 @default.
- W4381328570 hasConcept C137293760 @default.
- W4381328570 hasConcept C145420912 @default.
- W4381328570 hasConcept C154945302 @default.
- W4381328570 hasConcept C204321447 @default.
- W4381328570 hasConcept C23123220 @default.
- W4381328570 hasConcept C2524010 @default.
- W4381328570 hasConcept C2776036281 @default.
- W4381328570 hasConcept C2778112365 @default.
- W4381328570 hasConcept C30038468 @default.
- W4381328570 hasConcept C33923547 @default.
- W4381328570 hasConcept C38652104 @default.
- W4381328570 hasConcept C41008148 @default.
- W4381328570 hasConcept C48044578 @default.
- W4381328570 hasConcept C54355233 @default.
- W4381328570 hasConcept C67388219 @default.
- W4381328570 hasConcept C75165309 @default.
- W4381328570 hasConcept C77088390 @default.
- W4381328570 hasConcept C80444323 @default.
- W4381328570 hasConcept C86803240 @default.
- W4381328570 hasConcept C87431388 @default.
- W4381328570 hasConcept C99138194 @default.
- W4381328570 hasConceptScore W4381328570C137293760 @default.
- W4381328570 hasConceptScore W4381328570C145420912 @default.
- W4381328570 hasConceptScore W4381328570C154945302 @default.
- W4381328570 hasConceptScore W4381328570C204321447 @default.
- W4381328570 hasConceptScore W4381328570C23123220 @default.
- W4381328570 hasConceptScore W4381328570C2524010 @default.
- W4381328570 hasConceptScore W4381328570C2776036281 @default.
- W4381328570 hasConceptScore W4381328570C2778112365 @default.
- W4381328570 hasConceptScore W4381328570C30038468 @default.
- W4381328570 hasConceptScore W4381328570C33923547 @default.
- W4381328570 hasConceptScore W4381328570C38652104 @default.
- W4381328570 hasConceptScore W4381328570C41008148 @default.
- W4381328570 hasConceptScore W4381328570C48044578 @default.
- W4381328570 hasConceptScore W4381328570C54355233 @default.
- W4381328570 hasConceptScore W4381328570C67388219 @default.
- W4381328570 hasConceptScore W4381328570C75165309 @default.
- W4381328570 hasConceptScore W4381328570C77088390 @default.
- W4381328570 hasConceptScore W4381328570C80444323 @default.
- W4381328570 hasConceptScore W4381328570C86803240 @default.
- W4381328570 hasConceptScore W4381328570C87431388 @default.
- W4381328570 hasConceptScore W4381328570C99138194 @default.
- W4381328570 hasIssue "2" @default.
- W4381328570 hasLocation W43813285701 @default.
- W4381328570 hasOpenAccess W4381328570 @default.
- W4381328570 hasPrimaryLocation W43813285701 @default.
- W4381328570 hasRelatedWork W2034963017 @default.
- W4381328570 hasRelatedWork W2155123971 @default.
- W4381328570 hasRelatedWork W2352031993 @default.
- W4381328570 hasRelatedWork W2357865405 @default.
- W4381328570 hasRelatedWork W2388078788 @default.
- W4381328570 hasRelatedWork W2390485179 @default.
- W4381328570 hasRelatedWork W2765465462 @default.
- W4381328570 hasRelatedWork W3156632946 @default.
- W4381328570 hasRelatedWork W4381328570 @default.
- W4381328570 hasRelatedWork W162007055 @default.
- W4381328570 hasVolume "1" @default.