Matches in SemOpenAlex for { <https://semopenalex.org/work/W4387294543> ?p ?o ?g. }
Showing items 1 to 65 of
65
with 100 items per page.
- W4387294543 abstract "Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover deleted information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models." @default.
- W4387294543 created "2023-10-03" @default.
- W4387294543 creator A5001987532 @default.
- W4387294543 creator A5013096489 @default.
- W4387294543 creator A5063726130 @default.
- W4387294543 date "2023-09-29" @default.
- W4387294543 modified "2023-10-04" @default.
- W4387294543 title "Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks" @default.
- W4387294543 doi "https://doi.org/10.48550/arxiv.2309.17410" @default.
- W4387294543 hasPublicationYear "2023" @default.
- W4387294543 type Work @default.
- W4387294543 citedByCount "0" @default.
- W4387294543 crossrefType "posted-content" @default.
- W4387294543 hasAuthorship W4387294543A5001987532 @default.
- W4387294543 hasAuthorship W4387294543A5013096489 @default.
- W4387294543 hasAuthorship W4387294543A5063726130 @default.
- W4387294543 hasBestOaLocation W43872945431 @default.
- W4387294543 hasConcept C137822555 @default.
- W4387294543 hasConcept C153083717 @default.
- W4387294543 hasConcept C154945302 @default.
- W4387294543 hasConcept C162324750 @default.
- W4387294543 hasConcept C177264268 @default.
- W4387294543 hasConcept C17744445 @default.
- W4387294543 hasConcept C187736073 @default.
- W4387294543 hasConcept C195807954 @default.
- W4387294543 hasConcept C199360897 @default.
- W4387294543 hasConcept C199539241 @default.
- W4387294543 hasConcept C23123220 @default.
- W4387294543 hasConcept C26517878 @default.
- W4387294543 hasConcept C2777363581 @default.
- W4387294543 hasConcept C2780451532 @default.
- W4387294543 hasConcept C38652104 @default.
- W4387294543 hasConcept C41008148 @default.
- W4387294543 hasConceptScore W4387294543C137822555 @default.
- W4387294543 hasConceptScore W4387294543C153083717 @default.
- W4387294543 hasConceptScore W4387294543C154945302 @default.
- W4387294543 hasConceptScore W4387294543C162324750 @default.
- W4387294543 hasConceptScore W4387294543C177264268 @default.
- W4387294543 hasConceptScore W4387294543C17744445 @default.
- W4387294543 hasConceptScore W4387294543C187736073 @default.
- W4387294543 hasConceptScore W4387294543C195807954 @default.
- W4387294543 hasConceptScore W4387294543C199360897 @default.
- W4387294543 hasConceptScore W4387294543C199539241 @default.
- W4387294543 hasConceptScore W4387294543C23123220 @default.
- W4387294543 hasConceptScore W4387294543C26517878 @default.
- W4387294543 hasConceptScore W4387294543C2777363581 @default.
- W4387294543 hasConceptScore W4387294543C2780451532 @default.
- W4387294543 hasConceptScore W4387294543C38652104 @default.
- W4387294543 hasConceptScore W4387294543C41008148 @default.
- W4387294543 hasLocation W43872945431 @default.
- W4387294543 hasOpenAccess W4387294543 @default.
- W4387294543 hasPrimaryLocation W43872945431 @default.
- W4387294543 hasRelatedWork W1788528807 @default.
- W4387294543 hasRelatedWork W2153799433 @default.
- W4387294543 hasRelatedWork W2329452785 @default.
- W4387294543 hasRelatedWork W2352337653 @default.
- W4387294543 hasRelatedWork W2357241418 @default.
- W4387294543 hasRelatedWork W2367301249 @default.
- W4387294543 hasRelatedWork W2379157006 @default.
- W4387294543 hasRelatedWork W2393978999 @default.
- W4387294543 hasRelatedWork W2725657302 @default.
- W4387294543 hasRelatedWork W3036744775 @default.
- W4387294543 isParatext "false" @default.
- W4387294543 isRetracted "false" @default.
- W4387294543 workType "article" @default.