Matches in SemOpenAlex for { <https://semopenalex.org/work/W4387561041> ?p ?o ?g. }
Showing items 1 to 63 of
63
with 100 items per page.
- W4387561041 abstract "Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find that by providing just few in-context demonstrations without fine-tuning, LLMs can be manipulated to increase or decrease the probability of jailbreaking, i.e. answering malicious prompts. Based on these observations, we propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding aligned language model purposes. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrations of rejecting to answer harmful prompts. Our experiments show the effectiveness of ICA and ICD in increasing or reducing the success rate of adversarial jailbreaking attacks. Overall, we shed light on the potential of ICL to influence LLM behavior and provide a new perspective for enhancing the safety and alignment of LLMs." @default.
- W4387561041 created "2023-10-12" @default.
- W4387561041 creator A5004999983 @default.
- W4387561041 creator A5027049671 @default.
- W4387561041 creator A5029941351 @default.
- W4387561041 date "2023-10-10" @default.
- W4387561041 modified "2023-10-13" @default.
- W4387561041 title "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations" @default.
- W4387561041 doi "https://doi.org/10.48550/arxiv.2310.06387" @default.
- W4387561041 hasPublicationYear "2023" @default.
- W4387561041 type Work @default.
- W4387561041 citedByCount "0" @default.
- W4387561041 crossrefType "posted-content" @default.
- W4387561041 hasAuthorship W4387561041A5004999983 @default.
- W4387561041 hasAuthorship W4387561041A5027049671 @default.
- W4387561041 hasAuthorship W4387561041A5029941351 @default.
- W4387561041 hasBestOaLocation W43875610411 @default.
- W4387561041 hasConcept C104317684 @default.
- W4387561041 hasConcept C137293760 @default.
- W4387561041 hasConcept C141141315 @default.
- W4387561041 hasConcept C154945302 @default.
- W4387561041 hasConcept C166957645 @default.
- W4387561041 hasConcept C185592680 @default.
- W4387561041 hasConcept C199360897 @default.
- W4387561041 hasConcept C204321447 @default.
- W4387561041 hasConcept C2779343474 @default.
- W4387561041 hasConcept C37736160 @default.
- W4387561041 hasConcept C38652104 @default.
- W4387561041 hasConcept C41008148 @default.
- W4387561041 hasConcept C55493867 @default.
- W4387561041 hasConcept C63479239 @default.
- W4387561041 hasConcept C95457728 @default.
- W4387561041 hasConceptScore W4387561041C104317684 @default.
- W4387561041 hasConceptScore W4387561041C137293760 @default.
- W4387561041 hasConceptScore W4387561041C141141315 @default.
- W4387561041 hasConceptScore W4387561041C154945302 @default.
- W4387561041 hasConceptScore W4387561041C166957645 @default.
- W4387561041 hasConceptScore W4387561041C185592680 @default.
- W4387561041 hasConceptScore W4387561041C199360897 @default.
- W4387561041 hasConceptScore W4387561041C204321447 @default.
- W4387561041 hasConceptScore W4387561041C2779343474 @default.
- W4387561041 hasConceptScore W4387561041C37736160 @default.
- W4387561041 hasConceptScore W4387561041C38652104 @default.
- W4387561041 hasConceptScore W4387561041C41008148 @default.
- W4387561041 hasConceptScore W4387561041C55493867 @default.
- W4387561041 hasConceptScore W4387561041C63479239 @default.
- W4387561041 hasConceptScore W4387561041C95457728 @default.
- W4387561041 hasLocation W43875610411 @default.
- W4387561041 hasOpenAccess W4387561041 @default.
- W4387561041 hasPrimaryLocation W43875610411 @default.
- W4387561041 hasRelatedWork W1561927205 @default.
- W4387561041 hasRelatedWork W2482350142 @default.
- W4387561041 hasRelatedWork W2502115930 @default.
- W4387561041 hasRelatedWork W3126451824 @default.
- W4387561041 hasRelatedWork W3176240006 @default.
- W4387561041 hasRelatedWork W3191453585 @default.
- W4387561041 hasRelatedWork W4246396837 @default.
- W4387561041 hasRelatedWork W4285226279 @default.
- W4387561041 hasRelatedWork W4297672492 @default.
- W4387561041 hasRelatedWork W4310988119 @default.
- W4387561041 isParatext "false" @default.
- W4387561041 isRetracted "false" @default.
- W4387561041 workType "article" @default.