Matches in SemOpenAlex for { <https://semopenalex.org/work/W4385889721> ?p ?o ?g. }
Showing items 1 to 71 of
71
with 100 items per page.
- W4385889721 abstract "Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model." @default.
- W4385889721 created "2023-08-17" @default.
- W4385889721 creator A5005412226 @default.
- W4385889721 creator A5020153026 @default.
- W4385889721 creator A5092652481 @default.
- W4385889721 creator A5092652482 @default.
- W4385889721 date "2023-08-14" @default.
- W4385889721 modified "2023-09-25" @default.
- W4385889721 title "LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked" @default.
- W4385889721 doi "https://doi.org/10.48550/arxiv.2308.07308" @default.
- W4385889721 hasPublicationYear "2023" @default.
- W4385889721 type Work @default.
- W4385889721 citedByCount "0" @default.
- W4385889721 crossrefType "posted-content" @default.
- W4385889721 hasAuthorship W4385889721A5005412226 @default.
- W4385889721 hasAuthorship W4385889721A5020153026 @default.
- W4385889721 hasAuthorship W4385889721A5092652481 @default.
- W4385889721 hasAuthorship W4385889721A5092652482 @default.
- W4385889721 hasBestOaLocation W43858897211 @default.
- W4385889721 hasConcept C108827166 @default.
- W4385889721 hasConcept C111472728 @default.
- W4385889721 hasConcept C120665830 @default.
- W4385889721 hasConcept C121332964 @default.
- W4385889721 hasConcept C138885662 @default.
- W4385889721 hasConcept C153180980 @default.
- W4385889721 hasConcept C154945302 @default.
- W4385889721 hasConcept C15744967 @default.
- W4385889721 hasConcept C180747234 @default.
- W4385889721 hasConcept C192209626 @default.
- W4385889721 hasConcept C2779530757 @default.
- W4385889721 hasConcept C2780586882 @default.
- W4385889721 hasConcept C2780586970 @default.
- W4385889721 hasConcept C37736160 @default.
- W4385889721 hasConcept C38652104 @default.
- W4385889721 hasConcept C41008148 @default.
- W4385889721 hasConcept C77088390 @default.
- W4385889721 hasConcept C77805123 @default.
- W4385889721 hasConceptScore W4385889721C108827166 @default.
- W4385889721 hasConceptScore W4385889721C111472728 @default.
- W4385889721 hasConceptScore W4385889721C120665830 @default.
- W4385889721 hasConceptScore W4385889721C121332964 @default.
- W4385889721 hasConceptScore W4385889721C138885662 @default.
- W4385889721 hasConceptScore W4385889721C153180980 @default.
- W4385889721 hasConceptScore W4385889721C154945302 @default.
- W4385889721 hasConceptScore W4385889721C15744967 @default.
- W4385889721 hasConceptScore W4385889721C180747234 @default.
- W4385889721 hasConceptScore W4385889721C192209626 @default.
- W4385889721 hasConceptScore W4385889721C2779530757 @default.
- W4385889721 hasConceptScore W4385889721C2780586882 @default.
- W4385889721 hasConceptScore W4385889721C2780586970 @default.
- W4385889721 hasConceptScore W4385889721C37736160 @default.
- W4385889721 hasConceptScore W4385889721C38652104 @default.
- W4385889721 hasConceptScore W4385889721C41008148 @default.
- W4385889721 hasConceptScore W4385889721C77088390 @default.
- W4385889721 hasConceptScore W4385889721C77805123 @default.
- W4385889721 hasLocation W43858897211 @default.
- W4385889721 hasOpenAccess W4385889721 @default.
- W4385889721 hasPrimaryLocation W43858897211 @default.
- W4385889721 hasRelatedWork W2018160926 @default.
- W4385889721 hasRelatedWork W2104700403 @default.
- W4385889721 hasRelatedWork W2133515697 @default.
- W4385889721 hasRelatedWork W2162108740 @default.
- W4385889721 hasRelatedWork W2377333748 @default.
- W4385889721 hasRelatedWork W2406532298 @default.
- W4385889721 hasRelatedWork W2748952813 @default.
- W4385889721 hasRelatedWork W2899084033 @default.
- W4385889721 hasRelatedWork W2940009958 @default.
- W4385889721 hasRelatedWork W2981216146 @default.
- W4385889721 isParatext "false" @default.
- W4385889721 isRetracted "false" @default.
- W4385889721 workType "article" @default.