Matches in SemOpenAlex for { <https://semopenalex.org/work/W4312050653> ?p ?o ?g. }
- W4312050653 abstract "As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer (sycophancy) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors." @default.
- W4312050653 created "2023-01-04" @default.
- W4312050653 creator A5001194068 @default.
- W4312050653 creator A5001442211 @default.
- W4312050653 creator A5006175345 @default.
- W4312050653 creator A5006294201 @default.
- W4312050653 creator A5009112681 @default.
- W4312050653 creator A5010007563 @default.
- W4312050653 creator A5010253784 @default.
- W4312050653 creator A5010365586 @default.
- W4312050653 creator A5011636042 @default.
- W4312050653 creator A5011723751 @default.
- W4312050653 creator A5012713248 @default.
- W4312050653 creator A5015730722 @default.
- W4312050653 creator A5016181805 @default.
- W4312050653 creator A5017610025 @default.
- W4312050653 creator A5017968952 @default.
- W4312050653 creator A5020683620 @default.
- W4312050653 creator A5022793124 @default.
- W4312050653 creator A5025461840 @default.
- W4312050653 creator A5025573579 @default.
- W4312050653 creator A5027253674 @default.
- W4312050653 creator A5028970835 @default.
- W4312050653 creator A5030305998 @default.
- W4312050653 creator A5032088236 @default.
- W4312050653 creator A5032805177 @default.
- W4312050653 creator A5037671308 @default.
- W4312050653 creator A5038912770 @default.
- W4312050653 creator A5041888075 @default.
- W4312050653 creator A5044218586 @default.
- W4312050653 creator A5047846683 @default.
- W4312050653 creator A5049664153 @default.
- W4312050653 creator A5049786610 @default.
- W4312050653 creator A5050348824 @default.
- W4312050653 creator A5051064337 @default.
- W4312050653 creator A5051750194 @default.
- W4312050653 creator A5053213601 @default.
- W4312050653 creator A5053939329 @default.
- W4312050653 creator A5054887773 @default.
- W4312050653 creator A5056436767 @default.
- W4312050653 creator A5057116059 @default.
- W4312050653 creator A5057760823 @default.
- W4312050653 creator A5057999428 @default.
- W4312050653 creator A5059021264 @default.
- W4312050653 creator A5059933756 @default.
- W4312050653 creator A5061506488 @default.
- W4312050653 creator A5063966631 @default.
- W4312050653 creator A5064850171 @default.
- W4312050653 creator A5066197394 @default.
- W4312050653 creator A5067036768 @default.
- W4312050653 creator A5067390670 @default.
- W4312050653 creator A5070050208 @default.
- W4312050653 creator A5070402589 @default.
- W4312050653 creator A5070615024 @default.
- W4312050653 creator A5070762421 @default.
- W4312050653 creator A5074565235 @default.
- W4312050653 creator A5077427160 @default.
- W4312050653 creator A5078614102 @default.
- W4312050653 creator A5079539910 @default.
- W4312050653 creator A5081035645 @default.
- W4312050653 creator A5085932648 @default.
- W4312050653 creator A5086110112 @default.
- W4312050653 creator A5091112967 @default.
- W4312050653 creator A5091433552 @default.
- W4312050653 creator A5091860006 @default.
- W4312050653 date "2022-12-19" @default.
- W4312050653 modified "2023-10-11" @default.
- W4312050653 title "Discovering Language Model Behaviors with Model-Written Evaluations" @default.
- W4312050653 doi "https://doi.org/10.48550/arxiv.2212.09251" @default.
- W4312050653 hasPublicationYear "2022" @default.
- W4312050653 type Work @default.
- W4312050653 citedByCount "0" @default.
- W4312050653 crossrefType "posted-content" @default.
- W4312050653 hasAuthorship W4312050653A5001194068 @default.
- W4312050653 hasAuthorship W4312050653A5001442211 @default.
- W4312050653 hasAuthorship W4312050653A5006175345 @default.
- W4312050653 hasAuthorship W4312050653A5006294201 @default.
- W4312050653 hasAuthorship W4312050653A5009112681 @default.
- W4312050653 hasAuthorship W4312050653A5010007563 @default.
- W4312050653 hasAuthorship W4312050653A5010253784 @default.
- W4312050653 hasAuthorship W4312050653A5010365586 @default.
- W4312050653 hasAuthorship W4312050653A5011636042 @default.
- W4312050653 hasAuthorship W4312050653A5011723751 @default.
- W4312050653 hasAuthorship W4312050653A5012713248 @default.
- W4312050653 hasAuthorship W4312050653A5015730722 @default.
- W4312050653 hasAuthorship W4312050653A5016181805 @default.
- W4312050653 hasAuthorship W4312050653A5017610025 @default.
- W4312050653 hasAuthorship W4312050653A5017968952 @default.
- W4312050653 hasAuthorship W4312050653A5020683620 @default.
- W4312050653 hasAuthorship W4312050653A5022793124 @default.
- W4312050653 hasAuthorship W4312050653A5025461840 @default.
- W4312050653 hasAuthorship W4312050653A5025573579 @default.
- W4312050653 hasAuthorship W4312050653A5027253674 @default.
- W4312050653 hasAuthorship W4312050653A5028970835 @default.
- W4312050653 hasAuthorship W4312050653A5030305998 @default.
- W4312050653 hasAuthorship W4312050653A5032088236 @default.
- W4312050653 hasAuthorship W4312050653A5032805177 @default.
- W4312050653 hasAuthorship W4312050653A5037671308 @default.
- W4312050653 hasAuthorship W4312050653A5038912770 @default.
- W4312050653 hasAuthorship W4312050653A5041888075 @default.