Matches in SemOpenAlex for { <https://semopenalex.org/work/W4385569780> ?p ?o ?g. }
Showing items 1 to 67 of
67
with 100 items per page.
- W4385569780 abstract "Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-open. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation." @default.
- W4385569780 created "2023-08-05" @default.
- W4385569780 creator A5037737168 @default.
- W4385569780 creator A5043026875 @default.
- W4385569780 creator A5049618494 @default.
- W4385569780 creator A5071837282 @default.
- W4385569780 date "2023-01-01" @default.
- W4385569780 modified "2023-09-24" @default.
- W4385569780 title "Evaluating Open-Domain Question Answering in the Era of Large Language Models" @default.
- W4385569780 doi "https://doi.org/10.18653/v1/2023.acl-long.307" @default.
- W4385569780 hasPublicationYear "2023" @default.
- W4385569780 type Work @default.
- W4385569780 citedByCount "0" @default.
- W4385569780 crossrefType "proceedings-article" @default.
- W4385569780 hasAuthorship W4385569780A5037737168 @default.
- W4385569780 hasAuthorship W4385569780A5043026875 @default.
- W4385569780 hasAuthorship W4385569780A5049618494 @default.
- W4385569780 hasAuthorship W4385569780A5071837282 @default.
- W4385569780 hasBestOaLocation W43855697801 @default.
- W4385569780 hasConcept C105795698 @default.
- W4385569780 hasConcept C119857082 @default.
- W4385569780 hasConcept C13280743 @default.
- W4385569780 hasConcept C134306372 @default.
- W4385569780 hasConcept C137293760 @default.
- W4385569780 hasConcept C154945302 @default.
- W4385569780 hasConcept C165064840 @default.
- W4385569780 hasConcept C185798385 @default.
- W4385569780 hasConcept C204321447 @default.
- W4385569780 hasConcept C205649164 @default.
- W4385569780 hasConcept C23123220 @default.
- W4385569780 hasConcept C2993776861 @default.
- W4385569780 hasConcept C33923547 @default.
- W4385569780 hasConcept C36503486 @default.
- W4385569780 hasConcept C41008148 @default.
- W4385569780 hasConcept C44291984 @default.
- W4385569780 hasConceptScore W4385569780C105795698 @default.
- W4385569780 hasConceptScore W4385569780C119857082 @default.
- W4385569780 hasConceptScore W4385569780C13280743 @default.
- W4385569780 hasConceptScore W4385569780C134306372 @default.
- W4385569780 hasConceptScore W4385569780C137293760 @default.
- W4385569780 hasConceptScore W4385569780C154945302 @default.
- W4385569780 hasConceptScore W4385569780C165064840 @default.
- W4385569780 hasConceptScore W4385569780C185798385 @default.
- W4385569780 hasConceptScore W4385569780C204321447 @default.
- W4385569780 hasConceptScore W4385569780C205649164 @default.
- W4385569780 hasConceptScore W4385569780C23123220 @default.
- W4385569780 hasConceptScore W4385569780C2993776861 @default.
- W4385569780 hasConceptScore W4385569780C33923547 @default.
- W4385569780 hasConceptScore W4385569780C36503486 @default.
- W4385569780 hasConceptScore W4385569780C41008148 @default.
- W4385569780 hasConceptScore W4385569780C44291984 @default.
- W4385569780 hasLocation W43855697801 @default.
- W4385569780 hasOpenAccess W4385569780 @default.
- W4385569780 hasPrimaryLocation W43855697801 @default.
- W4385569780 hasRelatedWork W207304934 @default.
- W4385569780 hasRelatedWork W2368388617 @default.
- W4385569780 hasRelatedWork W2539940768 @default.
- W4385569780 hasRelatedWork W2559338413 @default.
- W4385569780 hasRelatedWork W2757542827 @default.
- W4385569780 hasRelatedWork W2798526799 @default.
- W4385569780 hasRelatedWork W2947497897 @default.
- W4385569780 hasRelatedWork W3207693618 @default.
- W4385569780 hasRelatedWork W4224294617 @default.
- W4385569780 hasRelatedWork W4226302158 @default.
- W4385569780 isParatext "false" @default.
- W4385569780 isRetracted "false" @default.
- W4385569780 workType "article" @default.