
Modern generative search engines provide cited evidence to enhance the reliability of large language model (LLM) responses. The challenge arises in evaluating the attribution of these responses: establishing whether claims are fully supported by their citations. The recent paper ‘AttributionBench: How Hard is Automatic Attribution Evaluation?’ introduces AttributionBench, a benchmark designed to standardize the evaluation process for these claims. The comprehensive benchmark suggests the difficulty that state-of-the-art LLMs, like GPT-3.5, face, achieving just about 80% macro-F1 in attribution accuracy. Detailed analysis identifies the majority of errors stem from nuanced information processing challenges and the gap between model access to information and human annotators’ capabilities.
**Highlights: **
**Opinion: ** This work holds significant implications for the future of generative search engines, highlighting the critical role of accurate evidence citation. It propels the conversation forward on developing sophisticated models capable of nuanced understanding and calls for further exploration in automatic attribution evaluation methods.