The Digital Goat
Subscribe
LLMs
Factuality
Automated Evaluation
Benchmarking
Long-form Factuality in LLMs

LLMs like GPT-4 are prone to generating factual inaccuracies in open-ended topics. The study presents a new benchmark, LongFact, to measure a model’s long-form factuality using GPT-4 generated questions across diverse topics. The Search-Augmented Factuality Evaluator (SAFE) is proposed, leveraging LLMs to fact-check individual statements using search results from Google Search. An extended F1 score methodology is also introduced to quantify the factuality of the generated content.

  • Long-form factuality: Introduction of LongFact, a new benchmark with thousands of questions across 38 topics.
  • SAFE: An LLM-based automated factuality evaluation method.
  • F1 score extension: A metric that balances the percentage of supported facts in a response with the preferred response length.

The research showcases LLM agents’ ability to obtain superhuman rating performance, reinforcing the importance of LLMs for automated fact-checking tasks. The resulting systems have a cost-effective advantage over human-based methods and mark a significant step in enhancing the credibility of AI-generated content. Further implementations could see SAFE applied in newsrooms and educational settings. Learn More

Personalized AI news from scientific papers.