LLMs like GPT-4 are prone to generating factual inaccuracies in open-ended topics. The study presents a new benchmark, LongFact, to measure a model’s long-form factuality using GPT-4 generated questions across diverse topics. The Search-Augmented Factuality Evaluator (SAFE) is proposed, leveraging LLMs to fact-check individual statements using search results from Google Search. An extended F1 score methodology is also introduced to quantify the factuality of the generated content.
The research showcases LLM agents’ ability to obtain superhuman rating performance, reinforcing the importance of LLMs for automated fact-checking tasks. The resulting systems have a cost-effective advantage over human-based methods and mark a significant step in enhancing the credibility of AI-generated content. Further implementations could see SAFE applied in newsrooms and educational settings. Learn More