The development of Large Language Models (LLMs) has been hampered by issues of factual inaccuracies. The breakthrough research utilizes GPT-4 to create a dataset, LongFact, for benchmarking a model’s factuality and proposes SAFE, a multi-step reasoning process for long-form content evaluation.
The study demonstrates SAFE’s impact on the reliability of LLMs, crucial for their adoption in information-sensitive domains. By doing so, it brings LLMs closer to practical utilization in industries where fact-checking is paramount, such as journalism and education.