
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models introduces ERBench, a benchmark designed for extensive and verifiable evaluation of Large Language Models (LLMs) like GPT-4. It pivots from static benchmarks using dynamic complexities from relational databases and entity-relationship (ER) models. The authors curate questions based on the schema, records, and functional dependencies allowing for automatic verification and creation of multihop questions that assess the reasoning abilities of LLMs.
This paper identifies crucial aspects of LLMs that require improvement and proposes a benchmark that goes beyond simply retrieving correct answers. ERBench’s ability to scrutinize reasoning pathways paves the way for enhanced model transparency and accountability. Read more.