ERBench: Hallucination Benchmark for LLMs

AI news

Large Language Models

Benchmarking

Entity-Relationship Model

Automatic Verification

Prompt Engineering

ERBench: Hallucination Benchmark for LLMs

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models introduces ERBench, a benchmark designed for extensive and verifiable evaluation of Large Language Models (LLMs) like GPT-4. It pivots from static benchmarks using dynamic complexities from relational databases and entity-relationship (ER) models. The authors curate questions based on the schema, records, and functional dependencies allowing for automatic verification and creation of multihop questions that assess the reasoning abilities of LLMs.

Utilizes relational databases for constructing thorough benchmarks
Supports automated verification of questions based on schema and functional dependencies
Enables creation of multihop questions with varying complexity levels
Suitable for continuous evaluation and application of prompt engineering techniques
Conducted experiments reveal limitations of LLMs and underline the necessity of correct answer rationales

This paper identifies crucial aspects of LLMs that require improvement and proposes a benchmark that goes beyond simply retrieving correct answers. ERBench’s ability to scrutinize reasoning pathways paves the way for enhanced model transparency and accountability. Read more.

Personalized AI news from scientific papers.