AI news
Subscribe
Large Language Models
Benchmarking
Entity-Relationship Model
Automatic Verification
Prompt Engineering
ERBench: Hallucination Benchmark for LLMs

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models introduces ERBench, a benchmark designed for extensive and verifiable evaluation of Large Language Models (LLMs) like GPT-4. It pivots from static benchmarks using dynamic complexities from relational databases and entity-relationship (ER) models. The authors curate questions based on the schema, records, and functional dependencies allowing for automatic verification and creation of multihop questions that assess the reasoning abilities of LLMs.

  • Utilizes relational databases for constructing thorough benchmarks
  • Supports automated verification of questions based on schema and functional dependencies
  • Enables creation of multihop questions with varying complexity levels
  • Suitable for continuous evaluation and application of prompt engineering techniques
  • Conducted experiments reveal limitations of LLMs and underline the necessity of correct answer rationales

This paper identifies crucial aspects of LLMs that require improvement and proposes a benchmark that goes beyond simply retrieving correct answers. ERBench’s ability to scrutinize reasoning pathways paves the way for enhanced model transparency and accountability. Read more.

Personalized AI news from scientific papers.