JDocQA: Japanese Document Question Answering Dataset for Generative LMs

LLM

Document QA

Multimodal Models

Hallucination Generation

Dataset

JDocQA: Japanese Document Question Answering Dataset for Generative LMs

The publication regarding JDocQA introduces a comprehensive dataset for document question-answering in Japanese. It encompasses visual and textual information for model training and evaluation. With 5,504 PDF documents and over 11,600 QA instances, it sets up a foundation for testing and refining multimodal and textual large language models in real-world scenarios.

Dedicated Japanese QA dataset including unanswerable questions.
Requires both visual and textual understanding for accurate answering.
Provides a platform to address ‘hallucination generation’ in LLMs.
Cross-model and cross-task analyses are enabled for the dataset.

JDocQA stands as a significant contribution to the field of document-based question-answering, Enhancing how AI understands and interacts with complex document structures in non-English languages.

Personalized AI news from scientific papers.