JDocQA: Japanese Document Question Answering Dataset for Generative Language Models

Japanese Document QA

Large Language Models

JDocQA Dataset

Question Answering

Document Understanding

JDocQA: Japanese Document Question Answering Dataset for Generative Language Models

JDocQA marks an advancement in language models’ capabilities to handle document question answering. It is a dataset comprised of 11,600 QA instances requiring visual and textual comprehension, tailored specifically for Japanese text.

Dataset Features:

5,504 documents in PDF format, covering a broad range of topics.
Annotated Questions and Answers with document references and bounding boxes.
Multiple categories of questions, including those unanswerable from the document.
Effectiveness validated with text-based LLMs and multimodal models as detailed in JDocQA.

JDocQA is critical for advancing AI’s understanding of complex documents in non-English languages, particularly Japanese. It presents numerous applications in automating document-based inquiries and aids in minimizing the language model hallucination phenomenon.

Personalized AI news from scientific papers.