Japanese Document QA
Large Language Models
JDocQA Dataset
Question Answering
Document Understanding
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models

JDocQA marks an advancement in language models’ capabilities to handle document question answering. It is a dataset comprised of 11,600 QA instances requiring visual and textual comprehension, tailored specifically for Japanese text.

Dataset Features:

  • 5,504 documents in PDF format, covering a broad range of topics.
  • Annotated Questions and Answers with document references and bounding boxes.
  • Multiple categories of questions, including those unanswerable from the document.
  • Effectiveness validated with text-based LLMs and multimodal models as detailed in JDocQA.

JDocQA is critical for advancing AI’s understanding of complex documents in non-English languages, particularly Japanese. It presents numerous applications in automating document-based inquiries and aids in minimizing the language model hallucination phenomenon.

Personalized AI news from scientific papers.