MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

AI digest: all

Dialog Understanding

Vision-Language Models

Multi-Image Conversations

Data Tuning

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Property	Value
Dataset Size	MMDU has 18k image+text tokens, 20 images, and 27 turns
Performance	Tuning on MMDU-45k led to improvements in conversation accuracy and scores on benchmarks

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (LVLMs). This study introduces a new benchmark, MMDU, and a dataset called MMDU-45k to assess LVLMs in multi-turn and multi-image conversations. By fine-tuning open-source LVLMs on MMDU-45k, improvements were observed in generating longer and more accurate conversations.

Personalized AI news from scientific papers.