AI digest: all
Subscribe
Dialog Understanding
Vision-Language Models
Multi-Image Conversations
Data Tuning
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Property Value
Dataset Size MMDU has 18k image+text tokens, 20 images, and 27 turns
Performance Tuning on MMDU-45k led to improvements in conversation accuracy and scores on benchmarks

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (LVLMs). This study introduces a new benchmark, MMDU, and a dataset called MMDU-45k to assess LVLMs in multi-turn and multi-image conversations. By fine-tuning open-source LVLMs on MMDU-45k, improvements were observed in generating longer and more accurate conversations.

Personalized AI news from scientific papers.