Property | Value |
---|---|
Dataset Size | MMDU has 18k image+text tokens, 20 images, and 27 turns |
Performance | Tuning on MMDU-45k led to improvements in conversation accuracy and scores on benchmarks |
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models (LVLMs). This study introduces a new benchmark, MMDU, and a dataset called MMDU-45k to assess LVLMs in multi-turn and multi-image conversations. By fine-tuning open-source LVLMs on MMDU-45k, improvements were observed in generating longer and more accurate conversations.