This research critiques the current methods of evaluating Large Vision-Language Models (LVLMs) and presents MMStar, a curated benchmark to thoroughly assess multi-modal capacities. The principal concerns with existing benchmarks are twofold: the superfluity of visual content for many samples and unintentional data leakage during training. The study highlights:
The significance of this paper stems from its scrutiny of current LVLM evaluation methods. By ensuring purer and more stringent benchmarks, the paper provides a clearer pathway for discerning true multi-modal proficiency, which is essential for applications like interactive robotics and enhanced content analysis. Further study could drive advancements in model training and evaluation to develop more finely-tuned and effective multi-modal AI systems. Read the full paper.