The recent study ‘Are We on the Right Way for Evaluating Large Vision-Language Models?’ by Lin Chen et al. challenges the efficacy of current evaluation benchmarks for LVLMs. It reveals that many tasks do not require visual content to answer questions, leading to misleading judgments of LVLM capabilities. To counter this, researchers have introduced MMStar, a new benchmark comprising 1,500 human-selected samples, which evaluates six core multimodal abilities across 18 detailed axes.
This paper highlights the necessity of refined benchmarks like MMStar to steer future LVLM development effectively. By focusing on visual indispensability and sophisticated capabilities, MMStar could reshape multi-modal AI research and applications.