The paper addresses two fundamental issues in evaluating LVLMs. Firstly, the overestimation of multi-modal gains due to the ability of models to infer answers without visual input. Secondly, unintentional data leakage that has allowed LVLMs to memorize certain visual-necessary questions. To combat these issues, the researchers propose MMStar, a new benchmark to better evaluate multi-modal capacities of LVLMs.
*Key Takeaways:
Why is this important? This paper pinpoints the pitfalls in LVLM evaluations, leading us to re-think the benchmarks we use. Understanding the true capabilities of multi-modal models is crucial for advancing AI technologies in image and language processing fields. The proposed MMStar benchmark and the insights from this study potentially pave the way for creating more reliable and effective LVLMs.
Read the full paper here: Are We on the Right Way for Evaluating Large Vision-Language Models?