The prowess of Vision-Language Models (VLMs), including GPT-4V, is put to the test in this insightful paper titled ‘How Far Are We from Intelligent Visual Deductive Reasoning?’. While major advances on vision language tasks have been celebrated, this study probes into visual-based deductive reasoning utilizing Raven’s Progressive Matrices (RPMs) to benchmark VLMs.
This paper highlights critical gaps in our journey towards AI with proficient visual reasoning, reminding us that LLMs are yet to master the art of ‘seeing’. Addressing these challenges will not only refine VLMs but also enrich AI applications requiring complex visual interpretations.