Even state-of-the-art Vision-Language Models (VLMs) like GPT-4V face hurdles in visual deductive reasoning, a vital cognitive capability. This work delves into their performance on complex tasks such as Raven’s Progressive Matrices, revealing a significant divide between text and image reasoning abilities.
These findings indicate that while LLMs show promise in text-based reasoning, our journey towards visual reasoning intelligence is still nascent. Fostering VLMs that can effectively navigate abstract visual cues is quintessential for the next wave of AI breakthroughs.