This paper introduces a groundbreaking method to enhance zero-shot Visual Question Answering (VQA) using image captions as prompts for large language models (LLMs). A comprehensive comparison of various state-of-the-art image captioning models is conducted, evaluating their impact on VQA performance across different question types.
The study underscores the synergy between image captioning and LLMs in enhancing reasoning abilities in AI, potentially setting a new standard in visual-language understanding. It also paves the way for further exploration into zero-shot learning and its implications for AI’s perceptual and contextual awareness.