Enhancing Visual Question Answering with LLM-driven Image Captions

Panos Kourgiounis

Visual Question Answering

Image Captioning

Large Language Models

Zero-Shot Learning

AI Reasoning

Enhancing Visual Question Answering with LLM-driven Image Captions

This paper introduces a groundbreaking method to enhance zero-shot Visual Question Answering (VQA) using image captions as prompts for large language models (LLMs). A comprehensive comparison of various state-of-the-art image captioning models is conducted, evaluating their impact on VQA performance across different question types.

Innovates a question-driven image captioning approach to aid LLMs in zero-shot VQA settings.
Extracts keywords from questions to generate targeted captions, which are then used as prompts for the LLM.
Significant performance gains observed on the GQA benchmark, suggesting potential for vast improvements in VQA tasks.

The study underscores the synergy between image captioning and LLMs in enhancing reasoning abilities in AI, potentially setting a new standard in visual-language understanding. It also paves the way for further exploration into zero-shot learning and its implications for AI’s perceptual and contextual awareness.

Personalized AI news from scientific papers.