A recent study titled ‘Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts’ investigates the potential of applying image captioning as a step in the visual question answering (VQA) pipeline by leveraging large language models (LLMs).
This approach elegantly combines vision and language, utilizing the descriptive power of LLMs within a VQA context. It’s impressive to see how custom captions can prime models for better zero-shot reasoning, presenting opportunities for richer human-machine dialogues.