The advancement of vision-extended LLMs (VLLMs) has been remarkable, yet their capability in handling long-tail entities lags behind. SnapNTell is introduced as a benchmark with a tailored dataset for entity-centric Visual Question Answering (VQA), testing models on recognizing entities and providing detailed knowledge about them.
In-depth points covered by the research:
This research underscores the need for addressing complex queries involving rare entities in VQA, which may lead to improved factual and detailed responses from AI systems. Emphasizing retrieval-augmentation can serve as a cornerstone for building more accurate and sophisticated multimodal language models.