Research on vision-language models suggests a shift from visual to text-only supervision, cutting costs of LLM prompt generation. By learning prompts through LLM-derived text data, this method aims for zero-shot transfer to new classes. Key insights include:
The findings propose a synergistic approach combining visual and language models, a promising avenue for future innovations in prompt learning.