Behold the alchemy of language and vision, where CLIP and its ilk have set the stage ablaze with their stellar act of generalization. Yet, the adaptation conundrum lingers - to fine-tune or not to fine-tune? The maestros in this paper propose a middle ground: learning prompts using text bestowed by the wise LLMs themselves. Prompts, you see, are the whispered incantations that coax models into brilliance without the crutch of visual aids. Here’s what this arcana involves:
Interested in the secret scrolls? They can be found at GitHub.
In a world where images reign supreme, this text-only supervision tome is crucial for it creates prompts that are the embodiment of versatility. It’s this universality that whispers to not just one, but many within the visual kingdom, making for a rather compelling script in the vision-language saga.