The paper introduces FoFo, a groundbreaking benchmark for assessing large language models’ proficiency in format-following, which is essential for AI agent applications. Typically, current benchmarks don’t adequately measure this skill. FoFo, created through AI-Human collaboration, contains real-world formats and instructions. Findings reveal significant gaps between open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) models, independence of format-following from content quality, and variability across domains.
Key points from the study include:
The paper emphasizes the importance of specialized tuning for such capabilities and suggests that FoFo could guide the selection of domain-specific AI agents. The benchmark is accessible here.
In my opinion, this paper is critical as it sheds light on an often-overlooked aspect of LLMs and pushes for advancements in domain-specific applications of AI. The findings could spur further research in enhancing AI autonomy and precision in professional environments.