Despite the high performance of large language models (LLMs) in benchmarks, simple tasks like flight booking unveil their weaknesses. GroundCocoa is a benchmark designed to test LLM agents in compositional and conditional reasoning, critical cognitive skills for humans. In this challenge, the user’s preferences must perfectly match available flight options in complex scenarios. Even the best LLM agent, GPT-4 Turbo, did not exceed 67% accuracy, indicating a gap in LLM capability and the need for further development.
The insights from this study could be instrumental in catalyzing future improvements in language models, emphasizing the need for stronger reasoning abilities for real-world applications. Read More