
OlympiadBench is an ambitious new benchmark aimed at pushing AI to reach human expert-level sophistication (ABSTRACT REASONING AND BENCHMARKING). It uses Olympiad-level math and physics problems to test and assess AI capabilities. The benchmark reveals that even the strongest AI, GPT-4V, scores on average 17.23%, indicating the need for better reasoning and problem-solving abilities in LLMs for real-world application.