This paper offers a rigorous evaluation of Large Language Models (LLMs) in elementary mathematical reasoning, using a newly commissioned benchmark, the GSM1k. It investigates the concern of dataset contamination versus genuine reasoning abilities of LLMs through comparison with the GSM8k benchmark.
This study provides a critical lens through which we can assess the actual capabilities and limitations of LLMs in specific cognitive domains like mathematical reasoning. Future research could focus on refining benchmarking practices and developing methodologies to better discern true reasoning from rote memorization.