ReasonEval introduces a new lens through which to measure the quality of reasoning exhibited by LLMs in mathematical tasks. It evaluates this quality based on \(\textit{validity}\) and \(\textit{redundancy}\) of the reasoning steps, revealing that increase in final-answer accuracy does not always correlate with improved reasoning quality. With the help of LLMs designed for automatic assessment, ReasonEval has shown commendable performance in detecting logical errors and step redundancy in complex mathematical problem-solving.
The focus on reasoning steps provides insights into the nuanced aspects of LLMs’ problem-solving approaches and offers guidance for data selection during training. This critical approach for evaluating accuracy reflects ReasonEval’s potential for improving the quality of educational and analytical tools that leverage LLMs.