The growth of Large Language Models (LLMs) in daily operations necessitates rigorous evaluation methods. By introducing a multi-agent AI model, this paper presents preliminary results for assessing different LLMs. These agents execute code retrieval tasks across a range of language models and validates through a verification agent, providing insights into their comparative performances.
This evaluation method has the potential to standardize LLM performance assessments, offering valuable metrics to guide further improvements in the development of these complex systems. Evaluating LLMs