This study introduces a distinctive multi-agent AI model that evaluates different LLMs. Eight AI agents work in unison to fetch code for high-level descriptions, using APIs from several LLMs, including GPT-3.5, GPT-4, and Google Bard. A verification agent checks the code against HumanEval benchmarks. Preliminary results suggest GPT-3.5 Turbo’s superior performance, providing a benchmark for side-by-side comparison.
The assessment of LLMs through a specialized multi-agent AI model is a significant step in understanding their capabilities and fine-tuning their applications in various fields. It assists researchers and practitioners in selecting the most efficient LLMs for their specific needs.