Fangyun Wei, Xi Chen, and Lin Luo question the effectiveness of MCQA (multiple choice question answering) for evaluating LLMs. They introduce the RWQ-Elo rating system where 24 LLMs like GPT-4, GPT-3.5, and LLaMA-1/-2 engage in a two-player format, with GPT-4 as the judge. The ‘Real-world questions’ (RWQ) benchmark comprises over 20,000 authentic user inquiries, providing a practical evaluation scenario. The RWQ-Elo system aims to mirror real-world usage more closely than previous benchmarks like AlpacaEval and MT-Bench. You can delve into the full details of their research and the comparative analysis of the system here.
By introducing the RWQ-Elo rating system, this work could significantly influence future LLM development and assessment. The emphasis on real-world inquiries ensures that the benchmarks set for these models are not only robust but also relevant to practical applications, possibly driving future AI research and deployments in a more user-centric direction.