Paper Presentation

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

- By Xiaosen Zheng, Ph.D. Student, Singapore Management University

Download the Research Paper

In the rapidly evolving world of AI, evaluating large language models (LLMs) effectively and fairly has become critical. Automated benchmarking tools such as AlpacaEval and Arena-Hard-Auto have emerged as popular solutions, offering scalable alternatives to human evaluation. However, as these benchmarks grow in influence, questions about their vulnerability to manipulation are surfacing.

Join us for an insightful session featuring Xiaosen Zheng, whose latest research exposes how “null models” can bypass genuine evaluation in LLM benchmarks and still rank highly. This session will delve into the mechanisms of LLM evaluation, the weaknesses in current benchmarking systems, and the essential steps needed to create robust, cheat-resistant benchmarks.


Meet our Speaker:

Xiaosen Zheng

Xiaosen Zheng is a fifth-year computer science Ph.D. student at Singapore Management University, supervised by Professor Jing Jiang. He is also a Visiting Scholar at CMU, under the guidance of Professor Chenyan Xiong. Previously, He had the privilege of collaborating with Tianyu Pang, Chao Du, Qian Liu, and Min Lin at Sea AI Lab. His research focuses on Data-Centric AI and AI Safety. He has publications in top-tier conferences including ACL, ICLR, ICML, NeurIPS.

(Homepage: https://xszheng2020.github.io)