Alignment Benchmarks#
IFEval#
Note
Instruction-Following Evaluation for Large Language Models[ZLM+23]
One core capability of Large Language Models (LLMs) is to follow natural language instructions. We introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-toreproduce evaluation benchmark. It focuses on a set of “verifiable instructions” such as “write in more than 400 words” and “mention the keyword of AI at least 3 times”. We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions.
Arena-Hard#
Arena-Hard-Auto-v0.1 is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models’ responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks. If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Although both Arena-Hard-Auto and Chatbot Arena Category Hard employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference.