EvalPlus

EvalPlus#

Note

EvalPlus[LXWZ23] is a rigorous evaluation framework for LLM4Code, with:

✨ HumanEval+: 80x more tests than the original HumanEval!
✨ MBPP+: 35x more tests than the original MBPP!

Quick Start#

Code Correctness Evaluation: HumanEval(+) or MBPP(+)#

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Question: evalplus.evaluate 究竟调用的是什么函数？

Answer: 在 setup.cfg 文件中有如下配置：

[options.entry_points]
console_scripts =
    evalplus.evaluate = evalplus.evaluate:main
    evalplus.inputgen = evalplus.inputgen:main
    evalplus.sanitize = evalplus.sanitize:main
    evalplus.syncheck = evalplus.syncheck:main
    evalplus.codegen  = evalplus.codegen:main
    evalplus.evalperf = evalplus.evalperf:main

所以入口是 evalplus/evaluate.py 中的 main 函数。

Question: 在 evalplus/evaluate.py 中：

def main():
    from fire import Fire

    Fire(evaluate)


if __name__ == "__main__":
    main()

这里 Fire(evaluate) 有什么用，为什么不直接调用 evaluate 函数呢?

Answer: Fire(evaluate) 是使用 Google 的 fire 库将 evaluate 函数快速转换为命令行接口（CLI）的核心机制。当运行脚本时：

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" --dataset humaneval --backend vllm --greedy

fire 解析命令行参数，匹配 evaluate 函数的参数列表。
将参数值转换为正确的类型（例如字符串、布尔值等）。
调用 evaluate(**parsed_args)，传入解析后的参数。

本地 Evaluate#

git clone git@github.com:evalplus/evalplus.git

```
cd evalplus
pip install -e .
```
下载 HumanEvalPlus 和 MbppPlus 文件，分别来自 evalplus/humanevalplus_release 和 evalplus/mbppplus_release.
把 HumanEvalPlus.jsonl 放入用户主目录下的 .cache/evalplus，并将其重命名为 HumanEvalPlus-v0.1.10.jsonl；类似地，把 MbppPlus.jsonl 文件放入此目录并重命名为 MbppPlus-v0.2.0.jsonl

获取模型的 response，需加上 instruction_prefix：

instruction_prefix = "Please provide a self-contained Python script that solves the following problem in a markdown code block:"
prompt = self.instruction_prefix + f"\n```python\n{prompt.strip()}\n```"

输出文件 xx.jsonl 需 task_id 和 solution（即 response）两个字段

```
evalplus.sanitize --samples xx.jsonl
```

evalplus.evaluate --samples xx-sanitized.jsonl
                  --dataset [humaneval|mbpp]