LiveCodeBench#
Note
LiveCodeBench[JHG+24] provides holistic and contamination-free evaluation of coding capabilities of LLMs. Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms – LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation.
Installation#
You can clone the repository using the following command:
git clone https://github.com/LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench
Manage dependencies using conda and yaml (change to your prefix):
conda env create --file environment.yml -n lcb
name: lcb
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2024.7.2=h06a4308_0
- ld_impl_linux-64=2.38=h1181459_1
- libffi=3.4.4=h6a678d5_1
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libstdcxx-ng=11.2.0=h1234567_1
- libuuid=1.41.5=h5eee18b_0
- ncurses=6.4=h6a678d5_0
- openssl=3.0.14=h5eee18b_0
- pip=24.0=py310h06a4308_0
- python=3.10.14=h955ad1f_1
- readline=8.2=h5eee18b_0
- setuptools=72.1.0=py310h06a4308_0
- sqlite=3.45.3=h5eee18b_0
- tk=8.6.14=h39e8969_0
- wheel=0.43.0=py310h06a4308_0
- xz=5.4.6=h5eee18b_1
- zlib=1.2.13=h5eee18b_1
- pip:
- aiohttp==3.9.4
- aiosignal==1.3.1
- annotated-types==0.7.0
- anthropic==0.25.1
- anyio==4.3.0
- async-timeout==4.0.3
- attrs==23.2.0
- boto3==1.34.117
- botocore==1.34.117
- build==1.2.1
- cachecontrol==0.14.0
- cachetools==5.3.3
- certifi==2024.2.2
- cffi==1.17.0
- charset-normalizer==3.3.2
- cleo==2.1.0
- cohere==5.5.4
- crashtest==0.4.1
- cryptography==43.0.0
- datasets==2.20.0
- dill==0.3.8
- distlib==0.3.8
- distro==1.9.0
- dulwich==0.21.7
- exceptiongroup==1.2.0
- fastavro==1.9.4
- fastjsonschema==2.20.0
- filelock==3.13.4
- frozenlist==1.4.1
- fsspec==2024.2.0
- google-ai-generativelanguage==0.6.1
- google-api-core==2.18.0
- google-api-python-client==2.125.0
- google-auth==2.29.0
- google-auth-httplib2==0.2.0
- google-generativeai==0.5.0
- googleapis-common-protos==1.63.0
- grpcio==1.62.1
- grpcio-status==1.62.1
- h11==0.14.0
- httpcore==1.0.5
- httplib2==0.22.0
- httpx==0.25.2
- httpx-sse==0.4.0
- huggingface-hub==0.22.2
- idna==3.7
- importlib-metadata==8.2.0
- installer==0.7.0
- jaraco-classes==3.4.0
- jeepney==0.8.0
- jinja2==3.1.4
- jmespath==1.0.1
- joblib==1.4.2
- jsonlines==4.0.0
- keyring==24.3.1
- markupsafe==2.1.5
- mistralai==0.1.8
- more-itertools==10.4.0
- mpmath==1.3.0
- msgpack==1.0.8
- multidict==6.0.5
- multiprocess==0.70.16
- networkx==3.3
- numpy==1.26.4
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-nccl-cu12==2.20.5
- nvidia-nvjitlink-cu12==12.6.20
- nvidia-nvtx-cu12==12.1.105
- openai==1.17.1
- orjson==3.10.0
- packaging==24.0
- pandas==2.2.2
- pebble==5.0.7
- pexpect==4.9.0
- pkginfo==1.11.1
- platformdirs==4.2.2
- poetry==1.8.3
- poetry-core==1.9.0
- poetry-plugin-export==1.8.0
- proto-plus==1.23.0
- protobuf==4.25.3
- ptyprocess==0.7.0
- pyarrow==15.0.2
- pyarrow-hotfix==0.6
- pyasn1==0.6.0
- pyasn1-modules==0.4.0
- pycparser==2.22
- pydantic==2.7.0
- pydantic-core==2.18.1
- pyext==0.7
- pyparsing==3.1.2
- pyproject-hooks==1.1.0
- python-dateutil==2.9.0.post0
- pytz==2024.1
- pyyaml==6.0.1
- rapidfuzz==3.9.6
- requests==2.32.3
- requests-toolbelt==1.0.0
- rsa==4.9
- s3transfer==0.10.1
- secretstorage==3.3.3
- shellingham==1.5.4
- six==1.16.0
- sniffio==1.3.1
- sympy==1.13.2
- tokenizers==0.15.2
- tomli==2.0.1
- tomlkit==0.13.0
- torch==2.4.0
- tqdm==4.66.5
- triton==3.0.0
- trove-classifiers==2024.7.2
- types-requests==2.32.0.20240602
- typing-extensions==4.11.0
- tzdata==2024.1
- uritemplate==4.1.1
- urllib3==2.2.1
- virtualenv==20.26.3
- xxhash==3.4.1
- yarl==1.9.4
- zipp==3.20.0
prefix: /home/xxx/anaconda3/envs/lcb
本地 Evaluate#
从 HuggingFace 下载评估文件,把它放在本地 livecodebench 目录下(e.g. LiveCodeBench/livecodebench/code_generation_lite)
生成待评估的文件,参数为 temperature=0.2 topp=0.95,Particularly, arrange the outputs in the following format
[ {"question_id": "id1", "code_list": ["code1", "code2"]}, {"question_id": "id2", "code_list": ["code1", "code2"]} ]
其中 code1 code2 是代码片段,需要从 response 中抽取出来
使用下述命令进行评估:
python -m lcb_runner.runner.custom_evaluator --custom_output_file {path_to_custom_outputs} --scenario codegeneration --start_date 2024-08-01