LiveCodeBench

LiveCodeBench#

Note

LiveCodeBench[JHG+24] provides holistic and contamination-free evaluation of coding capabilities of LLMs. Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms – LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation.

../_images/livecodebench-1.png

Installation#

You can clone the repository using the following command:

git clone https://github.com/LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench

Manage dependencies using conda and yaml (change to your prefix):

conda env create --file environment.yml -n lcb
name: lcb
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.7.2=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - setuptools=72.1.0=py310h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - aiohttp==3.9.4
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - anthropic==0.25.1
      - anyio==4.3.0
      - async-timeout==4.0.3
      - attrs==23.2.0
      - boto3==1.34.117
      - botocore==1.34.117
      - build==1.2.1
      - cachecontrol==0.14.0
      - cachetools==5.3.3
      - certifi==2024.2.2
      - cffi==1.17.0
      - charset-normalizer==3.3.2
      - cleo==2.1.0
      - cohere==5.5.4
      - crashtest==0.4.1
      - cryptography==43.0.0
      - datasets==2.20.0
      - dill==0.3.8
      - distlib==0.3.8
      - distro==1.9.0
      - dulwich==0.21.7
      - exceptiongroup==1.2.0
      - fastavro==1.9.4
      - fastjsonschema==2.20.0
      - filelock==3.13.4
      - frozenlist==1.4.1
      - fsspec==2024.2.0
      - google-ai-generativelanguage==0.6.1
      - google-api-core==2.18.0
      - google-api-python-client==2.125.0
      - google-auth==2.29.0
      - google-auth-httplib2==0.2.0
      - google-generativeai==0.5.0
      - googleapis-common-protos==1.63.0
      - grpcio==1.62.1
      - grpcio-status==1.62.1
      - h11==0.14.0
      - httpcore==1.0.5
      - httplib2==0.22.0
      - httpx==0.25.2
      - httpx-sse==0.4.0
      - huggingface-hub==0.22.2
      - idna==3.7
      - importlib-metadata==8.2.0
      - installer==0.7.0
      - jaraco-classes==3.4.0
      - jeepney==0.8.0
      - jinja2==3.1.4
      - jmespath==1.0.1
      - joblib==1.4.2
      - jsonlines==4.0.0
      - keyring==24.3.1
      - markupsafe==2.1.5
      - mistralai==0.1.8
      - more-itertools==10.4.0
      - mpmath==1.3.0
      - msgpack==1.0.8
      - multidict==6.0.5
      - multiprocess==0.70.16
      - networkx==3.3
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.6.20
      - nvidia-nvtx-cu12==12.1.105
      - openai==1.17.1
      - orjson==3.10.0
      - packaging==24.0
      - pandas==2.2.2
      - pebble==5.0.7
      - pexpect==4.9.0
      - pkginfo==1.11.1
      - platformdirs==4.2.2
      - poetry==1.8.3
      - poetry-core==1.9.0
      - poetry-plugin-export==1.8.0
      - proto-plus==1.23.0
      - protobuf==4.25.3
      - ptyprocess==0.7.0
      - pyarrow==15.0.2
      - pyarrow-hotfix==0.6
      - pyasn1==0.6.0
      - pyasn1-modules==0.4.0
      - pycparser==2.22
      - pydantic==2.7.0
      - pydantic-core==2.18.1
      - pyext==0.7
      - pyparsing==3.1.2
      - pyproject-hooks==1.1.0
      - python-dateutil==2.9.0.post0
      - pytz==2024.1
      - pyyaml==6.0.1
      - rapidfuzz==3.9.6
      - requests==2.32.3
      - requests-toolbelt==1.0.0
      - rsa==4.9
      - s3transfer==0.10.1
      - secretstorage==3.3.3
      - shellingham==1.5.4
      - six==1.16.0
      - sniffio==1.3.1
      - sympy==1.13.2
      - tokenizers==0.15.2
      - tomli==2.0.1
      - tomlkit==0.13.0
      - torch==2.4.0
      - tqdm==4.66.5
      - triton==3.0.0
      - trove-classifiers==2024.7.2
      - types-requests==2.32.0.20240602
      - typing-extensions==4.11.0
      - tzdata==2024.1
      - uritemplate==4.1.1
      - urllib3==2.2.1
      - virtualenv==20.26.3
      - xxhash==3.4.1
      - yarl==1.9.4
      - zipp==3.20.0
prefix: /home/xxx/anaconda3/envs/lcb

本地 Evaluate#

  1. HuggingFace 下载评估文件,把它放在本地 livecodebench 目录下(e.g. LiveCodeBench/livecodebench/code_generation_lite)

  2. 生成待评估的文件,参数为 temperature=0.2 topp=0.95,Particularly, arrange the outputs in the following format

    [
        {"question_id": "id1", "code_list": ["code1", "code2"]},
        {"question_id": "id2", "code_list": ["code1", "code2"]}
    ]
    

    其中 code1 code2 是代码片段,需要从 response 中抽取出来

  3. 使用下述命令进行评估:

    python -m lcb_runner.runner.custom_evaluator --custom_output_file {path_to_custom_outputs} --scenario codegeneration --start_date 2024-08-01