OpenCodeReasoning

OpenCodeReasoning#

Note

Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. However, much of the progress on distilling reasoning models remains locked. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes.

Main Evaluation#

Ablation and Analyses#

Ablation: Filtering by Code Execution#

We fine-tuned Qwen-2.5-14B-Instruct using three subsamples:

the full subsample of 445k instances;
selecting all instances that pass unit-tests (151k instances);
selecting an equal number of samples as in (2) that fail all tests.

Surprisingly, we observed that fine-tuning on incorrect solutions results in higher accuracy than on correct solutions.

Investigating further, we identified that incorrect solutions span questions that are more challenging than the ones associated with the correct solutions.

Ablation: Inclusion of C++ Solutions#

The inclusion of C++ solutions has no positive impact on Python benchmark performance, but does significantly improve the accuracy on the C++ benchmark.

OpenCodeReasoning

Contents

OpenCodeReasoning#

Dataset Construction and Refinement#

Coding Questions Collection#

Solution Code Generation#

Post-Processing for Refinement#

Scaling Up Data in Stages#

Main Evaluation#

Ablation and Analyses#

Ablation: Filtering by Code Execution#

Ablation: Inclusion of C++ Solutions#