OpenThoughts#

Note

Data recipes for reasoning models.

OpenThoughts3 Data Pipeline#

The experiments ablate each pipeline step independently, and we select the best-performing strategy based on downstream performance.

Question Soucing#

Our experiments cover 27 different question sources for code questions. Among them, CodeGolf questions from StackExchange and competitive coding questions from OpenCodeReasoning perform well.

Mixing Questions#

After obtaining high-quality questions from various question sources, the challenge becomes how to combine them effectively. For simplicity, we use the rankings of the previous step in our pipeline, Our mixing strategy selects the top-ranked N datasets, randomly samples \(\frac{31600}{N}\) questions from each source, and concatenates them to form a dataset of size \(31600\). We sweep values of \(N \in \{1, 2, 4, 8, 16\}\): mixing at most two sources yields the best results across all data domains.

Question Filtering#

Difficulty-based filtering asks an LLM (GPT-4o-mini) to assess the difficulty of each question, then retains the most difficult questions. Difficulty-based filtering is the winning strategy for code.

Deduplication and Sampling Multiple Answers Per Question#

We explore three degrees of deduplication strictness: no deduplication, exact match deduplication, and fuzzy deduplication using a threshold-based string similarity. We also explore sampling multiple answers (1×, 4×, 16×) for each domain.

For code, we employ the second-best strategy, which involves no deduplication with 16× answers per question.

Answer Filtering#

We do not perform answer filtering because no filtering strategy outperformed the baseline, which uses all the answers.

Teacher Model#

We use QwQ-32B as the teacher model.