Skywork Open Reasoner 1#

Note

Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench.
Additionally, we thoroughly investigate the phenomenon of entropy collapse.

MAGIC in Skywork-OR1#

Tip

We made the following refinements to the training strategy of vanilla GRPO:

  1. Multi-Stage Training.

  2. No Advantage Mask for Truncated Responses.

  3. High-Temperature Sampling.

  4. On-Policy Training.

We introduce the following characteristics into the loss function:

  1. Adaptive Entropy Control.

  2. No KL Loss.

Multi-Stage Training#

we used a shorter context length \(T\) in the initial stages. Once the model’s performance converged, we increased \(T\) in the subsequent stage. Our findings demonstrate that multi-stage training not only improves token efficiency in the initial stage but also preserves scaling ability.

No Advantage Mask for Truncated Responses#

We investigated several advantage mask strategies aimed at reducing the influence of truncated responses. However, our findings show that assigning negative advantages to truncated samples not only improves token efficiency but also preserves the model’s scaling ability in later stages. As a result, we did not apply any mask strategies in our final training pipeline.

Note

Ablation Experiments: Different Advantage Mask Strategies

  1. No-Adv-Mask: We do not employ any advantage mask strategy.

  2. Adv-Mask-Before: The truncated responses are not involved in the group advantage calculation for non-truncated responses, and the advantage of these truncated responses are set to 0.

  3. Adv-Mask-After: The truncated responses are still involved in the group advantage calculation for non-truncated responses, and the advantage of these truncated responses are set to 0.

../_images/sky-or1-1.png

High-temperature Sampling#

Note

Ablation Experiments: Different Online Sampling Temperatures \(\tau\)

  1. High Temperature: \(\tau\) = 1.0.

  2. Low Temperature: \(\tau\) = 0.6.

../_images/sky-or1-2.png

Adaptive Entropy Control#

While preventing premature entropy collapse via entropy regularization is beneficial, selecting an appropriate entropy loss coefficient is challenging – we introduce Adaptive Entropy Control, a method that adaptively adjusts the entropy loss coefficient based on the target and current entropy.

\[ \alpha_{k} = c_k\cdot\mathbb{I}\{e_k\le\mathbf{tgt\_ent}\} \]
\[\begin{split} c_{k+1} = \begin{cases} c_k + \Delta,\quad&\text{if }e_k<\mathbf{tgt\_ent}\\ c_k - \Delta,\quad&\text{if }e_k>\mathbf{tgt\_ent} \end{cases} \end{split}\]

where \(c_0=0\) denote the initial adaptive coefficient.

No KL Loss#

../_images/sky-or1-3.png

We observe that, in Stage 2, the KL loss strongly pulls the actor model’s policy back toward the reference model. As a result, performance on AIME24 fails to improve significantly once the actor’s policy becomes too similar to the reference policy. Based on this observation, we set \(\beta = 0\) for all training stages of our released models.

Empirical Studies on Mitigating Policy Entropy Collapse#

We hypothesize that the following two sources may influence the model’s entropy and convergence behavior:

  • Rollout diversity. If the rollout data contain a greater diversity of correct responses, this prevents the model from overfitting to a single correct trajectory.

  • Policy update. We also investigate how different components of the policy update influence entropy. We focus primarily on the number of stochastic gradient descent (SGD) steps per training step and the use of additional entropy control methods (e.g., entropy loss).

Premature Entropy Collapse Generally Manifests as Worse Performance#

../_images/sky-or1-4.png

The Impact of Off-policy Update by Increasing \(N_{\text{SGD}}\)#

../_images/sky-or1-6.png

It is clear that the number of SGD steps performed in one training step satisfies:

\[ N_{\text{SGD}} = \frac{D_R}{D_T}\cdot N_{\text{reuse}} \]

when \(N_{\text{SGD}}=1\), the policy update is purely on-policy; when \(N_{\text{SGD}}\ge 2\), the off-policy data is introduced into the policy update.

Note

Ablation Experiments: The Impact of Different Numbers of SGD Steps \(N_{\text{SGD}}\).

Consider the quadruple \((N_{\text{SGD}},D_R,D_T ,N_{\text{reuse}})\).

  1. \(N_{\text{SGD}}=1\): The baseline experiment with the quadruple (1,64,64,1).

  2. \(N_{\text{SGD}}=2\): We ran two experiments with the quadruples (2,64,32,1) and (2,64,64,2).

  3. \(N_{\text{SGD}}=4\): We ran two experiments with the quadruples (4,64,16,1) and (4,64,64,4).

../_images/sky-or1-5.png

Experiments with \(N_{\text{SGD}} \in \{2, 4\}\) exhibit faster policy convergence, with entropy decaying to very small values within a few training steps. As a result, test performance fails to improve consistently once the model enters a low-entropy state. Off-Policy Data Harms Test Performance.

Preventing Premature Entropy Collapse#

  • Entropy Loss Is Sensitive to Training Data.

  • Adjusting the Coefficient of Entropy Loss Adaptively.

  • Using a properly chosen higher-clip ratio can prevent premature entropy collapse and lead to better test performance. However, the optimal higher-clip ratio is task-dependent.

Empirical Studies on Training Resource Allocation#

Tip

  1. Rollout Time Dominates the Total Training Time.

  2. Larger Batch Size, Better Test Performance.

  3. Larger Group Size, Better Test Performance.