# VAPO

```{note}
We present VAPO, **V**alue-model-based **A**ugmented Proximal **P**olicy **O**ptimization framework for
reasoning models, a novel framework tailored for reasoning models within the value-model-based
paradigm.<br/>
We argue that value-model-based
approaches possess a higher performance ceiling if the challenges in training value models can be addressed:
1. Value models enable `more precise credit assignment` by accurately tracing the impact of each action
on subsequent returns, facilitating finer-grained optimization.
2. In contrast to the
advantage estimates derived from Monte Carlo methods in value-model-free approaches, value models can
provide `lower-variance value estimates` for each token.
3. Furthermore,
a well-trained value model exhibits inherent generalization capabilities.
```

```{figure} ../images/vapo1.png
```

## Mitigating Value Model Bias over Long Sequences

Initializing the value model with a reward model introduces significant
initialization bias. The reward
model is trained to score on the `<EOS>` token, incentivizing it to assign lower scores to earlier tokens due to
their incomplete context. In contrast, the value model estimates the expected cumulative reward for all tokens
preceding `<EOS>` under a given policy. During early training phases, given the backward computation of GAE,
there will be a positive bias at every timestep $t$ that accumulates along the trajectory.

**Value-Pretraining** is proposed to mitigate the value initialization bias:

1. Continuously generate responses by sampling from a fixed policy, for instance, $\pi_{\text{SFT}}$, and <span style="color: red">update the value
model with Monte-Carlo return.</span>

2. Train the value model until key training metrics, including value loss and explained variance, attain
sufficiently low values.

**Decoupled-GAE** decouples the advantage computation
for the value and the policy. For value updates, it is recommended to <span style="color: red">compute the value-update target
with $\lambda_{\text{critic}} = 1.0$.</span> This choice results in an unbiased gradient-descent optimization, effectively addressing the
reward-decay issues in long CoT tasks. For policy updates, using a smaller $\lambda_{\text{policy}}$ is advisable to accelerate policy convergence under computational
and time constraints.

## Managing Heterogeneous Sequence Lengths during Training

**Length-Adaptive GAE** aims to
ensure a more uniform distribution of TD-errors across both short and long sequences. We design the <span style="color: red">sum of
the coefficients $\lambda_{\text{policy}}$ to be proportional to the output length $l$:</span>

$$
\sum_{t=0}^{\infty}\lambda_{\text{policy}}^{t} = \frac{1}{1 - \lambda_{\text{policy}}} = \alpha l
$$

which result in :

$$\lambda_{\text{policy}} = 1 - \frac{1}{\alpha l}$$

**Token-Level Policy Gradient Loss.** Where all tokens within a single training batch are assigned uniform weights, thereby enabling the problems
posed by long sequences to be addressed with enhanced efficiency.

## Dealing with Sparsity of Reward Signal in Verifier-based Tasks

**Clip-Higher** increase the value of $\epsilon_{\text{high}}$ to leave more room for the increase of low-probability tokens.

**Positive Example LM Loss** is designed to enhance the utilization efficiency of positive samples during RL
training process. In the context of RL for complex reasoning tasks, some tasks demonstrate remarkably low
accuracy, with the majority of training samples yielding incorrect answers. To address
this challenge, we adopt an imitation learning approach by <span style="color: red">incorporating an additional negative log-likelihood
(NLL) loss for the correct outcomes</span> sampled during RL training:

$$
\mathcal{L}_{\text{NLL}}(\theta) = -\frac{1}{\sum_{o_i\in\tau}}\sum_{o_i\in\tau}\sum_{t=1}^{|o_i|}\log\pi_{\theta}(a_t|s_t)
$$

where $\tau$ denotes the set of correct answers. The final NLL loss is combined with the policy gradient loss
through a weighting coefficient $\mu$:

$$
\mathcal{L}(\theta) = \mathcal{L}_{\text{PPO}}(\theta) + \mu\ast\mathcal{L}_{\text{NLL}}(\theta)
$$

**Group-Sampling** reduces the number of distinct prompts per batch and `redirects computational resources toward
repeated generations`. We observed that it is marginally better than sample each prompt only once.