# InstructGPT

```{note}
Making language models bigger does not inherently make them better at following
a user’s intent. For example, large language models can generate outputs that
are untruthful, toxic, or simply not helpful to the user. In other words, these
models are not aligned with their users. In this paper, we show an avenue for
<span style="color: red">aligning language models with user intent</span> on a wide range of tasks by fine-tuning
with human feedback.
```

## High-level methodology

```{figure} ../images/instructgpt-1.png
```

## Dataset

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API. We heuristically deduplicate prompts by checking for prompts that share a long common
prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation,
and test splits based on user ID. To avoid the models learning potentially sensitive customer details, we
filter all prompts in the training split for personally identifiable information (PII).

To train the very first InstructGPT models, we asked labelers to write prompts themselves.

From these prompts, we produce three different datasets used in our fine-tuning procedure. The SFT dataset contains about 13k training
prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API
and labeler-written), and the PPO dataset has 31k training prompts (only from the API).

## Models

**Supervised fine-tuning (SFT).** We fine-tune GPT-3 on our labeler demonstrations using supervised
learning.

**Reward modeling (RM).** Starting from the SFT model with the final unembedding layer removed,
we trained a model to take in a prompt and response, and output a scalar reward.

In order to speed up comparison collection, we present labelers with anywhere between $K=4$ and $K=9$ responses to rank. This produces $\binom{K}{2}$ comparisons for each prompt shown to a labeler. We train on all $\binom{K}{2}$ comparisons from each prompt as a single batch element.

The loss function for the reward model is:

$$
\log(\theta) = \frac{1}{-\binom{K}{2}}\mathbb{E}_{(x, y_{w}, y_{l})\sim D}\left[\log(\sigma(r_{\theta}(x, y_{w}) - r_{\theta}(x, y_{l})))\right]
$$

where $r_{\theta}(x, y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters $\theta$, $y_{w}$ is the preferred completion out of the pair $y_{w}$ and $y_{l}$, and $D$ is the dataset of human
comparisons.

```{tip}
We employ the <span style="color: red">Bradley-Terry model for pairwise comparison of competitors</span>, where the strength parameter for $(x, y)$ is set to $\exp(r(x, y))$. Then:

$$p(y_{1}\succ y_{2}|x) = \frac{\exp(r(x, y_1))}{\exp(r(x, y_1)) + \exp(r(x, y_2))}=\sigma(r(x,y_1) - r(x,y_2))$$
```

**Reinforcement learning (RL).** We fine-tuned the SFT model on our environment using PPO. Given
the prompt and response, the reward model produces a reward and ends the episode. In addition, <span style="color: red">we add a per-token KL penalty from the SFT model at each token to mitigate over-optimization
of the reward model. The value function is initialized from the RM.</span> We call these
models “PPO.”

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the
performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the
following combined objective function in RL training:

$$
\begin{aligned}
\text{objective} = &\mathbb{E}_{(x, y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[r_{\theta}(x, y) - \beta\log\left(\pi_{\phi}^{\text{RL}}(y|x) / \pi^{\text{SFT}}(y|x)\right)\right] + \\
&\gamma\mathbb{E}_{x\sim D_{\text{pretrain}}}\left[\log(\pi_{\phi}^{\text{RL}}(x))\right]
\end{aligned}
$$

where $\pi_{\phi}^{\text{RL}}$ is the learned RL policy, $\pi^{\text{SFT}}$ is the supervised trained model, and $D_{\text{pretrain}}$ is the
pretraining distribution. The KL reward coefficient, $\beta$, and the pretraining loss coefficient, $\gamma$, control
the strength of the KL penalty and pretraining gradients respectively.

```{tip}
For an event $X$ with probability $p$, it's self information is

$$I(X) = -\log p(x)$$

The less probable an event is, the more surprising it is and the more information it yields. The term

$$\log\frac{p(x)}{q(x)} = -\log q(x) - (-\log p(x))$$

can be interpreted as our relative surprise. The KL divergence between $P$ and $Q$ is

$$\mathbb{E}_{x\sim P}\left[\log\frac{p(x)}{q(x)}\right]$$

can be interpreted as the expected relative surprise from using $Q$ instead of $P$ when the actual distribution is $P$. It measures how one probability distribution $P$ is different from the reference probability distribution $Q$.
```

```{tip}
The implementation details of PPO can be found in [this blog](https://newfacade.github.io/notes-on-reinforcement-learning/17-ppo-trl.html).
```

## Results

```{figure} ../images/instructgpt-2.png
```