ShorterBetter

ShorterBetter#

Note

We define the Sample Optimal Length (SOL) as the length of the shortest correct response among multiple generations, which serves as a dynamic reward signal to guide the model toward efficient reasoning.

Sample Optimal Length (SOL)#

Given a prompt \(x_i\) and a reference response \(y_{i}^{\ast}\), our method generates \(n\) rollouts of candidate responses \(G(x_i) = \{y_1, y_2, \dots, y_n\}\) from the policy \(p_{\theta}(\cdot|x_i)\). We then define the SOL as:

\[\begin{split} \mathcal{l}^{SOL}(G(x_i)) = \begin{cases} \underset{y_j\in G(x_i),\mathbb{I}(y_j=y_{i}^{\ast})=1}{\min}\mathcal{l}(y_j),\quad &\text{if at least one response is correct,}\\ \frac{1}{n}\sum_{j=1}^{n}\mathcal{l}(y_j),&\text{otherwise.} \end{cases} \end{split}\]

Given the SOL, we define the following reward function for each response:

\[ r(y_j) = \alpha\cdot\mathbb{I}(y_j=y_{i}^{\ast}) - \beta\cdot|\mathcal{l}(y_j) - \mathcal{l}^{SOL}(G(x_i))| \]