ShorterBetter#
Note
We define the Sample Optimal Length (SOL) as the length of the shortest correct response among multiple generations, which serves as a dynamic reward signal to guide the model toward efficient reasoning.
Sample Optimal Length (SOL)#
Given a prompt \(x_i\) and a reference response \(y_{i}^{\ast}\), our method generates \(n\) rollouts of candidate responses \(G(x_i) = \{y_1, y_2, \dots, y_n\}\) from the policy \(p_{\theta}(\cdot|x_i)\). We then define the SOL as:
\[\begin{split}
\mathcal{l}^{SOL}(G(x_i)) =
\begin{cases}
\underset{y_j\in G(x_i),\mathbb{I}(y_j=y_{i}^{\ast})=1}{\min}\mathcal{l}(y_j),\quad &\text{if at least one response is correct,}\\
\frac{1}{n}\sum_{j=1}^{n}\mathcal{l}(y_j),&\text{otherwise.}
\end{cases}
\end{split}\]
Given the SOL, we define the following reward function for each response:
\[
r(y_j) = \alpha\cdot\mathbb{I}(y_j=y_{i}^{\ast}) - \beta\cdot|\mathcal{l}(y_j) - \mathcal{l}^{SOL}(G(x_i))|
\]