Gemini 2.5#
Tip
The smaller models in the Gemini 2.5 series — Flash size and below — use distillation, as was done in the Gemini 1.5 series. To reduce the cost associated with storing the teacher’s next token prediction distribution, we approximate it using a k-sparse distribution over the vocabulary (store top-k tokens).
Tip
Driven by a consistent focus on data quality across the Supervised
Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning (RL) stages. A key focus
has been leveraging the model itself to assist in these processes, enabling more efficient and nuanced
quality control.
Furthermore, we have increased the training compute allocated to RL, allowing deeper exploration
and refinement of model behaviors. This has been coupled with a focus on verifiable rewards
and model-based generative rewards to provide more sophisticated and scalable feedback signals. Algorithmic changes to the RL process have also improved stability during longer training.