Qwen 2.5#
Note
Qwen 2.5[QY+25] implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO.
Architecture & Tokenizer#
Basically, the Qwen2.5 series include dense models for opensource, and MoE models for API service.
For dense models, we maintain the Transformer-based decoder architecture[VSP+23] as Qwen2. The architecture incorporates several key components: Grouped Query Attention (GQA[ALTdJ+23]) for efficient KV cache utilization, SwiGLU activation function[DFAG17] for non-linear activation, Rotary Positional Embeddings (RoPE[SLP+23]) for encoding position information, QKV bias in the attention mechanism and RMSNorm[JGZP23] with pre-normalization to ensure stable training.
Building upon the dense model architectures, we extend it to MoE model architectures. This is achieved by replacing standard feed-forward network (FFN) layers with specialized MoE layers.
For tokenization, we utilize Qwen’s tokenizer[BBC+23], which implements byte-level byte-pair encoding (BBPE[WCG19]) with a vocabulary of 151,643 regular tokens.
Pre-training#
Qwen2.5 demonstrates significant enhancements in pre-training data quality compared to its predecessor Qwen2. These improvements stem from several key aspects:
Better data filtering.
Better math and code data.
Better synthetic data.
Better data mixture.
We develop scaling laws for hyper-parameter based on the pre-training data of Qwen2.5. While previous studies primarily used scaling laws to determine optimal model sizes given compute budgets, we
leverage them to identify optimal hyperparameters across model architectures.
Post-training#
Supervised Fine-tuning#
Coding: To enhance coding capabilities, we incorporate the instruction tuning data of Qwen2.5- Coder. We expand our instruction dataset by synthesizing new examples from code-related Q&A websites and gathering algorithmic code snippets from GitHub. A comprehensive multilingual sandbox is used to perform static code checking and validate code snippets through automated unit testing, ensuring code quality and correctness.
Offline Reinforcement Learning#
In this study, we focus on objective query domains such as mathematics,
coding, instruction following, and logical reasoning.
In the previous phase, we extensively employ strategies like execution feedback and answer matching to
ensure the quality of responses. For the current phase, we reuse that pipeline, employing the SFT model
to resample responses for a new set of queries. Responses that pass our quality checks are used as positive
examples, while those that fail are treated as negative examples for Direct Preference Optimization (DPO[RSM+24]). To further enhance the reliability and accuracy of the training signals, we
make use of both human and automated review processes.
Online Reinforcement Learning#
The queries utilized to train the reward model are drawn from two distinct datasets: publicly available
open-source data and a proprietary query set characterized by higher complexity. Responses are generated
from checkpoints of the Qwen models, which have been fine-tuned using different methods—SFT,
DPO, and RL—at various stages of training. To introduce diversity, those responses are sampled at
different temperature settings. Preference pairs are created through both human and automated labeling
processes, and the training data for DPO is also integrated into this dataset.
In our online reinforcement learning (RL) framework, we employ GRPO. The query set utilized for training the reward model is identical to the one used in the RL training phase. The sequence in which queries are processed during training is determined by the variance of their response scores, as evaluated by the reward model. Specifically, queries with higher variance in response scores are prioritized to ensure more effective learning. We sample 8 responses for each query.