DeepSeek-V2#
Note
DeepSeek-V2[DALF+24] is a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.
Architecture#
Number of Parameters#
# DeepSeek-V2/config.json
{
"vocab_size": 102400,
"hidden_size": 5120,
"num_attention_heads": 128,
"qk_nope_head_dim": 128,
"v_head_dim": 128,
"kv_lora_rank": 512,
"q_lora_rank": 1536,
"qk_rope_head_dim": 64,
"first_k_dense_replace": 1,
"intermediate_size": 12288,
"n_shared_experts": 2,
"n_routed_experts": 160,
"moe_intermediate_size": 1536,
"num_experts_per_tok": 6,
...
}
Now Let’s calculate the number of parameters of DeepSeek-V2 step by step:
Embedding and UnEmbedding: 1048576000
class DeepseekV2Model(DeepseekV2PreTrainedModel):
def __init__(self, config: DeepseekV2Config):
super().__init__(config)
# 102400 * 5120
self.embed_tokens = nn.Embedding(
config.vocab_size, config.hidden_size, self.padding_idx
)
class DeepseekV2ForCausalLM(DeepseekV2PreTrainedModel):
def __init__(self, config):
super().__init__(config)
# 5120 * 102400
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
MLA: 149225472 per layer (omit RMSNorm weight etc.)
Fig. 1 Multi-head Latent Attention.#
class DeepseekV2Attention(nn.Module):
def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.q_lora_rank = config.q_lora_rank
self.qk_rope_head_dim = config.qk_rope_head_dim
self.kv_lora_rank = config.kv_lora_rank
self.v_head_dim = config.v_head_dim
self.qk_nope_head_dim = config.qk_nope_head_dim
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
# 5120 * 1536
self.q_a_proj = nn.Linear(
self.hidden_size, config.q_lora_rank, bias=config.attention_bias
)
self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
# 1536 * 128 * (128 + 64)
self.q_b_proj = nn.Linear(
config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
)
# 5120 * (512 + 64)
self.kv_a_proj_with_mqa = nn.Linear(
self.hidden_size,
config.kv_lora_rank + config.qk_rope_head_dim,
bias=config.attention_bias,
)
self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
# 512 * 128 * (128 + 128)
self.kv_b_proj = nn.Linear(
config.kv_lora_rank,
self.num_heads
* (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
bias=False,
)
# 128 * 128 * 5120
self.o_proj = nn.Linear(
self.num_heads * self.v_head_dim,
self.hidden_size,
bias=config.attention_bias,
)
MOE:
first layer: 188743680
other layer total parameters: 3822059520
other layer activated parameters: 188743680
class DeepseekV2MLP(nn.Module):
def __init__(self, config, hidden_size=None, intermediate_size=None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
self.intermediate_size = (
config.intermediate_size if intermediate_size is None else intermediate_size
)
# 5120 * intermediate_size * 3
# intermediate_size = 12288 if layer_idx=0 else 1536
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
Conclude our calculation:
total parameters: 235692359680
activated parameters: 21326725120
Reinforcement Learning#
In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.
Reinforcement Learning Algorithm. GRPO
Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment.
Tip
We obtain code preference data based on compiler-feedback, and mathematical
preference data based on the ground-truth labels (still train a reward model).
In our preference alignment experiments, we find that the
online approach significantly outperforms the offline approach. Therefore, we invest tremendous
efforts in implementing an online RL framework for aligning DeepSeek-V2.