DeepSeek-V2

DeepSeek-V2#

Note

DeepSeek-V2[DALF+24] is a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE.

Architecture#

Multi-Head Latent Attention

DeepSeekMoE

Number of Parameters#

# DeepSeek-V2/config.json
{
    "vocab_size": 102400,
    "hidden_size": 5120,
    
    "num_attention_heads": 128,
    "qk_nope_head_dim": 128,
    "v_head_dim": 128,
    "kv_lora_rank": 512,
    "q_lora_rank": 1536,
    "qk_rope_head_dim": 64,

    "first_k_dense_replace": 1,
    "intermediate_size": 12288,
    "n_shared_experts": 2,
    "n_routed_experts": 160,
    "moe_intermediate_size": 1536,
    "num_experts_per_tok": 6,
    ...
}

Now Let’s calculate the number of parameters of DeepSeek-V2 step by step:

Embedding and UnEmbedding: 1048576000

class DeepseekV2Model(DeepseekV2PreTrainedModel):
    def __init__(self, config: DeepseekV2Config):
        super().__init__(config)
        # 102400 * 5120
        self.embed_tokens = nn.Embedding(
            config.vocab_size, config.hidden_size, self.padding_idx
        )
        
class DeepseekV2ForCausalLM(DeepseekV2PreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        # 5120 * 102400
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

MLA: 149225472 per layer (omit RMSNorm weight etc.)

../_images/mla-3x.svg — Fig. 1 Multi-head Latent Attention.#

class DeepseekV2Attention(nn.Module):

    def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        
        self.q_lora_rank = config.q_lora_rank
        self.qk_rope_head_dim = config.qk_rope_head_dim
        self.kv_lora_rank = config.kv_lora_rank
        self.v_head_dim = config.v_head_dim
        self.qk_nope_head_dim = config.qk_nope_head_dim
        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim

        # 5120 * 1536
        self.q_a_proj = nn.Linear(
            self.hidden_size, config.q_lora_rank, bias=config.attention_bias
        )
        self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
        # 1536 * 128 * (128 + 64)
        self.q_b_proj = nn.Linear(
            config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False
        )

        # 5120 * (512 + 64)
        self.kv_a_proj_with_mqa = nn.Linear(
            self.hidden_size,
            config.kv_lora_rank + config.qk_rope_head_dim,
            bias=config.attention_bias,
        )
        self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
        # 512 * 128 * (128 + 128)
        self.kv_b_proj = nn.Linear(
            config.kv_lora_rank,
            self.num_heads
            * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
            bias=False,
        )

        # 128 * 128 * 5120
        self.o_proj = nn.Linear(
            self.num_heads * self.v_head_dim,
            self.hidden_size,
            bias=config.attention_bias,
        )

MOE:
- first layer: 188743680
- other layer total parameters: 3822059520
- other layer activated parameters: 188743680

class DeepseekV2MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        # 5120 * intermediate_size * 3
        # intermediate_size = 12288 if layer_idx=0 else 1536
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)

Conclude our calculation:

total parameters: 235692359680
activated parameters: 21326725120

Reinforcement Learning#

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.

Reinforcement Learning Algorithm. GRPO

Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment.

Tip

We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels (still train a reward model).
In our preference alignment experiments, we find that the online approach significantly outperforms the offline approach. Therefore, we invest tremendous efforts in implementing an online RL framework for aligning DeepSeek-V2.

DeepSeek-V2

Contents

DeepSeek-V2#

Architecture#

Number of Parameters#

Reinforcement Learning#