DeepSeek-Coder-V2#
Note
DeepSeek-Coder-V2{cite} is further pre-trained from an intermediate checkpoint of DeepSeek-V2[DALF+24]
with additional 6 trillion tokens.
The pre-training data for DeepSeek-Coder-V2 primarily consists of 60% source code, 10% math corpus, and 30% natural language corpus.
Supervised Fine-Tuning#
To build DeepSeek-Coder-V2 Chat, we construct the instruction training dataset mixed with code and math data. We first collect 20k code-related instruction data and 30k math related data from DeepSeek-Coder and DeepSeek-Math. To maintain the general ability, we also sample several data from the instruction data of DeepSeek-V2. Finally, we use a instruction dataset of 300M tokens.
Reinforcement Learning#
Prompts Considerable effort was spent collecting prompts related to code and math from various sources, and each code prompt comes with corresponding test cases. After filtering the prompts, there are approximately 40k data in total.
Reward Modeling Reward models play crucial roles in the RL training. In terms of mathematical
preference data, we obtain them using the ground-truth labels. In terms of code preference
data, although the code compiler itself can already provide 0-1 feedback, some code prompts may have a limited number of test cases and do not
provide full coverage. Therefore, we still decide to train a reward model on the data provided by the compiler, and use the reward model to provide signal during RL training.
Reinforcement Learning Algorithm We employ GRPO as our RL algorithm. Notably, GRPO is proven to be quite effective and has less cost compared with PPO, since there is no need to maintain an additional critic model.