685B? what are extra parameters as compared to 671B

#32
by hankhw - opened

Does anybody know what are the extra params?

Commenting to follow

If you read the technical paper, you will know: https://arxiv.org/html/2412.19437v1

Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts
D additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.

DeepSeek-V3-0324 has an extra output head (14B) which was used to speed up training and can be used as a draft model if the code supports it to further speed up inference.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment