Question on converting this to DeepSeekV2ForCausalLM

#2
by michaelfeil - opened

I want to use this model to run tests for the larger deepseek-v3 architectures, primarily doing research on faster kernels for FlashMLA et al. This sounds like a good model, as its takes much shorter to load and still produces coherent output.

Any way this can be loaded into a non-LlamaForCausalLM architecture, e.g. in DeepSeekV2ForCausalLM

@michaelfeil feel free to checkout this updated version of the model llama3_2-1B-deepseek.

Sign up or log in to comment