Damn.
Okay, your previous efforts were impressive, but this is just ridiculous.
Obviously your "dynamic task vector machine"/the notion that we need to clean the noise and shape our reasoning traces extremely deliberately is no longer a hypothesis. This is a bizarre outcome in my opinion, to have such radical performance gains, and I'm starting to wonder if there isn't A LOT more here in terms of test time compute, especially within the domain of agentic language models.
So I have a few things I feel are important to say:
I want to offer a friendly suggestion. I think the work that you are doing is brilliant and I'd like more to be able to grasp how important it is. Reading through your papers, I'm wondering if there's a way we can demonstrate to less technically apt folks, or maybe even folks with lower literacy and reading comprehension. In other words, I invite you to think of (or have a few big models or agents think of?) how to present what you are doing as a novel post-training step with a new name. I think it's time, I think the work is significant enough, I think your lab has earned it, and I think it could help humanity more than we might be able to see right now. Not sure exactly what this would be.
Building upon that, I feel it's important to drive home that I believe what you are doing needs to be standardized. You're already doing a good job identifying different formatting templates for reasoning structures, logging and comparing their performance. These templates (at this point I'm theorizing) are something I think could be critical. Naming these templates, testing them, and showing others how to implement them into their own models is key. But it isn't just that. I believe that as a "novel post-training step" this should ENCOMPASS aligning reasoning traces to optimized formatting, as well as whatever else you've identified is needed to fully convert a model's reasoning into being what you consider to be a "dynamic task vector machine" (sorry for my poor grasp on the concept). This process needs to be identifiable, it needs to be succinctly and clearly documented, and it needs to be repeatable. I see it used alongside of inside baseball terms like DPO. I think at this point it's fair to say if the brilliant Qwen lab is leaving so much performance on the table, even IF it is domain specific, this is something that NEEDS to be implemented not just within the Qwen line.
Next... you folks need money. Compute. You need to scale this, NOW. 8, 14, 32 billion parameters would be good, but obviously the 2507 thinking models are a target. You need to hit the 30b3a and the 235b22a, folks, and in that order. Elevate your system of models, with what I think would be a massively impressive lineup of agentic reasoning models. Nano(1.7)/Small(4)/Medium(30b3a)/Large(235b22a) Not just that, but Magistral, OSS, SMOLLM3, GLM, Ling Lite & Plus, Hunyuan, etc etc etc are all targets.
Fire right away, and hard. I believe moving forward, the type of use cases and realistic tasks that could be performed with a scaled version of this dynamic task vector machine go FAR beyond web search, into embodied AI, and zero shot use of tools with extreme precision that will open up a world of downstream uses by being able to push the complexity of tools further without compromising accuracy. Even potentially LMs that are DESIGNED to be lean on actual memorized data, instead being intended to be paired with bodies of verified accurate information for the end use case/system it's implemented in. I think you need a dataset with all of the most complex MCP and tool calls as you can get your hands on, but whatever you're cooking with or doing now SEEMS to be working, so you don't need me to tell you that. I don't know how far this team will take this, but... I see something glittering here. I want to see your platform thrive for all the knowledge you've contributed, and I wish you all the best. Cheers.
Thank you! We will consider your suggestion!
Hope you have a good time with Jan-v1
@CyborgPaloma Could this even be used with MoE though? Admittedly I'm not working in the weeds like you folks are, but I was looking into something similar and MoE training is just an entirely different beast.