I only just skimmed it, but will try to dive deeper in a bit.
I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.
I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.
Once I dig into it more I'll try to highlight other deltas.
This is exactly my gripe unfortunately, it feels like needless fragility. IIRC the author has said they believe it wouldn't be too difficult to patch QBE to work as a library, but from what I've seen the code is somewhat terse and eccentric.
Because this concept only works for offline compilers, but not for dynamic languages. It's about 100x slower.
I'm just converting the call to an external assembler in my compiler rcc to assemble the bytes directly. No need for strings and external files. The cost of the external call is outrageous.
Some people are speculating that Opus 4.7 is distilled from Mythos due to the new tokenizer (it means Opus 4.7 is a new base model, not just an improved Opus 4.6)
The new tokenizer is interesting, but it definitely is possible to adapt a base model to a new tokenizer without too much additional training, especially if you're distilling from a model that uses the new tokenizer. (see, e.g., https://openreview.net/pdf?id=DxKP2E0xK2).
Yes, I was thinking that. But it could as well be the other way around. Using the pretrained 4.7 (1T?) to speed up ~70% Mythos (10T?) pretraining.
It's just speculative decoding but for training. If they did at this scale it's quite an achievement because training is very fragile when doing these kinds of tricks.
Reverse distillation. Using small models to bootstrap large models. Get richer signal early in the run when gradients are hectic, get the large model past the early training instability hell. Mad but it does work somewhat.
Not really similar to speculative decoding?
I don't think that's what they've done here though. It's still black magic, I'm not sure if any lab does it for frontier runs, let alone 10T scale runs.
reply