>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.
There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.
But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.
I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.
I think this is exactly right. Basically when I am coding, having an agent that roughly matches my intelligence is a feature, not a bug. Having one that is 10x as smart would actively slow me down because I would have to spend the time understanding what it is doing or hand over all architecture to it and just vibe code everything, hoping that it doesn’t do the PhD version of fizzbuzz instead of the maintainable one.
But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.
Maybe. I can’t imagine what kind of solutions a software engineer who is 10x smarter than any human who has ever lived would be like by definition. All I know is that there is a possibility it says that the most optimal way to solve a problem is too clever for me to understand and as long as I must verify its work I must be able to understand fully the code it writes.
Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.
4.8 is demonstrating simplicity, hence its smarter?? It just refactored my 4.6 generated code (4.8 is very slow on difficult tasks - urgh! - without burning tokens - yey!) but the output was wow! Simple, elegant and exactly what i wanted to see.
It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
So first - these are terrific papers and I'd not seen some of them before.
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.
Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case?
I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.
I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.
Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights
To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.
I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).
The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.
Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)
> I don't disagree, but how much of this ends up being distillation?
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.
> I don't disagree, but how much of this ends up being distillation?
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.
The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data
Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.
Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.
Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?
Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).
MTP will still be highly valuable for interactive use of course.
It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.
Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.
haven't verified, but attributed to Askell:
"I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.
The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".
4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.
Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.
You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.
That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.
I must admit that I am going to find it fascinating when we hit the point where it becomes nearly impossible to deny the efficacy of these tools. I have straight up had people, even in real life, suggest that I'm lying about my productivity gains or what I'm able to accomplish with them.
Like, I understand the reasonable arguments against (I even agree with a few), but it's clear that some people have fully inserted their head into the sand and just don't want to believe any of this could be true. Which will be harsh, since I think getting hit with this train all at once in the future is going to be a rougher ride than a slower coming-to-terms-with, even if the result is one we're unhappy with.
What is the motivation for us users to lie about our experiences? It's to the degree now that people simply refuse to believe that I'm honestly describing my experiences with these tools?
I understand the motivations for the labs to lie, but what do you think mine is?
Oh, basic counting is now arithmetic? But I was told they were superintelligent and were going to cause an apocalypse because they can do pretty much everything ? Somehow because they can excrement a lot of text, we were told they can do everything else too?
I work in big tech and probably 90% of code over the last month has been written by AI. And I suspect it's probably higher within Anthropic, which is probably what he's basing his opinion on.
So, he's closer to correct than not.
That said, your recollection is also flawed. It was in mid-March, and here's the relevant quotes:
>I think we’ll be there in three to six months—where AI is writing 90 percent of the code. And then in twelve months, we may be in a world where AI is writing essentially all of the code.
[...]
>But the programmer still needs to specify, you know, what are—what are the conditions of what you’re doing, what—you know, what is the overall app you’re trying to make, what’s the overall design decision? How do we collaborate with other code that’s been written? You know, how do we have some common sense on whether this is a secure design or an insecure design?
[...]
>So as long as there are these small pieces that a programmer, a human programmer, needs to do, the AI isn’t good at, I think human productivity will actually be enhanced. But on the other hand, I think that eventually all those little islands will get picked off by AI systems.
With another 3-4 months left on the clock, his prediction seems remarkably on point for at least certain organizations and domains.
I welcome you to also hold yourself accountable in the coming months if this trend continues. ;)
Yep! We have a review process where we have a few agents, each tuned to a particular domain of expertise (security, code quality, etc) which iterate until the feedback meets a certain threshold, at which point it goes over to humans for (hopefully) final review.
That said, I generally agree that you're correct: writing code in many ways has not been the biggest bottleneck. However, by removing much of that writing, it frees up engineers to work on the uniquely human things that are larger bottlenecks.
I had a few comments in a thread here touching on where I think most of the value has come from for us (which is largely search/understanding of our dependencies and making away team work far more viable, which aids with cutting through bureaucracy and the tendency for teams to push back on work): https://news.ycombinator.com/item?id=48298731
Haven't you heard - these days they just throw slop generated by LLM agents over to other LLM agents which cosplay as internal QA. They know it works because they write really strict .MD files where they instruct agents in English language to 'never do this' and 'always do that'.
This is really what you think happens at large tech companies? You don't think it's possible this is maybe even slightly overly simplifying what the relevant processes are?
Comment does indicate you don’t really seek to know how things work with respect to this and seem to not be able to imagine that the Occam’s razor is: agents are more useful than you think they are.
> I welcome you to also hold yourself accountable in the coming months if this trend continues. ;)
My company did not swallow hundreds of billions in shady investment deals and is not publicly traded. We work with real money, and the revenue on our books is the revenue that is actually booked, not fake revenue we plan in 2 years time to maybe happen. So no, I am not going to hold myself accountable. But people who work with other people's money should be absolutely held accountable when their wild imaginations don't come true, repeatedly, quarter after quarter, year after year!
Mate, for 5 years I've been hearing that crap. I am not predicting anything / on the contrary the AI boosting bunch is. When are your predictions coming true?
AFAIK, most predictions from several years ago were for...approximately now to within the next few years. Can you be more specific?
You criticized a very specific (and fake/misquoted) prediction, ignored the correction, and are now criticizing vague hand-wavey "predictions" that you have left unspecified.
Can you please stop with the angry/ranty replies and actually have a real conversation grounded in actual facts?
Now, having said all of the above...I'll also point out that these are predictions, not promises/guarantees. These people are being asked to forecast and are doing so. I hardly think they should be held responsible for not being literal oracles, but even so--please, at least quote them correctly/at all.
In short: be better than the hallucinations you're seen to call out from the models.
I will note that you have essentially not responded to anything specific in my comment, nor at least acknowledged that you misstated Dario Amodei's actual prediction.
So, unsourced vibes from a shady guy whose entire empire is built on being against AI?
I genuinely don't know how folks can continuously buy into anything he has to say after that Wired piece. The credibility there is seriously lacking.
Please, continue to be skeptical of the labs. But people need to stop talking about this dude as if he's the Holy Grail of the anti-AI movement. It's going to blow up in y'alls faces.
Ed actually provides sources and goes into an incredible amount of detail as to how he came to his conclusions. The average AI booster just goes "I totally built ten businesses off vibe coding but I can't tell you anything because it's a SECRET!". And the mainstream tech media is so in the pocket of big tech and AI corporations that they might as well just publish their PR emails at this point. Yeah, I'll listen to Ed thank you very much.
I think it's telling that most critics don't address his actual points, but instead his credibility because he's a "hater".
That said, I really mean it when I say that I don't actually think Ed is a good choice for the anti-AI movement. I think an actual opposition is useful, but he ain't it.
It's an interesting profile, but I don't see why it would change my opinion of him. I already knew he works in PR, it's not like a thing he hides. I don't think one error in a spreadsheet really proves anything (plus he's pretty honest about being an amateur at financial analysis -- but most of what he's looking at is pretty basic math and it's baffling that nobody has an answer to his pretty straightforward questions of how-will-this-ever-make-money)
I guess like, I don't know about an anti-ai "movement", personally I like AI-the-product but I think AI-the-industry is extremely sketchy and has motivations that I think are awful. As with all technology revolutions, my issue is more with the people than the technology itself.
I don't really like how this whole thing has become "pro ai" vs "anti ai" though. For me, I'm just really irritated when I use AI every day, I'm a professional software developer, and all my experiences with it do not match the (very annoying) hype. I kind of wish we could just go back to talking about software engineering and if people like vibe coding, great, go do that and stop all the annoying think pieces that just give CEO's even worse AI psychosis.
I read the profile and didn't see anything really wrong. Why would PR companies have to believe in their clients? Why does he have to be held to higher moral standards than Sam Altman who’s a total lying snake?
The error you call out is hardly “serious”, as the whole argument is uninteresting. It is a stupid indefensible error but the argument about revenue being 20% or 30% lower than reported isn’t that central to his overall thesis. Stuff that matters is inference cost, profitability, actual training costs.
> So, unsourced vibes from a shady guy whose entire empire is built on being against AI?
Actually he provides sources when he analyses stuff and imho much better than the usual corporate "Sam Altman says we should ask ChatGPT how to raise babies" crap. Also, I don't know many 'shady' guys who have built entire "empires", nor does he seem to actually have an empire. Usually being shady means you are kind of unknown and all. I am not glorifying Ed, don't even know him personally. I am not even impressed with his writing style much to be honest. But he brings important facts and information to light, which otherwise would have been lost in the cacophony of corporate media light treatment of these con-men. Holy Grail? Blowing up in our faces? WTF are you talking about?
The source was the article in the WSJ itself, which then referred to their source at the Anthropic. Which kind of is a textbook definition of "leak". Because otherwise Anthropic would have their lawyers hunting both the employee breaking their stringent NDA and the WSJ as well...
Why puzzled ? I literally said "According to Ed Zitron", implying that's where I stumbled upon the article. I've no time to read corporate media, at least not regularly.
>If you have AI systems that can simply build out POCs in days, backtest on real data, show reliable results and numbers, you get a suite of product options you were never able to get before. If you have coding agents that can speed up implementation, you can build more stuff and choose the things that stick.
I'll also add this: within a large organization, you often need to interact with many different codebases owned by many different teams. Agents have made it much easier to wrangle by having the ability to deploy one to scope out your web of dependencies to learn about what would be needed for feature X, and how that integration can happen.
We've been doing far more away team work simply because it makes things move faster. It's easier to convince a team to sign off/review something than it is to get them to commit to the planning and eventual work.
It genuinely is helping things move faster inside large organizations. Or at least, it is for us, particularly since we're getting organizational prioritization to actually build the scaffolding to make those agents more effective at search.
> It's easier to convince a team to sign off/review something than it is to get them to commit to the planning and eventual work.
1000x yes: you have touched on what I think is the single biggest factor here, that is the humongous value of POCs. they are gnarly to build without agents, and so we used to have to get everyone on board so we didn't get screwed in performance reviews, which was monumental task because that means convincing very busy PMs who have a lot on their plate and dont want to take risks on things they don't understand, and now it's like "can we scale this out" and you have a very nicely formatted proposal and POC. It de-risks things very quickly
It has a style that allows people to _pretend_ they're having substantive conversation, but it's mostly just people blathering in a distinct style without ever listening.
Yes, although unfortunately the only problem with it, there's no way to contribute to older topics in a meaningful way. Due to the nature of this format not even the original author checks old comments and absolutely no chance any new conversation sparks out of it.
>Due to the nature of this format not even the original author checks old comments and absolutely no chance any new conversation sparks out of it.
Sometimes I wonder if the format actually helps. I suspect that when you know you can pretend you didn't see a reply to your comment, you feel less likely to need to defend yourself when you realize you might be wrong. You can just close the tab and, since there was no notification, just move on without the ego hit.
> It's frankly depressing how few places there are to have quality conversations
Yeah I used to learn so much across quite a few forums. Most of those communities are dead, dying, filled with bots or filled with people making shit up/just posting lousy jokes now. A lot of folks have jumped to Discord, which frankly, isn't for me, so feeling a bit lost on where to surf these days
I think these sort of efforts are mostly self-soothing at this point. It is almost certainly the case that the labs are at a minimum running inference over the information they're pulling and ensuring that it's useful/suitable for pre-training. The models are at least good enough to know whether they're looking at utter nonsense.
Ya I feel like these AI companies have the ability to be somewhat selective about their training sets. They don't have to add everything. I guess the idea is the filters wouldn't catch it, but if the junk is indistinguishable from the real stuff, then won't the platforms just be ruined by a bunch of junk?
Actually it was shown a couple of times already, some of it also by Anthropic's own research, that the LLMs are extremely easy to poison with small datasets.
That's correct, and their recent work on natural language autoencoders has given extremely compelling evidence of that...which is why their data collection practices for pre-training have almost certainly evolved, particularly since they've already scraped most of the internet.
>No, the complaint with Adobe is that if you cancel, they terminate access immediately rather than at the end of the billing period. There is no explanation for this other than a predatory one
This is exactly what Shutterstock does. What's maddening is that you can be getting a monthly charge, but are locked into a year contract. If you cancel, they'll continue to charge monthly but without being able to use the service. It's absurd.
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
reply