Do you think the work will still apply to speculative/alternative decoding methods like MTP and block diffusion, which are making batch=1 decoding less memory bound? Kernel launch overhead and memory transfer become less and less significant as a % of time when computing multiple tokens at once.
Why not, it's one way to look at it!
Although I have yet to see other work with speculative decoding higher than ~1,000 tokens/s., because the other bottlenecks start to matter at that point, and they need to be solved to go further.
Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.
We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.
It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.
You'd be surprised, people are somehow buying Tesla P40s and M40s on eBay for almost $300 and $180 respectively (M40 being the same gen as GTX 950). Google Colab still offers T4s and it's taken them years to add modern GPUs. Hope they're powering them with renewables at least.
And people in general are holding on to their old machines for very long periods of time now, especially CPUs. I've had to support first gen Intel i7s at work! That's pre AVX.
Just a note, P40 came out at $5700 in 2016 dollars. In 2026 dollars that is $8000 (wow!). If you bought 100k today, assuming a 1% failure rate per year your $800M investment can be traded in for about $30M.
I think it is reasonable to assume a similar depreciation in GPUs.
Meaning you'd need to have made more than (800M - 30M) * (1 + income tax rate) + (power + maintenance).
Some say the margines on inference are already there for new GPUs but they are right margines.
Outside of training the biggest LLMs at big labs, GPU lifespan isn't as short as the OP made it out to sound. A100s are 6 years old and still a reliable work-horse, and the 80GB version hasn't depreciated that much on the used market. On the consumer side, 3090s are actually still selling for very close to 2020 MSRP.
Even the ancient V100 (soon to be 10 years old!) had somewhat of resurgence on the second-hand market, with a healthy market for interconnects in China.
If I had a datacenter and power consumption was not a concern, I'd be holding on to my A100s for years at least for inference.
Oh yeah, not meant to be all doom and gloom. Lighter workloads greatly increase hardware lifespan. And the GPUS are like at most 50% of the data-center cost I think. You get to keep the building, the cooling, the power interconnects, the networking and everything else.
Additionally the demand drives new power infrastructure, and new fabs that will definitely outlive the bubble.
> And why is V100 even used? V100 is four generations old and not even supported anymore.
It wouldn’t surprise me that due to bureaucratic processes, it’s still somehow the most readily available GPU for Apple researchers despite being almost 10 years old now. I recall even last year seeing V100s used by Microsoft researchers who weren’t working on LLMs.
Why didn't you take into account batching, input tokens, different costs of electricity, and the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
> Why didn't you take into account [...] the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
Because that wasn't what they claimed to research?
>> for inference it's definitely not worth it.
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.
Who is going to buy a $4299 M5 Max MBP with 64GB of RAM just to run Gemma 4 31b? Firstly you don't need 64GB for that model. Secondly if you want a machine that sits in the corner and does nothing but LLM inference, you don't buy a MacBook Pro, you buy some GPUs which are going to cost you a fraction of that (~$1k for ~64GB of VRAM is possible). The people buying Apple Silicon for inference general aim for the Mac Studios with enormous amounts of RAM (128-512GB), to run very large models.
The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
Not a single new 64GB GPU, but multiple used GPUs.
They’ve significantly increased in price (so much for hardware depreciation…) but you can still get a modded 22GB 2080 ti for $320, or a Mi50 32GB for ~$450 each (used to be $150 a few months ago, alas), or a Mi50 16GB or <$200 but you’d need to stack 4 of them.
There’s also some more exotic configurations but those are probably the simplest options. You won’t get the performance of an RTX Pro 6000 Blackwell of course, and the power consumption will be pretty high so it’s only worth it if you have cheap electricity. But it is possible.
What quant? You should have no problem running it at Q4 with 256K context, Q5 or Q6 even although maybe not at full context. I can run Q4 on a 4090 with just 24GB VRAM.
> I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!
Say what? 5 years ago OpenAI had received around $139 million in funding, and they’d just come out with GPT3 with 175B parameters, a 2048 context window, trained on 300B tokens on a 10,000 V100 cluster which would have cost maybe $4-13 million at the time for their training run.
Meanwhile Deepseek V3’s famously frugal training was $5M, and Chinese AI companies are raising billions in funding. Sure American AI companies are raising tens (and maybe hundreds in the case of OpenAI, if you count their circular funding rounds) of billions but they’re grossly inefficient, and we’ve already hit the limits of the scaling laws where there’s little point in increasing the number of parameters of a model.
Oh, it was written in a paper, must be correct then, no further investigation required just believe it at face value! No track record of academic dishonestly, and definitely no incentives to fudge the numbers.
It's incredibly common all over Europe, not just Switzerland. Not only the metros but the trams and even buses often rely on this system where there's no turnstile or barrier, you just walk in.
Not sure it's about being a high trust society or not, there's frequent inspections where they block the doors, and you get a hefty fine if you're caught without a valid ticket. I certainly wouldn't call Prague or Rome or Dublin high trust societies on par with a Swiss city.
Buy a ticket and get on was the standard everywhere for trams and trolleys because you didn’t have enough enforcers and you didn’t have controlled access.
Spot checking kept people honest but it only really works when most people are honest.
And it is common to cheat, as it is cheaper to pay the fine than to buy a pass annually. Naturally, this is done more by the foreigners I know than the natives. But the foreigners are not Japanese...
Personally I feel like it would be less undignified and infantilising to have a machine take care of my basic bodily functions than a human being. There's no feeling of judgement or being shamed in front of someone else, and the machine could even restore a feeling of autonomy since it would feel like you're using a tool instead of being helplessly reliant on another person's help.
> Cloud computing was an absolutely mind blowing revolution - suddenly your startup could run its own computer systems in minutes without need to install and run your own systems in a data center. This was an absolute game changer, and I really drank the AWS Kool Aid down to every last drop then I licked out the cup. I was all in on AWS in a big way.
Am I the only one who remembers that VPSes and dedicated hosting services were a thing before AWS came around? Yes you had to pay for a month at a time and scaling wasn’t as instant, but it wasn’t like the only option before cloud computing was having to drive to the datacentre and install your own server.
> suddenly your startup could run its own computer systems in minutes without need to install and run your own systems in a data center.
The “in minutes” is doing a lot of the work in that sentence above.
I also used dedicated servers in the late ’90s (and they still offer great value today). But before AWS, provisioning new hardware typically took days, not minutes.
AWS changed that, and the rest of the industry eventually followed.
No you could rent virtualised servers way before AWS.
AWS simply had good marketing.
The virtualised server thing was not a AWS thing, the thing that was were their other services. For example instead of renting a virtual server and installing a database on it. You could rent the database; that was sort of a new thing that AWS made in to thing.
It was never cheaper what you paid for was a promise of fire and forget. You would no longer need to worry about any responsibility to update the server or the database cause the AWS crew took care of that.
> I also used dedicated servers in the late ’90s (and they still offer great value today). But before AWS, provisioning new hardware typically took days, not minutes.
VPSes and non-custom configs for dedicated servers were pretty instant as far as I know, I think the advantage of AWS was more that you could scale up and down much more easily since you weren’t locked down in a monthly contract, and that you could automate server provisioning through an API.
If you recall AWS didn't scale instantly originally either.
We had super bursty traffic, and had to go with Google Cloud (very early days! [0]) because you'd need to communicate with AWS and pre-warm the ELB capacity of your expected bursts.
We did a dead launch to 60 million customers (0 to 60 million, no organic growth phase) this way. I wouldn't want to do that on a VPS.
Am I the only one who remembers how shady a lot of those VPS/hosting companies were? Seemed to be a race to the bottom, so a 'good' outfit might suck or completely disappear a couple years later. (Also, pricing was all over the map, I had a client who was paying $150/mo for a VPS.) Hetzner survived, but for a long time they had a reputation as spamfarm. So I get the initial appeal of AWS, used tactically. But for larger companies, its something like IBM or Oracle, if you are price-sensitive, it's not for you.
Boys and girls being different does not mean one sex deserves corporal punishment and one does not. Girls are equally capable of cyberbullying (which is covered by this law), why should they only get detention while a 9 year old boy has to suffer physical violence? What does this teach girls - that they can get away with more? That they're more fragile than even a prepubescent boy?
If the law punishes one demographic less severely for the same actions, that's injustice. No different in principle from pre-modern practices where if a noble maimed a commoner, they'd just need to pay a fine, while if a commoner did the same, they'd be put to death.
> Boys and girls being different does not mean one sex deserves corporal punishment and one does not. Girls are equally capable of cyberbullying (which is covered by this law), why should they only get detention while a 9 year old boy has to suffer physical violence?
In many systems of law, the punishment should mirror the crime. You gouge out an eye -> the government gouges out one of your eyes.
In every country, men commit almost all violent crimes. In school, boys physically bully other boys. Hence the physical punishment for them.
> What does this teach girls - that they can get away with more? That they're more fragile than even a prepubescent boy?
Yes, for homo sapiens, the female is more fragile than the male. This is basic biology. I'm sure that in praying mantis society, females get harsher punishments.
> In every country, men commit almost all violent crimes. In school, boys physically bully other boys. Hence the physical punishment for them.
As I've said, and @echoangle repeated, caning is used for cyberbullying, which girls do too (at a rate relatively close to boys actually). If the law was caning in response to physical bullying, and it just so happened that the vast majority of offenders were boys, I would not object on the basic of sexism (I still would not approve of schools being allowed to physically punish students).
> Yes, for homo sapiens, the female is more fragile than the male. This is basic biology. I'm sure that in praying mantis society, females get harsher punishments.
There's no way the typical 16 year old girl is more fragile than the typical 9 year old boy, yet only the latter is subject to this punishment. Until children reach the age of 12 or so the strength difference is quite minor (and there's even a brief period where girls are taller and heavier).
Also it's absurd to punish demographics differently based on their statistical averages. Redheads are less sensitive to pain, should your hair colour determine how many strokes of the cane you get?
Girls are not meaningfully more fragile than boys, especially before puberty. Before puberty they're practically indistinguishable. If it weren't for long hair and the color pink none of us would know.
That's just something people tell themselves. Yes, we socialize boys not to cry. That doesn't mean boys are "stronger", it means that they have a pathological fear of being perceived as weak which will cause them sexual and relationship problems until the day they die.
reply