More

jychang · 2026-03-10T08:39:54 1773131994

No, Opus cannot be 10x larger than the chinese models.

If Opus was 10x larger than the chinese models, then Google Vertex/Amazon Bedrock would serve it 10x slower than Deepseek/Kimi/etc.

That's not the case. They're in the same order of magnitude of speed.

Filligree · 2026-03-10T11:27:02 1773142022

They serve it about 2x slower. So it must have about 2x the active parameters.

It could still be 10x larger overall, though that would not make it 10x more expensive.

jychang · 2026-03-11T22:28:04 1773268084

Yes, but I highly doubt they would increase sparsity much vs the chinese models.

That's how you get Llama 4.

Pretty much every major lab settled on ~3-5% sparsity for a reason.

bakugo · 2026-03-10T08:58:51 1773133131

I agree that Opus almost definitely isn't anywhere near that big, but AWS throughput might not be a great way to measure model size.

According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.

jychang · 2026-03-11T23:40:28 1773272428

The numbers look about right. Opus 4.5 is about 1.5x the size of Sonnet 4.6, and Opus 4/4.1 is about 5x the size of Sonnet 4.5/4.6.

Note that Opus 4.5 is about 1/3 the size of Opus 4/4.1 (and 1/3 the price in the API)

torginus · 2026-03-10T15:53:50 1773158030

My understanding is that for MoE with top K architecture, model size doesn't really matter, as you can have 10 32GB experts or a thousand, if only 2-3 of them are active at the same time, your inference workload will be identical, only your hard drive traffic will incread.

Which seems to be the case, seeing how hungry the industry lately has been for hard drives.

jychang · 2026-03-10T07:37:44 1773128264

Nobody is running 10s of trillion param models in 2026. That's ridiculous.

Opus is 2T-3T in size at most.

Chamix · 2026-03-10T16:59:49 1773161989

What do you think labs are doing with the minimum 10TB memory in NvLink 72 systems that were publicly reported to all start coming online in November/December of last year? And why would this 1 TB -> 10 TB jump matter so much for Anthropic previously being wholly dependent on running Opus 4x on TPUs, if the models were 2-3T at 4bit and could fit in 8x B200 (1.5 TB = 3T param) widely deployed during the Opus 4 era?

You have presented a vibe-based rebuttal with no evidence or or logic to outline why you think labs are still stuck in the single trillions of parameters (GPT 4 was ~1 trillion params!). Though, you have successfully cunninghammed me into saying that while anything I publicly state is derived from public info, working in the industry itself is a helpful guide to point at the right public info to reference.

johndough · 2026-03-10T17:18:33 1773163113

Could you point at some more public info about active parameter count? You said:

> and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)

I can see ~100B, but that would near the same order of magnitude. I find ~1000B active parameters hard to believe.

Chamix · 2026-03-10T18:41:41 1773168101

Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.

jychang · 2026-03-11T22:52:52 1773269572

Well, for one, Anthropic mostly uses Google TPUs and Amazon Inferentia2 chips, not Nvidia NVL72s. That's because... Google and Amazon are major investors in Anthropic.

Secondly, you missed out the entire AI industry trend in 2024-2025, where the failure of the GPT-4.5 pretrain run and the pullback from GPT-4 to GPT-4 Turbo to GPT-4o (each of which are smaller in parameter count). GPT-4 is 1.6T, GPT-4 Turbo is generally considered 1/2 to 1/4 that, and GPT-4o is even smaller (details below)

Thirdly, we KNOW that GPT-4o runs on Microsoft Maia 100 hardware with 64GB each chip, which gives a hard limit on the size of GPT-4o and tells us that it's a much smaller distilled version of GPT-4. Microsoft says each server has 4 Maia 100 chips and 256GB total. We know Microsoft uses Maia 100s to serve GPT-4o for Azure! So we know that quantized GPT-4o fits in 256GB, and GPT-4 does not fit. It's not possible to have GPT-4o be some much larger model that requires a large cluster to serve- that would drop performance below what we see in Azure.

Fourthly, it is not publicly KNOWN, but leaks say that GPT-4o is 200b-300b in size, which also tells us that running GPT-4 sized models is nonsense. This matches the information from Microsoft Maia servers above.

Fifthly, OpenAI Head of Research has since confirmed that o1, o3, GPT-5 use the same pretrain run as 4o, so they would be the same size.[1] That means GPT-5 is not some 1T+ model! Semianalysis confirms that the only pretrain run since 4o is 4.5, which is a ~10T model but everyone knows is a failed run.

Sixthly, Amazon Bedrock and Google Vertex serves models at approximately similar memory bandwidths when calculating tokens/sec, giving 4900GB/sec for Google Vertex. Opus 4.5 aligns very well with 100b of active params.

    42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
    143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
    70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct

For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B. There's calculations for Amazon Bedrock on the Opus 4.5 launch thread that compares it to gpt-oss-120b with similar conclusions.

Seventhly, Anthropic distilled Opus 4/4.1 to 4.5, which is why it runs ~3x faster than Opus 4 while costing 1/3 the price in terms of API fees.

Eightly, no respectable model has a sparsity below 3% these days- ridiculously low sparsity gives you Llama 4. Every single cutting edge model are around 3-5% sparsity. Knowing the active param count for Opus 4.5 gives you a very good estimate of total param count.

The entire AI industry is moving AWAY from multi-trillion-parameter models. Everything is about increasing efficiency with the amount of parameters you have, not hyperscaling like GPT-4.5 which was shown to be a bad way forward.

Nobody thinks Opus 4.5 is bigger than around 2T in size (so not 10T). Opus 4/4.1 may have been ~6T, but that's it. Any guess of 10T or above is patently ridiculous for both Opus 4/4.1 and Opus 4.5.

[1] https://x.com/petergostev/status/1995744289079656834

Chamix · 2026-03-13T21:59:17 1773439157

I appreciate the detailed comment! I took the day off and am bored so have a brain dump of a reply - basically I think we are talking past each other on two major points:

1. All the discussion about model size is CRITICALLY bisected into talking about TOTAL model size vs ACTIVE parameter size (of a "head" in an "Mixture of Experts"). Everything you've said trend-wise is mostly accurate for ACTIVE parameter count, which is what determines inference cost and speed.

But I am primarily talking about TOTAL parameter count (which has to just fit inside cluster HBM). The total parameter count only affects training cost and has nothing to do with inference cost or speed. So there is no downside to making total parameter count as big as your inference cluster will fit.

2. You touch on distllation, and this heavily relates to the post-gpt-4 base model (call it 5th gen, if gpt-4 was 4th gen), which indeed was used for all models through gpt5.1.

The actual base 5th gen model was as large as OAI could fit on training clusters, and only then distilled down to whatever total size a release model targeted, and the little secret with sparse MOE is the entire model weights don't have to fit (again, plenty of public papers detailing techniques) on a single HBM pool when training. This leads to the 2nd little secret, that GPT-4.5 is ALSO using that same base model; as I said in another comment, 4.5 was all an experiment in testing a huge ACTIVE parameter model (which again is all that determines cost and speed), not so much total (which is capped by inference cluster hardware anyways!) How do you think OAI would be able to serve 4.5 at scale if the model itself was 10x total bigger than everything else? But its easy to serve a model with active parameters 10x bigger!

So this same huge 5th gen base model was distilled down and RLed over and over again in different permutations and sizes to feed the whole OAI model lineup, from o4-mini to advanced voice to gpt4.5 all the way until finally 5.2 starts using a new, "6th gen" base model (with various failed base model trainings between 5th and 6th) (shallotpeat!).

Picking up misc pieces, yes 4o was tiny when served at Q4, which is what Maia 100 did (with some Q6). We are still taking about a ~1T total model. Quantization both static and dynamic was the whole drive behind gpt4-turbo variants which led straight into 4o targeting an extremely economical deployment of 5th gen base. Economical was sorely needed (arrakis!) since this all was at the critical junction when 8xH100s had not been deployed quite at scale yet, but AI use was rocketing off to mainstream, so we had silly situations like Azure being forced to serve on 256gb clusters. (We could go into a whole separate spiel about quantization +history, but suffice it to say everything in deployment is just Q4 these days, and training is mostly Q8)

But this DOES NOT mean o1 was tiny, which conveniently was deployed right when 8xH100s WERE available at scale. We split into the instant tree, where 4.1 was bigger than 4o and 5-instant was bigger than 4.1 etc. And the thinking tree, where o1 = o3 < 5-thinking < 5.2-thinking. Again, the ACTIVE counts were very small comparatively, especially as it let you cheaply experiment and then train with substantial inference compute required for RL training/unrolling! But there was no reason not to fit increasingly large distilled versions of the 5th-gen/6th-gen base models as the inference fleet buildouts (particularly in 2H 2025) came online! The same 5th and now 6th gen base models were refined and twisted (foundry!) into totally different end models and sizes.

I just think this really all comes down to total vs active, not understanding a huge base model can be distilled into arbitrarily sized release models, and then bizarrely giving weight to Meta's completely incompetent Llama 4 training run (I was there, Gandalf!) as giving any sort of insight on what sort of sparsity ratio cutting edge labs are using. You cannot learn anything about total parameter size from active parameter count+ derivatives (token speed, cost, etc)! But on this topic we could again diverge into an entire debate; I'll just say Google is likely doing like 0.1%-OOM in some production configs (Jim Keller is basically shouting extreme sparsity from the rooftops!).

Brief rebuttal summary:

1. Incorrect as of late 2025. Whole public reporting about Anthropic dissatisfaction with "Project Ranier". Dario talking about Nvidia compute candidly on Dwarkish interview!

2. Active vs Total

3. 4o is small, 4-bit 4o on Azure even smaller. 4o is 5th gen base distilled not gpt-4 distilled.

4. 256gb at Q4 fits 1T parameters! Active vs total

5. 5th gen pretrain / base model is huge! 4.5 uses the same base as 4o and 5.1! Can be shrunk to arbitrary size before RL/post training create finished model! Active vs total

6. Active vs total

7. Active vs total, also Ironwood/TPUv7 and Blackwell give much cheaper Q4 inference

8. Don't trust the Zuck

Anyways its all a mess and I don't think its possible to avoid talking past each other or misunderstanding in semi-casual conversation - even just today Dylan Patel (who is extremely well informed!) was on Dwarkesh podcast talking about 5.4-instant having a smaller active parameter count than GPT-4 (220B active), which is completely true, but instantly gets misinterpreted on twitter et al that 5.4 is a smaller model than gpt-4, ignores that 5-4.instant are 5.4-thinking are totally different models, etc etc, just too much nuance to easily convey.

johndough · 2026-03-10T09:08:31 1773133711

Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.

magicalhippo · 2026-03-10T10:32:05 1773138725

> I do not see any limitations to making models ridiculously large (besides training)

From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.

[1]: https://news.ycombinator.com/item?id=47089780

jychang · 2026-03-10T07:27:04 1773127624

> but not sure how to figure out what it would cost and I'm sure as hell not going to try.

Ask Opus to figure out how much it would cost. Lol.

jychang · 2026-03-10T06:59:45 1773125985

Yep, you can also get similar analysis from Amazon Bedrock, which serves Opus as well.

I'd say Opus is roughly 2x to 3x the price of the top Chinese models to serve, in reality.

jychang · 2026-03-10T06:55:20 1773125720

> I find it likely Opus is larger.

Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.

If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.

It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.

simianwords · 2026-03-10T07:00:19 1773126019

Are you sure you can use tps as a proxy?

jychang · 2026-03-10T07:33:09 1773127989

In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on.

Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.

throwdbaaway · 2026-03-11T14:49:57 1773240597

What about the VRAM requirement for KV cache? That may matter more than memory bandwidth. With these GPUs, there are more compute capacity than memory bandwidth than VRAM.

DeepSeek got MLA, and then DSA. Qwen got gated delta-net. These inventions allow efficient inference both at home and at scale. If Anthropic got nothing here, then their inference cost can be much higher.

DeepSeek also got https://github.com/deepseek-ai/3FS that makes cached reads a lot cheaper with way longer TTL. If Anthropic didn't need to invent and uses some expensive solution like Redis, as indicated by the crappy TTL, then that also contributes to higher inference cost.

fc417fc802 · 2026-03-10T08:16:05 1773130565

> In practice, tps is a reflection of vram memory bandwidth during inference.

> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.

jychang · 2026-03-10T08:28:23 1773131303

That doesn't work. Think about it a bit more.

Hint: what's in the kv cache when you start processing the 2nd token?

And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.

Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.

Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.

In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.

fc417fc802 · 2026-03-10T09:53:58 1773136438

Oh I see. I went and confused total aggregate throughput with per-query throughput there didn't I.

jychang · 2026-03-10T06:51:38 1773125498

That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper, and then you use that to claim that they're 10x more efficient.

Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.

Also, you can select BF16 or Q8 providers on openrouter.

irthomasthomas · 2026-03-10T10:50:05 1773139805

Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.

F7F7F7 · 2026-03-10T15:08:50 1773155330

The conspiracy theorist side of me whispers "instead of the rumored Sonnet 5.0 you got Opus 4.6...suspicious"

aerhardt · 2026-03-10T13:38:45 1773149925

I guess more than a tautology it is an inversion of observed causes and effects?

grayxu · 2026-03-10T12:00:30 1773144030

This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.

yorwba · 2026-03-10T12:44:28 1773146668

There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

erichocean · 2026-03-10T13:01:00 1773147660

Partially true, you can predict multiple tokens and confirm, which typically gives a 2-3x speedup in practice.

(Confirmation is faster than prediction.)

Many models architectures are specifically designed to make this efficient.

---

Separately, your statement is only true for the same gen hardware, interconnects, and quantization.

grumpoholic · 2026-03-10T13:01:48 1773147708

With speculative decoding you can use more models to speed up the generation however.

salawat · 2026-03-12T23:26:18 1773357978

Yes, because speculation has NEVER bitten us in the ass before, right? Coughs in Spectre

Speculative decoding is just running more hardware to get a faster prediction. Essentially, setting more money on fire if you're being billed per token.

re-thc · 2026-03-10T08:24:48 1773131088

> That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper

They do have different infrastructure / electricity costs and they might not run on nvidia hardware.

It's not just the models.

jychang · 2026-03-10T08:34:48 1773131688

Except there are providers that serve both chinese models AND opus as well. On the same hardware.

Namely, Amazon Bedrock and Google Vertex.

That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.

Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.

Shakahs · 2026-03-10T10:45:29 1773139529

AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.

giancarlostoro · 2026-03-10T11:29:47 1773142187

And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.

https://code.claude.com/docs/en/microsoft-foundry

https://www.anthropic.com/news/claude-in-microsoft-foundry

re-thc · 2026-03-10T11:37:14 1773142634

> Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models

We were responded about 10x not 0.5x.

x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.

atq2119 · 2026-03-10T12:55:08 1773147308

These providers do not run models on CPUs, x86 vs. Arm is irrelevant.

re-thc · 2026-03-10T18:17:12 1773166632

They run Nvidia and Huawei for example. And mine was just an example.

raggi · 2026-03-10T12:49:51 1773146991

Deployments like bedrock have no where near SOTA operational efficiency, 1-2 OOM behind. The hardware is much closer, but pipeline, schedule, cache, recomposition, routing etc optimizations blow naive end to end architectures out of the water.

Analemma_ · 2026-03-10T17:12:50 1773162770

Do you have evidence for any of this, or are you repeating a bunch of buzzwords you’ve heard breathlessly repeated on Twitter?

raggi · 2026-03-11T14:02:20 1773237740

Many techniques are documented in papers, particularly those coming out of the Asian teams. I know of work going on in western providers that is similarly advanced. In short, read the papers.

nullstyle · 2026-03-10T14:51:45 1773154305

Evidence?

fennecfoxy · 2026-03-10T09:55:28 1773136528

I mean GN has covered the Nvidia black market in China enough that we pretty much know that they run on Nvidia hardware still.

dryarzeg · 2026-03-10T10:06:36 1773137196

How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.

fennecfoxy · 2026-03-10T10:26:36 1773138396

I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.

Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.

I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.

re-thc · 2026-03-10T10:51:01 1773139861

> unless China is pumping out domestic chips cheaply enough

They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.

> I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware

DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.

I agree it could still be trained on Nvidia GPUs (black market etc), but not running.

yorwba · 2026-03-10T11:05:18 1773140718

> The new 1s have a mandate to at least run on local hardware.

They do? Source?

But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)

re-thc · 2026-03-10T11:33:22 1773142402

> with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China

They just have a China only endpoint and likely a company under a different name.

Nothing to do with AI. TikTok is similar (global vs China operations).

jychang · 2026-02-28T12:38:59 1772282339

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

roxolotl · 2026-02-28T13:11:41 1772284301

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.

jychang · 2026-02-28T10:15:51 1772273751

What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?

tosh · 2026-02-28T10:18:58 1772273938

Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

jychang · 2026-02-28T10:33:58 1772274838

I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.

danielhanchen · 2026-02-28T12:02:28 1772280148

Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.

No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.

If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...

This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.

So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.

jychang · 2026-02-28T12:28:45 1772281725

The Q4_K_XL is easily the most popular quant for the model, though.

So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?

The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?

segmondy · 2026-02-28T20:34:54 1772310894

each layer is made up of various weights, the weights are adjusted to quant it. a pure q8 will have all the weights as q8, or a q4 the same. but some are kept as f32, etc. here's an example of q3_k_xl - https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/tree/ma... we can see certain weights are f32, q8, q5, q3, etc. They used mxfp4 in some weights and mxfp4 doesn't seem to place nicely in quants so that's why they are retiring it. read their publication again and it should make more sense.

jychang · 2026-02-28T23:25:50 1772321150

I am aware of all that.

They literally never say “they used mxfp4 in some weights”. What you’re claiming they said doesn’t exist.

This isn’t a postmortem, it’s PR fluff without actually addressing the issue.

segmondy · 2026-03-01T06:59:45 1772348385

It's right there https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks I looked at the weights before. It's not PR fluff, they made it clear by showing how it really affected various tensors terribly.

"MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them."

The Q4 quants had a mixture of mxfp4 leading to worse outcomes.

jychang · 2026-03-03T22:58:15 1772578695

Nope. Where do they say something along the lines of "we had MXFP4 tensors in our previous upload" or "that's why we re-uploaded new versions"?

This is a famous non-apology non-explanation of what actually happened. "They made it clear by showing how it really affected various tensors terribly"? Where do they even say they had ever previously uploaded any quant with MXFP4?

lostmsu · 2026-02-28T10:40:08 1772275208

Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.

danielhanchen · 2026-02-28T11:56:13 1772279773

No our Qwen3.5 new ones show the opposite see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

lostmsu · 2026-02-28T17:46:17 1772300777

Am I misreading the table?

  Unsloth Q4_K_M

  PPL:       6.6053     KLD 99.9%: 0.5478     KLD mean: 0.0192

  bartowski Qwen_Q4_K_M

  PPL:       6.6097     KLD 99.9%: 0.5771     KLD mean: 0.0182

Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).

danielhanchen · 2026-03-01T04:10:04 1772338204

You forgot to check the disk sapce - _M and _XL are not the same across quants:

Unsloth Q4_K_M 18.49GB 0.5478 KLD 99.9% 0.0192 mean

Unsloth Q4_K_XL 19.17GB 0.4097 KLD 99.9% 0.0137 mean

bartowski Q4_K_M 19.77GB 0.5771 KLD 99.9% 0.0182 mean

lostmsu · 2026-03-01T11:25:25 1772364325

The table doesn't have bartowski Q4_K_XL to compare, but given the metrics of _Ms aren't universally better it's unclear if smaller size doesn't come with a cost.

az226 · 2026-03-01T21:13:26 1772399606

I’m curious how NVFP4 compares to their Q4.

danielhanchen · 2026-02-28T11:55:46 1772279746

Didn't expect this as well haha on HN again - probably related to Qwen3.5

jychang · 2026-02-28T09:51:10 1772272270

Not really breakthroughs, more like bugfixes for their broken first batch.

danielhanchen · 2026-02-28T12:09:38 1772280578

No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well

jychang · 2026-02-28T12:20:07 1772281207

Yeah, I saw that yesterday. The blog post does not explain why/how the Qwen 3.5 quants uploaded on 2/27 are different from the files uploaded on 2/24.

Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...

Questions for a postmortem that the blog post left unanswered:

- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?

- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T

- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?

A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.

zargon · 2026-02-28T18:05:37 1772301937

Explain what about that statement is false. Your original Q4_K_XL quant was broken. People noticing that it was a total outlier among other quants is what prompted this "research". Your own data proves that your new release fixes the bugs of your original, in order to match AesSedai's PPL. Fixing bugs is great. Searching for the best quant mix is helpful. I use your quants and appreciate your work. But whitewashing this situation dilutes trust and good will.

jychang · 2026-02-28T08:32:07 1772267527

It is that cheap. Look at Deepseek or GLM pricing.

lelanthran · 2026-02-28T09:35:59 1772271359

> It is that cheap. Look at Deepseek or GLM pricing.

Then it's a race to the bottom.

dangus · 2026-02-28T10:07:23 1772273243

Yep.

And unlike competitors, OpenAI has no ecosystem. Just a website and a domain name. Even a VSCode fork like Cursor is an improvement over that state.

Google pays over 15% of search revenue to be the default search engine on various browsers.