A large model (100B+, the more the better) may be acceptable at 2-bit quantizati...

zozbot234 · 2026-02-21T12:18:54 1771676334

100B+ is the amount of total parameters, whereas what matters here is active - very different for sparse MoE models. You're right that there's some overhead for the OS/software stack but it's not that much. KV-cache is a good candidate for being swapped out, since it only gets a limited amount of writes per emitted token.

zargon · 2026-02-21T15:01:47 1771686107

Total parameters, not active parameters, is the property that matters for model robustness under extreme quantization.

Once you're swapping from disk, the performance will be quite unusable for most people. And for local inference, KV cache is the worst possible choice to put on disk.