Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.


But are you using the newly released Apple MLX optimizations?


It's been approximately 2 months since I have tested it, so probably not.


But those optimizations are the subject of the article you are commenting on.


On models < 24GB presumably. "Faster" depends on the model size.


In this case, the 4090 is far more memory efficient thanks to ExLlamav2.

70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.


Doesn't running 70B in 24GB need 2 bit quantisation?

I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?


2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: