MoE doesn’t really help with the memory requirements for the reason mentioned in...

MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.