More

tarruda · 2026-03-22T13:50:22 1774187422

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

Aurornis · 2026-03-22T14:22:05 1774189325

The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.

In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.

This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

tarruda · 2026-03-22T14:42:10 1774190530

I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B).

Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.

Eventually I might try a more practical eval such as terminal bench.

Aurornis · 2026-03-22T14:48:22 1774190902

> I did not do a very long session

This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.

amelius · 2026-03-22T18:35:18 1774204518

> This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

It would be nice to see a scientific assessment of that statement.

singpolyma3 · 2026-03-22T16:22:45 1774196565

Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?

hnfong · 2026-03-22T16:43:17 1774197797

Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.

Aurornis · 2026-03-22T17:29:45 1774200585

4 bit is as low as I like to go. There are KLD and perplexity tests that compare quantizations where you can see the curve of degradation, but perplexity and KLD numbers can be misleading compared to real world use where small errors compound over long sessions.

In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.

simonw · 2026-03-22T15:39:43 1774193983

The project doesn't just use 2-bit - that was one of the formats they tried, but when that didn't give good tool calls they switched to 4-bit.

tarruda · 2026-03-22T17:29:31 1774200571

In my case it the 2.46BPW has been working flawless for tool calling, so I don't think 2-bit was the culprit for JSON failing.

They did reduce the number of experts, so maybe that was it?

arjie · 2026-03-22T16:05:52 1774195552

What's the tok/s you get these days? Does it actually work well when you use more of that context?

By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

tarruda · 2026-03-22T16:39:24 1774197564

> What's the tok/s you get these days?

I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):

    % llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
    ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
    ggml_metal_library_init: using embedded metal library
    ggml_metal_library_init: loaded in 0.008 sec
    ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
    ggml_metal_device_init: GPU name:   MTL0
    ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
    ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
    ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
    ggml_metal_device_init: simdgroup reduction   = true
    ggml_metal_device_init: simdgroup matrix mul. = true
    ggml_metal_device_init: has unified memory    = true
    ggml_metal_device_init: has bfloat            = true
    ggml_metal_device_init: has tensor            = false
    ggml_metal_device_init: use residency sets    = true
    ggml_metal_device_init: use shared buffers    = true
    ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
    | ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |

    build: d28961d81 (8299)

So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.

I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.

> You're the guy who launched Neovim!

That's me ;D

> I use it every day.

So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/

hnfong · 2026-03-22T16:50:50 1774198250

Apologies to others for the offtopic comment, but thank you so much for neovim. I started using Vim 25 years ago and I almost don't know how to type without a proper Vi-based editor. I don't write as much code these days, but I write other stuff (which definitely needs to be mostly hand written) in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

tarruda · 2026-03-22T17:30:47 1774200647

> in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

@justinmk deserves the credit for this!

terhechte · 2026-03-22T16:49:51 1774198191

Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days.

Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)

tarruda · 2026-03-22T17:35:26 1774200926

> Have you compared against MLX?

I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.

However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:

- SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode

arjie · 2026-03-22T16:41:05 1774197665

That's surprisingly fast. Thanks for sharing.

outlog · 2026-03-22T15:45:54 1774194354

What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?

tarruda · 2026-03-22T16:03:08 1774195388

I don't think I've ever seen the M1 ultra GPU exceed 80w in asitop.

Update: I just did a quick asitop test while inferencing and the GPU power was averaging at 53.55

woile · 2026-03-22T17:07:09 1774199229

Just a single m1 ultra?

tarruda · 2026-03-22T17:25:39 1774200339

Yes. Note that the only reason I acquired this device was to run LLMs, so I can dedicate its whole RAM to it. Probably not viable for a 128G device where you are actively using for other things.

iwontberude · 2026-03-22T17:03:39 1774199019

Thank you, I have been using way too much credits for my personal automation.

tarruda · 2026-03-04T12:30:02 1772627402

One thing that annoys me is the ability that my mobile carrier has to just throw ad popups.

Is that something that GrapheneOS fixes?

weebull · 2026-03-04T13:15:12 1772630112

Wtf‽ I didn't know that was possible.

pluc · 2026-03-04T12:38:46 1772627926

Your carrier does what now?

tarruda · 2026-03-04T12:42:30 1772628150

I have a pixel 8a with a TIM SIM card and every once in a while I see an ad popup on my phone.

deno · 2026-03-04T12:56:19 1772628979

Go to [Settings] » [Apps] » [Special app access] » [Display over other apps] and check if any preinstalled carrier apps or anything suspicious has this permission granted.

tarruda · 2026-03-04T17:03:19 1772643799

Just checked, and only "Phone" and "Google" have this permission.

There are no preinstalled apps, I bought this phone clean on Germany and then added a Brazil's SIM card when I got back.

Could it be that the SIM card has some control over the Phone app?

deno · 2026-03-04T20:17:11 1772655431

Apparently this is handled by the privileged STK[1] service. It can launch browser which is I think what's happening.

GrapheneOS presently doesn’t do anything different in this case, they pull it from AOSP without modifications. However you can disable it using the frontend app (SIM Toolkit) as someone pointed out, but as far as I can tell this requires the applet on SIM card to cooperate (offer the opt out).

Otherwise you can disable the STK altogether with ADB but that will also block you out of other SIM card interactive functions, which might not be a big deal however.

Edit: "We plan to add the ability to restrict the capabilities of SIM Toolkit as an attack surface reduction measure. (2022)"[2] and open issue[3].

[1] https://wladimir-tm4pda.github.io/porting/stk.html

[2] https://discuss.grapheneos.org/d/1492-blocking-sim-toolkit-m...

[3] https://github.com/GrapheneOS/os-issue-tracker/issues/875

tarruda · 2026-03-05T08:00:50 1772697650

Thanks for the info!

pluc · 2026-03-04T16:34:32 1772642072

Like a popup how? What kind of dialog is it? It's more likely to be an app that's bundled by your carrier than your carrier MitM'ing ads into your stuff which is kinda what it sounded like

tarruda · 2026-03-04T17:06:03 1772643963

Just a message popup, a window with dark background and some text ad on it.

I did not buy this phone from a carrier, just added the SIM card later.

Really surprised to learn this doesn't happen to others. Always assumed that the SIM card had some special privilege given by Android.

ethbr1 · 2026-03-04T18:51:28 1772650288

Sounds like your carrier is abusing STK to display ads.

See https://www.browserstack.com/guide/stop-popup-messages-in-an...

Caveat: if they're doing that, then they're almost certainly data mining your data streams (e.g. dns lookups etc.)

I wouldn't feel secure on such a carrier unless I also VPN'd traffic to a reputable provider (Nord, Express, or Proton) and forced DNS over TLS to known servers.

throwway120385 · 2026-03-04T18:58:11 1772650691

SIM cards can come with apps preloaded. There was a carrier in Mexico that would load a SIM app for Dominos Pizza and you could order a pizza from your phone if you were on that carrier. I learned this because of some carrier certification feedback I had to disposition at one job.

rcMgD2BwE72F · 2026-03-04T14:04:44 1772633084

Can't you just change your carrier?

tarruda · 2026-03-04T17:06:53 1772644013

I would rather have a phone that doesn't let my carrier show random messages whenever they feel like it.

tarruda · 2026-02-19T12:48:54 1771505334

> This is the first model that has really broken into the anglosphere.

Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.

tarruda · 2026-02-19T12:47:30 1771505250

They seem to be the same company that released ACEStep music generation model: https://acestep.io/

Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

tarruda · 2026-02-19T12:45:46 1771505146

This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

sosodev · 2026-02-19T18:42:04 1771526524

Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.

I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.

That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.

tarruda · 2026-02-19T19:01:19 1771527679

I did play with Qwen3 Coder Next a bit, but didn't try it in a coding harness. Will give it a shot later.

ipython · 2026-02-19T17:17:59 1771521479

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?

tarruda · 2026-02-19T18:57:04 1771527424

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

sosodev · 2026-02-19T18:30:09 1771525809

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

tarruda · 2026-02-19T19:26:10 1771529170

> so llama.cpp just doesn't handle it correctly.

It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

sosodev · 2026-02-19T19:33:15 1771529595

I see. It seems the looping is a bug in the model weights but there are bugs in detecting various outputs as identified in the PR I linked.

petethepig · 2026-02-19T18:42:35 1771526555

Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?

Did anyone do this kind of math?

tarruda · 2026-02-19T19:00:19 1771527619

No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k

And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).

ac29 · 2026-02-25T16:15:14 1772036114

> No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

Claude Pro costs $200/year, so you'd get 50 years of subscription not 50 months

terhechte · 2026-02-19T13:59:36 1771509576

Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.

tarruda · 2026-02-19T14:26:02 1771511162

Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:

- OpenAI completions endpoint

- Anthropic messages endpoint

- OpenAI responses endpoint

- A slick looking web UI

Without having to install anything else.

KerrAvon · 2026-02-19T17:02:54 1771520574

Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.

lostmsu · 2026-02-19T15:14:03 1771514043

gpt-oss 120b and even 20b works OK with codex.

tarruda · 2026-02-19T16:07:25 1771517245

Both gpt-oss are great models for coding in a single turn, but I feel that they forget context too easily.

For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.

ac29 · 2026-02-25T16:18:43 1772036323

> it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

Maybe because that doesnt make sense? ripgrep is for finding text inside files, not a replacement for find or ls

tarruda · 2026-02-16T13:22:22 1771248142

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

jon-wood · 2026-02-16T14:08:04 1771250884

I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.

Mossly · 2026-02-16T21:54:58 1771278898

It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.

ertgbnm · 2026-02-16T14:59:00 1771253940

I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.

WarmWash · 2026-02-16T15:43:43 1771256623

Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.

Wowfunhappy · 2026-02-16T20:28:15 1771273695

How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.

tarruda · 2026-02-16T13:18:47 1771247927

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

Tepix · 2026-02-16T16:49:22 1771260562

Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.

tarruda · 2026-02-16T19:11:53 1771269113

Considered getting a 512G mac studio, but I don't like Apple devices due to the closed software stack. I would never have gotten this Mac Studio if Strix Halo existed mid 2024.

For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.

kylehotchkiss · 2026-02-16T22:16:20 1771280180

I aspire to casually ponder whether I need a $9,500 computer to run the latest Qwen model

amelius · 2026-02-16T22:48:34 1771282114

You'll need more since RAM prices are up thanks to AI.

3abiton · 2026-02-18T12:18:32 1771417112

Given the shortage of wafers, the wait might be long. I am however working on a bridging solution. Sime already showed Strix Halo clustering, I am working on something similar but with some pp boost.

Unfortunately, AMD dumped a great device with unfinished software stack, and the community is rolling with it, compared to the DGX Spark, which I think is more cluster friendly.

PlatoIsADisease · 2026-02-16T15:32:30 1771255950

Why 128GB?

At 80B, you could do 2 A6000s.

What device is 128gb?

the_pwner224 · 2026-02-16T15:51:26 1771257086

AMD Strix Halo / Ryzen AI Max+ (in the Asus Flow Z13 13 inch "gaming" tablet as well as the Framework Desktop) has 128 GB of shared APU memory.

scoopdewoop · 2026-02-16T17:51:36 1771264296

Not quite. They have 128GB of ram that can be allocated in the BIOS, up to 96GB to the GPU.

cpburns2009 · 2026-02-16T22:30:33 1771281033

You don't have to statically allocate the VRAM in the BIOS. It can be dynamically allocated. Jeff Geerling found you can reliably use up to 108 GB [1].

[1]: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

khimaros · 2026-02-16T18:51:31 1771267891

allocation is irrelevant. as an owner of one of these you can absolutely use the full 128GB (minus OS overhead) for inference workloads

EasyMark · 2026-02-16T19:48:05 1771271285

Care to go into a bit more on machine specs? I am interested in picking up a rig to do some LLM stuff and not sure where to get started. I also just need a new machine, mine is 8y-o (with some gaming gpu upgrades) at this point and It's That Time Again. No biggie tho, just curious what a good modern machine might look like.

breisa · 2026-02-16T20:09:16 1771272556

Those Ryzen AI Max+ 395 systems are all more or less the same. For inference you want the one with 128GB soldered RAM. There are ones from Framework, Gmktec, Minisforum etc. Gmktec used to be the cheapest but with the rising RAM prices its Framework noe i think. You cant really upgrade/configure them. For benchmarks look into r/localllama - there are plenty.

aruggirello · 2026-02-16T22:01:54 1771279314

Minisforum, Gmktec also have Ryzen AI HX 370 mini PCs with 128Gb (2x64Gb) max LPDDR5. It's dirt cheap, you can get one barebone with ~€750 on Amazon (the 395 similarly retails for ~€1k)... It should be fully supported in Ubuntu 25.04 or 25.10 with ROCm for iGPU inference (NPU isn't available ATM AFAIK), which is what I'd use it for. But I just don't know how the HX 370 compares to eg. the 395, iGPU-wise. I was thinking of getting one to run Lemonade, Qwen3-coder-next FP8, BTW... but I don't know how much RAM should I equip it with - shouldn't 96Gb be enough? Suggestions welcome!

cpburns2009 · 2026-02-17T14:20:29 1771338029

I benchmarked unsloth/Qwen3-Coder-Next-GGUF using the MXFP4_MOE (43.7 GB) quantization on my Ryzen AI Max+ 395 and I got ~30 tps. According to [1] and [2], the AI Max+ 395 is 2.4x faster than the AI 9 HX 370 (laptop edition). Taking all that into account, the AI 9 HX 370 should get ~13 tps on this model. Make of that what you will.

[1]: https://community.frame.work/t/ai-9-hx-370-vs-ai-max-395/736...

[2]: https://community.frame.work/t/tracking-will-the-ai-max-395-...

aruggirello · 2026-02-18T14:12:44 1771423964

Thanks! I'm... unimpressed.

Tepix · 2026-02-17T04:19:26 1771301966

The Ryzen 370 lacks the quad channel RAM. Stay away.

paulsmal · 2026-02-17T00:01:09 1771286469

Ryzen AI HX 370 is not what you want, you need strix halo APU with unified memory

khimaros · 2026-02-21T16:22:24 1771690944

maxed out Framework Desktop

hedgehog · 2026-02-16T17:35:39 1771263339

Keep in mind most of the Strix Halo machines are limited to 10Gbe networking at best.

paulsmal · 2026-02-17T00:04:28 1771286668

you can use separate network adapter with RoCEv2/RDMA support like Intel E810

hedgehog · 2026-02-17T04:46:02 1771303562

Most Ryzen 395 machines don't have a PCI-e slot for that so you're looking at an extension from an m.2 slot or Thunderbolt (not sure how well that will work, possibly ok at 10Gb). Minisforum has a couple newly announced products, and I think the Framework desktop's motherboard can do it if you put it in a different case, that's about it. Hopefully the next generation has Gen5 PCIe and a few more lanes.

tgtweak · 2026-02-16T21:56:47 1771279007

Spark DGX and any A10 devices, strix halo with max memory config, several mac mini/mac studio configs, HP ZBook Ultra G1a, most servers

If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).

kristianp · 2026-02-17T03:27:37 1771298857

By A6000, do you mean the older Ampere generation model? 48 GB ddr6, released 2020 [1]. Can you even buy those new still?

[1] https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686

lm28469 · 2026-02-16T16:29:02 1771259342

That's the maximum you can get for $3k-$4k with ryzen max+ 395 and apple studio Ms. They're cheaper than dedicated GPUs by far.

tarruda · 2026-02-16T16:35:20 1771259720

Mac Studios or Strix Halo. GPT-OSS 120b, Qwen3-Next, Step 3.5-Flash all work great on a M1 Ultra.

sowbug · 2026-02-16T18:47:36 1771267656

All the GB10-based devices -- DGX Spark, Dell Pro Max, etc.

vladovskiy · 2026-02-16T15:46:54 1771256814

Guess, it is mac m series

bytesandbits · 2026-02-18T03:12:22 1771384342

maybe a deepseek v4 distill. give it a few days

tarruda · 2026-01-30T16:34:00 1769790840

Love the idea of keeping the agent filesystem in a single file!

tarruda · 2026-01-29T11:41:14 1769686874

These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.

roger_ · 2026-01-29T13:55:30 1769694930

MLX support on Macs was the main reason for me.

embedding-shape · 2026-01-29T14:33:54 1769697234

I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.

mycall · 2026-01-29T14:35:39 1769697339

Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

embedding-shape · 2026-01-29T14:39:49 1769697589

Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

Both have their places and are complementary, rather than competitors :)

tarruda · 2026-01-29T16:32:25 1769704345

I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.

mycall · 2026-01-30T02:06:38 1769738798

You can get concurrency gains [0] as local/single user (multi-agent) use case with vLLM with your Mac Studio.

[0] https://youtu.be/Ze5XLooTt6g?t=658

tarruda · 2025-12-25T23:56:07 1766706967

AFAIK MPS cannot be used on Asahi, so it has to be done using Vulkan which will definitely be much slower.