More

ycui7 · 2026-05-24T00:46:32 1779583592

I think what this actually means is that you can apply permanent residency in the US, but you can only get the physical green card outside of the US when the case is approved. So, the last step to get the card need to from outside the country.

robotresearcher · 2026-05-24T01:04:27 1779584667

What is the point of that?

They might as well mail it to your home. That’s what happens today.

ycui7 · 2026-05-18T00:44:47 1779065087

This is not surprising at all. The biggest benefit of cloud model in terms of energy efficiency is that when running more than 1 requests, the power consumption of said GPU roughly stayed the same. The more concurrency requests the server can handle, the less power each request consume. The server GPU is already likely more energy efficient than local GPU, concurrency make the cost structure unbeatable by local hardware. It is generally assumed the local hardware only run 1 request, but if the local engine is meant to serve a small business with meaningful concurrency, the economy might still work out.

ycui7 · 2026-05-10T04:33:04 1778387584

Every vendor defines their audio jack connector serial port differently. It is very dangerous to use 3.5mm jack. There is no pinout standard of using 3.5mm.

Even as pure audio jack, the 3.5mm connector has two standards, with the difference on ground and mic.

foresto · 2026-05-10T08:35:14 1778402114

Well, not every vendor does it differently, because there are far more vendors than possible wiring permutations on a 3-contact connector... but I understand what you mean.

The same issue applies to PCB header pins.

And the same goes for 25-pin D-sub connectors, which have been widely used as RS-232, parallel printer, and SCSI ports.

Voltages vary, too.

This is why we check before wiring them to other things.

ycui7 · 2026-05-10T03:33:23 1778384003

The type of people who need spice is dead serious about accuracy. 1ppm error sometimes is not tolerable. So, an optimization in a game engine is definitely not suitable for engineering simulation.

NavinF · 2026-05-10T15:07:41 1778425661

Dude these are incredibly oversimplified models of real components. How are you getting 1ppm when basic shit like tempco and self heating are missing from pretty much every vendor provided spice model?

ycui7 · 2026-05-05T14:20:02 1777990802

if the goal is to only get the median, you should not use sort. sort is O(nlogn). there are algo that give you medium at O(n), check quickselect.

ycui7 · 2026-04-29T07:45:21 1777448721

Exiting dGPU for gaming, but staying in the LLM world.

ycui7 · 2026-04-29T07:42:37 1777448557

B70 idles at 30W, while RTX PRO 4500 idles at 9W (measured to be 5W at wall).

B70 runs at 1/3 token output rate of RTX PRO 4500 and consume 3X idle power when do nothing.

ycui7 · 2026-04-29T07:31:12 1777447872

At this speed, people end up paying more on electricity than api calls. (California electricity)

ycui7 · 2026-04-29T07:28:59 1777447739

You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls.

5090 gets maybe 100TPS with MTP

ycui7 · 2026-04-29T03:32:04 1777433524

Intel Arc B70 when released, can only produce 1/3 of the token of RTX PRO 4500. Well, it also cost 1/3 of RTX PRO 4500.

It lacked software support the for the primary target application, running LLM. The officially supported vllm fork is 6 version behind mainline. It did not run the latest hot new open models on huggingface. Parallel two of B70 reduce token rate, not improve it. So, the software behind B70 is basically so far behind.

adrian_b · 2026-04-29T08:02:14 1777449734

What you say is not consistent with TFA.

The parent article shows that B70 is faster than RTX 4000.

RTX 4500 is faster than RTX 4000, but it cannot be more than 3 times faster, not even more than 2 times faster.

The parent article is consistent with RTX 4500 being faster than B70 for ML inference, but by a much smaller ratio, e.g. less than 50% faster.

If you know otherwise, please point to the source.

If you have run a benchmark yourself, please describe the exact conditions.

In the benchmarks shown at Phoronix for llama.cpp, the relative performance was extremely variable for different LLMs, i.e. for some LLMs a B70 was faster than RTX 4000, but for others it was significantly slower.

Your 3x performance ratio may be true for a particular LLM with a certain quantization, but false for other LLMs or other quantizations.

This performance variability may be caused by immature software for B70. For instance instead of using matrix operations (XMX engines), non-optimized software might use traditional vector operations, which are slower.

It is also possible that for optimum performance with a certain LLM one may need to choose a different quantization for B70 than for NVIDIA, because for sub-16-bit number formats Intel supports only integer numbers.

jubilanti · 2026-04-29T13:32:21 1777469541

TFA's benchmark was MLPerf, which doesn't require CUDA as Intel has their own Arc plugin. But actually try to run llama.cpp on Arc and it is a roll of the dice.

muyuu · 2026-04-29T04:26:28 1777436788

There are nonlinearities to exploit in that calculus. Given enough VRAM to host a larger model that you're targeting, just the size can push you past the usability threshold at a much better price.

ycui7 · 2026-04-29T07:37:05 1777448225

When you get 4 of these, the idle power alone is 120W. That is a lot of electricity if left on 24/7.

At that power consumption, you also end up being more expensive than API calls and many times slower. It starts to feel very stupid to run local interference.

If the client is very keen on privacy, then they can pay for the NVIDIA.

I end up returning my B70s, and bought RTX PRO 6000.

ycui7 · 2026-04-29T07:25:13 1777447513

Problem is the more B70 you have, the slower the inference it gets(due to terrible software atm). A single B70 is almost barely faster than CPU inference. If you have 4 B70, you might as well run interference on CPU and be faster with cheaper DDR5 instead of GDDR6.

adrian_b · 2026-04-29T08:10:53 1777450253

For what you say to be useful, please specify what sowftware you have used with B70, including its version.

Hardware-wise a B70 should be significantly faster than any of the available CPUs at ML inference. If it was not so in your tests, that must really be a software problem, so you must identify the software, for others to know what does not work.