50lo's comments

50lo · 2026-03-19T08:29:13 1773908953

The biggest challenge for formats like this is usually tooling. JSON won largely because: every language supports it, every tool understands it.

Even a technically superior format struggles without that ecosystem.

latexr · 2026-03-19T09:53:18 1773913998

And that in turn affects tool adoption. I have dabbled in Lua for interacting with other software such as mpv, but never got much into the weeds with it because it lacks native JSON support, and I need to interact with JSON all the time.

creationix · 2026-03-19T20:28:09 1773952089

yeah, LuaJIT is one of the use cases I had in mind working on this. JSON is pretty fast in modern JS engines, but in Lua land, JSON kinda sucks and doesn't really match the language without using virtual tables.

JSON has `null` values with string keyds, but lua doesn't have `null`. It has `nil`, but you can't have a key with a nil value. Setting nil deletes the key

Lua tables are unordered. But JS and JSON are often ordered and order often matters.

RX, however matches Lua/LuaJIT extremely well and should out-perform the JS Proxy based decoder using metatables. Since it's using metatables anyway do to the lazy parsing, it's trivial to do things like preserve order when calling `pairs` and `ipairs` and even including keys with associated null values.

You can round trip safely in Lua, which is not easy with most JSON implementations.

50lo · 2026-03-16T20:49:10 1773694150

Once you run more than one agent in a loop, you inevitably recreate distributed systems problems: message ordering, retries, partial failure, etc. Most agent frameworks pretend these don’t exist. Some of them address those problems partially. None of the frameworks I've seen address all of them.

50lo · 2026-03-12T01:40:55 1773279655

Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.

50lo · 2026-03-08T11:28:02 1772969282

It’d be interesting to see this compared against a human baseline — e.g., a competent engineer with a fixed time budget on the same tasks.

50lo · 2026-03-04T05:26:03 1772601963

If both sides refactor the same function into multiple smaller ones (extract method) or rename it, can Weave detect that as a structural refactor, or does it become “delete + add”? Any heuristics beyond name matching?

rs545837 · 2026-03-04T05:29:45 1772602185

Yes, weave detects renames via structural_hash (AST-normalized hash that ignores identifier names). If both sides rename the same function, it matches by structure and merges cleanly.

gritzko · 2026-03-04T05:33:44 1772602424

This will not work for refactors. In fact, any other change will break the hash. I know because I used this approach for quite some time.

rs545837 · 2026-03-04T05:41:49 1772602909

Thanks a lot, I will test it out as you said, in the mean time, could you also open up an issue on the repo, so that it helps me and others to track the issue.

gritzko · 2026-03-04T05:46:35 1772603195

I will ask Claude to open it, thanks!

rs545837 · 2026-03-04T05:52:59 1772603579

Thanks, lemme know how it goes, I will review and we can discuss over the issue.

50lo · 2026-03-03T19:54:39 1772567679

One thing that seems under-discussed in this space is the shift from verifying programs to verifying generation processes.

If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.

In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.

That feels closer to build reproducibility or supply-chain verification than traditional program proofs.

50lo · 2026-03-03T19:53:57 1772567637

With packages like this (lots of cores, multi-chip packaging, lots of memory channels), the architecture is increasingly a small cluster on a package rather than a monolithic CPU.

I wonder whether the next bottleneck becomes software scheduling rather than silicon - OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.

Agingcoder · 2026-03-03T21:16:09 1772572569

Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.

We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.

zadikian · 2026-03-04T02:40:35 1772592035

Wait, does a single CPU chip have numa within it now, or are you only talking about multi-socket machines?

__turbobrew__ · 2026-03-04T04:46:27 1772599587

Modern AMD processors are basically a bunch of smaller processors (chiplets) glued together with an interconnect. So yes single chip nodes can have many numa zones.

saati · 2026-03-04T15:29:38 1772638178

That was Zen 1, the later ones don't have per chiplet memory controllers, it's all on the single IO die, and they are not NUMA for a single socket.

creddit · 2026-03-04T03:07:44 1772593664

Single chips do.

brcmthrowaway · 2026-03-03T21:21:18 1772572878

Intel contributes to Linux, how is this a problem?

fc417fc802 · 2026-03-03T23:17:51 1772579871

Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).

Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.

Agingcoder · 2026-03-04T01:30:17 1772587817

Yes exactly.

The kernel tries to guess as well as it can though - many years ago I hit a fun bug in the kernel scheduler that was triggered by numa process migration ie the kernel would move the processes to the core closest to the ram. It happened that in some cases the migrated processes never got scheduled and got stuck forever.

Disabling numa migration removed the problem. I figured out the issue because of the excellent ‘a decade of wasted cores’ paper which essentially said that on ‘big’ machines like ours funky things could happen scheduling wise so started looking at scheduling settings .

The main numa-pinning performance issue I was describing was different though, and like you said came from us needing to change the way the code was written to account for the distance to ram stick. Modern servers will usually let you choose from fully managed ( hope and pray , single zone ) to many zones, and the depending on what you’ve chosen to expose, use it in your code. As always, benchmark benchmarks.

zadikian · 2026-03-04T02:43:25 1772592205

Guessing this is especially hard to automate with peripherals involved. I once had a workload slow severely because it was running on the NUMA node that didn't share memory with the NIC.

consp · 2026-03-04T07:24:30 1772609070

Isn't high grade SSD storage pretty much a memory layer as well these days as the difference is no longer several orders of magnitude in access time and thoughput but only one or two (compared to tha last layer of memory)?

Agingcoder · 2026-03-04T14:54:06 1772636046

Optane was supposed to fill the gap but Intel never found a market for this.

Flash is still extremely slow compared to ram, including modern flash, especially in a world where ram is already very slow and your cpu already keeps waiting for it.

That being said, you should consider ram/flash/spinning to be all part of a storage hierarchy with different constants and tradeoffs ( volatile or not, big or small , fast or slow etc ), and knowing these tradeoffs will help you design simpler and better systems.

fc417fc802 · 2026-03-04T10:06:30 1772618790

Sort of? Relative to 6 or more channels of RAM it's still quite abysmal but perhaps high bandwidth flash will change how things are done.

wmf · 2026-03-03T22:03:21 1772575401

Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.

jeffbee · 2026-03-03T20:36:17 1772570177

There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.

otabdeveloper4 · 2026-03-04T05:12:56 1772601176

> Userspace protocol stacks win.

No they don't. They are horribly wasteful and inefficient compared to kernel TCP. Also they are useless because they sit on top of a kernel network interface anyways.

Unless you're doing specific tricks to minimize latency (HFT, I guess?) then there is no point.

fc417fc802 · 2026-03-03T23:34:07 1772580847

Do the partitioned stacks of network namespaces share a single underlying global stack or are they fully independent instances? (And if not, could they be made so?)

wmf · 2026-03-03T23:58:00 1772582280

Usually network namespaces are linked together with a single bridge so you can get lock contention there.

If you have a separate physical NIC for each namespace you probably won't have any contention.

jeffbee · 2026-03-04T00:19:44 1772583584

I think you could get much of the way there by isolating a single NIC's receive queues, so the kernel doesn't decide to run off and service softirqs for random foreign tasks just because your task called tcp_sendmsg.

rishabhaiover · 2026-03-03T21:35:00 1772573700

io_uring?

jeffbee · 2026-03-03T21:58:16 1772575096

If anything, uring makes the problem much worse by reducing the cost of one process flooding kernel internals in a single syscall.

lich_king · 2026-03-03T20:29:09 1772569749

I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.

The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.

Retr0id · 2026-03-03T20:44:07 1772570647

Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.

dehrmann · 2026-03-04T02:23:08 1772590988

> I don't think there are any fundamental bottlenecks here.

You memory only has so much bandwidth, but now it's shared by even more cores.

lich_king · 2026-03-04T15:41:45 1772638905

You're responding out of context. The parent was asking if there are bottlenecks specifically related to scheduling. I explicitly made the point that if there are bottlenecks, they're more likely related to memory.

marcyb5st · 2026-03-04T08:14:44 1772612084

I think linux and co do already a decent job. Even on K8s (so like at least another layer removed from the host OS) you can specify your topology preferences: https://kubernetes.io/docs/tasks/administer-cluster/topology...

So on the OS side we might already have the needed tools for these CoC (cluster on chip ;))

whateverboat · 2026-03-03T20:05:28 1772568328

I think linux can handle upto 1024 cores just fine.

zokier · 2026-03-03T20:59:53 1772571593

afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.

Retr0id · 2026-03-03T22:05:44 1772575544

I had no idea we had socket counts so high, do you know where I could find a picture of one?

zokier · 2026-03-03T23:28:25 1772580505

It's bit cheating because it's cluster based system: https://www.hpe.com/psnow/doc/a50004268enw

So 4 sockets per chassis, up to 8 chassis in a complete system. Afaik OS sees it as single huge system, that is kinda their special sauce here.

nixon_why69 · 2026-03-04T04:09:59 1772597399

How the heck does the OS see it as a single system, is there some pcie or rdma black magic that allows the kernel to just address memory in a different chassis? Maybe CXL?

stinkbeetle · 2026-03-04T11:01:12 1772622072

No it's actual hardware coherent memory across the system. At a high level it is the same way two cores/caches are connected within one chip, or the same way two sockets are connected on the same board. Just using cables instead of wires in the chip or on a board.

This system has SMP ASICs on the motherboards that talk to a couple of Intel processor sockets using their coherency protocol over QPI and they basically present themselves as a coherency agent and memory provider (similarly to the way that processors themselves have caches and DDR controllers). The Intel CPUs basically talk to them the same way they would another processor. But out the other side these ASICS connect to a bunch of others all doing the same thing, and they use their own coherency protocol among themselves.

nixon_why69 · 2026-03-04T15:10:48 1772637048

Thanks for answering.

So it's not CXL, instead it's proprietary ASICs masquerading as NUMA nodes but actually forwarding to their counterparts in the other chassis? Are they proprietary to HP or is this some new standard?

stinkbeetle · 2026-03-04T20:34:16 1772656456

Proprietary called NUMAlnik.

stinkbeetle · 2026-03-04T10:40:54 1772620854

It's not cheating or a cluster based system. All the biggest high end servers use multiple externally cabled systems (chassis, sled, drawer). The biggest ones even span multiple racks (aka frames). These days it is HP and IBM remaining in the game.

These all have real hardware coherency going over the external cables, same protocol. Here is a Power10 server picture, https://www.engineering.com/ibm-introduces-power-e1080-serve... the cables attach right to headers brought out of the chip package right off the phy, there's no ->PCI->ethernet-> or anything like that.

These HP systems are similar. These are actually descendants of SGI Altix / SGI Origin systems which HP acquired, and they still use some of the same terminology (NUMAlink for the interconnect fabric). HP did make their own distinct line of big iron systems when they had PA-RISC and later Itanium but ended up acquiring and going with SGI's technology for whatever reasons.

These HP/SGI systems are slightly different from IBM mini/mainframes because they use "commodity" CPUs from Intel that don't support glueless multi socket that large or have signaling that can get across boards, so these have their own chipset that has some special coherency directories and a bunch of NUMAlink PHYs.

SGI systems came from HPC so they were actually much bigger before that, the biggest ones were something around 1024 sockets, back when you only had 1 CPU per socket. The interconnect topology used to be some tree thing that had like 10 hops between the farthest nodes. It did run Linux and wasn't technically cheating, but you really had to program it like a cluster because resource contention would quickly kill you if there was much cacheline transfer between nodes. Quite amazing machines, but not suitable for "enterprise" so IIRC they have cut it down and gone with all-to-all interconnect. It would be interesting to know what they did with coherency protocol, the SGI systems used a full directory scheme which is simple and great at scaling to huge sizes but not the best for performance. IBM systems use extremely complex broadcast source snooping designs (highly scoped and filtered) to avoid full directory overhead. Would be interesting to know if HPE finally went that way with NUMAlink too.

Found this diagram from an old HP product which was an SGI derivative https://support.hpe.com/hpesc/public/docDisplay?docId=a00062... 2 QPI busses and 16 NUMAlink ports!

Aha, it's still a directory protocol. https://support.hpe.com/hpesc/public/docDisplay?docId=sd0000...

Cheating IMO would be an actual cluster of systems using software (firmware/hypervisor) to present a single system image using MMU and IB/ethernat adapters to provide coherency.

to11mtm · 2026-03-03T23:30:24 1772580624

Sounds like a HPE Compute Scale-up Server 3200, but again keep in mind that's something where there's probably a fabric between nodes one way or another.

moffkalast · 2026-03-03T21:50:18 1772574618

https://xkcd.com/619/

TomMasz · 2026-03-04T11:25:56 1772623556

I was wondering this, too. There's no mention of OS support, but I assume Intel is working with the usual suspects on it.

rishabhaiover · 2026-03-03T21:33:36 1772573616

That's a great point. Linux has introduced io_uring, and I believe that gives us the native primitives to hide latency better?

But that's just one piece of the puzzle, I guess.

to11mtm · 2026-03-03T23:21:39 1772580099

> OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.

I mean....

IMO Erlang/Elixir is a not-terrible benchmark for how things should work in that state... Hell while not a runtime I'd argue Akka/Pekko on JVM Akka.Net on the .NET side would be able to do some good with it...[0] Similar for Go and channels (at least hypothetically...)

[0] - Of course, you can write good scaling code on JVM or CLR without these, but they at least give some decent guardrails for getting a good bit of the Erlang 'progress guaranteed' sauce.

user5994461 · 2026-03-03T20:58:17 1772571497

> I wonder whether the next bottleneck becomes software scheduling rather than silicon

Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.

IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.

01HNNWZ0MV43FF · 2026-03-03T21:26:41 1772573201

I searched for "Linux kernel limited to 8 cores" and found this

https://news.ycombinator.com/item?id=38260935

> This article is clickbait and in no way has the kernel been hardcoded to a maximum of 8 cores.

user5994461 · 2026-03-03T21:54:31 1772574871

That's the one. Funny thing, it's not actually clickbait.

The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.

It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.

50lo · 2026-03-03T05:58:58 1772517538

I noticed in the README that each commit message includes the agent and model, which is a nice start toward reproducibility.

I’m wondering how deep you plan to go on environment pinning beyond that. Is the system prompt / agent configuration versioned? Do you record tool versions or surrounding runtime context?

My mental model is that reproducible intent requires capturing the full "execution envelope", not just the human prompt + model & agent names. Otherwise it becomes more of an audit trail (which is also a good feature) than something you can deterministically re-run.

Curious how you’re thinking about that.

scheme271 · 2026-03-03T07:12:33 1772521953

LLMs are non-deterministic so I don't see how it's reproducible.

50lo · 2026-03-03T08:26:58 1772526418

That’s fair - strict determinism isn’t possible in the traditional sense. I was thinking more along the lines of bounded reproducibility.

If the model, parameters, system prompt, and toolchain are pinned, you might not get identical output, but you can constrain the space of possible diffs.

It reminds me a bit of how StrongDM talks about reproducibility in their “Digital Twin” concept - not bit-for-bit replay, but reproducing the same observable behavior.