And that in turn affects tool adoption. I have dabbled in Lua for interacting with other software such as mpv, but never got much into the weeds with it because it lacks native JSON support, and I need to interact with JSON all the time.
yeah, LuaJIT is one of the use cases I had in mind working on this. JSON is pretty fast in modern JS engines, but in Lua land, JSON kinda sucks and doesn't really match the language without using virtual tables.
JSON has `null` values with string keyds, but lua doesn't have `null`. It has `nil`, but you can't have a key with a nil value. Setting nil deletes the key
Lua tables are unordered. But JS and JSON are often ordered and order often matters.
RX, however matches Lua/LuaJIT extremely well and should out-perform the JS Proxy based decoder using metatables. Since it's using metatables anyway do to the lazy parsing, it's trivial to do things like preserve order when calling `pairs` and `ipairs` and even including keys with associated null values.
You can round trip safely in Lua, which is not easy with most JSON implementations.
Once you run more than one agent in a loop, you inevitably recreate distributed systems problems: message ordering, retries, partial failure, etc.
Most agent frameworks pretend these don’t exist. Some of them address those problems partially. None of the frameworks I've seen address all of them.
Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.
If both sides refactor the same function into multiple smaller ones (extract method) or rename it, can Weave detect that as a structural refactor, or does it become “delete + add”? Any heuristics beyond name matching?
Yes, weave detects renames via structural_hash (AST-normalized hash that ignores identifier names). If both sides rename the same function, it matches by structure and merges cleanly.
Thanks a lot, I will test it out as you said, in the mean time, could you also open up an issue on the repo, so that it helps me and others to track the issue.
One thing that seems under-discussed in this space is the shift from verifying programs to verifying generation processes.
If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.
In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.
That feels closer to build reproducibility or supply-chain verification than traditional program proofs.
With packages like this (lots of cores, multi-chip packaging, lots of memory channels), the architecture is increasingly a small cluster on a package rather than a monolithic CPU.
I wonder whether the next bottleneck becomes software scheduling rather than silicon - OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Modern AMD processors are basically a bunch of smaller processors (chiplets) glued together with an interconnect. So yes single chip nodes can have many numa zones.
Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).
Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.
The kernel tries to guess as well as it can though - many years ago I hit a fun bug in the kernel scheduler that was triggered by numa process migration ie the kernel would move the processes to the core closest to the ram. It happened that in some cases the migrated processes never got scheduled and got stuck forever.
Disabling numa migration removed the problem. I figured out the issue because of the excellent ‘a decade of wasted cores’ paper which essentially said that on ‘big’ machines like ours funky things could happen scheduling wise so started looking at scheduling settings .
The main numa-pinning performance issue I was describing was different though, and like you said came from us needing to change the way the code was written to account for the distance to ram stick. Modern servers will usually let you choose from fully managed ( hope and pray , single zone ) to many zones, and the depending on what you’ve chosen to expose, use it in your code. As always, benchmark benchmarks.
Guessing this is especially hard to automate with peripherals involved. I once had a workload slow severely because it was running on the NUMA node that didn't share memory with the NIC.
Isn't high grade SSD storage pretty much a memory layer as well these days as the difference is no longer several orders of magnitude in access time and thoughput but only one or two (compared to tha last layer of memory)?
Optane was supposed to fill the gap but Intel never found a market for this.
Flash is still extremely slow compared to ram, including modern flash, especially in a world where ram is already very slow and your cpu already keeps waiting for it.
That being said, you should consider ram/flash/spinning to be all part of a storage hierarchy with different constants and tradeoffs ( volatile or not, big or small , fast or slow etc ), and knowing these tradeoffs will help you design simpler and better systems.
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.
No they don't. They are horribly wasteful and inefficient compared to kernel TCP. Also they are useless because they sit on top of a kernel network interface anyways.
Unless you're doing specific tricks to minimize latency (HFT, I guess?) then there is no point.
Do the partitioned stacks of network namespaces share a single underlying global stack or are they fully independent instances? (And if not, could they be made so?)
I think you could get much of the way there by isolating a single NIC's receive queues, so the kernel doesn't decide to run off and service softirqs for random foreign tasks just because your task called tcp_sendmsg.
I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.
The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.
Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.
You're responding out of context. The parent was asking if there are bottlenecks specifically related to scheduling. I explicitly made the point that if there are bottlenecks, they're more likely related to memory.
afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.
How the heck does the OS see it as a single system, is there some pcie or rdma black magic that allows the kernel to just address memory in a different chassis? Maybe CXL?
No it's actual hardware coherent memory across the system. At a high level it is the same way two cores/caches are connected within one chip, or the same way two sockets are connected on the same board. Just using cables instead of wires in the chip or on a board.
This system has SMP ASICs on the motherboards that talk to a couple of Intel processor sockets using their coherency protocol over QPI and they basically present themselves as a coherency agent and memory provider (similarly to the way that processors themselves have caches and DDR controllers). The Intel CPUs basically talk to them the same way they would another processor. But out the other side these ASICS connect to a bunch of others all doing the same thing, and they use their own coherency protocol among themselves.
So it's not CXL, instead it's proprietary ASICs masquerading as NUMA nodes but actually forwarding to their counterparts in the other chassis? Are they proprietary to HP or is this some new standard?
It's not cheating or a cluster based system. All the biggest high end servers use multiple externally cabled systems (chassis, sled, drawer). The biggest ones even span multiple racks (aka frames). These days it is HP and IBM remaining in the game.
These all have real hardware coherency going over the external cables, same protocol. Here is a Power10 server picture, https://www.engineering.com/ibm-introduces-power-e1080-serve... the cables attach right to headers brought out of the chip package right off the phy, there's no ->PCI->ethernet-> or anything like that.
These HP systems are similar. These are actually descendants of SGI Altix / SGI Origin systems which HP acquired, and they still use some of the same terminology (NUMAlink for the interconnect fabric). HP did make their own distinct line of big iron systems when they had PA-RISC and later Itanium but ended up acquiring and going with SGI's technology for whatever reasons.
These HP/SGI systems are slightly different from IBM mini/mainframes because they use "commodity" CPUs from Intel that don't support glueless multi socket that large or have signaling that can get across boards, so these have their own chipset that has some special coherency directories and a bunch of NUMAlink PHYs.
SGI systems came from HPC so they were actually much bigger before that, the biggest ones were something around 1024 sockets, back when you only had 1 CPU per socket. The interconnect topology used to be some tree thing that had like 10 hops between the farthest nodes. It did run Linux and wasn't technically cheating, but you really had to program it like a cluster because resource contention would quickly kill you if there was much cacheline transfer between nodes. Quite amazing machines, but not suitable for "enterprise" so IIRC they have cut it down and gone with all-to-all interconnect. It would be interesting to know what they did with coherency protocol, the SGI systems used a full directory scheme which is simple and great at scaling to huge sizes but not the best for performance. IBM systems use extremely complex broadcast source snooping designs (highly scoped and filtered) to avoid full directory overhead. Would be interesting to know if HPE finally went that way with NUMAlink too.
Cheating IMO would be an actual cluster of systems using software (firmware/hypervisor) to present a single system image using MMU and IB/ethernat adapters to provide coherency.
Sounds like a HPE Compute Scale-up Server 3200, but again keep in mind that's something where there's probably a fabric between nodes one way or another.
> OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
I mean....
IMO Erlang/Elixir is a not-terrible benchmark for how things should work in that state... Hell while not a runtime I'd argue Akka/Pekko on JVM Akka.Net on the .NET side would be able to do some good with it...[0] Similar for Go and channels (at least hypothetically...)
[0] - Of course, you can write good scaling code on JVM or CLR without these, but they at least give some decent guardrails for getting a good bit of the Erlang 'progress guaranteed' sauce.
> I wonder whether the next bottleneck becomes software scheduling rather than silicon
Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.
IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.
That's the one. Funny thing, it's not actually clickbait.
The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.
It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.
I noticed in the README that each commit message includes the agent and model, which is a nice start toward reproducibility.
I’m wondering how deep you plan to go on environment pinning beyond that. Is the system prompt / agent configuration versioned? Do you record tool versions or surrounding runtime context?
My mental model is that reproducible intent requires capturing the full "execution envelope", not just the human prompt + model & agent names. Otherwise it becomes more of an audit trail (which is also a good feature) than something you can deterministically re-run.
That’s fair - strict determinism isn’t possible in the traditional sense. I was thinking more along the lines of bounded reproducibility.
If the model, parameters, system prompt, and toolchain are pinned, you might not get identical output, but you can constrain the space of possible diffs.
It reminds me a bit of how StrongDM talks about reproducibility in their “Digital Twin” concept - not bit-for-bit replay, but reproducing the same observable behavior.
Even a technically superior format struggles without that ecosystem.
reply