Hacker Newsnew | past | comments | ask | show | jobs | submit | amabito's commentslogin

IOPB bit semantics are inverted from what you might expect: 0 means permitted, 1 means denied. So zeroed pcb memory silently grants access to every port in range -- that's why this was consistently reproducible, not flaky. One sizeof() away from correct the whole time.


  Vulkan's presentation API makes this distinction explicit: VK_PRESENT_MODE_MAILBOX_KHR is the "replace if already queued" mode that actually reduces latency, while VK_PRESENT_MODE_FIFO_KHR is the pipeline-queue variant that adds frames ahead of time. OpenGL never standardized the difference,
  so "triple buffering" meant whatever the driver implemented -- usually vendor-specific extension behavior that varied between hardware. The naming confusion outlived OpenGL's dominance because the concepts got established before any cross-platform API gave them precise semantics.


The tooling gap exists partly because git's data model has no native concept of "this branch's upstream is another feature branch" — each PR is independent from the forge's perspective, so rebasing one layer in the stack requires manually re-targeting every PR below it. FAANG-internal tools solve this by storing the stack relationship in a metadata layer outside git itself, then regenerating the PR graph after each rebase. Without that layer, the bookkeeping falls on the developer, which is why most teams abandon the workflow after two or three levels deep regardless of how disciplined they are.


We seem to be reaching some sort of consensus among major code review partners on a standard for this information[1]

[1]: https://lore.kernel.org/git/CAESOdVAspxUJKGAA58i0tvks4ZOfoGf...


One thing that often gets overlooked in this comparison is how behavior trees degrade under retry or partial failure scenarios.

State machines make failure transitions explicit, but behavior trees can re-enter branches in ways that resemble retry amplification unless guarded carefully.

Curious whether anyone has modeled agent execution control as a state machine specifically to make containment explicit.


One thing I find fascinating about epoll/kqueue is how much modern async frameworks abstract away the underlying readiness model.

A lot of people talk about “async performance” without realizing the core efficiency gain came from avoiding O(n) scans on idle FDs.

Curious how many higher-level runtimes still leak edge-triggered vs level-triggered semantics in subtle ways.


Most orchestrators aren’t “bad”.

They just assume LLM calls behave like deterministic RPC.

The real issue is that we’re embedding stochastic, cost-weighted calls inside recursive graphs without structural bounds.

Orchestration isn’t the failure. Lack of containment is.


> Orchestration isn’t the failure. Lack of containment is.

Another LLM bot. Third one I've seen in a day.


This is interesting.

It looks less like a “model failure” and more like a containment failure.

When agents audit themselves, you’re effectively running recursive evaluation without structural bounds.

Did you enforce any step limits, retry budgets, or timeout propagation?

Without those, self-evaluation loops can amplify errors pretty quickly.


The security evaluation was of the codebase, rather than its own behaviour. It just happened to be _its_ codebase.

W.r.t the self evaluation of the 'dreamer' genome (think template), this is... not possible to answer briefly

The dreamer's normal wake cycle has a 80 loop budget with increasingly aggressive progress checks injected every 15 actions. When sleeping after a wake cycle it (if more than 5 actions were taken) 'dreams' for a maximum of 10 iterations/actions.

Every 10 wake cycles it does a deep sleep which triggers a self-evaluation capped at 100 iterations, where changes to the creatures source code and files and, really, anything are done.

The creature can also alter its source and files at any point.

The creature lives in a local git repo so the orchestrator can roll back if it breaks itself.


That’s actually a pretty disciplined setup.

What you’ve described sounds a lot like layered containment:

Loop budget (hard recursion bound)

Progressive checks (soft convergence control)

Sleep cycles (temporal isolation)

Deep sleep cap (bounded self-modification)

Git rollback (failure domain isolation)

Out of curiosity, have you measured amplification?

For example: total LLM calls per wake cycle, or per deep sleep?

I’m starting to think agent systems need amplification metrics the same way distributed systems track retry amplification.


I haven't actually measured it, but that could be interesting to see over time!

So far it seems pretty sane with Claude and incredibly boring with OpenAI (OpenAI models just don't want to show any initiative)

One thing I neglected to mention is that it manages its own sleep duration and it has a 'wakeup' cli command. So far the agents (i prefer to call them creatures :) ) do a good job of finding the wakeup command, building scripts to poll for whatever (e.g. github notifications) and sleeping for long periods.

There's a daily cost cap, but I'm not yet making the creatures aware of that budget. I think I should do that soon because that will be an interesting lever


I guess also worth mentioning is that the creatures can rewrite their own code wholesale, ditching any safety limits except the externally enforced llm cost cap. They don't have access to LLM api keys - llm calls are proxied through the orchestrator.


What’s interesting here is that the model isn’t really “lying” — it’s just amplifying whatever retrieval hands it.

Most RAG pipelines retrieve and concatenate, but they don’t ask “how trustworthy is this source?” or “do multiple independent sources corroborate this claim?”

Without some notion of source reliability or cross-verification, confident synthesis of fiction is almost guaranteed.

Has anyone seen a production system that actually does claim-level verification before generation?


The scarier version of this problem is what I've been calling "zombie stats" - numbers that get cited across dozens of sources but have no traceable primary origin.

We recently tested 6 AI presentation tools with the same prompt and fact-checked every claim. Multiple tools independently produced the stat "54% higher test scores" when discussing AI in education. Sounds legit. Widely cited online. But when you try to trace it back to an actual study - there's nothing. No paper, no researcher, no methodology.

The convergence actually makes it worse. If three independent tools all say the same number, your instinct is "must be real." But it just means they all trained on the same bad data.

To your question about claim-level verification: the closest I've seen is attaching source URLs to each claim at generation time, so the human can click through and check. Not automated verification, but at least it makes the verification possible rather than requiring you to Google every stat yourself. The gap between "here's a confident number" and "here's a confident number, and here's where it came from" is enormous in practice.


> Has anyone seen a production system that actually does claim-level verification before generation?

"Claim level" no, but search engines have been scoring sources on reliability and authority for decades now.


Right — search engines have long had authority scoring, link graphs, freshness signals, etc.

The interesting gap is that retrieval systems used in LLM pipelines often don't inherit those signals in a structured way. They fetch documents, but the model sees text, not provenance metadata or confidence scores.

So even if the ranking system “knows” a source is weak, that signal doesn’t necessarily survive into generation.

Maybe the harder problem isn’t retrieval, but how to propagate source trust signals all the way into the claim itself.


Approval gates make a lot of sense, especially for high-impact actions. Most agent failures I've seen come from nothing stopping execution at all.

One scenario I'm curious about: how do you think about overnight or weekend runs when no one is around to approve? Human gates work well during business hours, but agents don't necessarily respect that clock.

Do you see Axon evolving toward hybrid controls — human approval for sensitive actions, plus automatic limits for volume or repetition?


Great question. You're right that pure human gates don't scale to 24/7 operations. AXON already handles this partially: you can set auto-approve per tool and per agent. Low-risk actions like web search run without asking, while high-risk actions like shell commands always require approval. There's also session-level approval — trust the agent for a specific task, then revoke. For the overnight scenario: scheduled tasks in AXON run at defined times and queue approvals if needed. You approve in the morning or via Telegram/Discord on your phone. But you're pointing at exactly where we're headed. Hybrid controls — rate limits, budget caps, automatic rollback on anomalies — are on the roadmap. The audit trail already captures everything, so adding automated rules on top of that data is the natural next step. The philosophy stays the same: default to human control, earn autonomy gradually.


The Markdown-over-database choice makes sense for document-shaped output.

The harder problem seems to be concurrent semantic edits. Git-style merging works for code because conflicts are syntactic. With prose, two agents can produce logically conflicting conclusions without triggering a merge conflict.

How does Sayou reason about semantic divergence when Agent A updates a research note while Agent B is drafting against an older snapshot?


Current approach is last-write-wins with version history - simple but doesn’t solve concurrent edits.

I don’t think auto-merge is the right default for prose/research (unlike code where Git’s merge works). When Agent A writes strategic memo concluding X while Agent B writes concluding NOT-X, merging both is worse than surfacing the conflict.

Thinking the right model is: ∙ Optimistic writes (current behavior) for most cases ∙ Explicit locks for high-stakes docs agents know they’re collaborating on ∙ Diff tooling for post-hoc resolution when conflicts do occur

thanks for asking a great question. whats your thoughts?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: