> are they fairly uniformly similar with gains in one or a few areas, or is it noisier with a lower overall loss?
It seems like you want to know what median, 5-95 or 1-99 differences might be? I also wonder how the "residual" plot looks like... If there are too many residual data points for a scatter plot then a histogram might be useful to visualize the modes. I suspect that as loss decreases multiple modes should condense or altogether collapse into one.
Many times there is really no way of getting around some of the expert-human judgement complexity of the larger question of "How to get agents to build reliably".
One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.
I tried looking and couldn't find a proper price per token for the chat model. It claims to be free in some places. I did find these prices for the other services:
Text to Speech (Bulbul v3): ₹30 per 10K characters
Text to Speech (Bulbul v2): ₹15 per 10K characters
Sarvam Vision: Free per page
Speech to Text: ₹30 per hour
Speech to Text with Diarization: ₹45 per hour
Speech to Text & Translate: ₹30 per hour
Speech to Text, Translate & Diarization: ₹45 per hour
Sarvam Translate V1: ₹20 per 10K characters
Translate Mayura V1: ₹20 per 10K characters
Transliterate: ₹20 per 10K characters
Language Identification: ₹3.5 per 10K characters
One set of applications to build with subscription is to use the claude-go binary directly. Humanlayer/Codelayer projects on GitHub do this. Granted those are not ideal for building a subscription based business to use oathu tokens from Claude and OpenaAI. But you can build a business by building a development env and gating other features behind paywall or just offering enterprise service for certain features like vertical AI(redpanada) offerings knowledge workers, voice based interaction(there was a YC startup here the other day doing this I think), structured outputs and workflows. There is lots to build on.
I have my homenas set up with Node Proxy Manager container forwarding requests to different docker machines:ports e.g. I have some TTS/STT/LLM services locally hosted. To increase bandwidth to internet facing nodes, would you use this or some other simpler solution?
I assume so; I use the same thing with my Unraid box and then create the DNS entries in the unifi panel so I get jellyfin.lan, minecraft.lan, etc inside the house.
There could be another model in the future, one where many more independent people might support self maintained software by non saas companies
e.g. If the supply of labor learning to build software increases and it becomes very close to what are now vocation training, then you can just hire a guy — like you would a consultant — who can quickly get spun up and make fixes. I would think one of the few things preventing this kind of socio economic set up are saas jobs that are siloed off by interview "walls" to most people from entering. Make it like a vocation, like plumbing or electrician, with lots of non saas companies supporting the market and suddenly it will be the death of saas.
The incentives for this future are closer than they were in 2022-23.
I think a few things explain these kinds of projects
1. There are a lot of Agentic Data Plane startups for knowledge workers(not really for coders[1] but for CFOs, Analysts etc) going up. e.g https://www.redpanda.com/ For people to ask "Hey give me a breakdown of last year's sales target by region, type and compare 2026 to 2025 for Q1".
Now this can be done entirely on intranet and only on certain permissioned data servers — by agents or humans — but as someone pointed out the intranet can also be a dangerous place. So I guess this is about protecting DB tables and Jiras and documentation you are not allowed to see.??
2. People who have skills — like the one OP has with wasm (I guess?) — are building random infra projects for enabling this.
3. All the coding people are getting weirded out by its security model because it is ofc not built for them.
[1] As I have commented elsewhere on this thread the moment a coder does webfetch + codeexec its game over from security perspective. Prove me wrong on that please.
Wait. I don't understand the threat vector modelled here. Any agent or two isolated ones that the do Webfetch and code exec, even in separate sandboxes, is pretty much game over as far as defending against threat vectors goes. What am I missing here?
Well, if wasm process is limited on the syscalls it can make, the blast radius is limited. For example you can block network access, and disk access for tools that don't need those capabilities.
That being said, this doesn't sound like they're really thinking through the risks.
> Dynamic Tool Building - Describe what you need, and IronClaw builds it as a WASM tool
If the agent can write it's own insecure plugins, and the wasm processes isn't properly isolated, you've really gained nothing.
even if it is isolated, like no network or host access. Like say the malicious prompt created a wasm tool that patched your project code to leak information like adding a logger.warning. but LOG_LEVEL was set to error or whatever that prevented this from surfacing during testing or dev/beta.
Again running on that was container that code does not reveal anything. But then another isolated wasm tool was responsible to build the binary and ship it to prod.
Shotgunned all over prod logs are spotted by a log watcher within minutes of deploy. Whew... right?
Congrats on the launch. I've been fooling around with using my pipecat MCP(https://github.com/pipecat-ai/pipecat-mcp-server) with WebRTC. The WebRTC is hooked into a Webapp interface and this allows me to "talk" to different containers(projects) on my truenas.
I have just a list of chat sessions on the web app on all my projects. The webapp is modified to launch claude code daemons (borrowed from humanlayer/codelayer) and exposes the outbound STT from the WebRTC into a chat session.
- MCP Auth is via auth0
- Webapp itself is gated by a Bearer token.
This itself gets me pretty far. I am not sure what more this is offering?
My TTS/STT models are local by Kyutai and the voice agent's LLM between STT and TTS is used to determine some basic context: e.g. what project directories, mcp servers to select and what skills to use for launching the daemons.
This sounds solid, similar stuff to what we do! Sounds like this setup gets you most of the way there. We also have a mobile app + notifications. And I haven't tried using a coding voice agent via MCP, I'll try that out soon!
Good to know its similar. Oh I actually do have a text box as well, but using it to type from the phone is not very convenient. Too much typing, I generally STT into the text box. I don't use it to code much, unless I have specced it out and I know the spec is good. But then to code it up is just a few mins, no?
I spend my time trying to tuning the voice+webapp experience: i.e. how it can explain things, can it surface thinking tokens from claude tools properly etc. The sweat, blood, voice go into `/create_research -> /create_plan` loop before the `/implement_plan`. Sometimes I copy the research and paste it into chatGPT for review or comments as well.
I generally use the MCP to get it to follow commands and explain things to me to make progress in this cycle, and I often pause it and ask for drawing me a mermaid a sequence diagram for events or a block diagram showing how pieces go together.
It seems like you want to know what median, 5-95 or 1-99 differences might be? I also wonder how the "residual" plot looks like... If there are too many residual data points for a scatter plot then a histogram might be useful to visualize the modes. I suspect that as loss decreases multiple modes should condense or altogether collapse into one.
reply