You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.
Right, this works with any models. To me, the most interesting part is that you can use a smaller model that you could run locally to get results comparable to SoTA models. Ultimately, I'd far prefer running local, even if slower, for the simple reason of having sovereignty over my data.
Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.
If locally running models can get to the point where they can be used as a daily driver, that solves the problem.
You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.
Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".
So it seems like it's a difficulty classifier for task descriptions written in English.
This is then used to score embeddings of Python code, which is a completely different distribution.
Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.
But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
I think the goal is to have a light heuristic that helps find plausibly useful solutions. They're still going to go through a testing phase as a next step, so this is just a very simple filter to decide what's even worth testing.
> But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
It does because hallucinations and low confidence share characteristics in the embedding vector which the small neural learns to recognize. And the fact that it continuously learns based on the feedback loop is pretty slick.
> it's not clear to me based on the description how this could all be done efficiently.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
Right, clearly you can always find people to ship oil through the strait. So the whole notion that nobody will use it because it's dangerous is nonsense.
If you read what I said, it was that "most people won't do it", not that nobody will do it. From the point of view of worldwide oil supply, what most tanker captains do matters more than what a few exceptions do.
Also don't forget that Iran is far more technologically advanced than Iraq was. Iraqis had 70s tech, while Iran has stuff like hypersonic missiles that even the US can't produce right now.
I agree with everything you're saying here, and I'm not arguing the approach eliminates the need for the human to be in the loop. That's not the goal.
What I'm trying to do is to create hard context boundaries which help both the human and the agent understand what the code is doing. With Mycelium, you have a graph expressing the top level logic, so you have a declarative workflow that you can review. With traditional code this is mixed in together with the implementation details, and you have to tease out the business logic by reading through the code.
Mycelium creates a hard boundary between implementation details which live in the cells, and the high level business logic of the application. You can set up a workflow manifest which declares what you're doing. Then you can go an implement each step in form of a cell. And then you can review it and test it in isolation without having to consider the entire application.
This is the part that reduces the cognitive load and makes it easier to ensure that the code is doing what's intended.
As a note, I'm not arguing that I invented anything fundamentally new here. Workflow engines have been around for a while. I'm simply applying the idea directly in the context of coding agents.
It's a similar problem in the human context, but I think the reason stuff like workflow agents haven't caught on is because humans don't really like to work this say. Writing a conditional and calling a function keeps you in your flow, but having to jump between an orchestration layer and your code with implementation details breaks that. But LLMs don't have this problem. In fact, they benefit from having all the additional information that's expressed in the graph layer.
That's mostly correct, one small correction is that cells don't have to be pure. They just have to focus on doing a single task with some hard boundaries.
And what I meant with the law of the graph was simply that the graph defines the actual business logic, and then each cell is a context free component that can be plugged into it. I guess I was just trying to be clever there.
The key benefit I'm finding is that cells can be reasoned about in isolation because they know nothing about one another. You don't have implicit coupling happening the way you do in normal programs that's embedded in the call graph.
My approach is to use inversion of control where the cell gets some context and resources like a db connection, does some work, and produces the result. That gets passed on to the graph layer which inspects the result, and decides what cell to call next.
With this approach you can develop and tests these cells as if they were completely independent programs. The context stays bounded, and the agent doesn't need to know anything about the rest of the application when working on it.
The cells also become reusable, since you arrange them like Lego pieces, and snap them together into different configurations as needed.
The whole point of the framework is to see what LLM oriented framework would look like. My argument is that the way code is normally structured is not conducive towards LLMs because context grows in unbounded way, and they end up getting lost.
The whole point of the 'slop' report is to have the LLM try implementing the features using both the traditional approach and the framework and then reflect on how it fared with each approach.
I'm talking about expressing the application as a state machine and then implementing each step in the state graph as an independent subprogram. The cells accept a state, do some work, and produce a new state. Then the graph orchestrator inspects the state and dispatches to the next appropriate cell.
reply