Hacker Newsnew | past | comments | ask | show | jobs | submit | versteegen's commentslogin

Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.

The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.

...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".

Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).

But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):

> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration

but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.


I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.

> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.

Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.


It seems they don't test for that, since they use the second-best human solution as a baseline.

And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned.

Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov.


> The only reason you remember is because it beat Kasparov

There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book.

It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time.


This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans.

> Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.

Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application.


> Aren't they losing money on the retail API pricing, too?

No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.

For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852


Weird that they're all looking for outside money then


They're all looking for outside money because they're all looking for outside money, and so need to keep up with their competitors investments in training. It's a game of chicken. Once their ability to raise more abates, they'll slow down new training runs, and fund that out of inference margins instead, but the first one to be forced to do so will risk losing market share.


Inference is profitable. No one is selling at a loss. It’s training to keep up with competitors that is causing losses.


> Inference is profitable

Eh. We don't really know that, and the people saying that have an interest in the rest of the world believing it's true.


How are we so sure that deep inside the moon isn't made out of cheese?


I remember Enron. Hell, I remember the S&Ls. I've seen this movie too many times to not know how it ends.


I remember Google, Meta, Apple, Eli Lily, and other meteoric risen companies.


Ed Zitron made that claim (in particular here: [1]). In the same article he admits he not a programmer, and had to ask someone else to try out Claude Code and ccusage for him. He doesn't have any understanding of how LLMs or caching works. But he's prominent because he's received leaked financial details for Anthropic and OpenAI, eg [2]

[1] https://www.wheresyoured.at/anthropic-is-bleeding-out/ [2] https://www.wheresyoured.at/costs/


Maybe I'm misreading it, but I don't see him saying it's just the cost of *inference* alone (which is the strawman that the article in the OP is arguing against). He says:

> this company is wilfully burning 200% to 3000% of each Pro or Max customer that interacts with Claude Code

There is of course this meme that "Anthropic would be profitable today if they stopped training new models and only focused on inference", but people on HN are smart enough to understand that this is not realistic due to model drift, and also due to comeptition from other models. So training is forever a part of the cost of doing business, until we have some fundamental changes in the underlying technology.

I can only interpret Ed Zitron as saying "the cost of doing business is 200% to 3000% of the price users are paying for their subscriptions", which sounds extremely plausible to me.


I'm surprised "faulty PSU" is not on GP's list of common problems. Almost every unstable computer I've ever experienced has been due to either a dying PSU (not an under-specced one) or dying power conversion capacitors on the motherboard.


There's a Polish electronics forum that's infamous because it's kind of actively hostile to them noobs. "Blacklisted power supply, closing thread." is a micro meme at this point.


Ye some of the weirdest issues I've fixed have been PSU related.

I had a PC come to me that would boot fine, but if you opened the CD drive it'd shut off instantly.


Used to repair PCs in the mid 90s. Had guy come in with right mouse button not working suddenly. Replaced mouse. No go. Replaced motherboard, CPU, RAM, reinstalled Windows. No go. Changed the PSU. Right mouse button worked.


I concur. A lot of “flakey” issues can be traced to poor quality power supplies. That’s a component that doesn’t get any attention in spec sheets other than a max power rating and I think a lot of manufacturers skimp there. As long as the system boots up and runs for a few minutes, they ship it.


Heck, even dirty power from the wall can contribute. I've seen improvements in stability from putting things behind power conditioners.


Definitely that too, particularly in 2nd-world countries. I remember having a difficult time with dirty power for some hardware products I was responsible for at one time, where the customers were in the Middle East nd Africa in the 1990s. We ended up having to have the PS manufacturer do a redesign to help compensate for dirty power. It can be done, but it costs a bit more.


I could see that:

- Firefox may be more prevalent on those using Linux, since FF is less “corporate” than Chrome or Edge.

- People using Linux are probably putting Linux on old machines that had versions of Windows that are no longer supported.

However, what I can’t say next is “PSUs would get old and stop putting out as much” because that doesn’t tend to happen. They just die.

Those running Linux on some old tower may hook up too many devices to an underpowered PSU which could cause problems, but I doubt this is the norm.

If it’s not PSUs, what is it? It’s not electromagnetic radiation doing the bitflipping because that’s too rare.

Maybe bitflips could be caused by low-quality peripherals.

People also don’t vacuum out laptops like they used to vacuum out towers and desktops, so maybe it’s dust.

Or maybe it’s all a ruse and FF is buggy, but they don’t have time to figure it out.


>> People using Linux are probably putting Linux on old machines

Maybe for linux noobs. But i would suggest that most linux users are not noobs booting a disused pentium from a live CD. They are running linux on the same hardware as windows users. I would further suggest that as anyone installing a not-windows OS is more tech savvy than the average, that linux users actually take better care of thier machines. Linux users take pride in thier machines whereas the average windows user barely knows that computers have fans.

As any linux user for thier specifications and they will quote system reports and memory figues like Marisa Tomei discussing engine timings. Ask a random windows user and they will probably start with the name of the store that sold it.


Unix user for 35 years, Linux for 30+ years ... my case fan died during the summer of last year ... just took the side panel off and kept things running.

So much for taking pride in my machine :)


An exception to prove the rule. You fixed it yourself and are here proud of your machine.

I did basically the same thing recently when I built an AI rig. I tried to put it in a sever rack case but the fan noise was too much. So I ditched the rack and put in an open mining frame.


It's the powerhouse of the dell :p


yeah dell consumer pc psus were so awful


Which is kinda crazy to me, in light of how durable their business laptops have been in my experience. I’ve owned maybe 6 pc laptops in my career, and the only 2 that’ve survived that nearly 20 year space are both dells.


Does Dell design and/or build their own laptops? Depending on the year it is likely just their brand and specs, designed and built by an ODM.


AFAICT, Claude was not asked to prove its algorithm works for all odd n, but was instead told to move on to even n.


> Gemini 2.5 Pro's reasoning traces (before they nerfed them) were a good example. The deep technical analysis, and then the human-friendly version in the final output. But I found their reasoning more readable than the final output!

They were also sometimes more useful: you could see whether it reasoned its way to an answer, or used faulty reasoning, or if it was just contextual recall. Huge shame they replaced them with garbage (though a bit better now).

> the language is surprisingly offputting. I don't know if it got worse

I'm pretty sure it did.


Yes, .wow also.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: