Hacker Newsnew | past | comments | ask | show | jobs | submit | conradkay's commentslogin

Stock up 9% today, very pleasant for Zuck if you do the math on his net worth :)

I mean, kinda? It's not like Zuck is selling his stock tomorrow, so daily fluctuations in stock price don't really affect him.

It doesn't seem benchmaxxed, ARC AGI 2 score is quite bad (42.5%, GPT 5.4 is 76.1%) and coding is okay. But maybe this is the best Meta can do even benchmaxxing

The impressive part is multimodality, very plausible since there's less focus there by other labs (especially Anthropic)


I don't think any human would make that table in the middle

Well, in a slightly indirect manner. Claude is writing a ton of code, and therefore creating a lot of security vulnerabilities.

That's not what's happening here. This announcement is about the velocity with which Claude finds vulnerabilities in already-existing software.

Software already exists that has been written by Claude. They absolutely are selling the means to write software, and the means to securing the insecure software. At least for the time being. In the future Mythos will probably just make it possible to prompt good software from the start.

Ok. But mostly its entirely the old software, not the new software, that the bugs are being found in.

Maybe because there’s no critical and widely used software written by LLMs so far? Which says a lot about LLMs are failing to even approach the level of capabilities you would expect from all the hype? The goal has always been, even before LLMs, to find something smarter than our smarter humans. So far the success at that is really minuscule. Humans are still the benchmark, all things considered. Now they’re saying LLMs are going to be better than our best vulnerability researchers in a few months (literally what an Anthropic researcher said in a conference). Ok, that might happen. But the funny part is that the LLMs will definitely be the ones writing most of these vulnerabilities. So, to hedge against LLMs you must use LLMs. And that is gonna cost you more.

So today, most of the vulnerabilities being found by these tools are in code written by humans. Your hypothesis is that down the road, most of the vulnerabilities will be in code written by LLMs.

What seems more probable is that the same advances that LLMs are shipping to find vulnerabilities will end up baked into developer tooling. So you'll be writing code and using an LLM that knows how to write secure code.


I don't think claude wrote openbsd but to be honest that was before my time so I'm not sure

If it’s very good at finding security vulnerabilities, I would assume that the code it generates is much more hardened than anything your average developer can put out.

For what it's worth Anthropic explicity denies that. "To state it plainly: We never reduce model quality due to demand, time of day, or server load"

Also can see https://marginlab.ai/trackers/claude-code/

It's very interesting to me how widespread this conception is. Maybe it's as simple as LLM productivity degrading over time within a project, as slop compounds.

Or more recently since they added a 1m context window, maybe people are more reckless with context usage


It has nothing to do with the context window. Reasoning brought measured approaches grounded with actual tool calls. All of that short-circuits into a quick fix approach that is unlike Opus-4.5 or 4.6. Sonnet-4.5 used to do that. My context window is always < 200K.

That still leaves open the possibility that they reduce model quality due to profit. ;p

Posted this a while ago:

>Models are not "degrading". They're not being "secretly quantized". And no one is swapping out your 1.2T frontier behemoth for a cheap 120B toy and hoping you wouldn't notice!

>It's just that humans are completely full of shit, and can't be trusted to measure LLM performance objectively!

>Every time you use an LLM, you learn its capability profile better. You start using it more aggressively at what it's "good" at, until you find the limits and expose the flaws. You start paying attention to the more subtle issues you overlooked at first. Your honeymoon period wears off and you see that "the model got dumber". It didn't. You got better at pushing it to its limits, exposing the ways in which it was always dumb.

>Now, will the likes of Anthropic just "API error: overloaded" you on any day of the week that ends in Y? Will they reduce your usage quotas and hope that you don't notice because they never gave you a number anyway? Oh, definitely. But that "they're making the models WORSE" bullshit lives in people's heads way more than in any reality.


That generally makes sense to me, but I wonder if it's different when the attacker and defender are using the same tool (Mythos in this case)

Maybe you just spend more on tokens by some factor than the attackers do combined, and end up mostly okay. Put another way, if there's 20 vulnerabilities that Mythos is capable of finding, maybe it's reasonable to find all of them?


From the red team post https://red.anthropic.com/2026/mythos-preview/

"Most security tooling has historically benefitted defenders more than attackers. When the first software fuzzers were deployed at large scale, there were concerns they might enable attackers to identify vulnerabilities at an increased rate. And they did. But modern fuzzers like AFL are now a critical component of the security ecosystem: projects like OSS-Fuzz dedicate significant resources to help secure key open source software.

We believe the same will hold true here too—eventually. Once the security landscape has reached a new equilibrium, we believe that powerful language models will benefit defenders more than attackers, increasing the overall security of the software ecosystem. The advantage will belong to the side that can get the most out of these tools. In the short term, this could be attackers, if frontier labs aren’t careful about how they release these models. In the long term, we expect it will be defenders who will more efficiently direct resources and use these models to fix bugs before new code ever ships. "


Going off the recent biography of Demis Hassabis (CEO/co-founder of Deepmind, jointly won the Nobel Prize in Chemistry) it seems like he's very concerned about it as well

I would've basically agreed with you until I'd seen this talk: https://www.youtube.com/watch?v=1sd26pWhfmg

Maybe a bad example since Nicholas works at Anthropic, but they're very accomplished and I doubt they're being misleading or even overly grandiose here

See the slide 13 minutes in, which makes it look to be quite a sudden change


Very interesting, thanks for sharing.

> I doubt they're being misleading or even overly grandiose here

I think I agree.

We could definitely do much worse than Anthropic in terms of companies who can influence how these things develop.


I watched the talk as well and it's very interesting. But isn't this just a buffer overflow in the NFS client code? The way the LLM diagnosed the flaw, demonstrated the bug, and wrote an exploit is cool and all, but doesn't this still come down to the fact that the NFS client wasn't checking bounds before copying a bunch of data into a fixed length buffer? I'm not sure why this couldn't have been detected with static analysis.

I guess so, but there's a ton of buffer overflow vulnerabilities in the wild, and ostensibly it wasn't detected by static analysis

The red team post goes over some more impressive finds, and says that there's hundreds more they can't disclose yet: https://red.anthropic.com/2026/mythos-preview/


permanent underclass has arrived :(

For comparison, 5x the cost of Opus 4.6, and 1.67x for Opus 4.1

I think this would be very heavily used if they released it, completely unlike GPT 4.5


Opus 4 & 4.1 are still on Vertex+Bedrock @ $75/1mm out. They were used very heavily and in my subjective opinion are better than 4.5 and 4.6.

Interesting, what makes them better to you?

Opus 4, with enough context, could do most all I wanted in a single shot. More often than not, when I had a bad outcome and was frustrated I would realize that I was the problem (in giving improper direction or missing key context).

I also was in a pretty sweet position having a boat load of credits and premo vertex rate limits so I could 'afford' to dump hundreds of thousands of tokens in context all day.

With Opus 4.5 and 4.6, I find I have to steer very actively.

This is comparing using Opus 4 directly rather than comparing the performance of the models in Claude Code for example, or any 'agentic' setup.

Kinda reminds me of 4o vs 4-turbo.

I would imagine they are smaller models.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: