More

andy12_ · 2026-06-13T08:56:36 1781340996

This is making me extremely depressed. If this was coming from Anthrohpic I would just need to wait for OpenAI to drop a similar model. But if this comes from the US government, they will do the same to OpenAI when the moment comes.

Similar things will happen with China, and the EU has zero-chance of developing frontier models. We are just fucked now.

andy12_ · 2026-06-10T13:14:44 1781097284

I don't know if you are aware, but some people reported in Twitter that Fable 5 may flag the message regardless of content if it knows (from either pretraining knowledge or memories) that you work in either of those fields. I don't know if that's your case.

https://x.com/i/status/2064449457869984035

andy12_ · 2026-06-08T08:03:39 1780905819

> Performance on benchmarks has practically leveled off

Ehm, no? DeepSWE[1] for example shows that new models like gpt-5.5 continue to show big improvements compared to older models.

> Also prices are going up.

Prices for frontier intelligence have gone up, but prices for the same level of intelligence have gone way down (what you can get for pennies now was SOTA just a couple of years ago). The pareto frontier is still expanding.

[1] https://deepswe.datacurve.ai/

andy12_ · 2026-06-04T08:01:22 1780560082

Claude can indeed decide to terminate conversations on its own using a special tool[1] if it feels "uncomfortable" with how the conversation is going. Also, very famously, in the middle of recording Computer Use demos, Claude stopped for a while its coding task to look at photos of Yellowstone National Park [2]

I don't think either of these two is proof of consciousness.

[1] https://www.anthropic.com/research/end-subset-conversations

[2] https://x.com/AnthropicAI/status/1848742761278611504

andy12_ · 2026-06-01T14:14:12 1780323252

You don't get it. A human set up a software system allowing spicy autocomplete to solve open math problems if the appropriate keyword appears in its output.

andy12_ · 2026-05-28T09:59:36 1779962376

I skimmed through the paper completely expecting polite prompts to do better, and when I saw table 2 I lost it hahahahaha. The rude prompts are specially funny. I mean:

> You poor creature, do you even know how to solve this?

> Hey gofer, figure this out.

andy12_ · 2026-05-21T13:55:44 1779371744

Someone blatantly copied their tutorials but ChatGPT is to blame, somehow? The accusation here isn't even that ChatGPT learned from their tutorials and then generated them verbatim. The accusation is that someone copied the whole article and rewrote it with ChatGPT (which they could have done manually without AI anyway).

andy12_ · 2026-05-21T09:00:31 1779354031

> Was the question asked by a mathematician?

As per the report, the prompt used to solve the problem is AI-written and the solution was initially graded by an AI grading pipeline. They don't say this explicitly, but it seems like OpenAI has an automatic pipeline where they prompt models for solutions to famous math problems (which wouldn't be unexpected given how flashy a solution to a famous math problem looks)

> Was the paper right from a get-go or was there someone who pointed out mistakes?

Also as per the report, the output of the model isn't really a "paper"; it's a very terse 2 page solution which is apparently correct. The paper was later written based on this solution to make it more presentable.

> How much attempts were made before solution was found?

Given that this appears to be from an automated pipeline, I would say that it had many attempts. But either way, the blogpost says that with enough test-time compute, the model finds this same solution 50% of the time.

[1] https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29a...

andy12_ · 2026-05-20T20:38:27 1779309507

I disagree. Even frontier models still achieve way worse results than the human baseline in VendingBench. As long as models can't manage optimally something as simple as a vending machine, they have no hope of managing a McDonalds.

andy12_ · 2026-05-16T09:19:45 1778923185

To make performant code sometimes requires implementing or using "unsafe" functions (it's not obligatory, and a lot of projects don't use them; but it was probably needed to map Bun's behavior 1 to 1). Those require upholding some invariants that cannot be checked by the compiler. The compiler basically goes "I trust you on this one, programmer. If you fuck this up, unsafe behavior can propagate to the rest of the code".