Hacker Newsnew | past | comments | ask | show | jobs | submit | b--l's commentslogin

I found it tripped in most laughable situations by mere were words that could be related in some way to hacking but are in common use in programming. I would have to go back, examine my prompt for word that could be use in another context and replace it with a synonym.

I got downgraded from Opus to Fable for asking why MDMA was not addictive in the same way Cocaine is, so yeah, the "guardrails" are clearly vibe-coded.

Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.


You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.

That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.


Thank you for the gold kind stranger.


If you assume he is being blackmailed then it makes sense. On any other level it does not.


No, he could just be a flaming narcissist, doing whatever he thinks makes him look the best for the next five minutes.

Or he could be running scared of the Epstein files, and desperate to do something - anything - to distract the public.

Or he could just think that his superficial level of understanding is deeper than everyone else's, so anyone who disagrees is wrong or stupid or both.

Or he could know that he doesn't know, but think that he has to look like he knows, because if the image cracks, it's all over.

So, no, blackmail is not the only option that makes sense.


You’ve forgotten the scenario in which senility remixes all of the above daily.


grok-4.1-fast is the the number 2 model on this benchmark.

~~If you've used this model in real life to do any sort of programming, and have seen its output, you would know that there is something VERY wrong with your benchmark.~~

Edit: Oh sorry, I looked at the questions, I see this is also for SQL specifically. Interesting. Maybe they tuned that grok model for SQL. Cool site. I bookmarked it.


Yeah, multi-step SQL generation and debugging.

Some models surprised me and Grok Fast was one of them. It is consistently good at this task though!


If we learned anything from the code leak is that they essentially do not know what is in the blackbox of the code for that 500k line mass. So that's plausible.


"but most of our samples are from the last 2 months."

There's your major issue. That's well within the brutal quantization window.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: