Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Replying in a split thread to clearly separate where I was wrong.

If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.

But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.

 help



Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).

> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot

That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).


The problem with benchmaxxing is that lies about the capabilities of the technology. IF all we wanted was a machine that plays chess, we would just use a chess engine, which we have known how to make for decades. If Google wanted Gemini to be able to play chess, it would be much easier (and better; and hellavulat cheaper) to stick a traditional chess engine into their product and defer all chess to that engine.

The claim here (way up thread) was: “we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data”, and the implication is that logic and reasoning is an emerging properties of these models if given enough data and enough parameters. However the evidence seems to suggest otherwise. Logic and reasoning have to be specifically programmed into these models, and even with dataset as vast as online chess games (just lichess has 7.1 billion games), if that claim above were true, chess should be easy for LLMs, but it obviously isn’t. And that tells us something about the limitations of the technology.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: