Hacker News new | past | comments | ask | show | jobs | submit login
Something weird is happening with LLMs and chess (dynomight.substack.com)
143 points by crescit_eundo 8 hours ago | hide | past | favorite | 72 comments





I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.

OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.

Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.

So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?

For the record, I don't actually believe this. But given the data it's a logical possibility.


I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.


Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.

The problem here is the specific model architecture, training data, vocabulary/tokenization method, loss function and probably decoding strategy.... basically everything is wrong here.


Question here is why gpt-3.5-instruct can then beat stockfish.

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.

OP had stockfish at its weakest preset.

The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it

This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.

I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.


LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.

Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.


Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")

But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.

Yeah, that is the "something weird" of the article.

My money is on a fluke inclusion of more chess data in that models training.

All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation


Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.


FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.

E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.


The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.

I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923

I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.

I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923

Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.

How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?

Couldn't we just make every human readable character a token?

OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"


This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

That is, the groups are encoding something the model doesn't have to learn.

This is not much astray from "sight words" we teach kids.


We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.

There is no advantage to tokenization, it just helps solve limitations in context windows and training.


aka Character Language Models which have existed for a while now.

That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.

Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"

Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.

Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.

One could try using DSPy for automatic prompt optimization.


> 1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3.

Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?


Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.

i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number

that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.

while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.


> i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number

Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.


What about the number of possible positions where an idiotic move hasn't been played? Perhaps the search space who could be reduced quite a bit.

Unless there is an apparent idiotic move than can lead to an 'island of intelligence'

Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.

you should try it and post a rebuttal :)

> I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.

Yeah, once you've deviated from a sequence you're lost.

Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.


Maybe that one which plays chess well is calling out to a real chess engine.

It's not:

1. That would just be plain bizzare

2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters

3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame

4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval

5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.


> Also the specific chess notation being prompted actually matters

Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.

Likewise:

- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)

- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.

I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"


I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.

OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?

[1] https://github.com/thomasahle/sunfish

[2] https://lichess.org/@/sunfish-engine


The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:

- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.

- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.

- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.

- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!

- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.

- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.

I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."


Not that convoluted really

It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*

If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.

* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more

** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *


This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).

this possibility is discussed in the article and deemed unlikely

Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.

The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.

[1] https://dynomight.substack.com/p/chess/comment/77190852


What do you mean LLMs can't count to 10,000 for known reasons?

Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.

I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.

* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.


I don't see that discussed, could you quote it?

I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.

wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out


PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close

"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"

https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...


If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players

Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:

1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?

2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?

Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...


The author mentions in the comment section that changing temperature did not help.

my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.

I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.

I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).


An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.

I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.

I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.

I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!


The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.

I think it only needs to have read sufficient pgns.

I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...

I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?

There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).


So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?

LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.

Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.


> That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.

This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.


They are almost certainly tokenized in most LLM multi-modal models. https://en.wikipedia.org/wiki/Large_language_model#Multimoda...

Ah, an overloaded "tokenizer" meaning. "split into tokens" vs "turned into a single embedding matching a token" I've never heard it used that way before, but it makes sense kinda.

What about contemporary frontier models?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: