1. Chess.com uses glicko ratings not elo. Even the title makes an obvious mistake.

2. You left out which illegal moves it tried, which is valuable information.

3. Your qualititative analysis is almost non-existent. Some rudimentary analysis shows that it is simply predicting the most likely sequence of moves given the context of previous moves. That's why it makes massive blunders and illegal moves when a very common tactic is almost possible, but not quite. It tries anyway because it's the most likely sequence. This is also why it keeps giving multiple moves at a time.

5. It plays way more theory than the players it's playing can possibly know and there gets a strong position out of the opening where these tactics are more likely to work. This skews its rating upwards.

Re: Going insane

Did you try prefixing each of the sequence prompts with the original instructions ?

Excellent! As for the why of GPT4, it probably hadn't encountered chess yet, whereas GPT 3.5 may have had more training on the game of chess. A couple of days ago, I tried to play a game of Go using ASCII code, and it was actually able to draw and keep the state of the board for maybe a couple of moves, and then it completely misplaced previous moves, so I ended the experiment early. However, I did notice that it was aware of the rules, but there probably weren't enough tokens to keep the board state. Your record of moves in your prompt is probably the way to go forward with this, and possibly train it forward. Thanks for sharing your experiment.

