I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).