Something seems quite off with the metric. Why would 4o recently increase on itself at a rate ~17x faster than 4o increased on 4 in that graph? E.g. ELO is a competitive metric, not an absolute metric, so someone could post the same graph with the claim the cause was "many new LLMs are being added to the system are not performing better than previous large models like they used to" (not saying it is or isn't, just saying the graph itself doesn't give context that LLMs are actually advancing at different rates or not).
Chatbot arena also has H2H win rate for each pair of models for non tied results[1], so as to detect the global drift. e.g the gpt-4o released on 2024/09/03 wins 69% of the times with respect to gpt-4o released on 2024/05/13 in blind test.