Yes, that's what I'm suggesting. Cursor is spamming the models with too much context, which harms reasoning models more than it harms non-reasoning models (hypothesis, but one that aligns with my experience). That's why I recommended testing reasoning models outside of Cursor with a hand curated context.
The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.
Context alone shouldn't be the reason that sonnet succeeds consistently, but others (some which have even bigger context windows) fail.