Neither lost-in-the-middle nor long context performance have seen a lot of improvement in the recent year. It's not easy to generate long training examples that also stay meaningful, and all existing models still become significantly dumber after 20-30k tokens, particularly on hard tasks.
Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).
Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).