I've been using cursor since it launched, sticking almost exclusively to claude-3.5-sonnet because it is incredibly consistent, and rarely loses the plot.
As subsequent models have been released, most of which claim to be better at coding, I've switched cursor to it to give them a try.
o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same "adhd." As an example, in a NextJS app, if I do a composer prompt like "on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy."
sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.
Every single other model, doesn't matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say "DO NOT use radix-ui, use shadcn!" -- it will respond with "ok!" then begin writing its own components from scratch, again not using shadcn.
This is still problematic with o3-mini.
I can't believe it's the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor "locked in" the way sonnet just seems to be naturally.
It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.
My experience with cursor and sonnet is that it is relatively good at first tries, but completely misses the plot during corrections.
"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"
It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.
I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.
Claude makes a lot of crappy change suggestions, but when you ask "is that a good suggestion?" it's pretty good at judging when it isn't. So that's become standard operating procedure for me.
It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.
I'm always asking Claude to propose a variety of suggestions for the problem at hand and their trade-offs, then evaluating them for the top three proposals and why. Then I'll pick one of them and further vet the idea
>It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.
Absolutely, I find this a challenge as well. Every thought that crosses my mind is a great idea according to it. That's the opposite attitude to what I want from an engineer's copilot! Particularly from one who also advices junior devs.
More than once I've found myself going down this 'little maze of twisty passages, all alike'. At some point I stop, collect up the chain of prompts in the conversation, and curate them into a net new prompt that should be a bit better. Usually I make better progress - at least for a while.
This becomes second nature after a while. I've developed an intuition about when a model loses the plot and when to start a new thread. I have a base prompt I keep for the current project I'm working on, and then I ask the model to summarize what we've done in the thread and combine them to start anew.
I can't wait until this is a solved problem because it does slow me down.
What do you find difficult about distilling your own prompts?
After any back and forth session I have reasonably good results asking something like "Given this workflow, how could I have prompted this better from the start to get the same results?"
For my advanced use case involving Python and knowledge of finance, Sonnet fared poorly. Contrary to what I am reading here, my favorite approach has been to use o1 in agent mode. It’s an absolute delight to work with. It is like I’m working with a capable peer, someone at my level.
Sadly there are some hard limits on o1 with Cursor and I cannot use it anymore. I do pay for their $20/month subscription.
How? It specifically tells me this is unsupported: "Agent composer is currently only supported using Anthropic models or GPT-4o, please reselect the model and try again."
I think you’re right - I must have used it in regular mode, then got GPT-4o to fill in the gaps. It can fully automate a lot of menial work, such as refactors and writing tests. Though I’ll add, I had a roughly 50% success with GPT-4o bug fixing in agent mode, which is pretty great in my experience. When it did work, it felt glorious - 100% hands-free operation!
It seems like you could use aider in architecture mode. Basically, it will suggest the solution to your problem fist, and prompt you to start editing, you can say no to refine the solution and only start editing when you are satisfied with it.
Hah, I was trying it the other day in a Go project and it did exactly the same thing. I couldn’t believe my eyes, it basically rewrote all the functions back out in the test file but modified slightly so the thing that was failing wouldn’t even run.
Yes, but for some reason it seems to perform worse there.
Perhaps whatever algorithms Cursor uses to prepare the context it feeds the model are a good fit for Claude but not so much for the others (?). It's a random guess, but whatever the reason, there's a weird worsening of performance vs pure chat.
Yes but every model besides claude-3.5-sonnet sucks in Cursor, for whatever reason. They might as well not even offer the other models. The other models, even "smarter" models, perform vastly poorer or don't support agent capability or both.
What works nice also is the text to speech. I find it easier and faster to give more context by talking rather than typing, and the extra content helps the AI to do its job.
And even though the speech recognition fails a lot on some of the technical terms or weirdly named packages, software, etc, it still does a good job overall (if I don’t feel like correcting the wrong stuff).
It’s great and has become somewhat of a party trick at work. Some people don’t even use AI to code that often, and when I show them “hey have you tried this?” And just tell the computer what I want? Most folks are blown away.
Not for me. I first ask Advanced Voice to read me some code and have Siri listen and email it to an API I wrote which uses Claude to estimate the best cloud provider to run that code based on its requirements and then a n8n script deploys it and send me the results via twilio.
That sounds exhausting. Wouldn't it be faster to include you package.json in the context?
I sometimes do this (using Cline), plus create a .cline file at project root which I refine over time and which describes both the high level project overview, details of the stack I'm using, and technical details I want each prompt to follow.
Then each actual prompt can be quite short: read files x, y, and z, and make the following changes... where I keep the changes concise and logically connected - basically what I might do for a single pull request.
My point was that a prompt that simple could be held and executed very well by sonnet, but all other models (especially reasoning models) crash and burn.
It's a 15 line tsx file so context shouldn't be an issue.
Makes me wonder if reasoning models are really proper models for coding in existing codebases
Your last point matches what I’ve seen some people (simonw?) say they’re doing currently: using aider to work with two models—one reasoning model as an architect, and one standard LLM as the actual coder. Surprisingly, the results seem pretty good vs. putting everything on
one model.
This is probably the right way to think about it. O1-pro is an absolute monster when it comes to architecture. It is staggering the breadth and depth that it sees. Ask it to actually implement though, and it trips over its shoelaces almost immediately.
The biggest delta over regular o1 that I've seen is asking it to make a PRD of an app that I define as a stream-of-consciousness with bullet points.
It's fantastic at finding needles in the haystack, so the contradictions are nonexistent. In other words, it seems to identify which objects would interrelate and builds around those nodes, where o1 seems to think more in "columns."
To sum it up, where o1 feels like "5 human minute thinking," o1-pro feels like "1 human hour thinking"
I’ve coded in many languages over the years but reasonably new to the TS/JS/Next world.
I’ve found if you give your prompts a kind long form “stream of consciousness”, where you outline snippets of code in markdown along with contextual notes and then summarise/outline at the end what you actually wish to achieve, you can get great results.
Think a long form, single page “documentation” type prompts that alternate between written copy/contextual intent/description and code blocks. Annotating code blocks with file names above the blocks I’m sure helps too. Don’t waste your context window on redundant/irrelevant information or code, stating a code sample is abridged or adding commented ellipses seems to do the job.
By the time I've fully documented and explained what I want to be done, and then review the result, usually finding that it's worse than what I would have written myself, I end up questioning my instinct to even reach for this tool.
I like it for general refactoring and day to day small tasks, but anything that's relatively domain-specific, I just can't seem to get anything that's worth using.
,> and frequently a waste of time in domains where you're an expert.
I'm a domain expert and I disagree.
There's many scenarios where using LLMs pays off.
E.g. a long file or very long function are just that, and an LLM is faster at understanding it whole not being limited in how many things you can track in your mind at once (between 4 and 6). It's still gonna be faster at refactoring it and testing it than you will.
I agree that it's amazing as a learning tool. I think the "time to ramp" on a new technology or programming language has probably been cut in half or more.
We've been working on solving a lot of these issues with v0.dev (disclaimer: shadcn and I work on it). We do a lot of pre and post-processing to ensure LLMs output valid shadcn code.
We're also talking to the cursor/windsurf/zed folks on how we can improve Next.js and shadcn in the editors (maybe something like llms.txt?)
So I think I finally understood recently why we have these divergent groups with one thinking Claude 3.5 Sonnet is the best model for coding and another that follow the OpenAI SOTA at that moment.
I have been a heavy user of ChatGPT, jumping on to pro without even thinking for more than a second once released.
Recently though I took a pause from my usual work on statistical modelling, heuristics work and other things in certain deep domains to focus on building client APIs and frontends and decided to again give Claude a try and it is just so great to work with for this usecase.
My hypothesis is its a difference of what you are doing. OpenAI O models are much better than others at mathematical modelling and such tasks and Claude for more general purpose programming.
Context length possibly. Prompt adherence drops off with context, and anything above 20k tokens is pushing it. I get the best results by presenting the smallest amount of context possible, including removing comments and main methods and functions that it doesn't need to see. It's a bit more work (not that much if you have a script that does it for you), but the results are worth it. You could test in the chatgpt app (or lmarena direct chat) where you ask the same question but with minimal hand curated context, and see if it makes the same mistake.
Yes, that's what I'm suggesting. Cursor is spamming the models with too much context, which harms reasoning models more than it harms non-reasoning models (hypothesis, but one that aligns with my experience). That's why I recommended testing reasoning models outside of Cursor with a hand curated context.
The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.
Aider, with o1 or R1 as the architect and Claude 3.5 as the implementer, is so much better than anything you can accomplish with a single model. It's pretty amazing. Aider is at least one order of magnitude more effective for me than using the chat interface in Cursor. (I still use Cursor for quick edits and tab completions, to be clear).
Aider now has experimental support for using two models to complete each coding task:
- An Architect model is asked to describe how to solve the coding problem.
- An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.
Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).
Probably gonna show a lot of ignorance here, but isn’t that a big part of the difference between our brains and AI? That instead of one system, we are many systems that are kind of sewn together? I secretly think AGI will just be a bunch of different specialized AIs working together.
Efficient and effective organizations work this way, too: a CEO to plan in broad strokes, employees to implement that vision in specific ways, and managers to make sure their results match expectations.
Then you'll be in "architect" mode, which first prompts o1 to design the solution, then you can accept it and allow sonnet to actually create the diffs.
Most of the time your way works well—I use sonnet alone 90% of the time, but the architect mode is really great at getting it unstuck when it can't seem to implement what I want correctly, or keeps fixing its mistakes by making things worse.
I really want to see how apps created this way scale to large codebases. I’m very skeptical they don’t turn into spaghetti messes.
Coding is basically just about the most precise way to encapsulate a problem as a solution possible. Taking a loose English description and expanding it into piles of code is always going to be pretty leaky no matter how much these models spit out working code.
In my experience you have to pay a lot of attention to every single line these things write because they’ll often change stuff or more often make wrong assumptions that you didn’t articulate. And in my experience they never ask you questions unless you specifically prompt them to (and keep reminding them to), which means they are doing a hell of a lot of design and implementation that unless carefully looked over will ultimately be wrong.
It really reminds me a bit of when Ruby on Rails came out and the blogosphere was full of gushing “I’ve never been more productive in my life” posts. And then you find out they were basically writing a TODO app and their previous development experience was doing enterprise Java for some massive non-tech company. Of course RoR will be a breath of fresh air for those people.
Don’t get me wrong I use cursor as my daily driver but I am starting to find the limits for what these things can do. And the idea of having two of these LLM’s taking some paragraph long feature description and somehow chatting with each other to create a scalable bit of code that fits into a large or growing codebase… well I find that kind of impossible. Sure the code compiles and conforms to whatever best practices are out there but there will be absolutely no constancy across the app—especially at the UX level. These things simply cannot hold that kind of complexity in their head and even if they could part of a developers job is to translate loose English into code. And there is much, much, much, much more to that than simply writing code.
I see what you’re saying and I think that terming this “architect” mode has an implication that it’s more capable than it really is, but ultimately this two model pairing is mostly about combining disparate abilities to separate the “thinking” from the diff generation. It’s very effective in producing better results for a single prompt, but it’s not especially helpful for “architecting” a large scale app.
That said, in the hands of someone who is competent at assembling a large app, I think these tools can be incredibly powerful. I have a business helping companies figure out how/if to leverage AI and have built a bunch of different production LLM-backed applications using LLMs to write the code over the past year, and my impression is that there is very much something there. Taking it step by step, file by file, like you might if you wrote the code yourself, describing your concept of the abstractions, having a few files describing the overall architecture that you can add to the chat as needed—little details make a big difference in the results.
I use Cursor and Composer in agent mode on a daily basis, and this is basically exactly what happened to me.
After about 3 weeks, things were looking great - but lots of spagetti code was put together, and it never told me what I didn't know. The data & state management architecture I had written was simply just not maintainable (tons of prop drilling, etc). Over time, I basically learned common practices/etc and I'm finding that I have to deal with these problems myself. (how it used to be!)
We're getting close - the best thing I've done is create documentation files with lots of descriptions about the architecture/file structure/state management/packages/etc, but it only goes so far.
We're getting closer, but for right now - we're not there and you have to be really careful with looking over all the changes.
The worst thing you can do with aider is let it autocommit to git. As long as you review each set of changes you can stop it going nuts.
I have a codebase maybe 3-500k lines which is in good shape because of this.
I also normally just add the specific files I need to the chat and give it 1-2 sentences for what to do. It normally does the right thing (sonnet obviously).
The reality is I suspect one will use different models for different things.
Think of it like having different modes of transportation.
You might use your scooter, bike, car, jet - depending on the circumstances.
A bike was invented 100 years ago? But it may be the best in the right use case. Would still be using DaVinci for some things because we haven't bothered swapping it and it works fine.
For me - the value of R1/o3 is visible logic that provides an analysis that can be critiqued by Sonnet 3.5
I have an even more topical analogy! Using different languages for different tasks. When I need some one off script do automate some drudgery (take all files with certain pattern in their name, for each do some search and replace in the text inside, zip them, upload zip to URL, etc) I use python. When Im working on a multi-platform game I use c# (and unity). When I need to make something very lean that works in mobile browsers I use JS with some light-weight libraries.
Claude uses Shadcn-ui extensively in the web interface, to the point where I think it's been trained to use it over other UI components.
So I think you got lucky and you're asking it to write using a very specific code library that it's good at, because it happens to use it for it's main userbase on the web chat interface.
I wonder if you were using a different component library, or using Svelte instead of React, would you still find Claude the best?
I'm going to give you a video to watch. It's not mine, and I don't know much about this particular youtuber, but it really transformed how I think about writing and structuring the prompts I use, which solved problems similar to what you're describing here.
Cursor is also very user-unfriendly in providing alternative models to use in composer (agent). There's a heavy reliance on Anthrophic for cursor.
Try using Gemini thinking with Cursor. It barely works. Cmd-k outputs the thinking into the code. Its unusable in chat because the formatting sucks.
Is there some relationship between Cursor and Anthropic, i wonder. Plenty of other platforms seem very eager to give users model flexibility, but Cursor seems to be lacking.
Originally, actually there was a relationship between Cursor & OpenAI. Something like Cursor was supported by the OpenAI startup fund. So Cursor seems to have branched out. I think they are just emphasizing the models they find most effective. I'm surprised they haven't (apparently) incorporated Claude prompt caching yet for Sonnet.
My general workflow with ai so far has been this:
- I use copilot mostly for writing unit tests. It mostly works well since the unit tests follow a standard template.
- I use the chat one for alternating between different approaches and (in)validating certain approaches
My day job is a big monorepo, I have not investigated that yet but I believe the models context sizes fall short there and as such the above use cases only works for me.
o3 mini’s date cut-off is 2023, so it’s unfortunately not gonna be useful for anything that requires knowledge of recent framework updates, which includes probably all big frontend stuff.
I also have been less impressed by o1 in cursor compared to sonnet 3.5. Usually what I will do for a very complicated change is ask o1 to architect it, specifically asking it to give me a detailed plan for how it would be implemented, but not to actually implement anything. I then change the model to Sonnet 3.5 to have it actually do the implementation.
And on the side of not being able to get models to understand something specific, there’s a place in a current project where I use a special Unicode apostrophe during some string parsing because a third-party API needs it. But any code modifications by the AI to that file always replace it with a standard ascii apostrophe. I even added a comment on that line to the effect of “never replaced this apostrophe, it’s important to leave it exactly as it is!” And also put that in my cursor rules, and sometimes directly in the prompt as well, but it always replaces it even for completely unrelated changes. I’ve had to manually fix it like 10 times in the last day, it’s infuriating.
As subsequent models have been released, most of which claim to be better at coding, I've switched cursor to it to give them a try.
o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same "adhd." As an example, in a NextJS app, if I do a composer prompt like "on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy."
sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.
Every single other model, doesn't matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say "DO NOT use radix-ui, use shadcn!" -- it will respond with "ok!" then begin writing its own components from scratch, again not using shadcn.
This is still problematic with o3-mini.
I can't believe it's the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor "locked in" the way sonnet just seems to be naturally.
It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.
Has anyone found a workaround?