My experience with cursor and sonnet is that it is relatively good at first tries, but completely misses the plot during corrections.
"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"
It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.
I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.
Claude makes a lot of crappy change suggestions, but when you ask "is that a good suggestion?" it's pretty good at judging when it isn't. So that's become standard operating procedure for me.
It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.
I'm always asking Claude to propose a variety of suggestions for the problem at hand and their trade-offs, then evaluating them for the top three proposals and why. Then I'll pick one of them and further vet the idea
>It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.
Absolutely, I find this a challenge as well. Every thought that crosses my mind is a great idea according to it. That's the opposite attitude to what I want from an engineer's copilot! Particularly from one who also advices junior devs.
More than once I've found myself going down this 'little maze of twisty passages, all alike'. At some point I stop, collect up the chain of prompts in the conversation, and curate them into a net new prompt that should be a bit better. Usually I make better progress - at least for a while.
This becomes second nature after a while. I've developed an intuition about when a model loses the plot and when to start a new thread. I have a base prompt I keep for the current project I'm working on, and then I ask the model to summarize what we've done in the thread and combine them to start anew.
I can't wait until this is a solved problem because it does slow me down.
What do you find difficult about distilling your own prompts?
After any back and forth session I have reasonably good results asking something like "Given this workflow, how could I have prompted this better from the start to get the same results?"
For my advanced use case involving Python and knowledge of finance, Sonnet fared poorly. Contrary to what I am reading here, my favorite approach has been to use o1 in agent mode. It’s an absolute delight to work with. It is like I’m working with a capable peer, someone at my level.
Sadly there are some hard limits on o1 with Cursor and I cannot use it anymore. I do pay for their $20/month subscription.
How? It specifically tells me this is unsupported: "Agent composer is currently only supported using Anthropic models or GPT-4o, please reselect the model and try again."
I think you’re right - I must have used it in regular mode, then got GPT-4o to fill in the gaps. It can fully automate a lot of menial work, such as refactors and writing tests. Though I’ll add, I had a roughly 50% success with GPT-4o bug fixing in agent mode, which is pretty great in my experience. When it did work, it felt glorious - 100% hands-free operation!
It seems like you could use aider in architecture mode. Basically, it will suggest the solution to your problem fist, and prompt you to start editing, you can say no to refine the solution and only start editing when you are satisfied with it.
Hah, I was trying it the other day in a Go project and it did exactly the same thing. I couldn’t believe my eyes, it basically rewrote all the functions back out in the test file but modified slightly so the thing that was failing wouldn’t even run.
Yes, but for some reason it seems to perform worse there.
Perhaps whatever algorithms Cursor uses to prepare the context it feeds the model are a good fit for Claude but not so much for the others (?). It's a random guess, but whatever the reason, there's a weird worsening of performance vs pure chat.
Yes but every model besides claude-3.5-sonnet sucks in Cursor, for whatever reason. They might as well not even offer the other models. The other models, even "smarter" models, perform vastly poorer or don't support agent capability or both.
"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"
It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.
I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.