I tried doing some vibe coding on a greenfield project (using gemini 2.5 pro + cline). On one hand - super impressive, a major productivity booster (even compared to using a non-integrated LLM chat interface).
I noticed that LLMs need a very heavy hand in guiding the architecture, otherwise they'll add architectural tech debt. One easy example is that I noticed them breaking abstractions (putting things where they don't belong). Unfortunately, there's not that much self-retrospection on these aspects if you ask about the quality of the code or if there are any better ways of doing it. Of course, if you pick up that something is in the wrong spot and prompt better, they'll pick up on it immediately.
I also ended up blowing through $15 of LLM tokens in a single evening. (Previously, as a heavy LLM user including coding tasks, I was averaging maybe $20 a month.)
Cline very visibly displays the ongoing cost of the task. Light edits are about 10 cents, and heavy stuff can run a couple of bucks. It's just that the tab accumulates faster than I expect.
> You can "Save" 1,000 hours every night, but you don't actuall get those 1,000 hours back.
What do you mean?
If I have some task that requires 1000 hours, and I'm able to shave it down to one hour, then I did just "save" 999 hours -- just in the same way that if something costs $5 and I pay $4, I saved $
My point is that saving 1,000 hours each day doesn't actually give you 1,000 hours a day to do things with.
You still get your 24 hours, no matter how much time you save.
What actually matters is the value of what is delivered, not how much time it actually saves you. Justifying costs by "time saved" is a good way to eat up your money on time-saving devices.
If I "save 1000 hours" then that could be distributed over 41.666 days, so no task would need to be performed during that period because "I saved 1000 hours".
You could also say you saved 41.666 people an entire 24 hour day, by "saving 1000 hours", or some other fractional way.
How you're trying to explain it as "saving 1000 hours each day" is really not making any sense without further context.
And I'm sure if I hadn't written this comment I would be saving 1000 hours on a stupid comment thread.
It's like this coupon booklets they used to sell. "Over $10,000 of savings!"
Yes but how much money do I have to spend in order to save $10,000?
There was this funny commercial in the 90s for some muffler repair chain that was having a promotion: "Save Fifty Dollars..."
The theme was "What will you do with the fifty dollars you saved?" And it was people going to Disneyland or afancy dinner date.
The people (actors) believed they were receiving $50. They acted as if it was free money. Meanwhile there was zero talk about whether their cars needed muffler repair at all.
I think one issue is that you won't always be able to invoice those extra 999 hours to your customer. Sometimes you'll still only be able to get paid for 1 hour, depending on the task and contract.
But the llm bill will always invoice you for all the saved work regardless.
(from a companies perspective, this is true). As a developer, you may not be paid by the task -- If I finish something early, I start work on the next thing.
If you earn more than me, then if you value "time saved" then you should pay me to take my washing off me. Because then you can save even more of your valuable time!
The more of my washing you can take off me, the more of your time you can save by then using a washing machine or laundry service!
Saving an hour of my time is a waste, when saving an hour of your time is worth so much more. So it makes economic sense for you to pay me, to take my washing off me!
> LLMs are now being positioned as "let them work autonomously in the background"
The only people who believe this level of AI marketing are the people who haven't yet used the tools.
> which means no one will be watching the cost in real time.
Maybe some day there's an agentic coding tool that goes off into the weeds and runs for days doing meaningless tasks until someone catches it and does a Ctrl-C, but the tools I've used are more likely to stop short of the goal than to continue crunching indefinitely.
Regardless, it seems like a common experience for first-timers to try a light task and then realize they've spent $3, instantly setting expectations for how easy it is to run up a large bill if you're not careful.
I think that models are gonna commoditize, if they haven't already. The cost of switching over is rather small, especially when you have good evals on what you want done.
Also there's no way you can build a business without providing value in this space. Buyers are not that dumb.
I, too, recommend aider whenever these discussions crop up; it converted me from the "AI tools suck" side of this discussion to the "you're using the wrong tool" side.
I'd also recommend creating little `README`'s in your codebase that are mainly written with aider as the intended audience. In it, I'll explain architecture, what code makes (non-)sense to write in this directory, and so on. Has the side-effect of being helpful for humans, too.
Nowadays when I'm editing with aider, I'll include the project README (which contains a project overview + pointers to other README's), and whatever README is most relevant to the scope of my session. It's super productive.
I'm yet to find a model that beats the cost-effectiveness of Sonnet 3.7. I've tried the latest deepseek models, and while I love the price (nearly 50x cheaper?), it's just far too error-prone compared to Sonnet 3.7. It generates solid plans / architecture discussions, but, unlike Sonnet, the code it generates often confidently off-the-mark.
I’d generally prefer comments in code. The README’s are relatively sparse and contain information that would be a bit too high-level for module or class-level comments. If commentary is specific to a module or class or method, the documentation belongs there. My rule of thumb is if the commentary helps you navigate and understand rules that apply to entire sets of modules rooted at `foo/`, it generally belongs in `foo/README`.
For example “this module contains logic defining routes for serving an HTTP API. We don’t write any logic that interacts directly with db models in these modules. Rather, these modules make calls to services in `/services`, which make such calls.”
It wouldn’t make sense to duplicate this comment across every router sub-module. And it’s not obvious from looking at any one module that this rule is applied across all modules, without such guidance.
These little bits of scaffolding really help narrow down the scope of the code that LLMs eventually try to write.
My tool Plandex[1] allows you to switch between automatic and manual context management. It can be useful to begin a task with automatic context while scoping it out and making the high level plan, then switch to the more 'aider-style' manual context management once the relevant files are clearly established.
I loathe using AI in a greenfield project. There are simply too many possible paths, so it seems to randomly switch between approaches.
In a brownfield code base, I can often provide it reference files to pattern match against. So much easier to get great results when it can anchor itself in the rest of your code base.
The trick for greenfield projects is to use it to help you design detailed specs and a tentative implementation plan. Just bounce some ideas off of it, as with a somewhat smarter rubber duck, and hone the design until you arrive at something you're happy with. Then feed the detailed implementation plan step by step to another model or session.
This is a popular workflow I first read about here[1].
This has been the most useful use case for LLMs for me. Actually getting them to implement the spec correctly is the hard part, and you'll have to take the reigns and course correct often.
This seems like a good flow! I end up adding a "spec" and "todo" file for each feature[1]. This allows me to flesh out some of the architectural/technical decisions in advance and keep the LLM on the rails when the context gets very long.
Yeah, I limit context by regularly trimming the TODOs. I like having 5-6 in one file because it sometimes informs the LLM as to how to complete the first in a way that makes sense for the follow-ups.
READMEs per module also help, but it really depends a lot on the model. Gemini will happily traipse all over your codebase at random, gpt-4.1 will do inline imports inside functions because it seems to lack any sort of situational awareness, Claude so far gets things mostly right.
While its being touted for Greenfield projects I've notices a lot of failures when it comes to bootstrapping a stack.
For example it (Gemini 2.5) really struggles with newer ecosystem like Fastapi when wiring libraries like SQLAlchemy, Pytest, Python-playwright, etc., together.
I find more value in bootstrapping myself, and then using it to help with boiler plate once an effective safety harness is in place.
I've vibe coded small project as well using Claude Code. It's about visitors registration at the company. Simple project, one form, a couple of checkboxes, everything is stored in sqlite + has endpoint for getting .xlsx.
Initial cost was around $20 USD, which later grew to (mostly polishing) $40 with some manual work.
I've intentionally picked up simple stack: html+js+php.
A couple of things:
* I'd say I'm happy about the result from product's perspective
* Codebase could be better, but I could not care less about in this case
* By default, AI does not care about security unless I specifically tell it
* Claude insisted on using old libs. When I've specifically told it to use the latest and greatest, it upgraded them but left code that works just with an old version. Also it mixed latest DaisyUI with some old version of tailwindcss :)
On one hand it was super easy and fun to do, on the other hand if I was a junior engineer, I bet it would have cost more.
If you want to use Cline and are at all price sensitive (in these ranges) you have to do manual context management just for that reason. I find that too cumbersome and use Windsurf (currently with Gemini 2.5 pro) for that reason.
I think it's just that it's not end-to-end trained on architecture because the horizon is too short. It doesn't have the context length to learn the lessons that we do about good design.
> I noticed that LLMs need a very heavy hand in guiding the architecture, otherwise they'll add architectural tech debt. One easy example is that I noticed them breaking abstractions
That doesn’t matter anymore when you’re vibe coding it. No human is going to look at it anyway.
It can all be if/else on one line in one file. If it works and if the LLMs can work at, iterate and implement new business requirements, while keeping performance and security - code structure, quality and readability don’t matter one bit.
Customers don’t care about code quality and the only reason businesses used to care is to make it less money consuming to build and ship new things, so they can make more money.
This is a common view, and I think will be the norm on the near-to-mid term, especially for basic CRUD apps and websites. Context windows are still too small for anything even slightly complex (I think we need to be at about 20m before we start match human levels), but we'll be there before you know it.
Engineers will essentially become people who just guide the AIs and verify tests.
Have you ever tried to get those little bits of styrofoam completely off of a cardboard box? Have you ever seen something off in the distance and misjudged either what it was or how long it would take to get there?
Hook up something like Taskmaster or Shrimp, so that they can document as they go along and they can retrieve relevant context when they overflow their context to avoid this issue.
Then as the context window increases, it’s less and less of an issue
Nope - I use a-la-carte pricing (through openrouter). I much prefer it over a subscription, as there are zero limits, I pay only for what I use, and there is much less of a walled garden (I can easily switch between Anthropic, Google, etc).
I noticed that LLMs need a very heavy hand in guiding the architecture, otherwise they'll add architectural tech debt. One easy example is that I noticed them breaking abstractions (putting things where they don't belong). Unfortunately, there's not that much self-retrospection on these aspects if you ask about the quality of the code or if there are any better ways of doing it. Of course, if you pick up that something is in the wrong spot and prompt better, they'll pick up on it immediately.
I also ended up blowing through $15 of LLM tokens in a single evening. (Previously, as a heavy LLM user including coding tasks, I was averaging maybe $20 a month.)