I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.
There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).
AI code review will very probably have signal-to-noise problems.
It is good to see practical solutions aimed at addressing this.
I wonder if fine-tuning models would help - this isn't addressed in the blog.
While Sonnet-3.5 excels in accuracy-token, o1 excels in self-reflection in small isolated tasks. With AlphaCodium, o1 tasks are broken to small isolated tasks, while the flow introduced in AlphaCodium actually guides the steps and overall decision making framework.
We will see more of these frameworks for different use cases
Hey, co-creator here, I agree with the sentiment that code coverage may be a proxy and even sometimes a vanity metric but at the same time, IMO unit regression tests are necessary for a maintainable production codebase. I personally don’t feel confident making changes to production code that isn’t tested.
Specifically for generating unit regression tests the Cover-Agent tool already works quite well in the wild for some projects, especially isolated projects (as opposed to complex enterprise-level code). You can see in the few (somewhat cherry-picked) examples we posted [0] that it generates working tests that increase coverage (they were cherry-picked in the sense that these are examples we like to work with often internally at CodiumAI).
I believe that it’s possible to generate additional meaningful tests including end-to-end tests by creating a more sophisticated flow that uses prompting techniques like reflection on the code and existing tests, and generates the tests iteratively, feeding errors and failures back to the LLM to let it fix them. Just as an example. This is somewhat similar to the approach we used with AlphaCodium [1] which hit 54% on the CodeContests benchmark (DeepMind’s AlphaCode 2 hit 43% [2] with the equivalent amount of LLM calls).
If like me you think tests are important but hate writing them, please consider contributing to the open source to help make it work better for more use cases.
https://github.com/Codium-ai/cover-agent
Hey, one of the creators here.
As mentioned in the post, TestGen-LLM (by Meta) focused on Kotlin, and the prompts were very Kotlin-oriented.
In Cover-Agent (by CodiumAI) we tried to reimplement Meta's work, and stay mostly similar to the original implementation, although we did a bit improved the prompts. But it isn't generic enough.
We believe we know how to improve generality, as we did with our PR-Agent, and here is a rough plan:
https://github.com/Codium-ai/cover-agent/issues/13
Thanks a lot! I am part of the Amplication team; I'll try to answer your question. We do not currently READ any existing codebase. We can be used for scenarios where you want to modernize a legacy codebase into modern service architecture. One quick way to do so is with a DB-first approach, using our schema import feature (introspect an existing DB and quickly build new services with the same data models and APIs for them).
You can push new code into existing codebases, but we will not "know" about the existing code. If you want to create new services that will integrate with your existing services, that is something Jovu can support if you have a spec (like Open API spec) for your existing services.
I think that the next step is getting an official "checked" mark by the SWE bench team