More

itamarcode · 2025-08-12T14:06:19 1755007579

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team

whymauri · 2025-08-12T14:22:18 1755008538

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

NitpickLawyer · 2025-08-12T15:01:18 1755010878

There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).

itamarcode · on Jan 29, 2025

AI code review will very probably have signal-to-noise problems. It is good to see practical solutions aimed at addressing this. I wonder if fine-tuning models would help - this isn't addressed in the blog.

itamarcode · on Jan 29, 2025

So protecting models behind API isn't working, ha?

itamarcode · on Oct 15, 2024

While Sonnet-3.5 excels in accuracy-token, o1 excels in self-reflection in small isolated tasks. With AlphaCodium, o1 tasks are broken to small isolated tasks, while the flow introduced in AlphaCodium actually guides the steps and overall decision making framework.

We will see more of these frameworks for different use cases

itamarcode · on Sept 22, 2024

Did you get any response from users detecting this automation?

itamarcode · on May 21, 2024

Hey, co-creator here, I agree with the sentiment that code coverage may be a proxy and even sometimes a vanity metric but at the same time, IMO unit regression tests are necessary for a maintainable production codebase. I personally don’t feel confident making changes to production code that isn’t tested.

Specifically for generating unit regression tests the Cover-Agent tool already works quite well in the wild for some projects, especially isolated projects (as opposed to complex enterprise-level code). You can see in the few (somewhat cherry-picked) examples we posted [0] that it generates working tests that increase coverage (they were cherry-picked in the sense that these are examples we like to work with often internally at CodiumAI).

I believe that it’s possible to generate additional meaningful tests including end-to-end tests by creating a more sophisticated flow that uses prompting techniques like reflection on the code and existing tests, and generates the tests iteratively, feeding errors and failures back to the LLM to let it fix them. Just as an example. This is somewhat similar to the approach we used with AlphaCodium [1] which hit 54% on the CodeContests benchmark (DeepMind’s AlphaCode 2 hit 43% [2] with the equivalent amount of LLM calls).

If like me you think tests are important but hate writing them, please consider contributing to the open source to help make it work better for more use cases. https://github.com/Codium-ai/cover-agent

[0] https://www.youtube.com/@Codium-AI/videos [1] https://github.com/Codium-ai/AlphaCodium [2] https://storage.googleapis.com/deepmind-media/AlphaCode2/Alp...

itamarcode · on May 21, 2024

This is interesting. Are there specific use cases for which it works really well?

ntkris · on May 21, 2024

We are seeing a bunch of interest for scraping related use cases and document processing (eg. automatically processing invoices that are emailed in)

itamarcode · on May 21, 2024

Hey, one of the creators here. As mentioned in the post, TestGen-LLM (by Meta) focused on Kotlin, and the prompts were very Kotlin-oriented. In Cover-Agent (by CodiumAI) we tried to reimplement Meta's work, and stay mostly similar to the original implementation, although we did a bit improved the prompts. But it isn't generic enough. We believe we know how to improve generality, as we did with our PR-Agent, and here is a rough plan: https://github.com/Codium-ai/cover-agent/issues/13

itamarcode · on May 15, 2024

Can it consider an existing code base? rather be inspired by it or even integrate with it?

mulygottlieb · on May 15, 2024

Thanks a lot! I am part of the Amplication team; I'll try to answer your question. We do not currently READ any existing codebase. We can be used for scenarios where you want to modernize a legacy codebase into modern service architecture. One quick way to do so is with a DB-first approach, using our schema import feature (introspect an existing DB and quickly build new services with the same data models and APIs for them). You can push new code into existing codebases, but we will not "know" about the existing code. If you want to create new services that will integrate with your existing services, that is something Jovu can support if you have a spec (like Open API spec) for your existing services.

itamarcode · on April 4, 2024

The user can choose the relevant context as follows: go to the relevant code part/snippet/files, select, then Ctrl+shift+E, one by one.

In the near future, the agent will suggest these for you, after first indexing your code base