Hacker Newsnew | past | comments | ask | show | jobs | submit | itamarcode's commentslogin

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team


I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent


There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).


AI code review will very probably have signal-to-noise problems. It is good to see practical solutions aimed at addressing this. I wonder if fine-tuning models would help - this isn't addressed in the blog.


So protecting models behind API isn't working, ha?


While Sonnet-3.5 excels in accuracy-token, o1 excels in self-reflection in small isolated tasks. With AlphaCodium, o1 tasks are broken to small isolated tasks, while the flow introduced in AlphaCodium actually guides the steps and overall decision making framework.

We will see more of these frameworks for different use cases


Did you get any response from users detecting this automation?


Hey, co-creator here, I agree with the sentiment that code coverage may be a proxy and even sometimes a vanity metric but at the same time, IMO unit regression tests are necessary for a maintainable production codebase. I personally don’t feel confident making changes to production code that isn’t tested.

Specifically for generating unit regression tests the Cover-Agent tool already works quite well in the wild for some projects, especially isolated projects (as opposed to complex enterprise-level code). You can see in the few (somewhat cherry-picked) examples we posted [0] that it generates working tests that increase coverage (they were cherry-picked in the sense that these are examples we like to work with often internally at CodiumAI).

I believe that it’s possible to generate additional meaningful tests including end-to-end tests by creating a more sophisticated flow that uses prompting techniques like reflection on the code and existing tests, and generates the tests iteratively, feeding errors and failures back to the LLM to let it fix them. Just as an example. This is somewhat similar to the approach we used with AlphaCodium [1] which hit 54% on the CodeContests benchmark (DeepMind’s AlphaCode 2 hit 43% [2] with the equivalent amount of LLM calls).

If like me you think tests are important but hate writing them, please consider contributing to the open source to help make it work better for more use cases. https://github.com/Codium-ai/cover-agent

[0] https://www.youtube.com/@Codium-AI/videos [1] https://github.com/Codium-ai/AlphaCodium [2] https://storage.googleapis.com/deepmind-media/AlphaCode2/Alp...


This is interesting. Are there specific use cases for which it works really well?


We are seeing a bunch of interest for scraping related use cases and document processing (eg. automatically processing invoices that are emailed in)


Hey, one of the creators here. As mentioned in the post, TestGen-LLM (by Meta) focused on Kotlin, and the prompts were very Kotlin-oriented. In Cover-Agent (by CodiumAI) we tried to reimplement Meta's work, and stay mostly similar to the original implementation, although we did a bit improved the prompts. But it isn't generic enough. We believe we know how to improve generality, as we did with our PR-Agent, and here is a rough plan: https://github.com/Codium-ai/cover-agent/issues/13


Can it consider an existing code base? rather be inspired by it or even integrate with it?


Thanks a lot! I am part of the Amplication team; I'll try to answer your question. We do not currently READ any existing codebase. We can be used for scenarios where you want to modernize a legacy codebase into modern service architecture. One quick way to do so is with a DB-first approach, using our schema import feature (introspect an existing DB and quickly build new services with the same data models and APIs for them). You can push new code into existing codebases, but we will not "know" about the existing code. If you want to create new services that will integrate with your existing services, that is something Jovu can support if you have a spec (like Open API spec) for your existing services.


The user can choose the relevant context as follows: go to the relevant code part/snippet/files, select, then Ctrl+shift+E, one by one.

In the near future, the agent will suggest these for you, after first indexing your code base


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: