Can LLMs find bugs in large Python codebases?

sumanyusharma · on May 20, 2024

Hi HN - We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test)

GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.

GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.

Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this. Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.

This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful!

Link to results: https://hamming.ai/blog/bug-in-the-codestack Repo: https://github.com/HammingHQ/bug-in-the-code-stack

mbuleandra · on May 20, 2024

I'm a bit disappointed about the results on Llama3, I was hoping it would be much better than GPT3.5 Are you planning to do other LLM coding benchmarks in the future?

coder4life · on May 20, 2024

Nice work - this is a real benchmark for cases I use LLMs for