Hi HN - We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test)
GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.
Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this.
Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.
This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful!
I'm a bit disappointed about the results on Llama3, I was hoping it would be much better than GPT3.5
Are you planning to do other LLM coding benchmarks in the future?
GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.
Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this. Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.
This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful!
Link to results: https://hamming.ai/blog/bug-in-the-codestack Repo: https://github.com/HammingHQ/bug-in-the-code-stack