I've seen a few studies (sorry, no citations handy, if I have time I'll revisit this comment) which concluded that above some fairly low floor, bugginess is strongly positively correlated with file length.
Which is to say, always flagging the top 5% longest files as being among the buggiest has a good chance of being the right thing to do.
That could be true, my point was that even if a 1000 line file has a 1% chance of bug-per-line it will be marked "buggy" while a 50 line file with a 10% chance of bug-per-line won't.
It makes no sense that if you took that 1000 line file and refactored it into 10 files and those 10 files had the same bugs as the original that the 1000 line file will be flagged as buggy but all 10 of the split files won't.
Not necessarily, because maybe refactoring your code to be clean enough to fit into ten files is also makes it clean enough to remove bugs. And/or hairy code doesn't end up in short files.
Cheating the algorithm is indeed a possibility and is actually very easy:
1. Just don't file bug tickets.
2. If you do, don't attach the tickets to changes.
3. If you attach the tickets to changes, don't put the "bug" type on the ticket.
Doing things to deliberately cheat the algorithm is likely not going to pass code review.
One common misconception is that the algorithm is a stick to beat developers with. It really, really isn't, and I go to great lengths to try and make this clear in the internal docs (although the blog post doesn't do this as much). The idea is just to provide another insight into the code, so that devs don't accidentally change something that's a real problem without help.
The algorithm is just a tool to help developers understand their code. It's not subjective because different teams use the bug tracker differently, so a score from a file in one project cannot be compared to a file from another project. We mitigate this by only taking off the top 10% of code across the entire codebase, so hopefully we by and large only flag the worst offenders. Part of what I wanted to do was to set up a web app so teams could claim various parts of the code base, and then run the algorithm on a team-by-team basis instead, but I ran out of time on that.
I think this is actually one of the nice things about the algorithm, because I don't like the idea of implicitly pitting people or teams against one another. I just like that people can go "Oh hey, this has been a real problem for us, let's tread carefully here."
I don't think the issue with the algorithm is people deliberately trying to cheat it. The problem is that files that are actually by any empirical measure more bug-free getting higher scores.
You could have two very similar pieces of code, one of which is in one larger file with 1000 lines. Imagine this code gets 1 bug per day.
Now imagine the other code that is functionally identical happens to be broken up into 10 files. Now imagine that 9 bugs per day are applied to this unit of code that is essentially identical to the 1000 line file; the broken up code is exactly the same and is 9x buggier, but it will be scored lower because each file is only getting 0.9 bugs on average per day.
Do you disagree that the code in the second example would actually be way "trickier" and deserves to have a flag on it much more than the first one? I understand that one requirement for this is clarity of the algorithm to developers, but it seems like you could easily take the ratio of bug fixes to total commits on a file or normalize by file length if you wanted to actually get a reasonable metric of how "tricky" or dangerous it is to edit a file.
what you haven't thought about is that when you do refactor the 1000 line file into 10 files, that refactor would necessarily have made each bit slightly simpler (by virtue of it being smaller). Hence, it is likely that bugs don't get introduced as easily, simply because there is less to 'worry' about when writing code that has a smaller cognative load. so may be the metric isn't as wrong as you initially thought.
No, you can literally just take 10 javascript functions and put them in 10 different files or one file with absolutely no modification at all. Why would one get flagged and the other one not?
The point is simply that if you have a unit of code that has X bugs per day this will give a lower score if code that is actually buggier is across multiple shorter files.
Which is to say, always flagging the top 5% longest files as being among the buggiest has a good chance of being the right thing to do.