Some take "line of code" literally. Others see it as a stand-in for complexity. Because one can certainly make a short program that's far more of a liability than a long one.
I'm wondering if there's a quality term for a "unit of complexity" in code? Like when a single line has 4 or 5 "ideas" in it.
There are many metrics for measuring the logic complexity of code. An early and best-known one is Cyclomatic Complexity [0] - The more possible execution paths, the higher the score. You can find more at Software Metric [1].
If there is a useful metric, then I haven't found it yet. The data on cyclomatic complexity[1] does not convince me, and in practice, the linters I've seen have complained about boring code like this:
switch (foo) {
case KIND_1: return parseObjectOfKind1();
case KIND_2: return parseObjectOfKind2();
case KIND_3: return parseObjectOfKind3();
...
case KIND_15: return parseObjectOfKind5();
case KIND_16: return parseObjectOfKind16();
}
There are 16 paths and yet this code is easy to follow. There is no substitute for human judgement (code reviews).
Their point seems to be in the mismatch of names. Imagine seeing this sequence, with names instead of numbers:
switch(HTTP_METHOD) {
case PUT: processPut(...);
case POST: processPost(...);
case GET: processDelete(...); // wtf
case DELETE: processGet(...); // mate?
}
To the reader that appears to be an error even if it is precisely the thing you want to happen.
This is basic DRY violation. The number is duplicated twice and that can get out of sync (as in the example).
I think everyone has coded this way out of expedience and some of us have eventually messed it up too. But you could, for example, use a macro in C to only list the number once. It might not be a trade-off worth making though.
Personally, I've used reflection to do this kind of mapping of enums to functions in a language that supports reflection.
The code was inspired by a piece of Java code that rendered geometric primitives. It was basically case "box": return parseBox(); as you suggested in your other post.
Turning "box" into "parseBox" using string operations and then using reflection to call the right method is an approach I'd consider in Ruby, but not in Java. It breaks IDE features like "Find Usages" and static analysis, and the code to dynamically invoke a Java method is more annoying to review than the boring repetition in my posted snippet.
> This is basic DRY violation. The number is duplicated twice 2 and that can get out of sync (as in the example).
I think it's pretty clear that the number is a placeholder for something reasonable, e.g. making an association between two distinct sets of concepts. You'll still be vulnerable to copy-paste or typo issues.
> Personally, I've used reflection
Now you have two problems (and still have to maintain an association between two sets of concepts).
I don't think most enum-to-function mappings are distinct sets of concepts.
In my own code, I have used the enum name to map to a set of functions related to that enum value. The association is implicit in the name of the enum value and the name of the functions. No way to mess that up like this.
Depends on your language, but you could use the type system instead; e.g. the function is chosen by the compiler based on the type of the value.
In more dynamic languages, you could probably use introspection.
Lastly, this does not alleviate the association issue, but I prefer the alternative of declaring the associations in an array/map somewhere, and using map[enum_value]() instead of the switch.
In a lot of ways, I think the map solution is actually worse. There isn't an easy way to figure out what code is used for a given enum value with the map alone. The searchability depends on how the map is initialized. Not to mention a map will always introduce null or option types that you have to handle.
A map can be a good solution though, particularly if it's something like mapping enum values to strings that is constructed via EnumClass.values().map... or something so that you know the map is a total function.
Using a map also completely sidesteps the point of cyclomatic complexity because there are no code paths to count anymore; now they're data. And even though the function does the same as before, the linter will congratulate you on the reduced complexity.
And you'd have essentially replaced statically understood control flow with the equivalent of a computed goto. It's not like it's a bad approach, but from a code path standpoint, it's unequivocally worse, not better. (Imagine if that map was mutable!)
> Not to mention a map will always introduce null or option types that you have to handle.
It is the same with the switch, at least in C/C++. You either have a default, or must list all the possible cases and still return something "at the bottom".
I mean, hilarious point, but I'm not sure it's valid. Clearly he's numbering things here, but I am guessing in the real world you're far likelier to see tests of typing or some not-numerically-indexed value rather than literally numbered ENUMs
I mean, this can happen with any code by that logic if you're cut and pasting.
I think the real issue is types aren't included in the example. I work in much higher-level languages, but if you are passing strongly typed objects thru and your switch is derived on this typing, it's probably going to be illegal in your type system to return certain results.
If your type system doesn't validate your results, then you'll be prone to the class of error you are discussing. Maybe that's common in C.
It kind of encourages you to use a lookup table or polymorphism, which isn't far off from what the compiler is likely to do with that switch statement.
As for correlation to number of defects, another important factor is maintainability (velocity of new features in growing programs, keeping up with infrastructure churn in mature programs). If reducing perceived complexity improves maintainability, and if measuring cyclomatic complexity despite its flaws helps reduce perceived complexity, then it's still a useful metric.
Feels like one of those classic, "no one tool gives you the one correct answer. They all give you data and you come to an informed conclusion having utilized one or many of them."
> It kind of encourages you to use a lookup table or polymorphism
And I think that is a problem. switch()ing over an enum will give you a warning about unhandled cases in some languages. But if you start building your own lookup tables, you will be just as prone to typos like mine, except with more code, and less static analysis.
Yet the cyclomatic complexity drops from 16 to 1. I don't think that's a healthy incentive.
I think that's a fair criticism of a specific failing of a specific complexity metric. A reasonable response, especially if switch statements or pattern matches in your language are better at detecting completeness than other approaches, might be changing the metric to reward switch statements that completely cover the input range. It also makes it clear why any one metric should not be a target with no exceptions. I don't think it invalidates the concept of measuring complexity.
There are many things that the complexity calculator could take into account: the specifics of the language, the libraries that are used (which could start threads or contain hidden conditionals), the kind of project (following a spec vs. explorative programming), and so on.
I think it's easy to come up with an algorithm that's better than a coin flip or counting LOC. But to be useful, the algorithm has to be on par with the judgement of the developers who check in the code and review it. Otherwise the false positives will take time away from other bug-reducing activities like code reviews and writing tests.
Of course I don't have data to prove it, but I'm convinced that every complexity checker I've encountered in the last 15 years has been a (small) net negative for the project. Maybe machine learning will improve the situation.
That's what I was thinking about when I used "ideas". To me there's one main idea in all that code: "we are returning a parsed object of a kind"
Not that I'm trying to sell "ideas". I don't even know. But it's this very loose concept that floats around my mind when writing code. How many ideas are there in a stanza? Ahh too many. I should break out one of the larger ideas to happen before.
Speaking of measuring "ideas", the holy grail of complexity metric is Kolmogorov complexity - theoretically, the inherent complexity of a piece of data (including code) can be defined as the shortest possible program that generates it. For example, 20 million digits of pi or a routine that uses 20 branches to return the same object has low Kolmogorov complexity, because it's easy to write a generator for that, meanwhile a parser is more complex.
But it's only a mathematical construction and is uncomputable, just like the halting problem. In real life, for some applications a good compression algorithm like LZMA is sufficient to approximate it. But I'm not sure if it's suitable for measuring computer programs - it would still have a strong correlation to the number of lines of code.
>A `switch` and all its cases combined incurs a single structural increment.
>Under Cyclomatic Complexity, a switch is treated as an analog to an `if-else if` chain. That is, each `case` in the `switch` causes an increment because it causes a branch in the mathematical model of the control flow.
>But from a maintainer’s point of view, a switch - which compares a single variable to an explicitly named set of literal values - is much easier to understand than an `if-else if` chain because the latter may make any number of comparisons, using any number of variables and values.
>In short, an `if-else if` chain must be read carefully, while a `switch` can often be taken in at a glance.
Great paper! I particularly liked how they treat boolean expressions. It kind of has a mathematical justification as well since a series of only && or || is trivial for a SAT solver.
I disagree with them on assuming method calls being free in terms of complexity. Too much abstraction makes it difficult to follow. I've heard of this being called lasagna code since it has tons of layers (unnecessary layers).
Maybe the complexity introduced by overabstraction requires other tools to analyze? It's tricky to look at it via cylcomatic complexity or cognitive complexity since it is non-local by nature.
You jest, but I had a manager who loved Cyclomatic complexity as a measure of programmer ability. I tried in vain to convince him that CC was measuring the complexity of the underlying business logic, not my programming ability.
I wasted a lot of brainpower cutting down CC in my code doing terrible things like storing function names as strings in hashmaps, i.e.,
Because that could replace several 'if' checks, since function calls weren't branches. Of course, no exception handling, because you could save branches by using throws Exception.
To this day, I wonder if Cyclomatic complexity obsession helped make "enterprise Java" what it is today.
The worst part is that this doesn't even reduce the cyclomatic complexity of the code. You replaced conditional branches with table lookups, indirect function calls, and exceptions, but you still have the same number of paths through the code—or perhaps more with the extra code for the table lookups. In the end you're just hiding the real complexity from the tools.
Minimizing cyclomatic complexity might actually be a reasonable approach, if the complexity were accurately measured without these blind spots. For example, any indirect calls should be attributed a complexity based on the number of possible targets, and any function that can throw an exception should be counted as having at least two possible return paths (continue normally or branch to handler / re-throw exception) at each location where it's called.
The real measure you want is the cyclomatic complexity divided by the complexity required by the problem. The problem with line counting (or cyclo counting) is that it assumes a constant ratio.
It is not trivial to understand the necessary complexity of a problem. In SICP, there's an early practice problem that asks you to turn a recursive number pattern generator into an iterative generator. What the problem doesn't tell you is that the generator obscured by complex recursion is actually creating a tribbonaci number sequence. This hidden insight turns a difficult problem into a trivial problem.
I certainly didn't mean to imply it was trivial (or even possible). I'm just pointing out that assuming increasing LoC implies an increase in necessary complexity is just wrong.
Developers have tried to build metrics for "complexity" over the years. Cyclomatic Complexity is often pointed to as one of the more successfully deployed: https://en.wikipedia.org/wiki/Cyclomatic_complexity
As with any metric, it's something to take as useful "advice" and becomes a terrible burden if you are optimizing for/against it directly. (Such as if you are incentivized by a manager to play a game of hitting certain metric targets.)
It's also interesting to note that a complete and accurate complexity metric for software is likely impossible due to it being an extension/corollary of the Halting Problem.
I'm definitely not encouraging writing "clever" short code. Always strive to write clear code. You write it once, but it will be read (and potentially changed) many, many times.
As for complexity metrics, this is a contentious topic. It's hard to quantify "ideas" or "concepts". I think this is the part where technical design reviews and good code reviews help keep in check.
My IDE spits out a warning when I reach a certain level of complexity in a method, using Mccabe's Cyclomatic Complexity as a measure. It roughly maps with "too many if statements" in my case.
I'm wondering if there's a quality term for a "unit of complexity" in code? Like when a single line has 4 or 5 "ideas" in it.