This is for test coverage? I get it technically but not clear to me if this adds...

bvrmn · on July 18, 2022

> Covering all the internal states/code paths in tests seems very hard and potentially of diminishing returns?

100% test coverage is the most easiest goal to achieve (in python). It's pretty much dumb activity. To write proper tests to check BL is a much harder task.

From my experience it gives the biggest value for dynamic languages where you must be sure that every line is touched before releasing code.

gravypod · on July 18, 2022

Production impact: if you can replace:

  if (a == 10)

With

  if (true)

Then you're not testing the behavior of your application. If in production `a != 10` and you've never tested that it might be a problem.

Mutation testing can also do stuff like removing entire statements, etc.

YZF · on July 18, 2022

The question (well, one question) is why would we suspect the lack of a test indicates a bug? For most non-trivial software the possible state-space is enormous and we generally don't/can't test all of it. So "not testing the (full) behaviour of your application is the default for any test strategy", if we could we wouldn't have bugs... Last I checked most software (including Google's) has plenty of bugs.

The next question would be let's say I spend my time writing the tests to resolve this (could be a lot of work) is that time better spent vs. other things I could be doing? (i.e. what's the ROI)

Even ignoring that is there data to support that the quality of software where mutation testing was added improved measurably (e.g. less bugs files against the deployed product, better uptime, etc?)

Is this method better than just looking at code coverage? Possibly none of the tests enter the if statement at all?

EDIT: where I'm coming from is that it's not a given this is an improvement to the software development process. It seems like there was some experimentation around the validity of this method which is good but like a lot of software studies somewhat limited. It also seems there's a lot of heuristics based on user feedback, which is also good I guess, but presumably also somewhat biased.

The related paper has a lot of details including: "Since the ultimate goal is not just to write tests for mutants, but to prevent real bugs, we investigated a dataset of high-priority bugs and analyzed mutants before and after the fix with an experimental rig of our mutation testing system. In 70% of cases, a bug is coupled with a mutant that, had it been reported during code review, could have prevented the introduction of that bug."

Which should imply(???) that 70% of "high priority" bugs can be eliminated during the review process by using this sort of mutation testing. Seeing data to that effect would be cool (i.e. after the fact) and if it's real that'd be pretty incredible and we should all be doing that.

dale_glass · on July 18, 2022

> The question (well, one question) is why would we suspect the lack of a test indicates a bug?

It's not the lack of a test, but the fact that the existing test doesn't cover a branch.

Rarely taken paths is exactly where bugs often hide, because they likely weren't well exercised in human-driven testing either, and because they're hit rarely in general so most users don't run into them often. If the condition is not easily reproducible it can be very hard to figure out what to even put in a bug report.

Say you're working with images and have a branch for a 16 bit color (565) image. Such images are rarely used today, but they still exist. Among many users of your code it's possible only 1% ever hit that branch -- and those end up experiencing weird issues nobody else sees.

Or another example of such things is error handling. An application that works perfectly fine on a LAN can be hell to use on a flaky connection.

gravypod · on July 18, 2022

(Opinions are my own)

> is why would we suspect the lack of a test indicates a bug?

I can only speak for my experience but the code is not better because it is mutation tested. It is better because we have thought about all of the edge cases that could happen when inputting data into the system.

Mutation testing, as a tool, helps you find statements that are not being exercised when parsing certain data. For example if I write an HTML parse and I only ever provide test data that looks like `<a href=....` as an input string and a mutation testing tool replaces:

   if (attrs.has("href"))
      return LINK;

with:

   if (true)
     return LINK;

It is clear to a human reader that this conditional is important but the test system doesn't have viability in this. This means in the following situations you can be screwed:

1. Someone (on the team, off the team) makes a code change and doesn't fully understand the implications of their change. They see that the tests pass if they always `return LINK;`.

2. If you are writing a state machine (parser, etc) it helps you think of cases which are not being tested (no assertion that you can arrive at a state).

3. It helps you find out if your tests are Volkswagening. For example if you replace:

   for (int i = 0; i < LENGTH; i++)

with:

   for (int i = 0; i < LENGTH; i += 10)

Then it is clear that the behavior of this for loop is either not important or not being tested. This could mean that the tests that you do have are not useful and can be deleted.

> For most non-trivial software the possible state-space is enormous and we generally don't/can't test all of it. So "not testing the (full) behaviour of your application is the default for any test strategy", if we could we wouldn't have bugs... Last I checked most software (including Google's) has plenty of bugs.

I have also used (setup, fixed findings) using https://google.github.io/clusterfuzz/ which uses coverage + properties to find bugs in the way C++ code handles pointers and other things.

> The next question would be let's say I spend my time writing the tests to resolve this (could be a lot of work) is that time better spent vs. other things I could be doing? (i.e. what's the ROI)

That is something that will depend largely on the team and the code you are on. If you are in experimental code that isn't in production, is there value to this? Likely not. If you are writing code that if it fails to parse some data correctly you'll have a huge headache trying to fix it? Likely yes.

The SRE workbook goes over making these calculations.

> Even ignoring that is there data to support that the quality of software where mutation testing was added improved measurably (e.g. less bugs files against the deployed product, better uptime, etc?)

I know that there are studies that show that tests reduce bugs but I do not know of studies that say that higher test coverage reduces bugs.

The goal of mutation testing isn't to drive up coverage though. It is to find out what cases are not being exercised and evaluating if they will cause a problem. For example mutation testing tools have picked up cases like this:

   if (debug) print("Got here!");

Alerting on this if statement is basically useless and it can be ignored.

> Is this method better than just looking at code coverage? Possibly none of the tests enter the if statement at all?

Coverage does not tell you what the same thing as what mutation tests tell you. Coverage tells you if a line was hit. Mutation tests tell you if the conditions that got you there were appropriately exercised.

For example:

   if (a.length > 10 && b.length < 2)

If your tests enter this if statement and also pass when when replaced with:

   if (a.length > 10 && true)

Or:

   if (true || b.length < 2)

You would still have the same line coverage. You would still have the same branch coverage. But, if these tests pass, it is clear that you are not exercising cases where a.length <= 10 or b.length >= 10.

> where I'm coming from is that it's not a given this is an improvement to the software development process

In my experience if I didn't write a test covering it, it was likely because I didn't think of that edge case while writing the code. If I didn't think of that edge case while writing the code then I am leaning heavily on defensive programming practices I have developed but which are not bulletproof. Instead of hoping that I am a good programmer 100% of the time and never make mistakes I can instead write tests to validate assumptions.

> Seeing data to that effect would be cool (i.e. after the fact) and if it's real that'd be pretty incredible and we should all be doing that.

Getting this kind of data out of various companies might be challenging.