Fantastic read. My only concern is that there wasn't any talk around cost of false positives (selecting a test to run where it is unnecessary) vs false negatives (incorrectly dismissing a relevant test), as those costs in terms of their effect is not symmetrical.
The cost of a bug slipping through because a test being skipped will be higher than running an irrelevant test to a commit.
Yes a regression slipping through would far outweigh the benefits of reduced tests. The thing the post didn't make very clear is that thanks to our integration branch, the chance of a missed regression is still nearly zero. If the scheduling algorithm misses something, the failure will show up on a "backstop" push. These are pushes where we run everything, and then a human code sheriff will inspect any failures, and if something was missed figure out what caused it and back it out.
So the costs of missed regressions are:
1) More strain on the sheriffs (too much strain means the need to hire more)
2) More backouts which is annoying to developers and can mess up annotation (though we have ideas to fix the latter).
For the record, the algorithm with the 70% reduction in tests has a regression rate almost on par with the baseline (it's ~3-4% lower). This hasn't seemed to result in much additional strain on the pipeline.
There isn't any discussion of the cost at all. It just says the test run rate is down by 70%, it doesn't say anything about the defect detection rate, even though they say this is their cost function.
10 core-years per day sounds like a lot but it's only about a 10kW load, and they've saved 70% of that, or about $20 of opex per day.
One of the authors here, I can't exactly deny that line was added to sound impressive, so guilty as charged. However the savings are much higher than $20/day for a few reasons:
* Many tasks run on expensive instances (hardware acceleration, Windows)
* We have OSX/Android pools that run on physical devices in a data centre (these are an order of magnitude more expensive than Linux)
* There are ancillary costs. For example each task generates artifacts which incur storage costs. These artifacts are downloaded which incur transfer costs.
* There are also overhead costs (idle time, rebooting, etc) that aren't counted in the 10 years / day stat.
All these things see a corresponding decrease in costs with fewer tasks.
Is that really all? That would be 3650 cores running full time. 3W per core sounds too little for power consumption. And do power costs really dominate the price of running CPUs? I'm guessing the savings here are at least one order of magnitude more than your $20/day.
I get about $1000/day based on some EC2 prices for typical machines I've used, though I'm sure Mozilla's requirements are different and they can negotiate better prices than I can.
> “ The cost of a bug slipping through because a test being skipped will be higher than running an irrelevant test to a commit.”
It really depends on the type of bug, and perhaps this could be factored into the model by also correlating change sets with outage severity or complexity of a fix.
"A bug slipping through" in this case just means slipping through to where it's detected on a later push to the integration branch, or failing that, when a more complete set of tests runs when the change is merged into the main branch. In no case will poor scheduling here result in a bug making it into the final product. It's just that it's more costly in human time to detect it later, so currently the entire goal is set at detecting the problem on the first round of testing after a push.
they have a try server that developers can push to to run a swath of tests before bringing into the integration branch. outsiders can access that by being vouched for by a developer in mozilla and insiders obviously have access to it already. having used it as an outsider it's kind of a pain to use with a lot of setup and options. so having something like `mach try auto` would be awesome for outside devs in addition to the reduce server costs.
The cost of a bug slipping through because a test being skipped will be higher than running an irrelevant test to a commit.