You're building a bit of a straw man. If you have 400 tests that fail with a sin...

virgilp · on Dec 6, 2018

400 is an arbitrary number, but it's what sometimes (often?) happens with exact-match tests; take the second example with the PDF-to-HTML converter, an exact match tests would test too much, and thus your SVG tests will fail when nothing SVG-specific changed (maybe the way you rendered the HTML header changed). Or maybe you changed the order your HTML renderer uses to render child nodes, and it's still a valid order in 99% of your cases, but it breaks 100% of your tests. How do you identify the 1% that are broken? It's very hard if your tests just do exact textual comparison, instead of verifying isolated, relevant properties of interest.

In my Google example, the problem is that functional tests were testing something that should've been a performance aspect. The way you identify minor regressions is by having a suite of performance/ accuracy tests, where you track that accuracy is trending upwards across various classes of input. Those are not functional tests - any individual sample may fail and it's not a big deal if it does. Sometimes a minor regression is actually acceptable (e.g. if the runtime performance/ resource consumption improved a lot).

jdmichal · on Dec 7, 2018

> It's very hard if your tests just do exact textual comparison, instead of verifying isolated, relevant properties of interest.

I think you have this assumption that you never actually specified that exact-match testing means testing for an exact match on the entire payload. That's a strawman, and yes you will have issues exactly like you describe.

If your test is only meant to cover the SVG translation, then you should be isolating the SVG-specific portion of the payload. But then execute an exact match on that isolated translation. Now that test only breaks in two ways: It fails to isolate the SVG, or the SVG translation behaviour changes.

> In my Google example, the problem is that functional tests were testing something that should've been a performance aspect. The way you identify minor regressions is by having a suite of performance/ accuracy tests, where you track that accuracy is trending upwards across various classes of input. Those are not functional tests - any individual sample may fail and it's not a big deal if it does. Sometimes a minor regression is actually acceptable (e.g. if the runtime performance/ resource consumption improved a lot).

... "Accuracy", aka the output of your functionality is a non-functional test? What?

And I never said regressions aren't acceptable. I said that you should know via your test suite that the regression happened! You are phrasing it as a trade-off, but also apparently advocating an approach where you don't even know about the regression! It's not a trade-off if you are just straight-up unaware that there's downsides.

virgilp · on Dec 10, 2018

> That's a strawman

It wasn't intended to be; yes that's what I meant; don't check full output, check the relevant sub-section. Plus, don't check for order in the output when order doesn't matter, accept slight variation when it is acceptable (e.g. values resulting from floating-point computations) etc. Don't just blindly compare against a textual reference, unless you actually expect that exact textual reference, and nothing else will do.

> "Accuracy", aka the output of your functionality is a non-functional test? What?

Don't act so surprised. Plenty of products have non-100% accuracy, speech recognition is one of them. If the output of your product is not expected to have perfect accuracy, I claim it's not reasonable to test that full output and expect perfect accuracy (as functional tests do). Either test something else, that does have perfect accuracy; or make the test a "performance test", where you monitor the accuracy, but don't enforce perfection.

> And I never said regressions aren't acceptable.

Maybe, but I do. I'm not advocating that you don't know about the regression at all. Take my example with speech - you made the algorithm run 10x faster, and now 3 results out of 500 are failing. You deem this to be acceptable, and want to release to production. What do you do?

A. Go on with a red build? B. "Fix" the tests so that the build becomes green, even though the sound clip that said "Testing is good" is now producing the textual output "Texting is good"?

I claim both A & B are wrong approaches. "Accuracy" is a performance aspect of your product, and as such, shouldn't be tested as part of the functional tests. Doesn't mean you don't test for accuracy - just like it shouldn't mean that you don't test for other performance regressions. Especially so if they are critical aspects of your product/ part of your marketing strategy!

jdmichal · on Dec 10, 2018

OK I'm caught up with you now. Yes, I agree with this approach in such a scenario. I would just caution throwing out stuff like that as a casual note regarding testing without any context like you did. Examples like this should be limited to non-functional testing, aka metrics, which was not called at all originally. And it's a cool idea to run a bunch of canned data through a system to collect metrics as part of an automated test suite!