Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Best Practices for Testing in Java (2019) (phauer.com)
103 points by moks on Sept 1, 2020 | hide | past | favorite | 95 comments


>If this test fails, it’s hard to see what exactly is broken.

Breaking existing tests is rare. I don't mind some detective work when it happens. You start with a pretty strong lead, anyway: whatever you touched since the last time the test passed. In my workflow this is unlikely to be more than five minutes ago. Half the time I know what I broke without even looking.

I can see how specificity in test results would be important if you're running tests very infrequently and accumulating a lot of change between runs. But in that case I would probably invest in faster feedback before I'd invest in specificity. Huge, wordy test suites are expensive and can ossify bad design much more strongly than they preserve correct behavior.


> Breaking existing tests is rare. I don't mind some detective work when it happens. You start with a pretty strong lead, anyway: whatever you touched since the last time the test passed.

You describe the exact situation where the advice is not helpful at all.

Now imagine a large, very tightly interwoven codebase where in fact, making changes to a section of code breaks tests on the complete other side of the application.

If you don't follow that advice, you are immediately blocked, and pretty much at the mercy of someone else fixing their test (probably to be less needlessly specific or less assertive after changing requirements).

If they structured their test in a way that makes it clear why it should fail, you have the chance to investigate yourself how this contradicts with your changes.


That's a organization problem (bunkers, are they called silos?) not a coding problem. I wish people would specify their coding advice with "advice for big projects with square process", "advice for dysfunctional planning", "advice for small team" etc.


> very tightly interwoven

Perhaps this is your problem.


Of course. But you can't fix that by testing, only improve how to cope :)


It's a problem tightly coupled unit testing makes worse.


If breaking existing tests is rare, you don't have enough tests or a superhuman ability to organize your code.


Hardly "best" practices. "Best" would suggest there are no better practices.

I am fed up with unit testing. There is no feedback to tell if unit testing setup is complete and if unit tests are comprehensive.

I am more and more turning to functional regression testing when entire service is tested as a black box against spec.

Since we are testing API this tends to be specified much better than individual components. There is usually already some kind of documentation. API changes much less than individual components which means that if I start doing some kind of refactoring I may change a lot of internals but don't touch API at all. If I touch API then I need to update documentation and inform users so writing tests for that is not a huge additional cost.


> There is no feedback to tell if unit testing setup is complete and if unit tests are comprehensive.

Consider using mutation testing [1] tools like https://pitest.org/ to test the quality of your tests.

Having said that, unit tests obviously shouldn't be your only quality assurance practice.

[1]: https://en.wikipedia.org/wiki/Mutation_testing


> There is no feedback to tell if unit testing setup is complete and if unit tests are comprehensive.

Unit tests are neither required or expected to be "complete". At no point do unit tests tell you "this code is perfect and correct". Rather, they identify where code is incorrect. This is valuable information.

> I am more and more turning to functional regression testing when entire service is tested as a black box against spec.

Both types of tests (and others) are useful.


What I mean by feedback is that, when a developer works on functionality he will get a feedback from somewhere. It has to do what it is supposed to do whether the developer understood the requirements or didn't and the users are going to point it out with bug reports or incidents.

On the other hand if the developer misunderstood the requirements or completely bungled the unit tests (for example they always pass) then there is no feedback to tell. Everything is fine as long as tests pass. Of course, one could say that the test should be first written to not pass and then fixed by implementing the feature, but who's going to check that? There isn't easy way to tell me if developers are doing good job with tests other than to delve into each feature and understand it and look at the tests.

Also unit tests test components that typically are specified by the developer (so he makes both spec and the component and the unit test). With functional testing at least one of those is verified by the user and you get some feedback -- if the user reports the service is not working properly you make a functional test to replicate the issue and then fix the service so that it behaves correctly according to the written test.

If the user is another development team it is entirely possible, in a mature org, that the other team can write the test.


> It has to do what it is supposed to do whether the developer understood the requirements or didn't and the users are going to point it out with bug reports or incidents.

This sounds like you're saying that one should leave testing of the code to the end user, which obviously makes no sense. I'm going to assume it really means just "things the developer missed".

So that that point, we have the developer testing for things the developer thought of. We're in agreement there.

When I write code, I tend to make sure the individual pieces work (sort actually sorts, it doesn't blow up on null values, maybe it has too maintain order for equivalent items, maybe it has an option to sort null values first or last). As time goes on, I may find a bug or a change (or misunderstanding of requirements). When that happens, I need to go modify the code to account for it. When I do that, I want to make sure that everything that worked before still works.

If I wrote automated tests originally, all I need to do it rerun my tests to get back to "everything I knew to work before, still works". If I did not, then I need to manually test everything. Both are ways of regression testing; one of them is a lot faster.


> unit testing setup is complete

coverage reports. They tell you which branch wasnt executed in your test, which can tell you what may be missing in your unit testing.

> more turning to functional regression testing when entire service is tested as a black box against spec.

functional/regression testing is important, but it can be difficult to use if your app is huge. And you can not easily isolate an area to test, so you always end up testing a common component repeatedly, and thus is slower to run.

Not to mention some functional testing mechanisms (such as using a web browser) can be flaky.

Both have their place in the automated testing paradigm.


Although coverage reports are misleading, 100% coverage means that you passed through every line of code, that's it.

You don't know if you covered edge cases, it cannot cover concurrency well and although possible it will not tell you whether you used a series of values or long-paths in your code.


> every line of code, that's it

Yes. And it's worth noting what's left out of that. Passing through every line of code is not the same thing as following all paths through the code.

Here's a sampling of code features where that distinction is particularly important:

  - successive if-statements
  - switch statements with fallthrough
  - loops
  - mutable variables with a large influence
The first two are because {{a}, {b}} ≠ {{a, b}}.

By that last one, I mean both mutable variables with a large scope, and mutable variables within objects that have a large scope. That's probably the trickiest thing, because, in the presence of mutability like that, it's not just which lines of code were executed that matters, it's also the order in which they were executed.

That doesn't strike me as a pretty picture. Overall, it implies that automated code coverage metrics are the least useful in the kinds of codebases where they are most desirable.


You're right about edge cases in terms of input values, and right about issues with concurrency, but many of the issues you mention are solved by changing your metric.

Branch-level coverage tells you what percentage of codepaths are covered.

In other words, it would report 50% coverage for:

    if a:
        print "a yes"
    else:
        print "a no"
    if b:
        print "b yes"
    else:
        print "b no"

    test 1: a = false, b = false
    test 2: a = true, b = true


Unfortunately coverage measures are shit.

Real coverage test would test coverage OF THE SPECIFICATION. So for a list of statements describing how your service should behave in various situation you would want to know which are being verified by the unit test.

Checking lines of code is pointless because line of code would usually be required for multiple specification statements (requirements).

There is barely any connection between having 100% line coverage and covering entire component specification.

I think unit coverage are those measures where somebody thought they need to measure something but they could either not define what to measure or they deciding that measuring the right thing is too difficult so we can measure something else instead and managers will be happy anyway.


I agree. You can get rid of lots of unit tests if you do functional, integration, and performance testing of your APIs.


What is your feedback to tell if your functional tests and your spec is comprehenive?


The feedback loop is this:

1. When the user says the service doesn't work (which means it doesn't work according to spec) we look what is at fault (is it the user that misunderstood the spec, the spec is faulty or the implementation is faulty).

2. If the spec or implementation is faulty we ensure there is a test that checks the correct behavior (and fix the spec if necessary). For example, the user might submit a piece of logic that in his view should work correctly but it does not. This might get distilled to a test.

3. We ensure the test fails (which means we were able to reproduce the error).

4. We fix the code and ensure the new version passes the test.

5. We get back to the user to confirm.

Now, if you read it you will notice following:

- after completion of the process we are left with a test to check, so as users complain with problems and incidents the spec and library of tests checking the spec gets updated. This is the feedback loop that updates the library with relevant tests over time ensuring more complete suite.

- this is no different form the idea of TDD, just implemented for functional testing. It is not like there is something fundamentally wrong with TDD, it's that TDD cannot be effectively forced on developers. On the other hand, it can be enforced for functional testing because it is just a simple fix to incident/bug handling process.


> "Use Fixed Data Instead of Randomized Data Avoid randomized data as it can lead to toggling tests which can be hard to debug and omit error messages that make tracing the error back to the code harder. They will create highly reproducible tests, which are easy to debug and create error messages that can be easily traced back to the relevant line of code."

Good article, but I don't get this one at all; it almost seems like an anti-pattern. Choosing fixed data instead of random because the results are more "reproducible" seems to miss the point. If random data eventually helps uncover more bugs, then it's worth using!


I have seen that peoples response to randomized testing hinges on the answer to one failing question: Does the test log the input that led to a failure?

If the answer is no, then it's an immediate slam on the brakes. Tests aren't reproducible, they're flaky, this is impossible, etc.

If the answer is yes, then it's a revelation. The tests are finding me corner and edge cases I never would have thought of on my own, and I can enshrine and document them by creating a fixed-data unit test to cover them even before I start to work the bug (I am still doing TDD, after all), and hey, maybe I don't even have to think that much about the specific data, just its shape, and that will keep me honest and protect me from accidentally writing tautological tests. A short while later, you discover property testing, start achieving ever increasing test coverage with ever decreasing test code, have some sort of satori moment, tell your friends what you've discovered, and they tell you you're insane, randomize tests are flaky by definition, and flaky tests are Taboo. You slink away and change your legal name to Jonathan Livingston Tester.


I think the general idea is that you should be picking your input data to be as evil as possible, so that you're explicitly testing the edge cases; if you use random data you're not thinking about this aspect.

Fuzz testing is a separate thing, and should be done too. But it's (in my experience) supplemental to well-chosen hand-crafted input. It's hard to do TDD or ensure you have all your edge-cases covered if you're using randomized inputs.

For example, for datetimes, you can pick something nasty and unlikely like 1999-12-31T23:00 Pacific Time, which will read as a different day, month, and year if you have your timezone logic broken somewhere (e.g. you're parsing it as a naive UTC datetime somewhere). If you test randomly, that "dates at the end of the year parse incorrectly for 8 hours" bug is unlikely to be found.


If you're using generative testing, you'll most likely hit all those edge cases, and then some. Not sure what's the situation like in Java, for doing these types of tests, though, but something Quickcheck-like is a godsend in general.


Actually, you can get reproducible tests even with randomized tests, by setting the random seed (I always do that in e.g. Quickcheck). So this is really a moot point, not a good excuse not to do random testing (or not to use QuickCheck ;-P).


I don't believe any test frameworks in Java have built-in support for randomisation using a seed, so this is a foreign concept to most Java programmers. Which is a shame, because it's useful.

It would actually be really easy to package up seeded randomisation as a JUnit rule / extension. As far as i can tell, nobody has done that.


Reproducibility is a subtle but important part of all parts of a build, including tests.

What happens, in my own experience, is that randomized data in tests has a few unplanned downsides.

First, it makes you think your testing all possible values (since they all might be used, right?). But you aren't- you'll never run the test enough to see all values. And what you really should do is think for a few minutes about the edge cases and ranges and write multiple tests covering each edge.

The second downside I saw was "oh the tests failed so I just reran it and now it passes". This is especially difficult as your team brings in new people, less experienced people. Everyone is in a rush and yeah, that test fails like 1 in 30 of its runs but we don't have time to look into that right now. So you end up with this frustrating build that lets you continue when you have bugs, but sometimes fails.

Your unit tests are a spec. They specify your expectations of the system under test. If you use random data, then the only way you can make your spec accurate is to reimplement the program in your unit tests, also matching the spec. But this will incredibly tightly couple your code and your test, as changing one implementation will require you to change the other. Tests shouldn't be like that, imho.


Tests that use random input data are much more difficult to write correctly. Your test needs to know the expected result not just for one case, but for every possible case. That will vastly increase the number of bugs in your test code, leading to a seemingly endless stream of false-positives.

The worst part is that the feedback is late. The test will randomly fail long after it was written. There's a lot of needless overhead in relearning the context of the failing code just so you can fix a test that is not general enough for the data it was provided.

There are ways to effectively use randomly-generated test data, but it's harder than you'd think to do it right.


Tests with random inputs can be much easier to write. For example here's a test for a single hard-coded example:

    public bool canSetEmail() {
      User u = new User(
        "John",
        "Smith",
        new Date(2000, Date.JANUARY, 1),
        new Email("john@example.com"),
        Password.hash("password123", new Password.Salt("abc"))
      );
      String newEmail = new Email("smith@example.com");
      u.setEmail(newEmail);
      return u.getEmail().equals(newEmail);
    }
Phew! Here's a randomised alternative:

    public bool canSetEmail(User u, Email newEmail) {
      u.setEmail(newEmail);
      return u.getEmail().equals(newEmail);
    }
Not only is the test logic simpler and clearer, it's much more general. As an added bonus, when we write the data generators for User, Email, etc. we can include a whole load of nasty edge cases and they'll be used by all of our tests. I've not used Java for a while, but in Scalacheck I'd do something like this:

    def genEmail = {
      val genStr = arbitrary[String]
        .map(_.replace("@", ""))
        .retryUntil(_.nonEmpty)

      for {
        user   <- genStr
        domain <- genStr
        tlds   <- Gen.listOf(genStr)
      } yield (user + "@" + (domain :: tlds).mkString("."))
    }
Even using the default String generators, this still gives pretty good examples, e.g.

    scala> genEmail.sample
    res1: Option[String] = Some(뀩貲㳱誷Ⓣ壟獝鈂ᗘ䮕鹍尛斿績線孞왁궽偔ቫ﯃쥑瞴䒨䏵艥꿊狿냩쩈簋㷡泪伺鑬鮘䦚벛妹乘饡㙧ꐤ큫ᯕ肳铬呢ூ靪୒틯鏱滄㧄P莶寕상䐀鹭ᚿ㌤ᇹ䶂攔ꅥ㶓⽌ꅂ뀏뛶盏⍸㇗ɧ⇽줓嵛@퇐䎁鱎馉琿䏍㔰蘿⠶㘮큵휕炠᭑㯠ꇷ氕瑤镦碋䓯鄓헛㥝籆찷舊⸦䁜⯞௵籋㟨㨴鯅鿸刡㽴ﮈ耾刏碓辁亁ᦨ氦ব꧓ꌝ飖䒱찮刍ฤ﯉⯛ஃ颰溯쨚ই媐䄇延醴熜䢐필㉕徫澬폁횱坠줊埬າ⟂.὆蹵箅䀈떥喹뭡㘺ꐟ⯉쉂㯻牏䢺梘쭸칹总⻎恵꣥ﴈ宎嶲ꎮ쌊䯧挤ꓯ⻯ꢠ鏿㰔操駚樅졇੩풻殜ᶝ팱.瘽៣뿔뱡䞿愝滠蛊妏뷩먰⦹挸긖᩿㣽ﮀ嫶ꚸ㗃鈴䇡⋥梏ڝ晡烡.줼欺ࠞ娫)
The other advantage is that automated shrinking does a pretty good job of homing in on bugs. For example, if our test breaks when there's no top-level domain (e.g. 'root@localhost') then (a) that will be found pretty quickly, since 'tlds' will begin with a high probability of being empty and (b) the other components will be shrunk as far as possible, e.g. 'user' and 'domain' will shrink down to a single null byte (the "smallest" String which satisfies out 'nonEmpty' test); hence we'll be told that this test fails for \0@\0 (the simplest counterexample). We'll also be given the random seeds used by the generators.


That generator function illustrates exactly the problem I'm talking about. The maximum length of a string in Java is 2^31-1 code points. If user is an 'arbitrary string', then it could be 2^31-1 code points long. If domain is also an arbitrary string, then it can also be 2^31-1 code points long. When you concatenate them and exceed the maximum string length, you will cause a failure in the test code.

There are almost always constraints within the test data, but they're complex to properly express, so they aren't specified. Then one day, the generator violates those unstated constraints, causing the test to fail.


> one day, the generator violates those unstated constraints, causing the test to fail

Good, that's exactly the sort of assumption I'd like to have exposed. As a bonus, we only need to can fix this in the generators, and all the tests will benefit. I've hit exactly this sort of issue with overflow before, where I made the mistaken assumption that 'n.abs' would be non-negative.

In this case Scalacheck will actually start off generating small/empty strings, and try longer and longer strings up to length 100.

This is because 'arbitrary[String]' uses 'Gen.stringOf':

https://github.com/typelevel/scalacheck/blob/master/src/main...

'Gen.stringOf' uses the generator's current "size" parameter for the length:

https://github.com/typelevel/scalacheck/blob/master/src/main...

https://github.com/typelevel/scalacheck/blob/master/src/main...

The "size" of a generator starts at 'minSize' and grows to 'maxSize' as tests are performed (this ensures we check "small" values first, although generators are free to ignore the size if they like):

https://github.com/typelevel/scalacheck/blob/master/src/main...

We can set these manually, e.g. via a config file or commandline args, but the default is minSize = 0

https://github.com/typelevel/scalacheck/blob/master/src/main...

and maxSize = 100

https://github.com/typelevel/scalacheck/blob/master/src/main...

https://github.com/typelevel/scalacheck/blob/master/src/main...


> Tests that use random input data are much more difficult to write correctly.

Interestingly, I personally find them easier to write. I actually find classic unit tests hard to write, probably because I am painfully aware of the lack of coverage.

While with property-based testing, I start from the assumption I have on what the code should do. Then the test basically verifies this assumption on random inputs.

Doing unit test with the given input seems to me backwards - it's like a downgrade, because I always start from what kind of assumption I have and based on this I choose the input. And why not encode the assumption, when you already have it in your mind anyway?


Your implementation is necessarily complex. That's why it may have bugs, and why it needs tests.

You have many more tests than implementations. In my experience, ~20x more. If your tests had bugs at the same rate as your implementation, you'd spend 95% of your time fixing test bugs and 5% fixing implementation bugs. That's why tests should be simple.

If you're going to be spending that much time on validating assumptions, I think you're better off trying to express them formally.


I think I disagree, but it really depends what you mean by "test" or "test case". I assume that test case is for a given input, expect certain output, and test verifies certain assumption, such as for a certain class of inputs you get a certain class of outputs.

I believe that you always test two implementations. For example, if I have a test case for a function sin(x), then I compare with the calculator implementation, from which I got the result. So if the tests are to be comprehensive (and automatically executed), then they have to be another implementation of the same program, you can't avoid it, and you can't avoid to (potentially) have bugs in it.

Now, the advantage is that the test implementation can be simpler (in certain cases); or can be less complete, which means less bugs, but also (in the latter case), less comprehensive testing.

In any case, you're validating the assumptions. The assumptions come from how the test implementation works (sometimes it is just in your head). And to express them formally, of course, that's the whole point.

For example, if you're given an implementation of sin(x) to test with, you can express formally the assumption that your function should give a similar result.

By formalizing this assumption, you can then let the computer create the individual test cases; it is a superior technique than to write test cases by hand.


You should test the edge cases explicitly instead of hoping the randomization will save. If there's some bounded set of values, like an enum, then test every value instead of randomly picking things and hoping for the worst. I don't want to know eventually. I want to know now.


> I don't want to know eventually. I want to know now.

That would be nice, but it's not the choice we're facing. The choice is between knowing eventually (by randomising) or knowing never (with determinism); or alternatively, between definitely finding out at 3AM when there's a production outage, or possibly finding out during testing.

Test suites get run a lot, so even a small chance of spotting cases we hadn't thought of can be worthwhile.

Also, it's much easier to run a randomised test with specific examples than it is to run a hard-coded test with randomised inputs (this is because "randomised tests" are actually parameterised tests, which take test data as arguments). Hence we might as well write the randomised version, then also call it with a bunch of known edge-cases (in QuickCheck-style frameworks this is just a function call, in Hypothesis we can also use the '@example' decorator).

If we go down this route there are also automated approaches to make things easier, e.g. Hypothesis maintains a database of counterexamples that it's found in the past, which it mixes into its random data generators. We can ask for these values and use them as explicit examples if we like.

In my experience writing "randomised tests" (i.e. property checking) is much easier and far more powerful than writing lots of hard-coded examples. I've done this in Haskell with QuickCheck, Scala with Scalacheck, Python with Hypothesis, Javascript with JSVerify and I hand-rolled a simple framework when I wrote PHP many years ago. Occasionally I find the urge to sprinkle a few hard-coded tests into the suite, but it rarely seems worth it.


I'm not sure who told you that non-random tests means non-parameterized tests but I don't think you'll get a lot of push back on parameterized tests from any of the commentators here.

The point is you should write exhaustive tests such that you couldn't imagine the randomized test finding anything new, especially on bounded sets of inputs. If you're not writing exhaustive tests because of a random strategy then yes, the choice is exactly between knowing immediately or later.

I've seen novices assume random tests are exhaustive even when they could think of several edge cases.


Lucene has randomized test cases, but the framework also prints out the (random) seed used at the beginning, so if you find an interesting scenario, you can reproduce it by using the same seed again.


It also adds in the assumption that your randomized setup is correct. That is not always the case, and in a lot of situations it's not trivial.


I think better advice would be to use random data as a supplement, rather than as a replacement for fixed-data tests.


+1. I personally like using random data if the test is concerned with only one input, but I avoid random data when checking e.g. pairs of numbers because of the slight chance that the random numbers would be unexpectedly equal.


Great article. The only thing I would add is that anything other than `assertEquals(expectedValue, actualValue)` is not really permissible. Don't make assertions about 'portion' or 'subset' of your results, make assertions about the entire value! Otherwise, you're throwing away an opportunity to encode a constraint; sooner or later the undefined behavior will make itself known. We are writing tests to avoid that scenario in the first place, no?


Partial Disagreement on the phrasing of this - there are many classes of assertion like `assertThat(expected).containsInAnyOrder(foo,bar,baz);` that would not meet this. Order often does not matter and cannot be constrained. There are other things similar to this.


Value should probably be a `Set` then!


What if it's a repeated field of a protocol buffer? Parse it into a set and then check it? That's exactly what `assertThat(expected).containsInAnyOrder(foo,bar,baz);` is doing anyway.


I'm not very familiar with protobuf, but in searching, repeated field seems to have list- or set-like semantics, so yeah, I'd parse the value into a list or a set and compare it to a list or a set.


If you were going to work with protocol buffers frequently you'd probably quickly find yourself doing this over and over, and would then create the aforementioned `assertThat(expected).containsInAnyOrder(foo,bar,baz);`.


There are ordered sets, immutable sets... infinite sequences, lazy lists, sometimes collections are hairy to test.


If find that if I am trying to test values that are not 'fully realizable' (infinite, lazy, etc) there's typically a temporal aspect involved, and I have a 'mock clock and timer' implementation so that I can call a `time.advance(t)` from within my tests and time advances in my system under test. This works because (as OP also noted) I only use my own mockable abstractions for answering the question 'what time is it?' and for instructing the system 'call me back at time t'. The net effect being I can fully define the behavior of the system within an arbitrary window of time—and collect the resulting values or behavior into some data structure which encodes what happened and when, and then compare it against expected results.

Re immutable: the values you're getting from your system under test and the expected values you're constructing in tests should always be immutable!


You might have separate assertions or separate tests for the different aspects. Insisting on only asserting equality forces you to lump them together, which can make the failure less clear.


Honest question, what kind of data type can't be compared to another instance of itself for equality? I have never encountered a situation where I have a value of a type which has some 'aspect' that cannot all be incorporated into the definition of `equals()` for its type. Seems to contravene the very notion of values and types. What am I missing?


Java testing at google:

    * Don't mock if you have access to the remote server code, create fakes using calls to class methods being tested or use an in memory version of the entire thing if possible.
I know it's not easy and it takes time, but it's worth it in the long run. If something changes you'll detect errors quickly.

    * When using guice injections, override explicitly what you need directly in the test setup. Don't create huge test modules with everything and the kitchen sink, eventually these will become unmaintainable and become difficult for future maintainters to understand where overrides and injections are coming from.
Integration testing is whole another subject but I would highly recommend it for mission critical production code.


Good stuff. I especially appreciate the entreaty to limit assertions and corner cases to one test concern.

The only thing I didn't like was the Given When Then recommendation. BDD style story language is really meant for key specification examples, which can be automated in functional tests, but not really appropriate for unit testing.


Even for unit testing, I find that a standard arrange-act-assert structure is helpful.


While Given When Then rolls off the tongue easier, I think Arrange Act Assert better describes the structure


I find adding bdd language comments in my unit tests to organize code makes them WAY more readable and I’ve gotten very positive feedback from coworkers about it.


I fucking love them, with a caveat. If they are written in such a way as to optimize readability, that’s when they really shine. The audience of these tests should be other humans, and they should be written in a way that makes it easy to read.

However, if they’re written in poor English they can muddle the intention and make it harder to read. They might actually make the tests even more confusing as you now have to figure out what the statements are trying to say. This is especially clear if you’re working with teams that don’t use English as their first language.

Despite that, I do think overall it’s awesome to use such tests in most cases.


This article says GWT, but all the examples are showing AAA (Arrange Act Assert). Subtle semantic difference, likely not the core of the article though.


Why not BDD style in unit tests though? Our team adopted it and I really like it over triple A.


I've found it just obscures what the test is doing. If I'm looking at a test that failed, I want to see the code that caused the failure, as plainly as possible; I'm quite capable of reading code (indeed I find code is better at communicating clearly than English is), so I'd rather see a minimum of ceremony than some misguided effort at making tests more readable for imaginary business users.


I don't really understand his argument against it.


I'm pretty lazy, but I like testing..

The habit that I've repeated on my last few project is this:

Work out a way to gracefully serialize/deserialize data structures in your code into a human readable format, like EDN or pretty printed JSON. (this is easier in more civilized languages... -winks at Clojure-)

Pick high level functions that exercise a lot of code for testing.

Generate input data to exercise that function either by hand if it's small and tractable, or by instrumenting real runs of the program and outputting serialized copies of that data (to file or console).

Write test helpers that call the function with the input data, and depending on an environment variable, read the expected 'golden' data from disk, or if UPDATE_TEST_RESULTS=true, write the function output back to disk with a filename unique to the test, and mark the test as passed. If the test fails, print a nice readable, structural diff of the expected/actual data - e.g. python datadiff, or clojure.data/diff so the exact change is visible.

On the first run you set UPDATE_TEST_RESULTS=true and generate the expected test results. Check your 'golden' result data into git.

Now you can easily keep expected results up to date even if logic changes, and get a git diff of exactly how the data has changes.

Acknowledged that this doesn't speak to side effect riddled code that needs mocks.. and doesn't buy into the build-the-test-first style of development, but I've found it's a good return on investment for regression testing, and provides visibility into subtle logic changes, expected or otherwise, introduced by code changes.


Are you referring to snapshot testing[0][1]? i.e. you first "snapshot" the output of the function and commit it to VC, and each test run will run the same input and compare it against the "snapshot", failing and giving a diff if it differs.

I'm about to try it soon, seems like a good ROI as you said.

[0] https://jestjs.io/docs/en/22.x/snapshot-testing [1] https://github.com/mitsuhiko/insta


I think I am! Through convergent evolution at least - I hadn't found essays explicitly advocating it at the time.

My use case was comparing results from chains of Spark RDD & Dataframe transformations, so having fairly large realistc input/output datasets was part of the game, and the main reason that manually writing all expected results wasn't feasible.


Mandatory link to https://www.youtube.com/watch?v=EZ05e7EMOLM

It clarifies a few of the points made in the article.

Personally I would add: Don't overuse mocks, only use dependency injection on the module level (Ludicrous use of singleton DI will hinder your design, not help it)


"Do programmers have any specific superstitions?"

"Yeah, but we call them best practices."

(from https://twitter.com/dbgrandi/status/508329463990734848 but thats probably not the original source)


Good luck coming up with your own way to write every test, making every decision as if it were the first time!


I'm happy to take advice, but given how debatable everything is with programming I'm skeptical of anything labelled 'best practice' ... Over the decades, fads come and go, todays best practice is often tomorrows antipattern ...


I'd be happy if we could just get to the point where programmers wrote code that could be unit tested. i.e. don't write static initializers that connect to live databases for a start.


Dependency Injection + Mockito allows one to do that. There's a whole study on writing code that's easy tested. Our rule is no more than 4 layers: initiator, business logic, services, anemic domain model. Initiators abstract away the inbound protocol and serialize to the common model. Business logic controller handle all of the branching. Services effectuate single actions. Services can't call other services. And the domain model is how everything talks. We all build our apps this way and it's really easy to move people between projects. Not perfect but works for about 85% of stuff one has to write.


I'm not sure calling the domain layer "anemic" by default is correct, as it's typically a (negative) descriptions of models which are too data-driven instead of behavior-driven. I would suggest an alternative layer structure: Initiators/Controllers -> Application/Domain/Infrastructure Services -> Domain Models/Business Rules/Invariant Checking


We've thought about that; the style I've described is very non-OO. It is easy to teach though and it makes unit testing a breeze.

I like the architecture you've described. I think we arrived at where did because it's a natural for when using a DI container, which manages state and transaction groups for you.


is database access only in service?


Correct. A business logic controller can not directly corner the outside world, it must talk through a service.

This is nice because it prevents leaky abstractions into the controller layer. Actions across services that need to be atomic can be grouped in an XA transaction:

// Xa start databaseService.reset password(user); emailService.notifyPasswordChange(user); // Xa commit


Haven't seen anything that bad in ages, but yeah... I'd lose my cool if I saw that.

Last bad thing I saw was - quite recently actually - some business code in a request handler mutating state on a controller field (which are singletons.) Lots of fun once we started load testing with concurrent sessions...


I'm kinda confused about that one. Because if the value isn't coming from some external source like a db, api, file, whatever then why do you even need a static initalizer? If you mean they hardcoded the prod database then I'm so sorry.


They hardcode the code that reads a config file and gets the host name of the DB and then connects to the DB. This initializes a singleton object that’s accessed by nearly every file in the rest of the system. They do this for every external source, not just databases. In my experience, this is the most common way Java “developers” design systems, and they usually get angry with me if I try to fix it because their way is “faster”.


I've run into this type of thing too, it's soul-sucking because it's so easy to avoid. I wouldn't say I do TDD per se, but I definitely write tests for many things to prove to myself that it's working. Many devs just build + run the code and poke at it, which enables all types of heinous patterns like this (and in my opinion is a super inefficient way to code).


So why is this design bad? It seems like when you actually start running into scaling issues that singleton translates naturally into borrowing from a shared connection pool.


It's much more ergonomic, flexible, testable, and configurable to inject that connection pool thing instead of bodging it together in the static initializer. Classes shouldn't be managing resources of which they are properly only consumer...


... because nothing can be mocked out.


Do you not just mock the singleton? I mean don't get me wrong, it has all the usual global variable downsides but I don't see too much a meaningful difference between every method passing around the same connection pool handle explicitly vs ambiently. And either you can mock the connection objects the singleton returns or you can't.

In most apps I've seen "services" like Redis/Rabbit/Memcache/Postgres/External APIs/Storage are cross-cutting concerns and you make your life a nightmare by having to pass

    myfunc(actual, params, db, redis, memcache, rabbit,...)
because if you realize deep in your call stack you actually need data from Postgres now you have change all the callers recursively to pass the handle down. If you have this global "service catalog" it's almost always to eliminate the need for passing the same effectively global connection pools to hundreds of call-sites.

It is annoying that it makes every test require a bunch of ambient service mocks but you only really have to set it up once.


Yeah, sure, if you declare every variable global, it makes adding new features feel quick and easy. There's a reason that global variables are considered harmful.


I completely agree with you! But we're talking about one of the few exceptions -- cross-cutting concerns.

I think just about everyone would scoff at the idea that every single method in your codebase should take an explicit `logger` parameter rather than just having a global logger object.

So now we're looking as a case where you have a bunch of services: dbs, caches, apis, storage that are used all over the app ("in every file") and you have to make a judgement call.

* Have hundreds (honestly thousands at current $dayjob) of methods that do a lot of work just to pass around the same connection handle.

* Declare a single object that manages all the connection pools.

I think it's really hard to escape the fact that connection pools are actually global and you either have to admit this or have your runtime hide it from you.

Another example of a cross-cutting concern that runtimes usually hide is event loops. Can you imagine if every function that wanted to use async had to be passed an event_loop variable?


Static methods in general that create side-effects suck. PowerMock allows you to deal with these troublesome classes though so unit testing is possible.


Yes, surely it's possible. But the resultant test is highly tightly coupled to the implementation of the SUT which gives the test more than one reason to change.

We should strive generally to have tests that only change if the business requirements change. But if I want to refactor my unit (whatever that might be) then the test should not change, or at least should not change __much__


Where the hell are you working where people think that's ok? I haven't seen anyone defending that for 10+ years.


> KISS > DRY

I'd agree elsewhere, but not for tests. Non-DRY tests often means that there's an ad-hoc similarity between a series of tests.

Without DRYness, as time passes, domain knowledge fades away, and more maintainers touch the same test code, it becomes quite likely that this similarity will be broken. Which is a perfect recipe for false negatives: one fixes one test but not all other tests, which might start exercising a useless code path due to drift.


The more canonical form is DAMP rather than KISS. Descriptive and Meaningful Phrases. DRY is fine in moderation in tests. E.g. instead of `Person.builder().withName("Alice").withAge(100).build();` using `People.grandmother()` is DRY-ish, but certainly something you'd see less of in production code.


To easily see in test failures what went wrong we follow the naming convention:

UnitOfWorkUnderTest_Scenario_ExpectedBehavior

Which mimics the internal GWT structure.


I have suggested the same previously, but the display name approach is better. Now test names should be short:

    @DisplayName("throws IllegalArgumentException when given bad data")
    void badData() {


I love this. I started with trying to have simple, short and descriptive names for the test methods but then it doesn't work quite well with more complicated test cases. So I grew to like something along these lines:

given_someInput_someOtherInput_when_doX_then_thisThingHappen_thatThingHappen

which has the advantage to be optimised for reading. When I want to skim over the test class quickly I can simply look at the method names to get an idea on what the test is doing and if I want more details I can look further at it's implementation


Recommending AssertJ and not mentioning Hamcrest? Hamcrest is more flexible, more composable, and more extensible.


Having used both extensively, AssertJ's API is more obvious and generally useful out of the box. It's replaced all the places where I'd previously used custom Hamcrest matchers with generally smaller amounts of code that spit out better errors when tests fail. It's not a either or though, sometimes it's worth looking at whether hamcrest is the right approach.


The table of content is very helpful.

However, I'm curious why this hugo generated static site is consuming over 130MB memory? For reference, HN is consuming 5MB memory only.

(the numbers are observed from about:performance in Firefox)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: