Augmenting Agile with Formal Methods

agentultra · on July 31, 2018

I'm experimenting with including formal methods on my team. We're working with concurrency in an event-sourced system and want to be reasonably certain our design is correct. We're also working on deeply embedding our specifications into our test suites to address some of the trade-offs mentioned near the end. We like to eat our cake.

I'm down with advocating more use of formal methods in industry and open source. The cost of using these tools has come down at such a rapid pace in the last decade or so compared to the rising cost of complexity our system designs are taking on. It's starting to make the trade-off attractive imho.

jdp23 · on July 31, 2018

Agreed - with better and lower-cost tooling the tradeoffs get steadily more appealing.

Back in 2005ish I put together a panel with Kent Beck of extreme programming fame and Martyn Thomas of Praxis, which was doing commercial development using formal methods and Spark/Ada. We were hoping for fireworks but it turns out they were in agreement on a lot of stuff: formal methods need to be used in the context of a pragmatic development process, sometimes iteratively exploring is the right approach, for high reliability you need to complement testing with other techniques like formal methods. Concurrency was one of the examples where there was general agreement that testing and code inspection just isn't enough.

nradov · on July 31, 2018

Static analysis tools for common languages are getting fairly good at detecting many types of concurrency defects.

gbacon · on July 31, 2018

I’m also supporting a customer who is using a stratified approach to verification and validation on mission-critical code stratified into layers of static analysis, unit tests, property-based tests, and formal methods.

daxfohl · on July 31, 2018

Thanks for writing this article. I've been reading TLA+ for a while but the examples in Lamport's book all seem so irrelevant to my day to day work that I eventually gave up. I think starting with a motivational example like this is a better approach. I look forward to the book.

blub · on Aug 1, 2018

You might find SPIN more approachable, the syntax is similar to a cross between C and bash :)

Here's an example from the book "The SPIN model checker", by Holzmann, the creator of SPIN:

    mtype = { P, C, N };

    mtype turn = P;
    pid who;

    inline request(x, y, z) {
        atomic { x == y -> x = z; who = _pid }
    }

    inline release(x, y) {
        atomic { x = y; who = 0 }
    }

    active [2] proctype producer()
    {
       do
       :: request(turn, P, N) ->
               printf("P%d\n", _pid);
               assert(who == _pid);
               release(turn, C)
       od
    }
    active [2] proctype consumer()
    {
       do
       :: request(turn, C, N) ->
               printf("C%d\n", _pid);
               assert(who == _pid);
               release(turn, P)
       od
    }

And this is Dekker's mutual exclusion algorithm from the same book:

    bit     turn;
    bool    flag[2];
    byte    cnt;

    active [2] proctype mutex()
    {  pid i, j;

       i = _pid;
       j = 1 - _pid;
    again:
        flag[i] = true;
        do
        :: flag[j] ->
            if
            :: turn == j ->
                flag[i] = false;
                (turn != j) ->   /* wait until true  */
                flag[i] = true
            :: else ->
                skip               /* do nothing       */
            fi
        :: else ->
            break                  /* break from loop  */
        od;

        cnt++;
        assert(cnt == 1);          /* critical section */
        cnt--;

        turn = j;
        flag[i] = false;
        goto again
    }

I don't like the TLA syntax, which is quite different from a typical programming language. In this blog's deadlock example it looked much nicer than usual for some reason.

pron · on Aug 4, 2018

TLA+ is intentionally made to look different from a typical programming language because it is very different. Even a simple programming language is far more complex than the simple mathematics underlying TLA+. If all you want to do is to verify some simple assertion about a relatively simple program, then SPIN might indeed suit you better (assuming you can express what you want in SPIN). But TLA+'s main contribution is to allow you to design and deeply understand a complex system, and how and why it works, by reducing it to its bare essentials, and describing it at whatever level of detail you're interested in (often in multiple levels of detail, which are then formally related to one another).

Karrot_Kream · on July 31, 2018

I used TLA+ to design a locking strategy around memcached (well, it's a lot messier than that, I used TLA+ to debug a race condition we encountered in production which led eventually to a locking strategy). It's a fantastic tool for designing concurrent routines.

blub · on Aug 1, 2018

Extreme programmers were dogmatic about specific development practices like unit testing and for that and other reasons XP has fallen out of favor. Scrum and Kanban seem to be the big winners.

Unit testing is really bad at catching concurrency errors, interactions between components, domain errors and undefined behavior.

I admit that verification based on temporal logic (like used by TLA or SPIN) is a solid helper when designing algorithms involving concurrency, but at the same time I must also admit that I've never seen it used. Typical concurrency taming strategies I've seen in the past include using higher level primitives (like a thread pool where jobs can be enqueued) or taking and adapting code written by an expert (e.g Anthony Williams has some C++ examples in his book).

Race conditions are much more frequent in my experience. In fact I'm so biased towards finding those that I was frustratedly wondering how a race condition could happen in the example if everything is synchronized :) Proving that a multi-threaded work queue is correct and using that might be a good tactic, assuming there isn't such a primitive available. But if a program is designed around locks and sharing, that becomes much harder to model. Sooner or later such a program will be in a broken state.

thinkingkong · on July 31, 2018

How much of this is Agile vs just the methodology for writing good tests?

pmwhite · on July 31, 2018

The second. In the given example, if skilled programmers cannot find the bug in a couple dozen lines of code within an hour, it is not an Agile vs Better Agile vs non-Agile question. It is a question of whether there is a member on your team who has the "toolkit" to test concurrency scenarios sufficiently well to find this kind of bug.

Of course, the hard part is recognizing when to use formal methods and when not to. I do wonder if someone who had used TLA+ (frex) in the past could look at this code and know that now is the time. Or not.

gbacon · on July 31, 2018

At the bottom of JavaUnitTestChallengeSolved[0], William Underwood posted the test below and claims that it always catches the `notify()` problem. He did not mention using TLA+.

    public static void testConcurrency() throws InterruptedException {
      final int[] threadsFinished = new int[]{0};
      final BoundedBuffer buffer = new BoundedBuffer();

      Thread thread1 = new Thread() {
        public void run() {
          synchronized (buffer) {
            synchronized (this) {
              this.notifyAll();
            }
            try {
              buffer.wait();
            } catch (InterruptedException e) {}
            threadsFinished[0]++;
          }
        }
      };
      Thread thread2 = new Thread() {
        public void run() {
          synchronized (buffer) {
            synchronized (this) {
              this.notifyAll();
            }
            try {
              buffer.wait();
            } catch (InterruptedException e) {}
            threadsFinished[0]++;
          }
        }
      };

      synchronized (thread1) {
        thread1.start();
        thread1.wait();
      }
      synchronized (thread2) {
        thread2.start();
        thread2.wait();
      }
      synchronized (buffer) {
        buffer.put("");
        thread1.join(1000);
        thread2.join(1000);
      }
      assertEquals("Not all test threads were awoken!", 2, threadsFinished[0]);
    }

[0]: http://wiki.c2.com/?JavaUnitTestChallengeSolved

hwayne · on July 31, 2018

That's just testing that all of the test threads are awake after they run once. It only does one buffer operation (the `put` in `synchronized buffer`), and we trivially know it's "broken" if there are only puts and no gets. It's also using two threads for a buffer length of 4. This isn't actually surfacing the bug, just something that looks somewhat like the bug if you already know what the bug is.

EdSharkey · on July 31, 2018

I like your "toolkit" metaphor for test-ability in any case.

If you had asked me a year ago if you could Test-Drive UX development, I would have said no, UX tests are too brittle, and that one should just visually inspect them and focus on testing the data flowing to the view and commands flowing from the view.

But then I got some exposure to the Jest framework for testing and was blown away in that they've automated the acceptance testing and automated accepting changes to expected outputs in order to mitigate the brittleness problem.

And before that I learned about the Quixote framework for CSS testing and learned I could TDD my CSS code (to some extent), making that development more deterministic.

These tools completely transformed my beliefs about what the UX development process could be.

metalrain · on July 31, 2018

Mostly about TLA+