Using TLA+ in the Real World to Understand a Glibc Bug

pron · on Nov 1, 2020

A clarification:

> Instead of writing proofs, it works using the simple method of running all possible executions of a program.

TLA+ is a formal mathematical language designed to describe digital (and hybrid digital-analog) systems. By itself it "does" nothing other than allow writing a formal description of a system, various propositions about it, and their proofs.

There are tools, however, for mechanically writing and checking TLA+ proofs with a proof assistant (TLAPS), as well as model-checking them (TLC). The author refers to the latter, but it should be noted that model checkers, including an explicit-state model checker like TLC, don't only not run all possible executions of a program, they don't even run it once. They are "static." Indeed, TLC could, in some situations, check whether an uncountably infinite number of program executions, each of which of infinite length, satisfy some proposition within seconds.

Also, the author chose to specify their algorithm in PlusCal, rather than in TLA+, which is pseudocode-like a language that compiles to TLA+ in a straightforward manner and is sometimes more convenient than TLA+ to specify low-level concurrent algorithms (TLA+ is more convenient to specify a higher level view of an entire system).

TLA+ is, however, exceptionally pragmatic due to an availability of a model checker for it (in fact, there are now two model checkers for TLA+, TLC and Apalache, using very different algorithm), but mostly because it is exceptionally easy to learn compared to other formal mathematical languages.

elteto · on Nov 1, 2020

This is something I never quite understood with TLA+ (just from reading articles here and there, never actually used it).

Let’s say I have a piece of C++ code and a TLA+ model for what it should be doing. How do I validate the C++ code with it?

k__ · on Nov 1, 2020

You don't validate the C++ code, you validate the algorithm before you implement it with C++.

You have an idea for an algorithm, write it down with TLA+.

TLA+ tells you if you screwed something up.

You refine the algorithm until TLA+ says it's okay now.

You implement the refined algorithm with the programming language of your choice.

Everything that happens from that point is outside of TLA+'s power.

pron · on Nov 1, 2020

You're asking the wrong question. Unless your program is very small or the property you wish to check is very simple, nothing can fully "validate" it. The question should be, is TLA+ more beneficial than just testing in improving program correctness?

Humanity currently lacks the means to verify "deep" properties of code at any reasonable scale. Any tool that works at the code level either focuses on very simple properties (memory safety), complex but local properties of a small code-unit (Ada SPARK, Java's OpenJML), or deep properties of tiny programs (like seL4 and CompCert). These can be very important, but none truly offer the ability to "verify code" with some repeatable process at a generally-interesting scale. TLA+ could be used to do the same -- e.g. there is a C to TLA+ compiler -- but then it suffers from the same scale/depth/cost limitation of all code-level tools.

So given that we can't just choose to verify a program, the question we want to answer is a different one: what is a good way, in a cost/benefit sense to achieve the correctness we want. Then you can ask whether TLA+ can offer such a way that is "better" in that sense than what you do now, which is likely testing. Those who have tried it, including myself, have come to the conclusion that in cases where the algorithm/design is sufficiently interesting or subtle, the answer is absolutely yes. The way TLA+ achieves it better than code level tools is that it allows you to describe your system at the level appropriate for answering your most important questions about it, and that level can be a much higher one than the code.

So, to answer your question, the mere act of precisely expressing your system and correctness criteria in some chosen level of detail using very simple mathematics (high-school or first-year-college level), and then model-checking some interesting instances, is a good way (in the cost/benefit sense) to find bugs and make your program more correct. You compare it to the code by inspection. No, it does not give you 100% guarantees, but nothing else can, either. It is still worthwhile to do in addition to testing, as overall it can easier and cheaper in many cases to detect certain rare-yet-catstrophic bugs.

Having said that, there are various techniques being explored that can check whether your program matches the TLA+ specification more mechanically than just by inspection, that might not be sound (i.e. give a 100% guarantee) but increase confidence while being cheap.

doonesbury · on Nov 1, 2020

While tla+ supports pluscal so the the spec is closer to imperative programming one ultimately would do it's true there is an insuperable gap between the spec and implementation. It should be noted dafny and Verdi are better here. Both systems allow refinement by constantly reproving safety properties over code thats closer and closer to the real implementation including making runnable code. (Verdi does coq to ocaml and dafny does dafny to .net executable generation). Frama C and spark are similar. However there appears to be a philosophical difference: tla+ finds counter examples to the spec. Absence up to the number of states checked shows the code is bug free. Dafny, Verdi seem to proceed by showing the spec is proved through various proof checking (semi) automation. As such they don't explore state.

While unit testing and what not has it's place there clear advantage of FM is the developer comes up with the spec first then show/prove it rather than coming up with the spec bug by bug by bug which can be slower and more labor intensive and customer facing.

Look if you tell a Dev team we'll need to do four quarters of work over 15 jira epics implemented in two week sprints nobody complains that the plan is dumb because all the requirements aren't really nailed down, the scope could change, or the team might not come up with the algos .. So why even try?

But if you tell the same team we'll be using formal methods there's a lot of complaining because FM can't check everything ... Not down to the real final code ... So why bother? I find this dichotomy disingenuous by programmers.

pron · on Nov 1, 2020

> It should be noted dafny and Verdi are better here.

The problem is that the scale and complexity of systems that are verifiable by Dafny and Verdi, or dozens of not hundreds of other code-level tools, is drastically smaller than those specifiable by TLA+. You are very limited in the scale of the systems you write and verify, and so those tools are much more limited in applicability. It is true that if your program is very small and/or the properties you wish to prove are simple or very specific, then there are tools that can handle them at the code level, and that might be more convenient than TLA+ in those very specific circumstances. Also, Verdi and Dafny are research tools; TLA+ is used in the field by ordinary engineers at Amazon, Microsoft, Oracle, MongoDB and a growing number of other companies to help write software you use quite a lot (most AWS services, for example).

> However there appears to be a philosophical difference: tla+ finds counter examples to the spec. Absence up to the number of states checked shows the code is bug free. Dafny, Verdi seem to proceed by showing the spec is proved through various proof checking (semi) automation. As such they don't explore state.

That's not the philosophical difference, just a difference in available tooling. Both TLA+ and Dafny are just languages. Both have proof tools that employ SMT solvers for deductive proofs. TLA+ also has a model checker, and perhaps so could Dafny. The philosophical difference is that Dafny is a code-level tool, and so gives you "full cover" but only in limited circumstances, while TLA+ gives you a more partial cover but in a much wider range of use-cases.

> But if you tell the same team we'll be using formal methods there's a lot of complaining because FM can't check everything ... Not down to the real final code ... So why bother? I find this dichotomy disingenuous by programmers.

I don't know if I'd use the word disingenuous. I think it's just unfamiliarity. The goal of FM tools aimed at "ordinary" software is not to produce perfect software but to eliminate more bugs more cheaply than with tests alone. They shouldn't be compared to a hypothetical perfection, but to what developers use now.

doonesbury · on Nov 2, 2020

>I don't know if I'd use the word disingenuous.

Point taken. Programmers may see project uncertainly as less pressing/difficult while formal methods are onerous. How much one or the other depends on details, and how the team frames risk and methods to reduce it. Even on the project side, for example, the Chaos report from the 1990s shows some 40% (60% at small companies) of software projects are cancelled before delivering anything mainly due to scope change and lack of executive support. Both require attention. Both are non-trivial.

While we're on formal methods I'd like to solicit your opinion a bit more. I am working on a distributed KV store which I'd like to formally certify. I started with TLA+ except that compared to Dafny, Verdi the gap between spec and implementation is higher. I don't like that. Verdi, in particular, comes with formally verified, shimmable network layers plus the ability to produce runnable code from the Coq spec. My practical assessment is,

+ Dafny: pros: Refine spec to implementation; provides safety and liveness checks. Can produce runnable code towards testing, Q/A experimentation.

- Dafny: cons: The Dafny GIT repo is a mess --- extremely disorganized and therefore hard to simulate examples in papers with the real thing. It's very hard to see what libraries to build on.

+ Verdi: pros: Like Dafy which came after Verdi, it's focused on refinement: go from spec to more and more implementation detail. It's based on Coq proofs and Coq has tons of books, examples, exercises. See https://softwarefoundations.cis.upenn.edu/ for five solid write-ups concluding with formally certified C code. So I do not see Verdi as overly academic. It has formally verified libraries that can be re-used esp. in the area of message passing, marshalling which models reliable and non-reliable packet delivery. Moreover, and this is important for distributed programming, proofs can be done in terms of traces. Not sure if TLA+ and it's model checker TLC can do that.

- Verdi: Cons: I have to learn Coq, not the end of the world but it's another chunk of work. Also Verdi can't help with liveness checks while TLA+ and Dafny or Spin can. I initially worked with SPIN and TLA+: those specs are checked in a model checker which looks for counter examples. I have a certain feel for that. On the other hand as I understand it out the box, Dafny or Verdi or FRAMA-C use proofs. Now a proof is formal and is help, however, I am unclear if Dafny, or Verdi will give me a counter example if a refinement goes wrong. More specifically, if a refinement is wrong which happens: one is unable to prove a claim - and that's your head's up ... or at some point a run of the executables made will assert a violation of a safety property. I'm not sure.

+ TLA is of course much used in industry -- so I won't plough that field again. But I wish I knew if it can prove system properties through traces like Verdi because it seems such a natural way to reason about distributed systems. TLA+ of course does liveness and safety checks.

All of these are free tools ... including SPIN ... so if one can put the time in you can get a lot of power. That's good to know. Navigating some of the finer details as I suggest here is not so easy when choosing one.

For my little project I was thinking Verdi for safety and TLA+ for liveness checks when needed.

Please weight in ... I'd love to hear more perspectives.

pron · on Nov 2, 2020

You might be overthinking this. What you want is a cost-effective way of reaching the correctness you need; i.e. a good bug-found-per-hour rate. Unless you have a lot (years) of experience writing formal proofs, I suggest you try them, first. They generally do not have a good correctness ROI compared to other methods. But nothing and no one can give you an answer without knowing the particulars of your project, the people working on it, and the correctness goals. You'll have to try a few things for a bit and then make up your mind. Just remember that 1. the smaller your project is and the fewer people working on it the less the particular formal method chosen matters, and 2. in formal methods, going from 99.999% correct to 99.9999999% correct can be a 10x factor in cost, so the assumption that more correct = better is generally wrong as there are diminishing returns. Of course, if you just want to do this to learn about formal methods rather than find one that you can frequently use in large projects, then the choice matters even less. In terms of basic principles, the tools you mentioned are much closer to one another than they are different.

doonesbury · on Nov 2, 2020

I want to better motivate my question on traces by quoting from a paper I enjoyed reading:

https://homes.cs.washington.edu/~ztatlock/pubs/verdi-wilcox-...

which gives this example for a distributed lock:

mutex(τ ) := τ = τ1 ++ (n1, Grant) ++ τ2 ++ (n2, Grant) ++ τ3 → (n1, Unlock) ∈ τ2

What this safety requirement means is that if the lock is granted in some pending message to host n2 (say delivered or in queue to be delivered) then it must have been proceeded by an unlock message and not on the same host getting the grant --- in fact on the last node who was granted the lock. Here n1, n2 designate hostIds.

What I like about this approach, is that Verdi seems to make it quite easy to model and specify in terms of nodes, processes on nodes, and messages between nodes whether on a reliable or non-reliable network. And safety conditions seem to naturally flow, in part, from how messages are ordered when sent. Of course, in an unreliable network which Verdi can model too, they may come out of order requiring more work to de-dup and/or ensure exactly-once-delivery.

zozbot234 · on Nov 1, 2020

> Humanity currently lacks the means to verify "deep" properties of code at any reasonable scale. Any tool that works at the code level either focuses on very simple properties (memory safety), complex but local properties of a small code-unit (Ada SPARK, Java's OpenJML), or deep properties of tiny programs (like seL4 and CompCert).

This may be true in a sense, but then a TLA+ specification is also a "tiny program" of sorts. There's no reason why one couldn't use a "code-level" (end-to-end) interactive proof tool to verify TLA+-like specs. Even model checking can be subsumed elegantly within such a tool; a model checking run is simply a search for a refutation (as opposed to a positive proof) of some claimed properties wrt. the proposed spec.

pron · on Nov 1, 2020

> There's no reason why one couldn't use a "code-level" (end-to-end) interactive proof tool to verify TLA+-like specs.

Sure, and that is what TLA+, together with its tools, aim to do. The problem is the relation between the spec, that you can precisely verify, and the program it aims to abstract.

The advantage TLA+ has over code-level tools in this respect is that those tools use code as their specification language -- i.e. a formal language that is deterministic enough to efficiently compile/interpret and execute. But TLA+ allows you to specify at an arbitrary level, even levels with a lot of nondeterminism that precludes compilation, hence the difference between it and code. For example, in TLA+ you can say, suppose I have a subroutine that sorts, and then build and check your specification on top of that without actually writing a sorting routine that can be executed. Similarly, you can say something like, "some hashing function is used" without actually writing any specific hashing function.

This way, you can write a specification that's small enough to be verified, yet captures the interesting aspects of a system that's made of too much code to verify at the code level.

zozbot234 · on Nov 1, 2020

> But TLA+ allows you to specify at an arbitrary level, even levels with a lot of nondeterminism that precludes compilation, hence the difference between it and code. ... Similarly, you can say something like, "some hashing function is used" without actually writing any specific hashing function.

This can also be done in many end-to-end tools. Typically, these tools allow for end-to-end proof based on the actual code; they don't require it.

pron · on Nov 1, 2020

> Typically, these tools allow for end-to-end proof based on the actual code; they don't require it.

TLA+ also allows for end-to-end proof, but it's not primarily designed for it. Because those other tools are primarily designed for it but not for arbitrary-level specification, they are drastically more complicated. This is necessarily the case. To allow execution, you must separate your computation level from your specification level (e.g. there is a distinct, syntactic spec/program or type/object divide), but that's not the case in TLA+. If you're familiar with, say, type-theory-based proof assistants, TLA+ is analogous to being at the type level only (except with an infinite hierarchy in both directions).

In the time in took me to grasp Lean's type-theoretic universal quantification I was writing specifications of large, real-world systems in TLA+; I still find it hard to explain the meaning of universe-polymorphic operators like `id`. Plus, few if any of those languages have model-checkers, which also have a generally and drastically better cost/benefit than writing formal proofs. TLA+ has a very good proof assistant, but we rarely use them because writing a formal deductive proof after a successful model-checking is usually a terrible return on investment, unless your goal is publishing an algorithm rather than delivering a production system.

Due to the scale limitation of sound verification, a system primarily designed for arbitrary-level specification and verification is more widely applicable than a system primarily designed for end-to-end verification, although the latter is certainly valuable in niche circumstances. It is, therefore, unsurprising that TLA+ is more commonly used in the field on systems that are too big/complex for feasible end-to-end verification (which is most "ordinary" software) than any of those primarily end-to-end tools. It is much easier to learn, especially for non-specialists, and thanks to the available model-checkers, it's also much easier to get results with.

zozbot234 · on Nov 1, 2020

> Due to the scale limitation of sound verification, a system primarily designed for arbitrary-level specification and verification is more widely applicable than a system primarily designed for end-to-end verification

I'm not sure what "designed for arbitrary-level specification and verification" boils down to here, or how it would differ in a useful way from an end-to-end-focused design. It seems like it would be relatively straightforward to embed something like, e.g. TLA in a proof assistant. The main features of TLA, viz. temporal logic and non-determinism, are well understood as special cases of modal logic, which can in turn be modeled via quantification on "possible worlds". So I'm not sure how a systen that only focuses on e.g. TLA is "more applicable" and "less niche" - But I would be happy to be proven wrong here.

pron · on Nov 1, 2020

First, UX is not (mostly) about the chosen theory, and the fact that all sufficiently rich formalisms can all be embedded in one another (indeed, TLA has been embedded in Isabelle/HOL, and Coq) doesn't mean that they all have the same UX, just as the fact that Assembly, C++ and Python can all interpret one another doesn't mean that their UX is the same. (And that TLA+ focuses "only" on TLA means just as much as saying Agda focuses "just" on lambda calculus; you can embed the lambda calculus in TLA just as you can vice-versa). TLA+ is specifically designed for engineers -- not logicians -- to specify digital and hybrid systems rather than explore formal mathematics in general. Plus, the tooling makes a world of difference.

Second, as I tried explaining before, all formalisms that focus on end-to-end verification must be more complex, because they must be programming languages; this means that they need a well-defined portion of their syntax to be executable. For example type-theoretic formalisms have distinct categories for the object level and the type-level, the type level not being executable (although in some very specific circumstances, it's possible to automatically synthesise an inhabitant from its type). TLA+ is not a programming language. It has no specific syntactic category that can be executed, and there is no distinction between a type level and an object level. There is just a type-level, you like. Moreover, there's an infinite lattice of formulas, and the simple logical implication, A ⇒ B, means that A implements or refines B, or that B is an abstraction of A (if you're familiar with types, this could either correspond to A being an inhabitant of B or A being a subtype of B, except there is no distinction between the two). In fact, there's a theorem that says that if B is an abstraction of A, then you can prove A^ ⇒ B, where A^ is A with some substitutions. If you're interested, I have a rather long text, written as a series of blog posts, about TLA+'s design of the theory and its ergonomics: https://pron.github.io/tlaplus. Parts 3 and 4 focus on the computational aspect, while part 2 talks of the data aspect.

ImprobableTruth · on Nov 1, 2020

>In the time in took me to grasp Lean's type-theoretic universal quantification

I'm not sure what you mean? They work in the exact same manner as in normal predicate (higher order) logic.

>I still find it hard to explain the meaning of universe-polymorphic operators like `id`.

Uh, are you talking about the normal identity function? If one knows that they types have a level (universe) instead of a "type of all types" to avoid the 'set of all sets' paradox, I'm not sure what the issue would be.

pron · on Nov 1, 2020

> They work in the exact same manner as in normal predicate (higher order) logic.

It is based on dependent products which you need to understand sooner-or-later (sooner, really), and are rather more intricate than FOL.

> Uh, are you talking about the normal identity function?

Ah, but that's the thing. It's not a function and it can't be (for the same reason you don't have an identity function in set theory). In TLA+ the meaning of terms is simpler, as there is a clear distinction between pure syntactic operators and domain objects (sets, but also functions). So in TLA+, you can have either:

    Id(x) ≜ x

which is just parameterised syntax that does not denote any domain object, or perhaps you'd want:

    Id(S) ≜ [x ∈ S ↦ x]

which is an operator defining the identity function on a given set.

In Lean, however,

    universe u
    def id {α : Sort u} (a : α) : α := a

which is similar in spirit to the second TLA+ definition, describes a rather complex object in the meta-domain.

ImprobableTruth · on Nov 1, 2020

>It is based on dependent products which you need to understand sooner-or-later (sooner, really),

Type-theoretic universal quantification _is_ the dependent product (/dependent function, a term I prefer). I assume you're talking about the Cartesian representation, but that's just one way to visualize it, the definition is not dependent on it. If you understand predicate logical quantification it's not necessary at all, since the introduction/elimination rules for type theoretic quantification are the same as for normal predicate logic quantifiers.

>are rather more intricate than FOL

Maybe in application and theoretical implications, but to grasp the concept you have just have to understand that it's relaxing the restriction on "what you can quantify over", which really isn't a huge jump I'd say.

> Ah, but that's the thing. It's not a function and it can't be (for the same reason you don't have an identity function in set theory).

Why would it not be a function? With set theory you have the issue that you can't take a set as an argument and have the codomain and it's elements be dependent on it, but that's exactly what dependent function types allow you to do. The universe U is a sort and A is a valid sort too, so

forall A:U. forall a : A. A

is a perfectly valid function signature and the term would just be (lambda A : U. lambda a : A. a). That doesn't strike me as any more complicated than your second TLA+ example. Sure, you have to deal with universe polymorphism, but I'd say that's compensated by the increased complexity from having to use syntactic operator.

Lean's syntax to specify the universe U is admittedly pretty ugly, but the actual meta-theory isn't that complex I'd say.

pron · on Nov 1, 2020

> Maybe in application and theoretical implications, but to grasp the concept you have just have to understand that it's relaxing the restriction on "what you can quantify over", which really isn't a huge jump I'd say.

It's not that. It's hard for me to try and quantify some essential complexity of a type theory and a set theory, but programmers don't usually learn type theory in high-school or college, but FOL comes "for free" (not to mention the unnecessary complication of constructive mathematics with special handling of noncomputable functions for specifying digital systems) and engineers don't even need to learn the inference rules, because they will rarely bother writing proofs; just pressing the button on the model checker has a much better cost/benefit, plus it allows learning just semantics first, even you do want to learn deduction later. With TLA+ there isn't much to grasp or learn; you know most of the basics. It could just be a situational thing.

> Why would it not be a function?

What is the type of its first argument? IIRC, in Lean, it is only a function when a specific universe is given (implicitly).

ImprobableTruth · on Nov 2, 2020

I'd agree that people are generally going to be less familiar with type theory (though I think it's not uncommon to touch on the simply typed lambda calculus in undergrad courses), but I really think anybody who has all those prerequisites could also learn enough type theory to write specifications in a couple hours because you don't actually need a deep understanding of type theory to use it (though maybe you meant that you immediately started writing specifications of large, real-world systems in TLA+, but I'd presume you'd then be the outlier on that). In my experience people pick up Isabelle (/HOL) really quickly and writing specifications in a type theory based proof assistant isn't a huge jump from that.

Now, writing actual proofs is a completely different matter, but I didn't disagree with you on how economical/ergonomic the over all approach actually is, just the basic complexity of the theory/specifications ;)

>What is the type of its first argument?

Some universe U, which is an implicit argument. You could alternatively write your Lean example as

def id {u} {α : Sort u} (a : α) : α := a

Not being able to specify the type of u in Lean might seem kinda cheaty, but I'd say that's similar to how one declares generic type parameters in e.g. Java. You wouldn't say that those languages' generic functions aren't actually functions, would you?

Similar to how one could make those generics explicit, it is actually possible to define a special type of all universes (not including itself). Agda does this by having a special primitive type of universe levels. As an example

id : forall {u : Level} -> forall {A : Set u} -> A

is the signature of the universe polymorphic identity function. The Level type doesn't you any issues because it is essentially a 'top' type that doesn't have a type itself.

pron · on Nov 2, 2020

> but I really think anybody who has all those prerequisites could also learn enough type theory to write specifications in a couple hours because you don't actually need a deep understanding of type theory to use it

That has not been my experience. Typed formalisms have further complications with "partial" functions -- either they're actually partial, which is a rather complex concept, or you need refinement types to define them, as well as with constructive mathematics, which isn't necessary for specification. Plus, what are you going to do with your specifications after a few hours if you can't write proofs after few hours (or a few days)? And because of that the tutorials don't and won't just teach specification first and proof later -- as the TLA+ tutorials do -- so even if this were hypothetically true, you have nowhere to learn things like that. Lean has very good tutorials, and you can compare how far they get how quickly with the TLA+ tutorials. I don't know if it's even possible to teach such formalisms the same way as TLA+, but even if it were possible, I haven't seen such tutorials. Those languages are so infused with their formalism's proof theory that it's hard to separate the two [1]. For example, most typed formalisms define integer division by zero to equal zero. Suppose they didn't, and they wanted to teach how to define division "properly". How much would a student need to know in order to define and use the div function? OTOH, TLA+ handles partiality in what I think is a simpler way, with a careful definition of Hilbert's epsilon operator (see https://pron.github.io/posts/tlaplus_part2#the-choose-operat...)

Finally, most specifications people in industry are interested in are not of sequential algorithms but concurrent ones, which you also can't do after a few hours (or a few days).

> In my experience people pick up Isabelle (/HOL) really quickly

Depends what you mean by "really quickly." Isabelle has some accidental complications such as the nested languages. But I think something like Why3 can be picked up quickly, although it doesn't handle concurrency easily.

> Not being able to specify the type of u in Lean might seem kinda cheaty, but I'd say that's similar to how one declares generic type parameters in e.g. Java. You wouldn't say that those languages' generic functions aren't actually functions, would you?

The type system in Java or Haskell cannot be used as a sound mathematical foundation, so whatever I call those things it would mean something quite different from maths functions. I call them subroutines, but those that call them functions know (I hope) that they're overloading the term. In any event, students that do learn simply typed lambda calculus learn it in the context of programming languages whose type systems are inconsistent and they certainly don't learn how to do maths with them.

Anyway, `id` is not a function in Lean because the universe argument doesn't have a type. But I wonder, in Agda, is `id id` some object even when a specific universe cannot be inferred? If so, how does Agda avoid Russel's paradox?

[1]: Which I think is a big problem. The clean separation of model and proof is an important property of a formal logic designed for ease of learning, expression and comprehension, and we must not forget that ease of learning, expression, and comprehension -- so that practitioners rather than logicians will be able to make everyday use of it -- has been a central goal of formal logic since at least Frege, if not Boole or even Leibniz, equal in importance to, if not superseding, that of mechanical proof. I might be getting carried away, but I think that TLA+ is a pretty special accomplishment in the history of formal logic in that it is perhaps the first truly rich formal logic that is commonly used by non-specialist practitioners in their day-to-day work.

ImprobableTruth · on Nov 3, 2020

If you're fine with 'stupid' refinement types in the form of subset types without coercion, not only do you get them 'for free' from the core type theory without introducing another language element, I personally also think they're quite intuitive. Using a subset type of non-zero nats or providing a proof that the argument isn't zero to define a 'proper divsion' just 'makes the most sense' in my opinion, but that may just be ultimately down to personal taste.

I really do think though that no single part of dependent type theory is actually all that complex. It's not as simple as TLA+, but there's nothing really hard about it till you start proving.

And I'm not saying that all this is feasible right now, but I'm fairly certain that this is at least fundamentally possible. You can't 'truly' separate specifications from proofs in dependent type theory, but you also definitely don't need to really understand how to write actual proofs to write even complex specifications in it. All of this of course depends on stronger automation for a good subset (possibly by also integrating strongly with other provers), but I don't think that's a total pipe dream (though I am definitely personally biased on this).

>The type system in Java or Haskell cannot be used as a sound mathematical foundation, so whatever I call those things it would mean something quite different from maths functions

I'm just talking about the 'well-behaved' subset, but let me go at this from another angle.

If Lean tomorrow introduced a Sort Omega to type universes, the set of programs that you can write wouldn't change. It's essentially just a syntactic change. Would that really be enough to make you say that these are functions now while they previously weren't?

>But I wonder, in Agda, is `id id` some object even when a specific universe cannot be inferred? If so, how does Agda avoid Russel's paradox?

If you quantify over a universe, a constraint gets introduced as part of the type checker. These constraints can be viewed as a graph that is checked for cycles, so not having a specific universe isn't an issue.

pron · on Nov 3, 2020

> but I don't think that's a total pipe dream

You may well be right. Personally, though, I find this essential inseparability of the logic as specifying a model and its proof theory aesthetically unappealing (not that I find everything in formal FO set theories appealing). Of course, it may have other advantages.

> Would that really be enough to make you say that these are functions now while they previously weren't?

Yes, provided that `id` wouldn't be a function that could operate on that type. But any operator with a truly universal domain cannot be, itself, a quantifiable object without introducing a Russell paradox (perhaps constructiveness can save it, but assuming the system allows non-computable objects, too). As it is, `id` is universal, but it isn't a function. This isn't just a philosophical or even a purely syntactic difference. Lean doesn't allow you to refer to `id` as an object, just an application of it to a known universe.

> so not having a specific universe isn't an issue.

Does that require forbidding full recursion, or can you still introduce non-computable objects?

ImprobableTruth · on Nov 3, 2020

>Lean doesn't allow you to refer to `id` as an object, just an application of it to a known universe.

That's true, but I don't think you necessarily need a type for all universes (excluding itself), just some other way to denote that type is supposed to be universe polymorphic.

>Does that require forbidding full recursion, or can you still introduce non-computable objects?

I'm not sure what you mean exactly. The core type theory assumes that all functions are terminating. You can 'cheat in' full recursion by providing a non-computational termination proof, but the type checker would still be assuming that its actually terminating.

I think if that wasn't the case, it'd need to be restricted (unless you did something really weird like introducing infinite negative universe levels).

lakecresva · on Nov 2, 2020

I think TLA+ is great, but the way in which you try (here and elsewhere) to downplay its inability to verify implementations/executable code by dismissing an OS kernel and an optimizing C compiler as "tiny programs" make you sound like a zealot at best. At worst you sound like a snake-oil salesman when you acknowledge seL4 and CompCert, and then immediately assert that "we can't just choose to verify a program".

indolering · on Nov 11, 2020

I'm a big fan of high-assurance code and feel that we should write software as if people's lives depend on it, because they do! But your response is a bit hyperbolic and calling someone a zealot or a snake-oil salesman is in poor taste.

CompCert and seL4 represent the upper limit of what can be formally verified, and they are relatively small programs. CompCert targets a subset of C99, which is fairly small. Certainly much smaller than C++ and most other production programming languages. seL4 weighs in at 10K LoC and only provides memory protection and process scheduling. The implementation cost ~2.2 py but the proof cost was ~20 py (although they estimate new tooling would reduce this to ~6 py)[1].

Even Gernot (the head of the seL4 project) stipulates that formal verification is suitable only for software that doesn't change much over time, because it's difficult to add new features without breaking the proof. seL4 benefits from decades of iterations on the L3/L4 design. Now imagine trying to sell formal verification of a product like Gmail, which IIRC has a code base churn of ~20%/year. Even using formal methods to create a markup sanitizer would be a huge research project, probably larger than that of Everest's formally verified implementation of the HTTPS stack[2].

Economies of scale have made commodity programming languages designed for low-assurance code ubiquitous. If we ever want to see investment in high-assurance tooling, we have to prove it's value within the existing marketplace. And we can only do that if we are honest about what is possible and what is profitable.

So no, we can't just choose to verify any program. We are getting to the point where it is profitable to formally verify small components, like DNS or the WASM bytecode. That's exciting! But it's not as feasible as you or I want it to be.

[1]: https://www.sigops.org/s/conferences/sosp/2009/papers/klein-...

[2]: https://project-everest.github.io/

uluyol · on Nov 1, 2020

> Humanity currently lacks the means to verify "deep" properties of code at any reasonable scale.

This is an active area of research. Papers are regularly being published (e.g. in the OSDI/SOSP community) that increase the depth or breadth of formal verification.

Personally, I don't think we'll ever reach a point where all (or even most) code is formally verified, but it does seem to be making some traction.

jen20 · on Nov 1, 2020

For a fantastic intro to the basics and purpose of TLA+, I'd strongly recommend Scott Wlaschin's [1] talk from NDC Oslo this year. Like other academic-seeming subjects, Scott does a great job of applying it to real-world programming.

That said, the answer to your exact question is that it doesn't validate the C++, it validates the algorithm. It's your job to ensure the C++ implements the algorithm as validated!

[1]: https://www.youtube.com/watch?v=tqwcz-Yt9gQ

pron · on Nov 1, 2020

I wish people use "real" (pretty-printed) TLA+ syntax -- the syntax used when reading a spec -- when giving presentations rather than SANY's LaTeX-like ASCII syntax used to type them. The read/write ratio for maths, including TLA^+ specifications, is higher than for programs, and so there are two concrete syntaxes optimised for those two different modes. I think most people know that on the computer, maths is usually typed differently from how it's read, and they expect to read the maths, not the LaTeX source used to typeset it. It isn't any different for TLA^(+).

vector_spaces · on Nov 1, 2020

My (probably wrong) understanding is that you don't -- you spec out the system design in TLA+, and use that to check if your design is sound (free of race conditions or certain other kinds of logic bugs). Checking that the TLA+ spec is implemented properly by the (say) C++ code is out of scope

Twirrim · on Nov 1, 2020

An anecdotal example of the use of TLA+

The service I worked for in AWS had to built a key service, about 18 months-ish after launch. If the service got anything wrong, it would wipe out the customer's data.

Evaluating the conditions and making the right decisions wasn't that complicated, but it was complicated enough as it involved multiple service components to meet certain scaling properties.

The need to not do the wrong thing was crucial, and so the decision was made to use TLA+ to make sure things were correct.

Two senior engineers had designed the overall process and it had been through architectural review (so almost all the senior engineers in the team had seen and reviewed it, and in this case I know at least one well respected principal engineer had been involved. Potential problems had been identified and resolved). Everyone felt confident, but we wanted to be sure.

The engineers spent about a week getting up to speed with TLA+, and then started writing up the formal proof of the process, defining the valid states etc. before letting the model checker at it. Problems were found. A number of them, in fact, mostly things that are hard to just mentally reason about that come with the increased complexity of the service being made from multiple parts that would need to communicate. A number of the bugs were of the type that would be unlikely to get exposed by any kind of unit/integration/functional testing. This is where the strong value proposition is for the use of TLA+, catching logical issues that are hard to reason about or test for, and Amazon has been using it to increasing amounts across services.

The engineers involved, settled in to a refine/re-run cycle until they had eliminated the bugs. A final review was done of the TLA+ model and everyone was happy.

After that they settled in to implementation phase, and they found that the way the model had ended up, it was relatively quick and simple to implement in Java. The model pretty much outlined everything that was needed in code. It became almost "fill in the blanks" coding, and there was a lot less mental effort involved for all in this stage.

The initial optimistic estimate for the project was around 6 months, with extensive allowance for "shadow mode" running and verifying. The reality was that even with the time taken to learn, it was around 4 months from start to end, even including the time for engineers to learn TLA+ from scratch, and the service was launched with a high degree of confidence.

The trick is that from here, you really need to think about keeping the model up to date. It should be seen as being as important as any other tests you've written for your actual code.

User23 · on Nov 1, 2020

A better question to ask is how do I formally derive a program that satisfies a given specification? Then validation is just mathematical “checking your work” and is easy or even trivial.

pron · on Nov 1, 2020

Except that's another question to which the answer is currently, "you don't", at least not at any interesting scale, not with TLA+ and not with anything else. But TLA+ is a tool aimed to help achieve correctness effectively, i.e. get a good cost/benefit for it. Between the highly scalable, low-confidence, testing, and the high-confidence, low-scale end-to-end verification, there's a huge swath of software correctness techniques, and the flexible TLA+ tries to hit a rather large swathe of that spectrum.

In other words, we ask, is this better than what we do now in a sufficiently large portion of cases, rather than is this the silver bullet? The answer to the second question is no. I think the answer to the first one is yes.

User23 · on Nov 1, 2020

Formal program derivation satisfying a given specification has been around for well over half a century now. Hoare logic and the predicate calculus are entirely adequate for formal derivation in any language with well-documented semantics. Lamport himself has contributed significantly to the field. Unfortunately the American dominated programming industry has a strange phobia about mathematical reasoning[1]. The irrationality is exposed by patently ridiculous statements like that formal derivation doesn't scale. If you believe that, then you can't believe any nontrivial results from the hard physical sciences or pure mathematics. The actuality is that formal derivation makes building nontrivial constructs considerably easier. It is a tool, and like any tool when used properly it acts as a capability multiplier.

[1] This is doubly confusing to me because it's not a matter of difficulty. Anyone with the mental horsepower to learn a modern programming language is easily capable of gaining proficiency in the predicate calculus.

pron · on Nov 1, 2020

That's like saying that we've been able to turn lead into gold for a while now. It depends on what you mean by "we" and by "can." Some specialists have turned some lead atoms into gold in a billion-dollar particle accelerator.

Common, "ordinary" software can neither be mechanically derived nor verified using any feasibly affordable process at common scales. Let me remind people that the largest programs ever verified end-to-end were one-fifth the size of jQuery, and they required world-experts years of effort as well as careful simplification.

I've written a fair bit on program correctness and formal verification/synthesis [1], and I, myself, am a practitioner and advocate of formal methods, but I am not aware of any in-the-field, industrial scale, general (non-niche), non-specialist, repeatable, affordable formal synthesis/verification techniques that can handle systems and specifications of the scale TLA+ is commonly used for by ordinary engineers. If you know of such examples, I'd be interested in seeing them.

[1]: https://pron.github.io/

fanf2 · on Nov 1, 2020

I recently found Microsoft’s P programming language https://github.com/p-org/P and I’m wondering how it fits into this space. The P README says:

“P is a language for asynchronous event-driven programming. P allows the programmer to specify the system as a collection of interacting state machines, which communicate with each other using events. P unifies modeling and programming into one activity for the programmer. Not only can a P program be compiled into executable code, but it can also be systematically tested using Model Checking. P has been used to implement and validate the USB device driver stack that ships with Microsoft Windows 8 and Windows Phone. P is currently being used extensively inside Amazon (AWS) for model checking complex distributed systems.”

inaseer · on Nov 2, 2020

This is a good callout. P language evolved into the P# framework (as opposed to a language) which then evolved into Coyote (https://microsoft.github.io/coyote/)

Coyote allows developers to systematically test concurrency expressed through .NET async/await tasks. I encouraged my team to embrace Coyote based concurrency testing right from the start as we set out to build a new Azure service from the ground up. The results have been encouraging. The unification of the programming and modeling language is actually _quite_ useful in practice. For one, developers who aren't experts in formal methods are able to write concurrency tests fairly easily with minimal guidance. Secondly, the model and code never go out of sync as they are one and the same. The trick through which we are able to scale out these techniques to large scale software-in-the-wild is to have numerous tests, each test exercising concurrency in one set of APIs. This approach has scaled out in a decent manner. We run our entire distributed system in one process with various tests exercising concurrency and faults triggered through a focused set of APIs. We'll blog more about our usage of Coyote in the coming months.

While TLA+ is very powerful and useful, we've also found language integrated tools like Coyote to be quite effective in testing and catching concurrency bugs in real world code in Azure.

pron · on Nov 1, 2020

The size of P programs verifiable by a model checker is similar to that of TLA+ specifications checkable by TLA+, but because TLA+ can specify systems at a non-compilable level of detail, it also allows specifying and (perhaps non-soundly) verifying much larger systems than those you can soundly verify in P. There are many programming languages with model-checkers, but they are commonly used in niche areas -- like hardware or embedded software -- where programs are relatively small and simple compared to most "ordinary" software.

inaseer · on Nov 2, 2020

This is a fair call-out. I'll like to share a counter-example where we 'model check' our real-world service written in C# with pretty decent effectiveness. As you rightly call out, the number of execution points around which the model checker explores interleavings in a concurrent situation cannot be too large. If you look at _most_ micro-services, the really interesting points around which interleavings have to be explored are to external storage systems, say, Cosmos DB or Azure Storage etc. This is because most micro-services maintain their state in these external systems and are otherwise stateless. This insight allows us to only explore interleavings around calls to these external services as that's where a ton of concurrency bugs reside and ignore the interleavings at arbitrary points _within_ the services. Leveraging this insight allows us to utilize concurrency testing tools like Coyote on real world code and services with the testing scaling decently with the size of the program.

User23 · on Nov 1, 2020

First, thanks for the link to your github.io. At a glance it looks quite interesting and I look forward to reading more.

Let me state a couple assumptions that I'm making and please let me know if they're incorrect. First, by mechanically derived you mean automatically generating a program text from a specification with no additional human input. Second by mechanically verified you mean something like calling a function that takes a specification and a program text and returns true if the program text satisfies the program with no additiona human input.

I do not believe either mechanical derivation or verification is possible in the general case using these definitions. Both appear obviously undecidable in the general case. Nevertheless I believe it's a fruitful field of research since there are already specific cases where one or both are already possible and it would be valuable to grow that set. My somewhat jaded view of the focus on mechanical verification is that it's what the industry that funds much of the research wants, since they probably dream of it as a way to save money on programmers and testing.

Here and elsewhere when I talk about formal derivation of a program I mean as a method by which a human programmer constructs the program text. The simplest useful example of this that I can think of is deriving a totally correct loop from a specification. The loop invariant method is so easy and so powerful there is absolutely no valid reason why every programmer shouldn't use it when writing any loop. Dijkstra wrote a textbook[1] with a number of other non-trivial examples that you may enjoy reading if you haven't already. In short, Dijkstra shows how formally deriving a program is equivalent to deriving a mathematical proof. It's actually a relatively informal book and not at all mathematically demanding. He and Scholten fully developed the concepts with a much more mathematically rigorous textbook too[2].

So with that in mind, I'd personally like to see less of a focus on how to remove the programmer from the process and more on making it easier for him to formally derive his programs. One approach I imagine is a kind of side by side editor with good visual feedback where on the one side you build up your specification and on the other side you write code to ensure the various conditions the specification adds hold everywhere in the state space. If I were to undertake such a thing myself I'd probably start with a very simple purpose built language to avoid getting in over my head, but in principle any language with formally definable semantics could be used. A new project would start with the empty specification and the empty program, and obviously total correctness holds. All current major editors can be thought of as implementations of this model that only support the degenerate case of the empty specification that admits all programs. The initial verification strategy would rely on the programmer as its oracle, but additional strategies could be used where feasible. But even simply providing a nice way for a programmer to track that yes he's quite sure he's satisfied this part of the spec would be helpful.

[1] https://www.goodreads.com/book/show/2276288.A_Discipline_of_...

[2] https://www.goodreads.com/book/show/3144463-predicate-calcul...

pron · on Nov 1, 2020

There are various avenues of research in this area (you may want to look at David Harel's Behavioural Programming [1], and for more deductive approaches look at Back's refinement calculus, which is very similar to what you have in mind), but to date, we do not have any methodology that allows us to either synthesise or verify programs that correspond to specifications that are routinely written and verified in TLA+; we're nowhere even remotely close. It's comparing a battle-tested, industrial-grade tool with hints of more or less promising directions in the lab. The gap between the lab and the field is so large that we must be careful to treat them as almost two different worlds.

[1]: http://www.wisdom.weizmann.ac.il/~bprogram/more.html

User23 · on Nov 1, 2020

I agree with everything you state, but I think we're talking past each other a bit now. My claim is that even without any additional tooling beyond TLA+ to produce a specification and your development environment of choice, just using mathematical reasoning as you derive your program text makes it considerably easier to build systems that satisfy the specification. That mathematical method using predicate transformers is well-developed and has been for decades. This formal approach to deriving the program text results in an object that is a mathematical proof of its own correctness. The human act of writing the program is exactly the same as deriving a proof. I'm simply incapable of accepting the proposition that avoiding mathematical rigor scales better than using it.

pron · on Nov 1, 2020

> just using mathematical reasoning as you derive your program text makes it considerably easier to build systems that satisfy the specification.

And I am saying that since this has never been achieved in practice for systems of considerable scale, this is a rather hypothetical use of the word "easier." Is it possible that one day this will actually be easier? Sure. Is this the case now? I don't think so.

> The human act of writing the program is exactly the same as deriving a proof.

The human act of painting your ceiling is exactly the same as painting the Sistine Chapel and the human act done by a first-year student taking a calculus exam is exactly the same as a Field medalist writing a proof in a paper. Look at how, say, a verified quicksort is written in Idris [1] and compare that to how it's written in Java/Python (one could write Quicksort in Idris as in Java, but then the program will not serve as proof of its correctness). I think that the claim that with what we know how to do today producing the former is as easy as the latter, let alone easier, requires some empirical evidence.

[1]: https://github.com/bmsherman/blog/wiki/Quicksort-in-Idris

User23 · on Nov 1, 2020

> And I am saying that since this has never been achieved in practice for systems of considerable scale, this is a rather hypothetical use of the word "easier." Is it possible that one day this will actually be easier? Sure. Is this the case now? I don't think so.

Respectfully, I recommend that you read Dijkstra's A Discipline of Programming that I linked above. I think you'll find it interesting at the very least and possibly even enlightening. I assume you are a product of the American educational system, where these concepts are rarely taught. There are exceptions though, for example the great Leslie Lamport has also done notable work[1] in this area. You'll find that out of necessity on some level we all have to reason about state spaces and the possible coordinates in those spaces to write software. Using formal mathematical reasoning is strictly better than the sloppy approach generally used. I'm not claiming that it's a magic bullet though. Even mathematicians make errors after all.

There are various middle grounds too. Done properly, TDD and its cousins amount to an implied specification and "ensure" that the program written will satisfy the tested subset of said implied specification. The primary practical limitation of this approach is that testing can never show the absence of errors, only their presence.

> The human act of painting your ceiling is exactly the same as painting the Sistine Chapel and the human act done by a first-year student taking a calculus exam is exactly the same as a Field medalist writing a proof in a paper.

You've lost me here, I don't at all understand how these analogies have anything to do with the kind of reasoning a human can do about programs vs the kind of "reasoning" programs can do about programs.

The point is that it doesn't matter what language a program is written in. The program text is always a mathematical object describing a state space (usually reasoned about as a set of state spaces like structs or objects) and transformations in those state spaces. Granted, some languages, like Perl, have ill-defined semantics that make reasoning about correctness impracticable, but often it's possible to restrict oneself to a subset of the language that can be reasoned about. For what it's worth I've applied these techniques quite successfully in my professional life. I hate getting paged and I hate debugging, so I've found it useful to learn to write correct software. I seriously doubt that I'm the only one, but as I've said before there is some kind of bizarre phobia American programmers have about being told that what they are doing is math that makes it very hard to talk about with most of them.

[1] https://lamport.azurewebsites.net/pubs/lamport-win.pdf

pron · on Nov 2, 2020

I think you are not sufficiently familiar with this field, Dijkstra wrote "Discipline" years before computational complexity results on program analysis were known, and I think you misunderstand the tremendous gap between what works on small examples in the lab and what is scalable enough to be applied widely in the field and the difference between something that might be possible in principle vs. something that's affordably feasible in practice. We know more than Dijkstra did, the techniques have improved, and still no "ordinary" (non-niche) real-world software of any significant size has been written using formal refinement to date. If you claim this is possible today -- go ahead and do it for a medium-sized program of say, at least 100KLOC. A formalism like TLA+ is well suited for this task, so go ahead and try.

I am not American, I don't like making generalisations about Americans or any other nationality, and, as you could see from my blog, I am very much in favour of employing maths in system design -- that's what TLA+ is all about. But confusing "mathematical object" with "can be feasibly reasoned about" is just ignoring computational complexity. I can give you an example of two subroutines, each only a few lines long and each very simple to understand, and yet you won't be able to answer a most basic question about their composition. Complexity theory tells us that mathematical problems have an inherent hardness, and the fact you can write them does not mean you can feasibly solve them. There are no simple answers here. In some carefully constructed ways formal methods can help in some specific circumstances, different methods helping in different use-cases, and you cannot just say that it's "better" overall. For example, while programs of millions of lines of code are routinely tested, no one has ever managed to formally reason about deep properties of programs of this size.

User23 · on Nov 2, 2020

I take it that means you've never read Discipline or Predicate Calculus and Program Semantics.

I don't doubt your expertise in TLA+, as I hope you can see from my studious agreement with you on the items that are clearly within your expertise, but on this point you're not even wrong. You are attempting to refute the claim that a human programmer can derive a program that has desired properties more effectively with mathematical reasoning than without it by offering the objection that mechanical verification has high computational complexity[1]. That is sheer nonsense. It's like saying that because factoring products of primes is hard that it's in no way feasible to multiply them rigorously so just use an estimate instead.

As an aside, the avionics for the Space Shuttle was over 250KLOC and was proved correct from the bottom up. And it functioned flawlessly. This was done before computational complexity results on program analysis were known, but that lack of knowledge was observably irrelevant to building correct software. Sadly, the hardware left a bit to be desired.

[1] It's undecidable in the general case.

pron · on Nov 2, 2020

No ordinary, real-world software has ever been fully and affordably formally proven. There have been specialised 1MLOC programs checked with a model checker, but the techniques don't generalise. I would be interested in a source to your claim about the Space Shuttle software, because even though that's both high-cost and niche software, I think you might have the details wrong. Anyway, I encourage you to try to affordably write ordinary software in this way. This is certain to win you some high-profile award, as no one has been able to do it so far. Either you succeed or you fail, but there's really no point in insisting it can be done without pointing to anyone who has done it.

User23 · on Nov 2, 2020

This discussion would be more fruitful if you carefully read my responses. I'm not actually disagreeing with you, but you're missing my point. I don't mean to be rude, but you really appear to be suffering from man-with-a-hammer syndrome here. As I've already shown, what can be mechanically verified with a model checker has no bearing on a programmer's ability to formally derive totally correct software. It would be extremely surprising if it were otherwise, since that would mean that one of the most fundamental theorems in computing science is incorrect.

> Anyway, I encourage you to try to affordably write ordinary software in this way

I already stated in an earlier response that I've been doing this professionally. And I have been for over a decade.

Regarding the Shuttle, you'll have to hit a library, the only online source I can find is Feynman's minority report and it doesn't go into great detail. It does however show how properly decomposing a complex system into component parts is necessary to keep it intellectually manageable. And bear in mind that while Feynman is best known as a physicist, he knew enough about computing science to write a book on it.

pron · on Nov 2, 2020

And my point is that discussion of hypothetical ability is not what I am talking about. I can show you why there are good reasons to doubt the programmer's ability to reason about a large program in light of computational complexity results accumulated in the last twenty years (and you are simply wrong about the bearings of model checking to the situation), but instead of trying to argue over hypotheticals, I'm merely pointing out that the ability you believe exists has never, to the best of my knowledge, actually been realised in a scalable, repeatable, affordable way in the field. If you claim you could achieve it, please try.

> And I have been for over a decade.

What exactly have you done for over a decade?

I am quite familiar with predicate transformers, but the distance between using Hoare logic and the like on code units -- which is relatively easy and pretty widely done (e.g. with SPARK) -- and formally (which means mechanically checkable, even if it is human directed) deriving a large program from a specification is gargantuan. Proving/deriving code units from contracts and the like is not the problem I'm talking about. Neither Dijkstra nor anyone else has come remotely close to achieving it, and achieving it is a guaranteed Turing award.

User23 · on Nov 2, 2020

> I can show you why there are good reasons to doubt the programmer's ability to reason about a large program in light of computational complexity results accumulated in the last twenty years (and you are simply wrong about the bearings of model checking to the situation)

OK. Show me.

This is why it's important to write the specification with an eye toward keeping the implementation manageable. This is probably a drawback of using a tool like TLA+ to check a specification. The tool makes it relatively easy to write a terribly structured specification, check it with TLA+ and declare victory, and leave the programmer in a hopeless situation. A well designed specification will unsurprisingly mirror well designed code[1]. What I have empirically tested is that keeping my functions/methods simple enough to be easily specified also keeps them easy to formally derive correct program text for. The problem of composition is treated entirely separately. You don't need to understand the fully concatenated predicate transformers for each function/method, you just need to know what wp (or another appropriate predicate transformer) is for each and be sure to satisfy it.

> What exactly have you done for over a decade?

Please understand this is a greatly abridged response. My general habit is to look at the requirements and infer the implied specification since a formal specification is sadly almost always missing. Thankfully most business code is theoretically trivial so postulating a reasonable specification from the requirements is easy. So, I then document my assumptions about what I believe the customer is actually asking for formally. Before I write code I think about the postcondition that I want that code to establish and determine what preconditions, if any, must hold. I put this as comments in the test modules. Then I do a more or less standard TDD approach, with the tests trivially derived from the specification[2], because management likes tests and they're paying the bills. Then I derive the code to make the tests pass from the specification using predicate transformers from the precondition to the desired postcondition. And then I run the tests and move on to the next properly isolated concern.

Obviously no human discipline is infallible and I still make mistakes, but I make far fewer mistakes than I would using the typical sloppy approach most programmers use. One time I got chewed out by my manager for allowing one bug to make it into production in nine months, because I had raised expectations of the quality of my code so high. And by the way I've always just presented my methodology as TDD because of the mathematical phobia that's so prevalent in American industry. If I tell management that this code is easy to maintain, you just need programmers proficient in the predicate calculus they will dismiss me. If on the other hand I say look I have well over 90% branch coverage I'll get an attaboy and management will be happy.

By this time I hope it's clear that my thoughts on the subject were not formed in the lab, but at a Big Tech Company You've Definitely Heard Of. I hope we can agree that the general tendency in industry to ignore or even deride what computing science can teach us is unfortunate.

> formally (which means mechanically checkable, even if it is human directed)

Dubious[3]. A mathematical formalism does not require algorithmic automated checking to exist. And even if said checking is decidable, which it mostly isn't, it may not run in a practical time on any possible mechanical computer we can conceivably build. However if you mean like any other formal mathematics where "the symbols do the work" so long as you follow the appropriate rules of derivation then I am in complete agreement.

[1] Proper separation of concerns and such.

[2] Obviously no tests are going to exhaustively check a non-trivial specification, but my approach is still strictly better than the typical practice of primarily optimizing for satisfying coverage tools.

[3] https://en.wikipedia.org/wiki/Formalism_(philosophy_of_mathe...

pron · on Nov 2, 2020

> OK. Show me.

Here is a simple specification: write a program that, given a positive even integer, returns two prime integers, whose some is the input, and never fails to do so. Start with a formal statement of the requirement, and derive a program that provably satisfies it. Decompose your program however you like.

> Then I derive the code to make the tests pass from the specification

Do you use predicate transformers all the way from a high-level spec to each and every line in a 1MLOC distributed and interactive system, with management consoles, reports and ten-thousand users per server?

> A mathematical formalism does not require algorithmic automated checking to exist.

It most certainly does. That is the literal meaning of "formal" in formal logic/method/synthesis/verification. A formal logic has a proof theory made of inference rules. It means that conclusions are reached by manipulation of the syntactic form -- hence formal -- without use of any mathematical intuition. Alan Turing then showed that a proof system can always be described as a Turing machine, and that formal = mechanical. There is a huge gap between informal mathematical reasoning and formal reasoning. I believe I've seen someone mention an exponential blowup in proof size of formal proofs compared to mathematical proofs.

> Obviously no human discipline is infallible and I still make mistakes, but I make far fewer mistakes than I would using the typical sloppy approach most programmers use.

There's a very, very wide gap between "disciplined" and formal. I am all for disciplined and for using formal methods in various places. End-to-end formal derivation/verification, i.e. from high-level spec to code, in a scalable, feasible manner, is yet to be achieved.

User23 · on Nov 2, 2020

> Here is a simple specification: write a program that, given a positive even integer, returns two prime integers, whose some is the input, and never fails to do so.

I'm not going to solve Goldbach's conjecture on hackernews or anywhere else. I wish I could, I really do, but my incapability there doesn't invalidate my arguments. I'm not even annoyed though, that was a cute trick and I appreciate your cleverness. Fortunately Fields Medal tier mathematical problems have never come up in my professional programming career.

> It most certainly does. That is the literal meaning of "formal" in formal logic/method/synthesis/verification. A formal logic has a proof theory made of inference rules. It means that conclusions are reached by manipulation of the syntactic form -- hence formal -- without use of any mathematical intuition.

Once again, I completely agree with this. Let the symbols do the work. But we keep coming back to the same misunderstanding. A program can feasibly be derived that can't feasibly be verified without access to the discrete documented steps of the derivation[1]. Showing the work of the derivation is sufficient to convince any honest interlocutor of the correctness of the proof. It may well be enough to allow an algorithm to perform the verification too, I'll defer to your expertise on that. In any event, as I said before using the analogy of primes, verification is rather a lot harder than construction.

Edit: Also I want you to know I'm finding this a pleasant discussion and am thankful that you're taking the time to help develop my thoughts and hopefully yours too.

[1] By analogy, factoring the product of two primes is easy when in addition to being given the product you are also given the multiplier and multiplicand.

pron · on Nov 3, 2020

> Fortunately Fields Medal tier mathematical problems have never come up in my professional programming career.

Ah, but you see, I've used this problem only because you could be easily convinced of its difficulty because it's a well-known one and others have tried solving it unsuccessfully. But hard problems come up all the time, especially when concurrency is involved. You just don't know they're hard at first glance. For example, the specification that a database be kept on multiple computers to offer redundancy without loss of data is a staple of distributed computing, but finding an algorithm that does it (perhaps together with some other latency requirements) is extremely hard and it took many years to find the first one that could do it.

That easy specifications are easy to derive doesn't help us, because they might also be the same programs that are easy to test or be convinced of their correctness in some other ways. The whole point of formal methods is to help us with the hard cases, which, sadly, are very frequent.

> A program can feasibly be derived that can't feasibly be verified without access to the discrete documented steps of the derivation.

Not if it's done formally. A formal derivation guarantees proof, and, in fact, the problem of verification is easily reducible to derivation, i.e., derivation is no easier. Here is the proof: X is a program, and so, also a specification, and P(X) is some property that you want to verify for X. A formal derivation of a program from the specification X ∧ P(X) will, therefore, prove P(X). This is the "trick" I used with the "Goldbach program", except it's not a trick, but a basic result in the study of the hardness of program analysis, a subject that's been extensively studied for a few decades now. The claim that it's easier to derive than to prove is just mathematically false. With regards to claims about informal "proofs" or "derivations" it's hard to say anything, because the problem is not well-stated.

User23 · on Nov 3, 2020

> finding an algorithm that does it (perhaps together with some other latency requirements) is extremely hard and it took many years to find the first one that could do it.

I agree. And correctly implementing those algorithms is much easier using a formal derivation with a mathematical tool such as the predicate calculus than it is by just eyeballing it and hoping your tests catch all errors.

> But hard problems come up all the time, especially when concurrency is involved

Predicate transformers are well suited for reasoning about nondeterminism and concurrency.

> Not if it's done formally

Program derivation with predicate transformers is done formally. It's a chain of equivalences, implications, and consequences as appropriate to the problem. The program text is literally developed as a proof, and each transformation of the code must be justified by axiom, theorem, or lemma and at the end of that derivation all properties of the specification hold. You can't possibly be claiming that no formal mathematics was done before the invention of automated checkers, because that would be both ridiculous and observably false.

> Here is the proof: X is a program, and so, also a specification,

I disagree. A specification maps to either an empty or countably infinite set of programs that satisfy it. Also all programs satisfy countably infinitely many specifications. They are not isomorphic or even dual. In this case that conflation doesn't seem to impair our understanding too much though.

> and P(X) is some property that you want to verify for X.

Right, and the task of the programmer is to write a program text that provably establishes that P(X) holds everywhere in the program state space.

> A formal derivation of a program from the specification X ∧ P(X) will, therefore, prove P(X)

Yes this is what I meant when I said you should use the same techniques to derive the program text that you would to derive any other mathematical proof.

> it's not a trick, but a basic result in the study of the hardness of program analysis

It's a trick because it's a clever way of saying write a program for which there is no known algorithm. As of now nobody is going to be able to do that regardless of their methodology. I get the impression that you mean this to be a compelling point, but I'm not grasping how I'm better off not using mathematical reasoning to derive the programs that I do know how to write.

> The claim that it's easier to derive than to prove is just mathematically false

It's not even false it's incoherent. But I didn't make that claim. Both derivation and verification are ways to prove a program satisfies a specification.

As you say, a derivation of a program that satisfies a specification is a proof. Specifically, a proof by construction. The claim is that it's considerably easier to derive a program that satisfies some property than it is to verify if an arbitrary program satisfies that property. For example say we have a specification S: "returns a product of two primes" and two programs. P1: "Return 2^82589933-1 * 2^77232917-1" and P2: "Return 2^8258993365541237147-1". The proof of the correctness of P1 is trivial because I constructed it to be so[1]. As for P2 I have no idea if it satisfies S, but I am certain that it'll be considerably harder to convince myself and others that it does than it was for P1.

> With regards to claims about informal "proofs" or "derivations" it's hard to say anything, because the problem is not well-stated.

I've made no claims of the sort.

[1] https://primes.utm.edu/largest.html

zozbot234 · on Nov 1, 2020

> Hoare logic and the predicate calculus are entirely adequate for formal derivation in any language with well-documented semantics.

Formal verification, yes. Formal derivation of a correct program from a spec, I really don't think so. In particular, I don't know of an easy way of deriving the required induction hypotheses/loop invariants to prove correctness of a program which includes recursion or loops.

User23 · on Nov 1, 2020

I use formal to mean using mathematical methods like those a mathematician uses when deriving a proof, which is to say a human directed process. I would say mechanical derivation if I meant having a program do it automatically.

jlokier · on Nov 1, 2020

> TLC could, in some situations, check whether an uncountably infinite number of program executions

I'd love to see an example of an uncountably infinite number of program executions checked.

The set of all computable values as countable, after all.

pron · on Nov 1, 2020

Not all computations produce a value; most don't. Consider this TLA+ specification:

    VARIABLE x
    □(x ∈ 0..9)

or, written in a form that can be checked by TLC:

    VARIABLE x
    x ∈ 0..9 ∧ □[x' ∈ 0..9]_x

It says "x is a variable that is always an integer between 0 and 9, and it describes a system that nondeterministically assigns a number 0-9 to the variable x at each step, forever. It has an uncountable number of executions, each of infinite length, yet because it has a small number of states -- ten to be precise -- TLC can easily and quickly check many propositions about it. It is true that most of the behaviours don't have a corresponding determistic Turing machine and so most of those sequences are not computable, but it is not only a perfectly valid specification, but also a reasonable abstraction of real systems (e.g. a program that forever asks the user to input an integer between 0 and 9).

On the other hand, this specification describes a system that picks an arbitrary natural number and then counts down to zero. It has only countably many executions (one for each initial number chosen), and each of them is finite in length -- i.e. they all terminate (I've omitted a liveness condition necessary to state that each computation will actually terminate, as it would mean little to those not familiar with TLA+ [1]):

    VARIABLE x
    x ∈ Nat ∧ □[x > 0 ∧ x' = x - 1]_x

And yet TLC cannot check it because it has infinitely many states and TLC is an explicit-state model checker. In practice, we let TLC check a derived specification that overrides the value of Nat, the set of all natural numbers, with some finite set.

[1]: To those familiar with TLA+, the real specification is:

    VARIABLE x
    LET Next ≜ x > 0 ∧ x' = x - 1
    IN x ∈ Nat ∧ □[Next]_x ∧ WF_x(Next)

jlokier · on Nov 1, 2020

Thanks!

jlokier · on Nov 1, 2020

Ah, I think I see it.

The set of infinite, non-deterministic execution traces is uncountable (same reason real numbers composed of infinite sequences of digits from a finite set are uncountable).

On the other hand, the set of infinite execution traces which can result from any deterministic, finite-size program is countable (because the set of programs is).

But programs take inputs. And the set of possible inputs to an infinite execution trace is uncountable (same reason as reals, above).

The prover needs to show that a program behaves correctly when given any sequence of inputs, even if the program never terminates and keeps taking new inputs forever. That requires proofs over an uncountable set of infinite execution traces.

doonesbury · on Nov 1, 2020

>don't only not run all possible executions of a program, they don't even run it once.

Can you please clarify? A tla+ spec (or pluscal to tla+) is supposed to verify against all possible interleavings of reachable states like spin. You seem to be saying something different

pron · on Nov 1, 2020

The model checker TLC, like SPIN, does verify "against all interleavings" but it doesn't "run" all of them; in fact, it runs none of them. Conceptually, it gives you the answer about all of them, but the algorithm doesn't actually run them.

doonesbury · on Nov 1, 2020

OK, that's clearer. Thank you. Certainly not nit-picking, but it's worth pointing out in passing that computing the state to check qll the interleavings --- albeit not actually running each path through the reachable states but looking for safety or livness violations --- requires real work being super exponential in the worst case.

salawat · on Nov 1, 2020

Define "against all interleavings", or is that something I'd understand after reading TLA+'s documentation?

I think I get the drift, I'm just wanting to confirm "read the docs" is the only missing bit.

edwintorok · on Nov 1, 2020

Worth pointing out that the bug is still present in latest glibcs (I can reproduce it on Fedora 33), and affects programs that use pthread_cond_signal, in particular language runtimes like OCaml, Python and .NET that have a global runtime lock.

Bugreports:

* Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1889892

* Ubuntu: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800

Also see https://github.com/mratsim/weave/pull/58/files

theamk · on Nov 1, 2020

Wow, this is great! I have read TLA blogposts before and I was always under impression that it was a whole bunch of complexity for little benefit. This one shows a great performance on a real case which would be complex to solve in real life.

inaseer · on Nov 2, 2020

Concurrency is hard - this was a great investigation into the bug using TLA+ by the author which suggested simplifications to the code in addition to the bug fix.

This reminded me of another reader/writer concurrency bug posted by Tom Cargill in the c2 wiki caused by surprising behavior of Java monitors, which Hillel Wayne showed how to discover and then fix using TLA+ at https://www.hillelwayne.com/post/augmenting-agile/

We were able to reproduce and fix the same bug using Coyote as well, and documented our experience at https://cloudblogs.microsoft.com/opensource/2020/07/14/extre... I was pleasantly surprised at the finding and was definitely heartened by our experiment where we were able to reproduce the bug in actual working C# code without the need of an additional modeling step. While TLA+ is undoubtedly very powerful when it comes to modeling and catching concurrency bugs, I do hope tools like Coyote get wider adoption and are implemented for more languages so bugs in critical infrastructure code can be caught in a scalable and repeatable way while staying within just one language for implementation _and_ testing.

sesuximo · on Nov 1, 2020

I still have no idea what the actual bug is. Can anyone explain a bit more specifically?

secondcoming · on Nov 1, 2020

"Where is the bug? Worker 2 wasn’t supposed to go to sleep, because g_signals[0] was supposed to be 2. So the “compare” part of the “compare and wait” of the futex should have returned false. But Worker 1 came back from the past and did a decrement of g_signals that should have belonged to a previous generation"

strictfp · on Nov 1, 2020

i.e an ABA problem

taeric · on Nov 1, 2020

These always strike me as requiring a ton of understanding beforehand. Is unfortunately like a lot of testing folks do. If you knew to write some tests, the test probably want necessary.