Having been down this road [1], a big problem is that the people doing this tend to be too much into the theory and not enough into wanting programs not to crash. That's some of the Coq world.
It doesn't have to be this way. Microsoft's work in this area led to the Static Driver Verifier [2], which was a huge success. Windows crashes due to kernel drivers breaking the rest of the kernel have pretty much disappeared. They were once (Windows XP and earlier) responsible for over half of Windows crashes.
Most of the things you want to prove about programs to make sure they don't crash are mathematically trivial. Detecting buffer overflows and off-by-one errors seldom requires elaborate proofs. If they do, your program design probably has a problem. Simple proofs can be handled by a prover which uses a Nelson-Oppen decision procedure. This is a complete decision procedure for formulas which contain only addition, subtraction, multiplication by constants, Boolean operators, structures, and arrays. This covers most data access operations. It's fully automatic and complete - the algorithm always succeeds within its problem space. You need one of those for the easy parts.
For the hard parts, we used the Boyer-Moore prover. (Sources are available and I brought it back to life on Github.[3] It runs 3000x faster than it did in 1982.)
The interface with the language is crucial to usability. Most efforts get this wrong. They put the verification stuff in separate files, or in comments. It should be part of the normal code. We built the verification language into the compiler, so you used the same source and the same syntax for compiling and verification. There were additional statements - ASSERT, ENTER, EXIT, etc. The same compiler front end was used for both compiling and verification. The verifier itself ran off the byte code from the compiler front end. All type and size issues had been decided at that point, so the verifier was looking at unambiguous operations such as "pop two operands from stack, add in 16-bit twos complement arithmetic, push on stack". This captured the semantics of the program the same way the compiler did.
The user never sees the easy stuff. Every subscript use generates a bounds check assertion which has to be proved. If the prover is able to prove the assertion true, the user doesn't see it. Only the errors are visible. Another trick is that you don't write "assert(a and b and and d)", you write "assert(a); assert(b);". Those generate separate proof goals, and if only one fails,the user gets error messages about only that one. Also, when you write "assert(a); assert(b);", the second one is proved under the assumption the first one is true. This prevents cascading errors.
A key point here is that verification is meaningless unless you prove all the theorems. If you write "assert(false)" other errors go away, and you're left with one simple error, which you cannot eliminate.
You don't get to dismiss it. It's possible to write an assertion that's equivalent to "assert(false);", such as "assert(0 > 1)"; This is why I'm so negative on the arguments from some of the Rust crowd that a little undefined behavior is OK. It can blow at any seam.
When you needed to prove something hard, you wrote
assert(a);
assert(b);
where a could be proved by the Nelson-Oppen prover from the code before it. B was chosen such that if you assumed b, the Nelson-Oppen prover could prove anything required past that point. Now you had one error message, that the system couldn't prove "a implies b". So then someone had to try to prove that with the Boyer-Moore prover. The Boyer-Moore prover is mostly automatic, but often someone had to provide intermediate theorems to get to a hard one. Those were all machine-checked. If this succeeded, the Boyer-Moore prover generated a theorem that a implies b for use by the verifier. Rerunning the program verifier would then succeed. Such theorems could be reused with different variable bindings, so if the same problem came up again, it didn't require more work with the Boyer-Moore system.
If you're generating vast amounts of user-visible proof work, something is wrong.
The whole system I'm describing is now on Github.[4] It's decades-old code, but I've ported about 80% of it to modern compilers. The Franz LISP to Common LISP port of the prover isn't working well, and the author of that code died years ago. If anyone wants to work on that, go ahead.
I appreciate your detailed post. Very interesting esp with Nelson-Oppen being new to me. I remember Boyer-Moore from studying CLI stack. Your comment brings three things to mind:
1. It implies, esp with links, that one should both look into and build on BM prover. Whereas, my foray showed that high-assurance sector moved on from BM to ACL2 with latter used in all sorts of stuff (esp hardware). Hardin et al link below are my favorite at the moment. Your Github on nqthm even says avoid BM in favor of ACL2 for new stuff. Anything you want to add on that such as things ACL2 could use from BM or in general? I know from other works that new and improved doesn't mean better in every way. Also, doesn't ACL2 lack a tiny, proof kernel checking its steps like the LCF provers or am I misremembering? I have seen work embedding it in HOL, though.
2. It reminded me about Gypsy Verification Environment that got used on lots of B3/A1-style projects. I didn't realize that was in the FTP site. It might be interesting to get into runnable shape for reliving history and seeing how things were done similar to what you appear to be doing with those other works. I know it dates to the mid-70's but the mid-to-late 1980's paper on applying 2.0+ version doesn't mention your work at all in references even though done in similar time frame. Why you think that is? Since it doesn't, what's better or worse about the approaches if we compare your Pascal-F method to GVE? Side note: the report in the tarball describing the the Gypsy, Middle Gypsy, and Micro Gypsy languages was really good on readability vs detail included. I always love that.
3. The current contenders in verified, low-level programming are the mature SPARK Ada tool and Microsoft's Dafny. Would you recommend skipping the Pascal tech to just enhance one of them directly or extracting new languages to their form for use on their tooling? Latter was my plan if I build anything with SPARK the target as many have done. Jasmin language similarly used Dafny. Anything you think especially SPARK Ada could learn in automating practical analyses from Pascal-F, Gypsy, or BM/ACL2? Interestingly, SPADE-Pascal, SPARK's predecessor, was developed in mid-80's as well.
As usual, just trying to see how these points in history might connect in old lessons or new opportunities.
I'm not suggesting one should use the original Boyer-Moore theorem prover today. It's of historical interest. I made it runnable again because I used it when it was state of the art, and I'd previously ported it once. But it's 1980s technology.
I like their constructive approach to mathematics. They build number theory up from something close to the Peano axioms, one proof at a time. This is good for soundness. I built on that and didn't add axioms. Proving McCarthy's "axioms" for arrays in Boyer-Moore theory was possible, and was done. Instead of explicit quantifiers, you have recursive functions which are proved to terminate. That gets you out of many annoying logic knots associated with quantifiers. Like undecidability.
I've been out of this for a long time. There's been considerable progress. But the tooling still seems not to be all that programmer-friendly. It could be. We had that partly figured out. What we didn't have was enough CPU power to make it go fast enough to be useful.
I'd hoped Rust would go in for more verification. C/C++ has so much undefined behavior that huge efforts have to be made just to nail down the language. Rust is much better in that respect.
People get carried away with abstraction in this area. I wouldn't try to verify macros or generics before they're instantiated.
That requires elaborate and iffy type theory. Instantiate them, look at the generated code (say at the LLVM input level) with hard types for everything, and work on that. That's unambiguous. Then build a system with enough automation that the user doesn't have to do work for each instantiation. That's what computers are for. Don't try to be too clever.
Automated proof is what you get with SparkAda. With the 2014 version you get executable contracts (thanks to DbC added to Ada2012) and lots of help from the gnatprove tool (that acts as a frontend - meaning it implements all runtime checks the compiler implements - to why3 or other provers or static analysis tools - see the inclusion of codepeer recently). Then once you've generated all the verification conditions you run z3, cvc4, alt-ergo... You define the amount of effort you're allowing and if they fail to prove your VCs you get immediate feedback (in your dev IDE) and you drill down to 'why' and 'how to help the prover' quickly once you've spent some time doing it (most of the time it bogs down to 'state your unstated assumptions'). And if you're not lucky but still motivated you can quickly jump into Coq or Isabelle with your VCs.
They're very pragmatic. They went for a subset of Ada, that is deemed 'provable' but still largely useable. You get lots of nice typing features and RT checks in Ada.
Still some difficult areas : aliasing, floating point, and user-friendliness, but still the tech is working.
As for generics that's the tactic SparkAda adopted. It's disturbing at first ('they went the lazy way !') but it's pragmatic and it works great. In fact some times you'd be inclined to transform a procedure, function or package to a generic, because with the correct restricted, specific types and parameters the proof effort is easier.
We're at a point where proof of absence of runtime error is almost achievable by mere programmers. We'll be asking soon 'do we go for test+fuzzing+quickcheck for robustness or proof ?'.
As for automation, they're always working on small but useful stuff. The latest one I can think about is loop unrolling. Might look funny but writing loop invariants and variants is tedious. They're slowly (lead bullets) working up small automation steps.
I mean, isn't it all varying degrees of provable things? At the end of the day you're still at the mercy of the hardware uncertainty when crossing clock domains and the like.
To me it's all about the risk factors you're willing to accept and how you decide to mitigate against them.
still at the mercy of the hardware uncertainty when crossing clock domains
That was solved years ago. See "Arbiter (electronics)".[1] It's possible to resolve that uncertainty, but sometimes you have to take extra time to resolve the race condition. Think of this as being like resolving a coin flip. Sometimes, the coin lands on edge. You deal with that by adding some noise in the form of vibration and waiting. Repeat until the coin ends up on one face or the other.
(I had the painful experience of working on the OS for a multiprocessor mainframe which predated that solution, and did have race conditions on memory access.)
> Having been down this road [1], a big problem is that the people doing this tend to be too much into the theory and not enough into wanting programs not to crash.
This problem is already solved. Instead of crashing - do something else! Like make a so-called “educated guess” of what the program should actually do. Or throw you into a debugger. Or, heck, do as JavaScript does: if you can't figure out what the right thing to do is, do something wrong instead!
The actual unsolved problem is how to write correct programs. I'm skeptical that mechanizing formal verification is the solution.
I almost completely disagree with the author of this article.
Yes, formally verified systems still have bugs due to the TCB (the trusted computing base). But that's because the TCB is precisely the only part of the system that is not formally verified! So of course it's going to have bugs.
If so many crashing and correctness bugs were found in the (small) trusted part of formally-verified systems (well, in reality only 16 bugs were found in 3 entire formally verified systems, but I digress), imagine how many similar bugs there are in systems that are not formally verified at all!
If anything, what this article should be arguing is that operating systems, proof finding and verification tools and compilers should also be formally verified (and some already are), so that even bugs in these parts of the system are also eliminated.
The only part of the article that I agree with is that it's not feasible to write 10x more proof code than real code.
But IMHO, this is why proof automation is really important to have - tools like Isabelle/HOL, Why3 and SMT solvers like Z3 provide this (the latter of which the author mentions).
Coq, on the other hand, has much less automation and the result is that it's much harder to use and results in larger proof code. Yet, it's seemingly inexplicably much more popular - perhaps it's the catchy name?
Coq has/is much stronger a logic (albeit constructive) than HOL. That is sometimes convenient. Most users of proof assistants don't get far enough that powerful automation is unavoidable.
In addition, Coq benefits for a network effect: Coq started earlier targetting university teaching, primarily at UPenn with Pierce's Software Foundations [1].
That Coq has a stronger logic is an argument I hear often, but is that really so important when the result is that it's so hard to use that no real-world software engineer is ever going to bother?
Maybe with HOL I will not be able to prove some deep theoretical mathematical result that I could with Coq's Calculus of Constructions, but if I can otherwise much more easily prove >99% of my code to be bug-free, then I would argue that perhaps it's the better "tool" for the job...
The issue is not the edge of logical independence, but rather how easily can you express certain theorems. Most verification is about the verification of programming languages (e.g. the POPLmark Challenge [1]), and in this setting sometimes what you want to prove is more convenient to express in the CoC. You can also express it in HOL, but not quite as easily.
All that said, automation of provers based on the LCF approach with a non-constructive logic is much superior as of December 2017. With this in mind I was surprised to see that L. de Moura based the new Lean prover [2] on a variant of the CoC. He thinks that all automation that we currently have for HOL (i.e. Isabelle/HOL) can be brought to Lean.
> sometimes what you want to prove is more convenient to express in the CoC. You can also express it in HOL, but not quite as easily.
> All that said, automation of provers based on the LCF approach with a non-constructive logic is much superior as of December 2017.
This is kind of the point I am making. Is the fact that some things are more convenient to express worth the substantial loss of automation?
> With this in mind I was surprised to see that L. de Moura based the new Lean prover [2] on a variant of the CoC. He thinks that all automation that we currently have for HOL (i.e. Isabelle/HOL) can be brought to Lean.
> J. Blanchette [3] seems to think the same.
Interesting, thanks! I hope they are right and in the future this kind of automation does come to Coq and Lean.
> Maybe with HOL I will not be able to prove some deep theoretical mathematical result that I could with Coq's Calculus of Constructions, but if I can otherwise much more easily prove >99% of my code to be bug-free, then I would argue that perhaps it's the better "tool" for the job...
Is this true though? What program properties can Isabelle easily automate the proofs of that Coq cannot?
> Is this true though? What program properties can Isabelle easily automate the proofs of that Coq cannot?
Almost all?
Granted, I haven't used Coq very much, so correct me if I'm wrong, but isn't it true that in Coq, with the exception of basic proofs like simple arithmetic expressions, in order to prove a theorem you have to at least explicitly know and write the name of the lemmas that are needed to prove it (apart from how they fit together)?
In Isabelle/HOL, many times you literally only need to write "try" and Isabelle will find the proof of the theorem for you, including the lemmas that were needed to prove it and how they fit together (assuming these lemmas were already proven before). If it turns out that your theorem is false, it can show you an example of inputs where it fails.
This auto-proof functionality I'm describing is mostly thanks to Sledgehammer [0], the auto-counterexample finding is due to Nitpick (and another tool which I can't remember), although there are a few other automation tools that are also very useful.
Granted, you still have to have a high-level understanding of how you want to prove your code, but I would argue this eliminates a lot of the work and (wasted?) time you need to accomplish the final goal...
"Granted, I haven't used Coq very much, so correct me if I'm wrong, but isn't it true that in Coq, with the exception of basic proofs like simple arithmetic expressions, in order to prove a theorem you have to at least explicitly know and write the name of the lemmas that are needed to prove it (apart from how they fit together)?"
Coq has automated tactics like "auto" that search for proofs based on a configurable database of "hints". On the other hand, if it doesn't succeed, it leaves the proof state unchanged and gives no hints as to what to do next. But on the other, other hand, if it succeeds, it gives no hints as to how it succeeds. So it's less useful than it might be.
Edit: Also, Coq has a dandy baked-in search system for finding lemmas and so-forth that would be useful in a given situation. On the other hand, for a similar flavor, look on Hoogle for functions transforming one list into another list.
See [1] for an example (edit: although this may not be the best example nowadays, see below):
Quoting:
> Consider the conjecture "length(tl xs) <= length xs", which states that the length of a list’s tail (its "cdr") is less than or equal to the length of the entire list.
> Thanks to Vampire, Sledgehammer finds the following proof: by (metis append_Nil2 append_eq_conv_conj drop_eq_Nil drop_tl tl.simps(1))
Those are all the lemmas that were needed to prove the theorem. They were selected completely automatically.
AFAIU the "metis" procedure tries to fit the lemmas together so that they prove the theorem, and I believe the HOL kernel gives high assurance that if "metis" succeeds, then the theorem was proven correctly (but my understanding may be wrong here).
Note that Sledgehammer may even automatically find and return a human-readable proof, which for the example above, it returns something like this:
proof –
have “tl [] = []” by (metis tl.simps(1))
hence “∃u. xs @ u = xs ∧ tl u = []” by (metis append_Nil2)
hence “tl (drop (length xs) xs) = []” by (metis append_eq_conv_conj)
hence “drop (length xs) (tl xs) = []” by (metis drop_tl)
thus “length (tl xs) <= length xs” by (metis drop_eq_Nil)
qed
Edit: That said, note that at least nowadays Isabelle/HOL seems to be able prove the above theorem by simplification alone ("by simp"), so this may not be the best example.
metis is a first-order automated theorem prover. the metis tactic will handle encoding the goal as a first-order problem, using ~just the rules you provide it, invoking actual-metis, and reconstructing a proof (in the kernel logic) from the output proof. so, it is impossible to prove things that are false using the metis tactic (ie, metis is entirely out of the TCB).
re: the simplifier, I feel like it's a bad example because coming up with good simp rules and knowing when to add them and when not to when writing new code in Isabelle/HOL is almost an art.
One example are nominal techniques [1]. While they are constructive, they have so far only found substantial implementation in Isabelle/HOL [2]. The reason is that the natural implementation is non-constructive, and leads to only a small blowup in proof complexity, which can easily be handled by Isabelle/HOL's proof automation. In contrast, constructive implementations of nominal techniques seem to require a considerable blowup in logical complexity, which the existing proof automation in Curry-Howard based provers doesn't seem to handle gracefully.
Another issue is that most of the automation that is known as "hammer" is based on non-constructive automatic provers.
Do you know why that is the case? Nominal logic is usually presented as a sheaf model (i.e., as the internal language of the Schanuel topos), which models a constructive dependent type theory. Is there a problem when constructing a universe?
I'm probably the wrong person to discuss this question with. I don't know more than I've written above. All I know about this is from conversations with Christian Urban, who implemented Nominal Isabelle. It was a while back, maybe the problem of implementing nominal techniques in constructive type theory in an effective way (i.e. doesn't put additional burden on automation) has been solved by now.
The problem as I understand it is this.
Let's say you have a theory T in the formal language given by
signature S.
- In HOL you extend T to T' and S to S', but the deltas between T and
T', and S and S' are small, and largely generic. This is possible
because almost everything handling nominal can be done using double
negation.
- In CoC you extend T to T' and S to S', but the deltas between T and
T', and S and S' are huge, and largely non generic.
Not that I'm aware of. It would be nice to have one. The number of people who understand nominal, HOL, CoC and implementations of provers is vanishingly small.
Coq itself is not a monolithic system, but built around a small TCB (the kernel) itself. Proof automation and tactics always produce proof objects (terms of the type theory) that are checked by the kernel.
This system design allows to built proof automation outside the kernel, i.e. it is safe to use proof automation that hasn't been proven correct, because any proof found by the automation will be checked by the kernel before it is accepted.
Another feature of Coq, which unfortunately is underused in practice is the inclusion of a notion of reduction in the type theory.
Terms of the type theory are programs, and equality in the type theory is defined up to execution of these programs.
This means that in principle, one can construct a function SMT(s) and a proof that SMT(s) returns true if the smt formula s is satisfiable, and false otherwise.
The cool thing is that a proof that a certain formula s is unsatisfiable can now be represented as an equality proof SMT(s) = false. This proof is very small, because it relies on the function SMT which is essentially a verified SMT solver.
You might wonder how quickly SMT can be executed. The answer is that you would have to write SMT with purely functional data structures, but that Coq has an integrated JIT compiler that will speed up execution.
So Coq is a pretty amazing system, but it is, 30 years after its creation, still in its infancy.
We have 90,000 lines of code in a commercial application with no integration or unit tests, or any test automation at all, and that’s used by approximately 15,000 users a day.
Obviously we have bugs. We measure both our crash rate and usage/behavior with public analytics packages. About one out of every thousand uses of the application ends in a crash. More often, a user is unable to complete a task in the app because of a code failure, or poor/misleading UI.
I don’t know how hard it would be to formally verify our application, but I suspect at least 10x the cost of building it. We don’t talk about formal verification, but we do talk about the need for unit tests. My answer to both is: stop, think about what you are saying.
Our users are rarely inconvenienced by crashes, our crash rate is lower than 90% of our peer group already. Spending even another 1x the time and effort spent building this app trying to formally prove it and eliminate those few last crashes would be a massive expenditure for a tiny gain in user satisfaction, and only in one area. It would literally set us back years in functionality.
Writing/debugging/maintaining 3,000 unit tests doesn’t offer much better benefits because our application is almost entirely front end GUI. Writing automated UI tests today will have to be entirely rewritten next year when we update the entire UI, and then likely within another year as we do it again.
In our case the most productive way our team can improve the application for its users is to improve functionality and usability. We will create bugs along the way but find most in manual testing, and our usage will increase as customer benefits increase.
My point isn’t that we are doing everything right and that formal verification is wrong, or that unit tests are wrong, etc. My point is every application is unique. They are all developed with limited resources and those resources should be focused on producing the best possible outcome.
If I was leading a team at NASA developing embedded software for a space probe, pair programming and unit testing would be required, along with integration testing, automation and we should consider formal verification. We would only have one shot at getting it right or a billion dollar probe will fail.
But my current team doesn’t face such an immense cost from its bugs. We can push an update to fix critical bugs to every customer whenever we deem necessary. Our biggest challenge is getting customers to use our application, and to use it more often. Customer impediments are far more often poor design decisions than crashes or mis-implementations. So that’s what we need to focus on.
> Our biggest challenge is getting customers to use our application, and to use it more often.
That may be your biggest challenge now, but at some point, you may find that in order to overcome that challenge, you need to add major new functionality that requires changing broad swaths of your application. When doing that, you may find that your crash rate becomes unacceptably high as your changes begin violating assumptions you didn't know you were making. That's what unit and integration tests are for, they help expose those assumptions before a user does. This is a common misunderstanding, tests are not about catching bugs in the present, they are about decreasing the risk of adding new functionality in the future. I would go so far as to say that if you're working on a project with a defined end state, where there is zero chance it will ever be expanded past that end state, then automated tests are pointless; spend a lot of time on QA of the end product, shrink wrap it when you're happy with it, and don't look back. But I would also go so far as to say that nobody makes software like that. I certainly haven't ever worked on a project that is unlikely to have major, risky, expansions in functionality during its lifetime.
I agree with your general point, but currently don’t think the math adds up on unit testing for us.
Its not that it wouldn’t be better if we had unit tests. But their benefits are mostly in the future, meaning they need to be discounted for the value of time. And even without discounting, they won’t be as valuable to customers and our product as actually creating new and fixing existing features.
If our product fails in the next year, any efforts yielding most of their benefits after that are wasted. Any efforts that yield more customers and usage now, gives us more resources to do more things in the future, including unit tests.
That way if we get to that major, risky expansion in the future we will have more resources to write unit tests then, when they are most useful.
And lastly, i’ve reviewed the bugs that business prioritizes for us. Virtually none, including crashes and feature fails, would be caught by traditional unit tests. Lots of them were force unwrapping code written by original contractors that would pass unit test, but crash in multithreaded environment of garbage collection frees the right object.
To do unit tests we would also have to write a ton of mocks, making our production code less maintainable and understandable, and potentially introduce new bugs.
I think that rosette is a good tool which abstracts a bunch of the difficulty of SMT solvers and also allows you to synthesize programs https://emina.github.io/rosette/
How many real world production systems have benefited from formal methods, significantly more than if the same resources used to do so had been applied in other ways?
Let’s assume you can name a few, how general in the class of systems this can apply to? Or this just an obscure corner of computer science that applies to NASA and various low level system components?
I don’t count out formal methods, but can you really say none of this reminds you of the AI winter?
Is there evidence that it’s more reliable compared to equally complex systems? Or that it’s reliability would have been impractical or more expensive using traditional methods?
I didn’t hold out much hope of a surgical response (because it’s not the most common thing being done in projects) but that looks like a great link exactly on point.
It may sound cynical but as I mentioned anecdotally is does really feel like the AI winter, where simultaneously people felt there was incredible promise while hearing a steady stream of discouraging progress reports.
I look forward to reading their paper, maybe there are a lot more bright spots than can be easily seen from the outside.
I think that formal verification is going to be size limited. seL4 is a microkernel which means small, about 9000 LOC. I don't think xen at 150,000 is within the abilities of verification. So it's promising for small stuff.
All of my formal verification experience comes from using it in hardware, so the following comment may not apply so well to software where I have no experience of formal methods (beyond learning about it at university).
I think the mistake the author is making here is seeing formal verification as an all or nothing affair. Either you formally prove your program to be correct or it's useless. I would agree with them that doing this is a huge amount of effort and potentially just not feasible for real systems.
However there's a huge amount of value in formal verification as an extra tool to assist you in your verification efforts. By combining traditional verification techniques (normal testing in software terms) with formal methods you have a whole new way to find bugs, one which works rather differently and hence tends to find different bugs. You may not get anywhere near confidence you've formally proved everything (for one I find the more interesting properties just never get proven, but leaving the tool to run for a while often throws up bugs that are tricky to find via other means) but it does accelerate your verification efforts.
It's also a useful tool for exploring certain scenarios you as the designer (or programmer) may be worried about. You can overconstrain it to just look at certain behaviour and you can ask it to show you some specific execution that hits a particular property. Maybe there's some code path that's very hard to stimulate for whatever reason and you want to check out it's functionality. You can get the formal tool to hit it and then see what it does.
Hi, author here. I definitely agree the valuable part is thinking through the verification process, which devotes a lot of time to understanding the system and adds resilience in edge cases. Someone on Twitter brought that up: https://twitter.com/jetapiador/status/943909582279233536.
Perhaps that's what I should have concluded my argument with.
This is similar to the conclusion I came to after doing my PhD in a CS department that had lots of people enamored by formal methods: that the usefulness of it is inversely proportional to the complexity of the project to be verified.
Sad that apparently hasn't changed much in the 20 years since.
I say "enamored", because formal methods seems to me like a field that gets a lot of attention because the possible outcome of it is so damn attractive: if we indeed could easily prove software correct, that would be the biggest advance in software, EVER. But is a bit of a fallacy that just because the imagined end result is so elegant and desirable, that there must be a path to it.
Also, any gains that can be claimed by formal methods are hard to evaluate in a vacuum, as using any other techniques that involve spending 10x more time thinking about / working with some code can possibly have equally beneficial reductions in bug counts. Most software that has undesirable bugs is such because no time/money investment is deemed worth it to make it better, not because formal methods is the only way to improve it.
I find it funny that meanwhile, mainstream software is going the opposite direction, with ever larger systems written in dynamically typed languages like JavaScript and Python, or overly simplistic static type systems like those of Java and Go. You'd think if formal methods was our far away desirable future we'd be incrementally moving in that direction by using more expressive type systems (Haskell, Rust..) and maybe even dependent typing. The invariants that can be encoded in such systems can rule out usefully large classes of bugs with way lower effort than formal methods, but somehow movement on that front is slow. If so few people care to even grab the low hanging fruit, why do formal methods people think we can jump from chaos to order in one step?
For me, the problem with formal methods is that a prerequisite is a (formal) specification, and I've never seen one of those in 20 years of software work.
That depends on what you want to use formal methods for. For full functional correctness, yes, you need a spec. But much can be gained by just proving the absence of runtime errors, i.e., for C, things like "all array accesses are in bounds", "every dereferenced pointer is valid", or "integer arithmetic never overflows". Formal methods tools for C have those baked in, so you can get going quickly without specifying these things yourself. It's still hard, though.
> And the ones I did encounter had zero "testing" done to see if they corresponded to the real world.
You can excuse human beings for being unable to write the programs they want, but if they can't even specify the programs they want, they don't deserve the epithet Homo sapiens.
If you believe a proof of correctness produced by a formal verification tool, I have a bridge to sell you.
For a proof of correctness, you need to model the environment, and write a specification or a golden model. These are just as prone to bugs as the system you are trying to verify (even assuming the tools are correct).
That is not to say formal verification is useless. Developing a proof will force you to formalize your intuitive grasp of the system and its interfaces. Counterexamples produced by the tools will often help to find deep bugs that would be almost impossible to find any other way. And after many iterations of debugging both the system and the proof, you might gain a lot of confidence in the proof and the system.
But even after all of that, trusting a formal proof as a guarantee of correctness is insane.
"For a proof of correctness, you need to model the environment, and write a specification or a golden model. These are just as prone to bugs as the system you are trying to verify (even assuming the tools are correct)."
I disagree with the idea that a spec is as buggy as the implementation. Some examples.
In hardware verification, a spec for a floating-point operation is typically a page or two long. The corresponding RTL implementations can be tens of thousands of lines, with many additional concerns (pipelining, clock gating, ...) that don't show up in the spec, and where the operation itself is fused together with dozens of other operations. The difference in complexity is striking.
In simple software cases, say a sorting function, the spec can be as straightforward as "the answer is ordered and has the same elements as the input array." This same spec can be used with any number of implementations, including things like Timsort where there are multiple modes and optimizations that are interacting with one another. The spec says nothing about runs, galloping, etc., which the implementation uses.
For more complex software, the spec can of course become complex. Even so, there are examples, e.g., the Jitawa Lisp runtime (https://www.cl.cam.ac.uk/~mom22/jitawa/), where the spec is a relatively straightforward description of a programming language that is much, much simpler than the implementation. For instance, the spec here does not mention things like garbage collection and just-in-time compilation, which are fairly complex features of the implementation.
Anyway, trusting a formal proof seems, to me, certainly no less sane than trusting a system without such a proof.
Here's a spec which was buggy and yet widely accepted and implemented: TLS 1.0
You could formally prove an implementation of it and had a dangerously buggy end product. The same could be said for many of the early encryption specs which have proven to be vulnerable against timing attacks, side channel attacks, and so forth.
I've helped QA quite a few financial specs (namely those created to transform credit reports into decisions) which were inherently buggy, and these were written as almost pure mathematical formulas that just didn't take corner cases into account.
A formal proof has value, but it's certainly not an absolute - relying on it blindly to indicate the correctness of a program is putting yourself at risk.
EDIT: As a completely separate point, let's look at the potential problems with this sorting spec: "the answer is ordered and has the same elements as the input array."
What does ordered mean? How do I compare a number and a string? A binary blob? A structure? Is there a restriction on the word "same" - do they have to be the existing object, or can it be a copy? Should the answer be in the same memory space as the original, or should it be a distinct memory object? Can the original be modified?
> A formal proof has value, but it's certainly not an absolute - relying on it blindly to indicate the correctness of a program is putting yourself at risk.
Indeed, I agree, but lots of people never ask themselves how much more at risk they are putting themselves when their implementation has no formal spec or proof at all!
Like another commenter has argued, it seems that people seem to look at formal verification like it's an all-or-nothing approach (either it achieves zero bugs or it's completely worthless, or alternatively -- if it's formally verified then it MUST be bug free!), but it's really not, it simply gives higher (often, much much higher) assurances.
Verification, showing that the spec and the impl agree. This is usually done with comprehensive tests if it's done at all, and very rarely with formal methods (since formal specs are so rare). TLS 1.0 is not a formal spec.
Concurrently, you have validation. This is arguably far far more important, since it's cheaper to fix this before you go spend money on verification. Validation is showing that the spec, IS THE RIGHT SPEC! This is where you would have put almost all your cryptographers when developing TLS 1.0, and there's work on doing this with TLS 1.3 (eg, https://www.ethz.ch/content/dam/ethz/special-interest/infk/i...).
Once you have a formal spec, written in an automated, machine-checked system, you can start analyzing it with formal methods. This is super super powerful. The seL4 specification, even at its most abstract, is rather subtle and quite long. It's hard to reason with it. But since they already had a formal spec, they could analyze that spec at its highest level of abstraction to prove valuable security properties: confidentiality and integrity. Now, you don't have to look at the whole spec to validate it for your security requirements. You just have to look at the statements of confidentiality and integrity and validate those.
This is standard high-assurance software development, and it's not taught or practice very often.
>What does ordered mean? How do I compare a number and a string? A binary blob? A structure? Is there a restriction on the word "same" - do they have to be the existing object, or can it be a copy? Should the answer be in the same memory space as the original, or should it be a distinct memory object? Can the original be modified?
this is completely uninteresting in any formal spec system, because it fixes those details usually implicitly. you're usually comparing things of the same type, or using an explicit comparison operator that has stated properties (which become proof obligations).
To that, I would argue that security is (or should be) part of the spec; if it’s not secure, it’s not correct. Having it be a separate and lesser concern is how we end up tacking on security features as an afterthought.
To borrow from the kernel devs: Security requirements are just like any other requirements.
>I disagree with the idea that a spec is as buggy as the implementation. Some examples.
It isn't necessarily as buggy as implementation, but I've worked on plenty of (non-formally verified) projects where the majority of bugs were spec bugs. If you're diligent about following the UNIX philosophy of assembling small self-contained, tightly specified (and test driven) components, this is pretty normal I think. Building the wrong thing is often the biggest problem.
The one time I did do formal verification (at uni) that was certainly the case too, though that may partly have been because the language was an overcomplicated pile of crap (Z).
One of my co-workers also told me a story about a formally verified program being translated into python and having a couple of bugs picked up on the way. I'm pretty credulous of this.
Formal verification techniques can be very useful, especially for finding deep, often insanely complex bugs.
The process of developing the proof can produce a lot of insight about the system and expose hidden assumptions that were never considered while designing the system.
It is the final proof itself which is the problem. It does not contribute anything except for terminating the debugging process.
Because the specification can be just complex as the system, the proof can only provide a false sense of security. A correctness theatre.
> It is the final proof itself which is the problem. It does not contribute anything except for terminating the debugging process.
That's true in a certain restricted sense, if you see the verified program as a thing set in stone and never touched again. But verified software is maintained, changed and extended just like other software. Whenever you change a program and your proofs still work, that tells you something. If the proofs break, that also tells you something. If all you need to change are some small lemmas but the overall end-to-end proof goes through, you know exactly what interactions with other parts of the system were affected by your change.
Because the specification can be just complex as the system
That's true in principle, but in practice I don't think it's going to hold in most cases. A good specification necessarily captures the essential complexity of the system, but it's a lucky developer whose implementation doesn't introduce considerable accidental complexity as well. Ensuring that this accidental complexity doesn't break anything could have great value, depending on the nature of the system and how damaging a failure could be.
This is what many people are missing in their understanding of formal verification. Things that reduce bugs are not proofs, but a different thought process that sort of double checks everything from a different perspective.
> Because the specification can be just complex as the system
Can you give an example of this? I don't find this true in general at all e.g. writing the specification of a sorting algorithm is much easier than implementing most practical sorting algorithms.
> the proof can only provide a false sense of security. A correctness theatre.
You could use this argument to say that unit tests, type systems, code review etc. are correctness theatre. All these things prevent mistakes.
It's pretty clear that a fully formal specification of a program must be (roughly) as complex as the program itself, for otherwise you could construct an infinitely descending chain of ever simpler specifications which leads to a contradiction.
The reason that you found the
specification of a sorting algorithm simpler than the sorting algorithm is that you probably have not fully specified the algorithm. Consider the sorting algorithm in C's standard library, which allows the user to specify a comparison function. Now imagine you'd pass something like the following comparison function:
def compare (x, y):
printf ( "I'm comparing %i with %i\n", x, y )
if ( x < y ) return 1
if ( x == y ) return 0
else return -1
It almost completely exposes the inner workings of sorting. Hence a full spec needs to account for this.
> It's pretty clear that a fully formal specification of a program must be (roughly) as complex as the program itself, for otherwise you could construct an infinitely descending chain of ever simpler specifications which leads to a contradiction.
That's like saying a theorem statement must be roughly as complex as the proof.
Also consider that it's much harder to reason (and prove) that heavily optimised versions of algorithms are correct compared to naive implementations. For example, a proof of correctness for heapsort is more involving than one for insertion sort.
Trusting any sufficiently complex system to be correct is insane. It’s virtually impossible to do and be confident at all levels your proof is correct once you move past much more than trivial systems.
This is why modern software testing takes a more... I guess you could call it Bayesian... approach.
QA, unit testing, integration testing, etc - these are all ways of reducing probabilities of incorrectness. Never formally prove it of course, but make your confidence higher with each.
> QA, unit testing, integration testing, etc - these are all ways of reducing probabilities of incorrectness. Never formally prove it of course, but make your confidence higher with each.
And formal proofs don't make your confidence higher? IME they're the most efficient tool in the toolbox if you need to get to really low defect rates. (Of course, the sad economic reality is that for most consumer software even a 5% defect rate is perfectly acceptable).
It's not that simple. Defects don't necessarily have to cause problems, depending on how you design your software (think OTP-style architecture). And avoiding problems is in fact what you always need, not low defect rates. It's just so happens that sometimes you can only achieve decent reliability by pursuing very low defect rates. But for a lot of software it cannot economically beat other approaches.
Fred Brooks said it years ago: "Even perfect program verification can only establish that a program meets its specification. … Much of the essence of building a program is in fact the debugging of the specification."
As far as I can tell, the formal proofs themselves were always correct in his example. The problem was that the proved statements were not the ones intended.
Whenever I read a post like this I have to wonder: was there as much resistance to writing tests for your code before that became common practice? Many of the arguments seem to apply to test driven development in the same way as they do to formal verification:
- You have to go over every single line of code (to show full functional correctness|to achieve full test coverage).
- The amount of code that goes into (verifying|testing) an application is on the same order of magnitude as the application itself!
- There are still bugs in (unproven|untested) parts of the application!
- It's not perfect, so why should we use it at all?
At the end of the day, formal verification is slowly but surely becoming mainstream. Whether through more advanced type systems, static analysis, domain specific model checking, or through interactive theorem proving. We're finally at the point where whole program verification is starting to become realistic for some domains and it will only become easier from here.
---
As for the article itself, others have already pointed out some specific problems. Let me just add one more: Simulation proofs do not scale, but are far from the only option. I would encourage you to look into more high level verification techniques using axiomatic semantics, such as Iris in Coq.
Formal Verification is the state of the art verification method for hardware these days, with most of the ASICs in the market going through it, and recently many SoC's and complex chips like GPUs and non-RISC CPUs. It does not mean that other methods of verification are dying or are ever going to. FV is just as good as the combination of the tools and the people working on it, and it's an extremely difficult discipline to work on.
FV for software is a whole other beast, we simply lack the compute power to thoroughly verify even simple software properly, and the tools are still in their infancy.
My personal opinion, having worked for the lead FV company in the market for years, is that FV for software is probably sufficiently far fetched that Machine Learning methods for verification might be there before FV for software becomes a reality. The reason is: while for hardware it is imperative to have 100% correctness, since the cost of bugs is absurdly high, for software having 90 something percent is mostly acceptable, so if it's cheaper to let a neural network (or something else, I'm no ML expert) estimate the correctness of the software, that's whats going to happen.
"The guarantees of a formally verified system rest entirely on the assumptions of the system, including its trusted computing base (TCB). The TCB of a verified system includes the specs, which are manually written and hoped (fingers crossed!) to be correct, the verification tools (e.g., Coq engine, Z3, compiler), and the runtime infrastructure (OS, hardware). This should be a huge red flag. For instance, if you’ve used Z3, you’ll agree that it’s black magic to everyone except the developers, and my professors confess that they’ve found correctness bugs, especially in decidability theories that are not commonly used."
I don't quite understand why Z3 is part of the TCB. Isn't Z3 generally used to generate proofs which are then verified by the respective proof assistant?
Yes, if you use Z3 as part of a proof assistant like Isabelle/HOL or F-Star.
No, if you use Z3 as part of some other system that does not have its own primitive notion of proofs and proof checking. Many tools are like this: Dafny is one, I think, as is Why3, and the Alive system for checking the correctness of LLVM peephole optimizations. They simply trust Z3.
It's also used in lots of other things that are not "verifiers" as such. For example, to find new compiler optimizations by superoptimization: https://arxiv.org/abs/1711.04422 (And sure enough, if you search the paper for "Z3", the first hit is a paragraph where the authors describe that they once had a miscompilation due to a bug in Z3.)
It seems to me that the arguments here are fair, but maybe the conclusion isn't.
But the guarantees that we expect are much stronger than a correct protocol, handshake, or compiler optimization. We need the entire system to be reliable, and we need proof effort to be minimal for widespread adoption to be feasible.
Ideally, yes, of course that is true. But even if it is not currently true, having some aspects of a system formally proven still means we can focus our limited resources on the other aspects and the integration, and that's still beneficial.
Although the CompCert paper explicitly says that the parsing step, as well as the assembling/linking steps, are not verified, this kind of bug is a huge blow to the credibility of formal methods. It really makes you wonder what sort of bugs formal verification is supposed to catch.
That's strange I reach the exact opposite conclusion on this statement.
Author here. I guess what I was trying to say is that CompCert as a system has been trusted for over 10 years, yet only now do we find such a silly correctness bug. People won't read the paper to find out that parsing is unverified, they'll just trust CompCert because they hear it's been rubber-stamped as "formally verified". And that's dangerous.
A better way to do it is pull up C compilers, get their bug rates, and compare it to CompCert for both unverified and verified parts. First tests formal specs + safe language where latter tests verification. You can use smaller, C compilers if uou want to be fair. Two or three spec errors versus what you find in other compilers would corroborate CompCert team's claims. Likewise for verified TLS versus ad hoc protocols or implementations.
The gap is IDEs becoming automated theorem provers. You put in the type signature of what you want and keep adding unit tests until the IDE finds the correct code for you.
Their philosophy are completely different. Formal proofs aim at proving the nonexistence of bugs, while tests (any kind of them) aim at revealing bugs in case you add them.
You can sometimes generate code form a proof, but you can't do that from tests.
I have begun to think of formal methods as a sort of tar pit. The idea is attractive to people who are concerned about correctness and are aware that you can reason about program behavior, but they do not yet provide a solution to the practical problems of writing correct software. On the other hand, this is not an excuse for abandoning any attempt at reasoning about programs, and settling for trial-and-error as the only way to write software.
Formal methods have been verifying real world systems using ACL2 for a long time. Also, sqlite has 730 times amount of test code https://sqlite.org/testing.html .
Formal verification is not a magic bullet but it doesn't make your systems bullet proof though.
Isn’t formal verification subject to the problem of “when to stop proving correctness”? Even if you can prove one part you still have to prove that it’s used correctly? Like in the WPA2 case?
You don't need to go all the way and specify/prove everything in order to achieve significant assurances of absence of (at least certain classes of) bugs.
See the section "Use SPARK at the right level for you" [0] for an example of how you can approach formal verification.
You can start proving certain things, like that your code is free from run-time errors, which already significantly eliminates a lot of bugs.
Then you can start proving certain properties, like it was done for WPA2, to make sure that at least those properties hold and are bug/error free.
Finally, you can achieve full functional correctness, which gives you a significantly higher assurance that your implementation really does satisfy your functional specification. You can start doing this in parts of the code, and progressively increase all the way to the entire code base if you so wish.
Every added formal verification step gives you higher assurance that your code works as intended.
Sure, you can never prove that your code is 100% bug free (this would require the functional specification itself to be bug free, and AFAIK there is no way to verify this), but I would argue that 99% verification is much better than 0% verification.
The other problem is that you have to understand the proof itself. It's not always trivial to tell whether a complex proof is stating what was intended correctly. Effectively, the proof itself is a program that needs to be verified by a human.
What's really needed is a mathematical analysis of size of the bug space compared to the size of proving code required to cover it. I.e. to get to 99.5% confidence that a system of size N is bug-free it would take a proving system of size f(N). Is f polynomial, a small exponent, a large exponent? If it's a large exponent, then the more you take on, the farther you get from your goal, and the narrowly targeted efforts are the best we can do.
Some people work on monads over linear logic that supposedly could simulate any imperative program and make them available for formal verification, so we will see. OTOH I definitely wouldn't want to spend time proving correctness of every single line of code I write; it would make programming "no fun" zone and require rocket-science level of understanding.
> I definitely wouldn't want to spend time proving correctness of every single line of code I write; it would make programming "no fun" zone and require rocket-science level of understanding.
Fortunately, you don't have to. Formal verification can be done as little or as much as you want (and personally, I find it's quite fun, but that's subjective).
You can look at it like it's testing on steroids - you can not only verify that some part of your code works for certain inputs, but you can verify that it works for all inputs! (for the definition of "work" that you decide to give it).
Yeah, it's fun to some extent; I still remember proving whole Peano arithmetic and then using that to prove functional programs. But doing real-world detail level I would probably go crazy fast; just making proofs some floating point computation does what it's supposed to would be extremely tricky due to bit-level stuff and stretching sparsity of representable numbers that could fit to xyz-bits of floating point.
I'm hoping the formal verification crowd will start to influence ethereum developers. You have small amounts of relatively simple code, that absolutely must be correct or at least predictable, seems like a perfect match
The question that needs to be answered is not, "But the guarantees that we expect are much stronger than a correct protocol, handshake, or compiler optimization. We need the entire system to be reliable, and we need proof effort to be minimal for widespread adoption to be feasible." (Pretend that's a question, ok?) The question that needs answering is, "what is the easiest way to provide (strong) guarantees that the code is reliable?"
If you wanted to write a filesystem, how much work would you have to put into it before you could consider it "reliable" using traditional techniques? If it is 10x more than it would take to simply write the code, then "MIT’s FSCQ file system, developed over about 1.5 years using Coq, the complete system was 10x more code than a similar unverified filesystem. Imagine that — 2000 lines of implementation become 20,000 lines of proof!" doesn't sound so bad, does it? And 10x is not necessarily a bad estimate---unit testing alone would account for 2x, at a rough guess, and not only does it not cover all possible executions, it also has a reputation in this neck of the woods of focusing on the happy path.
"An Empirical Study on the Correctness of Formally Verified Distributed Systems" has been discussed here before (https://news.ycombinator.com/item?id=14441364) but I'd like to mention one quote from it:
"Finding 9: No protocol bugs were found in verified systems, but 12 bugs were reported in the corresponding components of unverified systems. This result suggests that recent verification methods developed for distributed systems improve the reliability of components that, despite decades of study, are still not implemented correctly in real-world deployed systems. In fact, all unverified systems analyzed had protocol bugs in the one-year span of bug reports we analyzed."
(And speaking as a guy who has some network protocol experience, some of the bugs that were found in the "shim" layers were...likely caused by inexperience: no, TCP does not preserve write boundaries on the recv end, and yes, UDP will duplicate packets under some error conditions.)
I will admit that many of the author's specific complaints are completely true. This quote should be in big letters in the introduction to any formal verification work: "On top of all that, if any spec or implementation changes even slightly, all the specs, impls, and proofs that depend on it will change as well. This makes even small changes a huge pain." And Coq is something of a pain in the rump. (I encourage anyone with a few spare hours to work through Software Foundations (https://softwarefoundations.cis.upenn.edu/); it's a lot of fun but at the same time very frustrating. [That should be "many, many spare hours". --ed.]) The least damning thing you can say about Coq is that it's not very well engineered. On the other hand, SMT solvers like the kind of thing behind SPARK and Frama-C are brilliant when they succeed in finding proofs but when they fail,...well, they light up a "?" on the dashboard and expect the experienced driver to know what the problem is. Further, they're rather limited on the programming features you can use with them---Dafny and the Java verifiers have garbage-collected languages, alleviating some of the pain.
But still, to address the author's fundamental issue,
"I went into the class believing that formal verification is the future — the only solution to a world of software ridden with bugs and security issues. But after recent events and a semester of trying to apply formal methods, I’m a serious skeptic. In this post, I’ll discuss why I think formal verification has a long way to go — and why it just doesn’t work right now."
I don't think any clear-cut answer exists. For some applications (like your FS example) formal proofs are a complete no-brainier. On other cases, you are better served by many tests. On yet different cases, even tests are too much, and you will do better matching your code format to the specification as well as possible. Not to talk about the cases where specs validation is the real issue, and all your attention will be there.
It's not even an either-or issue. You can have all of those situations appearing on different places of the same code-base.
His post and yours both oversell the problems while underselling the successes. It’s best to keep big picture in mind. Yes, it’s extremely hard. Yes, we shouldn’t rely exclusively on it or have too much expectations. On Guttman’s end, the Orange Book of TCSEC (later, EAL7 of Common Criteria) required many activities like I included in my own security framework:
Each was supposed to check the others. Each time, you want experienced people doing it since it’s the people that make these processes work. What’s mostly missing in this post and esp in Guttman’s is all the evidence of defect reduction. Most of the work he cites were tracking problems they found throughout their lifecycle. They all said the formal specifications caught bugs due to ambiguity. Some found formal proofs beneficial with often mentioning “deep” errors they likely wouldn’t find with testing or code review. TLA+ use at Amazon was great example where errors showed up 30+ steps into protocols. Most of old verifications found they caught errors just simplifying their specs for the prover even if they wouldn’t run proofs. And for high-assurance security, the A1-class systems would go through 2-5 years of analysis and pentesting where evalutators said they did way better than systems using mainstream practices. Most of those can’t even tell you how many covert channels they have.
If you’re interested in methods that helped in practice, I dropped a start on a survey paper on Lobste.rs trying to shorten several decades worth of material into an outline post:
Now, all that said, full proof is hard with a lot of problems. You hit nail on the head about the infrastructure problem. I’ve always strongly encouraged people to concentrate on building up one or two projects to have infrastructure for most common ways of solving problems in programming with as much automation as possible. They have at least been building most things on Coq, Isabelle/HOL, HOL4, and ACL2. Having prior work to build on has been helping them. As you’ve seen, though, they’ve not made this a priority: the tooling still is horrible. So, horribly-difficult work with horribly-difficult tooling. That leads to the next proposal I did that might help you or some of your readers who want 80/20 rule of most benefits without most problems:
Here’s my recommendation in general. Learn lightweight methods such as Meyer’s Design-by-Contract for most stuff, TLA+ for concurrency/order, Alloy for structure, SPARK for low-level code, and so on. TLA+ on distributed stuff seems to sell people the most. With a formal method in hand, then apply this methodology: memory-safe language + lightweight method to model-check important stuff + Design-by-Contract + tooling to generate tests from specs/contracts + fuzzer running with runtime checks enabled. This combination will bring the most quality improvement for least effort. In parallel, we have the experts formally verifying key components such as kernels, filesystems, network stacks, type systems, compilers, security protocols, and so on. They do so modelling real-world usage with each one coming with contracts and guidance on exactly how they should be used. The baseline goes up dramatically with specific components getting nearly bullet-proof.
Also, you said you wanted entire system to be reliable. There was one that set the gold standard on that by noting how everything failed with systematic mitigations for it in prevention and/or recovery:
The OpenVMS clustering software alone let their systems run for years at a time. One ran 17 years at a rail yard or something supposedly. The NonStop systems took that to hardware itself with them just running and running and running. Applying a combination of balanced assurance like I describe, NonStop techniques for reliability, and Cambridge’s CHERI for security should make one hell of a system. You can baseline it all with medium-assurance using tools that already exist with formal verification applied to a piece of the design at a time. The work by Rockwell-Collins on their commercially-deployed AAMP7G’s shows that’s already reachable.
It doesn't have to be this way. Microsoft's work in this area led to the Static Driver Verifier [2], which was a huge success. Windows crashes due to kernel drivers breaking the rest of the kernel have pretty much disappeared. They were once (Windows XP and earlier) responsible for over half of Windows crashes.
Most of the things you want to prove about programs to make sure they don't crash are mathematically trivial. Detecting buffer overflows and off-by-one errors seldom requires elaborate proofs. If they do, your program design probably has a problem. Simple proofs can be handled by a prover which uses a Nelson-Oppen decision procedure. This is a complete decision procedure for formulas which contain only addition, subtraction, multiplication by constants, Boolean operators, structures, and arrays. This covers most data access operations. It's fully automatic and complete - the algorithm always succeeds within its problem space. You need one of those for the easy parts.
For the hard parts, we used the Boyer-Moore prover. (Sources are available and I brought it back to life on Github.[3] It runs 3000x faster than it did in 1982.)
The interface with the language is crucial to usability. Most efforts get this wrong. They put the verification stuff in separate files, or in comments. It should be part of the normal code. We built the verification language into the compiler, so you used the same source and the same syntax for compiling and verification. There were additional statements - ASSERT, ENTER, EXIT, etc. The same compiler front end was used for both compiling and verification. The verifier itself ran off the byte code from the compiler front end. All type and size issues had been decided at that point, so the verifier was looking at unambiguous operations such as "pop two operands from stack, add in 16-bit twos complement arithmetic, push on stack". This captured the semantics of the program the same way the compiler did.
The user never sees the easy stuff. Every subscript use generates a bounds check assertion which has to be proved. If the prover is able to prove the assertion true, the user doesn't see it. Only the errors are visible. Another trick is that you don't write "assert(a and b and and d)", you write "assert(a); assert(b);". Those generate separate proof goals, and if only one fails,the user gets error messages about only that one. Also, when you write "assert(a); assert(b);", the second one is proved under the assumption the first one is true. This prevents cascading errors.
A key point here is that verification is meaningless unless you prove all the theorems. If you write "assert(false)" other errors go away, and you're left with one simple error, which you cannot eliminate. You don't get to dismiss it. It's possible to write an assertion that's equivalent to "assert(false);", such as "assert(0 > 1)"; This is why I'm so negative on the arguments from some of the Rust crowd that a little undefined behavior is OK. It can blow at any seam.
When you needed to prove something hard, you wrote
where a could be proved by the Nelson-Oppen prover from the code before it. B was chosen such that if you assumed b, the Nelson-Oppen prover could prove anything required past that point. Now you had one error message, that the system couldn't prove "a implies b". So then someone had to try to prove that with the Boyer-Moore prover. The Boyer-Moore prover is mostly automatic, but often someone had to provide intermediate theorems to get to a hard one. Those were all machine-checked. If this succeeded, the Boyer-Moore prover generated a theorem that a implies b for use by the verifier. Rerunning the program verifier would then succeed. Such theorems could be reused with different variable bindings, so if the same problem came up again, it didn't require more work with the Boyer-Moore system.If you're generating vast amounts of user-visible proof work, something is wrong.
The whole system I'm describing is now on Github.[4] It's decades-old code, but I've ported about 80% of it to modern compilers. The Franz LISP to Common LISP port of the prover isn't working well, and the author of that code died years ago. If anyone wants to work on that, go ahead.
[1] http://www.animats.com/papers/verifier/verifiermanual.pdf [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/de... [3] https://github.com/John-Nagle/nqthm [4] https://github.com/John-Nagle/pasv