I almost completely disagree with the author of this article.
Yes, formally verified systems still have bugs due to the TCB (the trusted computing base). But that's because the TCB is precisely the only part of the system that is not formally verified! So of course it's going to have bugs.
If so many crashing and correctness bugs were found in the (small) trusted part of formally-verified systems (well, in reality only 16 bugs were found in 3 entire formally verified systems, but I digress), imagine how many similar bugs there are in systems that are not formally verified at all!
If anything, what this article should be arguing is that operating systems, proof finding and verification tools and compilers should also be formally verified (and some already are), so that even bugs in these parts of the system are also eliminated.
The only part of the article that I agree with is that it's not feasible to write 10x more proof code than real code.
But IMHO, this is why proof automation is really important to have - tools like Isabelle/HOL, Why3 and SMT solvers like Z3 provide this (the latter of which the author mentions).
Coq, on the other hand, has much less automation and the result is that it's much harder to use and results in larger proof code. Yet, it's seemingly inexplicably much more popular - perhaps it's the catchy name?
Coq has/is much stronger a logic (albeit constructive) than HOL. That is sometimes convenient. Most users of proof assistants don't get far enough that powerful automation is unavoidable.
In addition, Coq benefits for a network effect: Coq started earlier targetting university teaching, primarily at UPenn with Pierce's Software Foundations [1].
That Coq has a stronger logic is an argument I hear often, but is that really so important when the result is that it's so hard to use that no real-world software engineer is ever going to bother?
Maybe with HOL I will not be able to prove some deep theoretical mathematical result that I could with Coq's Calculus of Constructions, but if I can otherwise much more easily prove >99% of my code to be bug-free, then I would argue that perhaps it's the better "tool" for the job...
The issue is not the edge of logical independence, but rather how easily can you express certain theorems. Most verification is about the verification of programming languages (e.g. the POPLmark Challenge [1]), and in this setting sometimes what you want to prove is more convenient to express in the CoC. You can also express it in HOL, but not quite as easily.
All that said, automation of provers based on the LCF approach with a non-constructive logic is much superior as of December 2017. With this in mind I was surprised to see that L. de Moura based the new Lean prover [2] on a variant of the CoC. He thinks that all automation that we currently have for HOL (i.e. Isabelle/HOL) can be brought to Lean.
> sometimes what you want to prove is more convenient to express in the CoC. You can also express it in HOL, but not quite as easily.
> All that said, automation of provers based on the LCF approach with a non-constructive logic is much superior as of December 2017.
This is kind of the point I am making. Is the fact that some things are more convenient to express worth the substantial loss of automation?
> With this in mind I was surprised to see that L. de Moura based the new Lean prover [2] on a variant of the CoC. He thinks that all automation that we currently have for HOL (i.e. Isabelle/HOL) can be brought to Lean.
> J. Blanchette [3] seems to think the same.
Interesting, thanks! I hope they are right and in the future this kind of automation does come to Coq and Lean.
> Maybe with HOL I will not be able to prove some deep theoretical mathematical result that I could with Coq's Calculus of Constructions, but if I can otherwise much more easily prove >99% of my code to be bug-free, then I would argue that perhaps it's the better "tool" for the job...
Is this true though? What program properties can Isabelle easily automate the proofs of that Coq cannot?
> Is this true though? What program properties can Isabelle easily automate the proofs of that Coq cannot?
Almost all?
Granted, I haven't used Coq very much, so correct me if I'm wrong, but isn't it true that in Coq, with the exception of basic proofs like simple arithmetic expressions, in order to prove a theorem you have to at least explicitly know and write the name of the lemmas that are needed to prove it (apart from how they fit together)?
In Isabelle/HOL, many times you literally only need to write "try" and Isabelle will find the proof of the theorem for you, including the lemmas that were needed to prove it and how they fit together (assuming these lemmas were already proven before). If it turns out that your theorem is false, it can show you an example of inputs where it fails.
This auto-proof functionality I'm describing is mostly thanks to Sledgehammer [0], the auto-counterexample finding is due to Nitpick (and another tool which I can't remember), although there are a few other automation tools that are also very useful.
Granted, you still have to have a high-level understanding of how you want to prove your code, but I would argue this eliminates a lot of the work and (wasted?) time you need to accomplish the final goal...
"Granted, I haven't used Coq very much, so correct me if I'm wrong, but isn't it true that in Coq, with the exception of basic proofs like simple arithmetic expressions, in order to prove a theorem you have to at least explicitly know and write the name of the lemmas that are needed to prove it (apart from how they fit together)?"
Coq has automated tactics like "auto" that search for proofs based on a configurable database of "hints". On the other hand, if it doesn't succeed, it leaves the proof state unchanged and gives no hints as to what to do next. But on the other, other hand, if it succeeds, it gives no hints as to how it succeeds. So it's less useful than it might be.
Edit: Also, Coq has a dandy baked-in search system for finding lemmas and so-forth that would be useful in a given situation. On the other hand, for a similar flavor, look on Hoogle for functions transforming one list into another list.
See [1] for an example (edit: although this may not be the best example nowadays, see below):
Quoting:
> Consider the conjecture "length(tl xs) <= length xs", which states that the length of a list’s tail (its "cdr") is less than or equal to the length of the entire list.
> Thanks to Vampire, Sledgehammer finds the following proof: by (metis append_Nil2 append_eq_conv_conj drop_eq_Nil drop_tl tl.simps(1))
Those are all the lemmas that were needed to prove the theorem. They were selected completely automatically.
AFAIU the "metis" procedure tries to fit the lemmas together so that they prove the theorem, and I believe the HOL kernel gives high assurance that if "metis" succeeds, then the theorem was proven correctly (but my understanding may be wrong here).
Note that Sledgehammer may even automatically find and return a human-readable proof, which for the example above, it returns something like this:
proof –
have “tl [] = []” by (metis tl.simps(1))
hence “∃u. xs @ u = xs ∧ tl u = []” by (metis append_Nil2)
hence “tl (drop (length xs) xs) = []” by (metis append_eq_conv_conj)
hence “drop (length xs) (tl xs) = []” by (metis drop_tl)
thus “length (tl xs) <= length xs” by (metis drop_eq_Nil)
qed
Edit: That said, note that at least nowadays Isabelle/HOL seems to be able prove the above theorem by simplification alone ("by simp"), so this may not be the best example.
metis is a first-order automated theorem prover. the metis tactic will handle encoding the goal as a first-order problem, using ~just the rules you provide it, invoking actual-metis, and reconstructing a proof (in the kernel logic) from the output proof. so, it is impossible to prove things that are false using the metis tactic (ie, metis is entirely out of the TCB).
re: the simplifier, I feel like it's a bad example because coming up with good simp rules and knowing when to add them and when not to when writing new code in Isabelle/HOL is almost an art.
One example are nominal techniques [1]. While they are constructive, they have so far only found substantial implementation in Isabelle/HOL [2]. The reason is that the natural implementation is non-constructive, and leads to only a small blowup in proof complexity, which can easily be handled by Isabelle/HOL's proof automation. In contrast, constructive implementations of nominal techniques seem to require a considerable blowup in logical complexity, which the existing proof automation in Curry-Howard based provers doesn't seem to handle gracefully.
Another issue is that most of the automation that is known as "hammer" is based on non-constructive automatic provers.
Do you know why that is the case? Nominal logic is usually presented as a sheaf model (i.e., as the internal language of the Schanuel topos), which models a constructive dependent type theory. Is there a problem when constructing a universe?
I'm probably the wrong person to discuss this question with. I don't know more than I've written above. All I know about this is from conversations with Christian Urban, who implemented Nominal Isabelle. It was a while back, maybe the problem of implementing nominal techniques in constructive type theory in an effective way (i.e. doesn't put additional burden on automation) has been solved by now.
The problem as I understand it is this.
Let's say you have a theory T in the formal language given by
signature S.
- In HOL you extend T to T' and S to S', but the deltas between T and
T', and S and S' are small, and largely generic. This is possible
because almost everything handling nominal can be done using double
negation.
- In CoC you extend T to T' and S to S', but the deltas between T and
T', and S and S' are huge, and largely non generic.
Not that I'm aware of. It would be nice to have one. The number of people who understand nominal, HOL, CoC and implementations of provers is vanishingly small.
Coq itself is not a monolithic system, but built around a small TCB (the kernel) itself. Proof automation and tactics always produce proof objects (terms of the type theory) that are checked by the kernel.
This system design allows to built proof automation outside the kernel, i.e. it is safe to use proof automation that hasn't been proven correct, because any proof found by the automation will be checked by the kernel before it is accepted.
Another feature of Coq, which unfortunately is underused in practice is the inclusion of a notion of reduction in the type theory.
Terms of the type theory are programs, and equality in the type theory is defined up to execution of these programs.
This means that in principle, one can construct a function SMT(s) and a proof that SMT(s) returns true if the smt formula s is satisfiable, and false otherwise.
The cool thing is that a proof that a certain formula s is unsatisfiable can now be represented as an equality proof SMT(s) = false. This proof is very small, because it relies on the function SMT which is essentially a verified SMT solver.
You might wonder how quickly SMT can be executed. The answer is that you would have to write SMT with purely functional data structures, but that Coq has an integrated JIT compiler that will speed up execution.
So Coq is a pretty amazing system, but it is, 30 years after its creation, still in its infancy.
We have 90,000 lines of code in a commercial application with no integration or unit tests, or any test automation at all, and that’s used by approximately 15,000 users a day.
Obviously we have bugs. We measure both our crash rate and usage/behavior with public analytics packages. About one out of every thousand uses of the application ends in a crash. More often, a user is unable to complete a task in the app because of a code failure, or poor/misleading UI.
I don’t know how hard it would be to formally verify our application, but I suspect at least 10x the cost of building it. We don’t talk about formal verification, but we do talk about the need for unit tests. My answer to both is: stop, think about what you are saying.
Our users are rarely inconvenienced by crashes, our crash rate is lower than 90% of our peer group already. Spending even another 1x the time and effort spent building this app trying to formally prove it and eliminate those few last crashes would be a massive expenditure for a tiny gain in user satisfaction, and only in one area. It would literally set us back years in functionality.
Writing/debugging/maintaining 3,000 unit tests doesn’t offer much better benefits because our application is almost entirely front end GUI. Writing automated UI tests today will have to be entirely rewritten next year when we update the entire UI, and then likely within another year as we do it again.
In our case the most productive way our team can improve the application for its users is to improve functionality and usability. We will create bugs along the way but find most in manual testing, and our usage will increase as customer benefits increase.
My point isn’t that we are doing everything right and that formal verification is wrong, or that unit tests are wrong, etc. My point is every application is unique. They are all developed with limited resources and those resources should be focused on producing the best possible outcome.
If I was leading a team at NASA developing embedded software for a space probe, pair programming and unit testing would be required, along with integration testing, automation and we should consider formal verification. We would only have one shot at getting it right or a billion dollar probe will fail.
But my current team doesn’t face such an immense cost from its bugs. We can push an update to fix critical bugs to every customer whenever we deem necessary. Our biggest challenge is getting customers to use our application, and to use it more often. Customer impediments are far more often poor design decisions than crashes or mis-implementations. So that’s what we need to focus on.
> Our biggest challenge is getting customers to use our application, and to use it more often.
That may be your biggest challenge now, but at some point, you may find that in order to overcome that challenge, you need to add major new functionality that requires changing broad swaths of your application. When doing that, you may find that your crash rate becomes unacceptably high as your changes begin violating assumptions you didn't know you were making. That's what unit and integration tests are for, they help expose those assumptions before a user does. This is a common misunderstanding, tests are not about catching bugs in the present, they are about decreasing the risk of adding new functionality in the future. I would go so far as to say that if you're working on a project with a defined end state, where there is zero chance it will ever be expanded past that end state, then automated tests are pointless; spend a lot of time on QA of the end product, shrink wrap it when you're happy with it, and don't look back. But I would also go so far as to say that nobody makes software like that. I certainly haven't ever worked on a project that is unlikely to have major, risky, expansions in functionality during its lifetime.
I agree with your general point, but currently don’t think the math adds up on unit testing for us.
Its not that it wouldn’t be better if we had unit tests. But their benefits are mostly in the future, meaning they need to be discounted for the value of time. And even without discounting, they won’t be as valuable to customers and our product as actually creating new and fixing existing features.
If our product fails in the next year, any efforts yielding most of their benefits after that are wasted. Any efforts that yield more customers and usage now, gives us more resources to do more things in the future, including unit tests.
That way if we get to that major, risky expansion in the future we will have more resources to write unit tests then, when they are most useful.
And lastly, i’ve reviewed the bugs that business prioritizes for us. Virtually none, including crashes and feature fails, would be caught by traditional unit tests. Lots of them were force unwrapping code written by original contractors that would pass unit test, but crash in multithreaded environment of garbage collection frees the right object.
To do unit tests we would also have to write a ton of mocks, making our production code less maintainable and understandable, and potentially introduce new bugs.
I think that rosette is a good tool which abstracts a bunch of the difficulty of SMT solvers and also allows you to synthesize programs https://emina.github.io/rosette/
How many real world production systems have benefited from formal methods, significantly more than if the same resources used to do so had been applied in other ways?
Let’s assume you can name a few, how general in the class of systems this can apply to? Or this just an obscure corner of computer science that applies to NASA and various low level system components?
I don’t count out formal methods, but can you really say none of this reminds you of the AI winter?
Is there evidence that it’s more reliable compared to equally complex systems? Or that it’s reliability would have been impractical or more expensive using traditional methods?
I didn’t hold out much hope of a surgical response (because it’s not the most common thing being done in projects) but that looks like a great link exactly on point.
It may sound cynical but as I mentioned anecdotally is does really feel like the AI winter, where simultaneously people felt there was incredible promise while hearing a steady stream of discouraging progress reports.
I look forward to reading their paper, maybe there are a lot more bright spots than can be easily seen from the outside.
I think that formal verification is going to be size limited. seL4 is a microkernel which means small, about 9000 LOC. I don't think xen at 150,000 is within the abilities of verification. So it's promising for small stuff.
Yes, formally verified systems still have bugs due to the TCB (the trusted computing base). But that's because the TCB is precisely the only part of the system that is not formally verified! So of course it's going to have bugs.
If so many crashing and correctness bugs were found in the (small) trusted part of formally-verified systems (well, in reality only 16 bugs were found in 3 entire formally verified systems, but I digress), imagine how many similar bugs there are in systems that are not formally verified at all!
If anything, what this article should be arguing is that operating systems, proof finding and verification tools and compilers should also be formally verified (and some already are), so that even bugs in these parts of the system are also eliminated.
The only part of the article that I agree with is that it's not feasible to write 10x more proof code than real code.
But IMHO, this is why proof automation is really important to have - tools like Isabelle/HOL, Why3 and SMT solvers like Z3 provide this (the latter of which the author mentions).
Coq, on the other hand, has much less automation and the result is that it's much harder to use and results in larger proof code. Yet, it's seemingly inexplicably much more popular - perhaps it's the catchy name?