Software gets more complicated, and so bugs become more common. Hardware gets faster, so speed pressures are reduced on software, so it gets more layered and hence slower; and also more complicated. More layers of abstraction means more things to get wrong, so correctness becomes a serious problem. The world is increasingly reliant on software, so correctness becomes a serious concern.
That sums up the "software crisis" as it relates to correctness. Most theorem provers have been swept up in this. They are written in high level languages and frameworks made by people who paid no thought to verification and only slightly more than most to correctness. What's more, even some theorem prover languages have dubious correctness or semantics; Rust and Dafny are both languages that make a big deal about formal correctness but have no formal spec. (Rust is working on it.) Coq is not known to be consistent (the literature is full of "here is a proof that some subsystem of Coq has this and that property" but no one can handle the whole thing), not to mention that it has had a few soundness bugs in its history, one of which (Reals violate EM) wasn't even clearly a bug because no one knew what the model was supposed to be.
Agda is a free for all as far as I can tell, and Isabelle has some strange thing with axiomatic typeclasses that takes them out of the HOL world. HOL Light is pretty decent metatheory-wise, and the logic at least was formalized inside itself, but the implementation has not, and OCaml's Obj.magic is still there, ready to cause inconsistency if you feel like using it.
Lean has not had any soundness bugs to date, and the theory is well understood, but it is a massive C++ application and there are doubtless dozens of implementation bugs in there waiting to be found.
None of the big theorem provers are really appropriate for trying to prove correct, because either it is not known what this even means or it's trivially false and we all try to pretend otherwise, or it's an intractable open question.
These are the paragons of correctness in our world.
Can we do better? What does it even mean to do better? This paper aims to find out.
Lean has independent proof checkers, and Coq could have, too. The fact that these are big systems doesn't harm the de Bruijn principle: you only need to trust the checkers…
(btw I couldn't find the source code for the proof checker in the paper, is it available?)
The fast proof checker I talk about is mm0-c: https://github.com/digama0/mm0/tree/master/mm0-c . It is (deliberately) bare bones, and I'm planning on formalizing approximately the assembly that you get from running that through gcc.
There is a second proof checker for MM0 in mm0-hs: https://github.com/digama0/mm0/tree/master/mm0-hs , which has a few more features like the MM1 server mode to give interactive feedback, as well as a verifier that reads a text version of the proof file (not described in the paper, the .mmu format).
The footnotes are supposed to be links to the github repo but I got the versions mixed up and submitted a partially anonymized version to arXiv.
Do we even know if it's possible to prove correctness of a theorem-prover, in a very general sense? At some point, that would imply that some theorem-prover would have to prove its own correctness (either by proving itself directly or by proving itself through a loop of theorem-provers). This seems intuitively like it start to run up against Gödel-style incompleteness problems.
It does, but only if you state the theorem in a particular way. Let's say that you have a verifier A, that checks theorems in Peano Arithmetic (PA). Inside that logic, you can define what a computer is, how it executes, what PA is, what a verifier is, and what it means for a program to verify theorems of PA. Then you can have a sequence of bytes (verifier B), and prove that they represent a program that verifies theorems of PA.
What have you proven? Well, verifier A (using PA) proves that "verifier B proves theorems of PA". That means that "if B says phi is true, PA |- phi". We would have a hard time actually running verifier B inside the logic of verifier A (that entails running all the steps of the computation B does as theorem applications inside verifier A), but even if we did, we would obtain a proof of "PA |- phi". If we assume A correctly implements PA, then that means PA |- "PA |- phi". In general, this is the best we can do. If we assume further that PA is a sound axiom system, i.e. when it proves facts about numbers then we won't find concrete counterexamples, then this strengthens to PA |- phi and then to phi itself, so we've learned that phi is true.
The plan is to prove inside verifier A (that is, MM0) the statement "Here is a sequence of bytes. It is an executable that when executed has the property that if it succeeds on input "T |- phi", then T |- phi." The bootstrapping part is that the sequence of bytes in question is in fact the code of verifier A. In order to support statements of the form "here is a sequence of bytes", the verifier also has the ability to output bytes as part of its operation (it's not just a yes/no oracle), and assert facts about the encoding of those bytes in the logic.
Seems like the ideal would be a number of theorem provers each capable of verifying the next. And the first one simple enough to be verified with pen and paper
Provers in the LCF tradition have tiny cores that avoid having to trust lots of code. There are also reflective provers that can admit self-verified extensions. One project, called Milawa, had several layers of this.
(I'm the author of the paper BTW.) Regarding TLA+ and CakeML: TLA+ seems like a good idea, Leslie Lamport talks a lot of sense in most of his papers. I especially recommend "Composition: A Way to Make Proofs Harder" (https://lamport.azurewebsites.net/pubs/lamport-composition.p...). It sums up a lot of my feelings about "simplifying" a problem by adding more layers and generalizing the layers so that they have more work to do individually. I'm not sure I'm totally on board the idea of using temporal logic; my plan is to just use plain first order logic to talk about the states of a computer.
Regarding CakeML, I'm really impressed by the work they've done, and I think they more than anyone else are leading the charge on getting verification as pervasive as possible at every level (especially the low levels, where historically it has been difficult to get past the closed source barrier). That said, I think they are not doing an optimal job at being a bootstrap. They have a system that does that and so much more that if you just want a bootstrap it's way overkill. And you pay for that overkill - 14 hour compilation per target (see https://cakeml.org/regression.cgi).
(I edited my comment to remove my questions about TLA+ and CakeML, when I got to the "Related work" part of the paper. I was also wondering about Milawa but couldn't remember the name! Kudos sir!)