This is the opposite direction to what I expected to see for new theorem proving efforts: "Good old fashioned AI" vs leveraging advances of deep learning for natural language.
I think the direction that is ripe for advance is a two-part process: (1) translating from human-language mathematics proof to computer-verifiable formal proof language combine with (2) GPT3-style automated generation of plausible next-steps in human language proofs.
Gowers emphasizes the success of humans at "pruning" to only consider potentially useful next steps in a proof. I think GPT3 demonstrates such pruning in generating plausible natural language text continuations.
And the success of natural language translation between natural languages suggests natural->formal language translation may also be ripe.
Combined with the success of formal language automated theorem provers, it seems plausible to build on current technologies a stack that produces formally verified natural language proofs.
I don't think Gowers is trying to build a better theorem prover.
An AI researcher would be trying to build a better prover.
Gowers is a mathematician looking for mathematical insight.
This project's success would be based on how well it answers those fundamental questions he posed (i.e How does a human generate proofs, given no algorithm exists to solve the halting problem?)
The output might be more like meta mathematics, with insights provided by software, rather than ML software optimised for a goal.
> How does a human generate proofs, given no algorithm exists to solve the halting problem?
To be fair, this has some trivial answers -- despite being a fascinating question.
For example, given a formal proof verification system (e.g. Lean), you can search for proofs (either by brute force or hopefully using heuristics) and if a proof exists, you will find it after some finite time. The problem is that sometimes no proof exists. You may be thinking of getting "smart", and also trying to prove no proof exists in parallel. But that's just another proof, such that this recursion tower (proof that (proof that (... proof doesn't exist)) doesn't resolve. But I think overall this mostly means you should be able to accept non-existence or arbitrary proving times of proofs (maybe also a sociological problem for mathematicians!). That's of course how human mathematics works: no one now if any famous theorem (e.g. Riemann hypothesis) will be proven within some time frame, or maybe never.
But isn't that how humans do it too? They have a hunch, pursue it, and if it doesn't work out, they backtrack. Of course, some people have better hunches than others, but that doesn't change the global idea.
...well, quite. But that's not really the "move along, nothing to see here" it first sounds like.
"Hunch" is a label we use for "I don't know explicitly how I came up with this".
Generally, when we have a label for a concept the description of which starts with "I don't know...", some people try and follow up with questions: is that a thing it is possible, in principle, to know? How could we find out for certain? If it's knowable, how do we get to know what it is?
Sometimes, they succeed, and generations of such successes is how we end up with the modern world.
GPT3 style automated generation of plausible next steps in human language proofs is a disaster waiting to happen. GPT3 generates vaguely plausible text sequences without understanding the material. It relies heavily on the imprecision of language and the fact that there are many plausible words in sentences. It doesn't even perfectly capture the rules of grammar as it sometimes makes mistakes of a grammatical nature.
Consider that mathematics requires greater precision in that the language has more exacting meanings and fewer plausible alternatives. Also consider that the bar to doing something useful in mathematics is extremely high. We're not trying to GPT3 a plausible sentence now, we're trying to guide GPT3 to producing the complete works of shakespeare.
GPT3 demonstrates a kind of pruning for generating viable text in natural language continuations but I'd argue it is nothing like pruning useful next steps of a proof. The pruning in GPT3 works as a probability model and is derived from a good data set of human utterances. Generating a good dataset of plausible and implausible next steps in mathematical proofs is a much harder problem. The cost per instance is extremely high as all of the proofs have to be translated into a specific precise formal language (otherwise you explode the search space to be any plausible utterance in some form of English+Math Symbols making the problem much harder). Even worse, different theorem provers want to use different formal languages making the reusability of the data set less than typical in ML problems. The dataset is also far smaller. How many interesting proofs in mathematics are at a suitable depth from the initialization of a theorem prover with just some basic axioms? Even if you solve the dataset problem though there are further problems. GPT3 isn't designed to evaluate the interestingness of a sentence, only the plausibility with hopes that the context in which the sentence is generated provides enough relevance.
In short, I'm highly skeptical that benefits in natural language translation will translate to formal languages. I'd also argue the problems you classify into "formal language translation" aren't even translation problems.
I also think very few people see the technologies you've mentioned as related (for good reason) and I think a program that attempts to build on them is likely to fail.
Fair enough to be skeptical. Some responses to your points:
> the bar to doing something useful in mathematics is extremely high
Ah but the bar to do something interesting in automated theorem proving is much lower. Solving exercises from an advanced undergraduate class involving proofs would already be of interest.
> Generating a good dataset of plausible and implausible next steps in mathematical proofs is a much harder problem.
There are thousands of textbooks, monographs, and research mathematical journals. There really is a gigantic corpus of natural language mathematical proofs to study.
In graduate school there were a bunch of homework proofs which the professors would describe as "follow your nose" : after you make an initial step in the right direction the remaining steps followed the kind of pattern that quickly becomes familiar. I think it is very plausible that a GPT3 style system trained on mathematical writing could learn these "follow your nose" patterns.
> problems you classify into "formal language translation" aren't even translation problems
Fair. Going from natural language proofs like from a textbook to a formal language like automatic theorem provers use has similarities to a natural language translation problem but it would be fair to say that this is its own category of problem.
I agree there might be some sort of translation problem that would partially automate the cost of converting all these examples in textbooks, monographs, and research journals from pseudocode in English+Mathematics into the correct formal logic statements. I think this is an interesting and complex problem that could make managing the cost of a dataset manageable. It still comes with a problem that most of these sources start from very far past the axioms so in order to use them you need formal language proofs for each of the things they assert without proof.
I question whether you'd get high enough accuracy out of a pattern matching type model like GPT3 that occasionally chooses an unusual or unexpected word. Given how frequently translating A->B->A yields A* instead of A with GPT3 I wonder if we are actually successfully capturing the precise mathematical statements.
I’m also skeptical of the approach here but for the opposite reason. ML will conquer proofs by an AlphaZero approach of ignoring all human training data. You can generate arbitrary problems easily enough and doing some GAN like system of both generating problems that yet can’t be solved and a solver to solve them seems to obviate the need for human training data. I’d be really hesitant to be a PhD student in the GOFAI or natural language datamining approaches since I think this could easily be solved by ML in the next five or ten years by ML (specifically at least one well known unsolved research problem being solved by AI). I hope that I’m wrong… I like math as a human endeavor.
The problem with generating arbitrary math problems is that you can generate an infinite number of boring, pointless problems. For example, prove that there is a solution to the system of equations x + 2y + 135798230 > 2x + y + 123532 and x - y < 234. Training an AI system how to solve these problems doesn't do anything.
I think we are in a stage for mathematics similar to where solving Go was in the 80's. For Go back then we didn't even have the right logical structure of the algorithm, we hadn't invented Monte Carlo tree search yet. Once a good underlying structure was found, and we added on AI heuristics, then Go was solvable by AIs. For mathematics I think we also need multiple innovations to get there from here.
Such an "AlphaZero approach" will only knock Gowers's GOFAI approach out of business if the AI can also "justify" its proofs, in the sense that Gowers explains in his blogpost.
Do you think that will happen in 5 years?
DeepMind's recently introduced 540 billion parameter language model PaLM can already explain complex jokes and and reason logically by explicitly describing reasoning steps. The results are quite impressive.
https://ai.googleblog.com/2022/04/pathways-language-model-pa...
To be sure, this happened just two years after GPT-3. So 5 years for something which Gowers wants, but based on a large language model, seems in fact quite realistic.
I find myself doubting that this goal makes any sense.
One mathematician often cannot "justify" their reasoning process to another if their thinking is even slightly different. For example what is intuitive for one side of the algebra/analysis divide is often not to people on the other side.
It is therefore unlikely that a human level computer program will think enough like humans that it can be justified to any human. And vice versa. For it to do so, it has to be better than the human, also have a model of how the human thinks, and then be able to break down thinking it arrived at one way to something that makes sense to the human.
If a computer succeeded in this endeavor, it would almost certainly come up with new principles of doing mathematics that would then have to be taught to humans to for humans to appreciate what it is doing. Something like this has already happened with Go, see https://escholarship.org/uc/item/6q05n7pz for example. Even so, humans are still unable to learn to play the game at the same level as the superhuman program. And the same would be true in mathematics.
I've found for most substantial proofs I've been exposed to I understood the author's justification of how they found the proof. That doesn't mean I would have been able to come up with that step on my own, but once it was done I understood why they took it. If you get sufficiently far into the esoteric weeds of either analysis or algebra, the justifications won't be understood outside that field. But they don't have to be.
Theorem provers starting from basic axioms are often looking for things that aren't in the esoteric weeds of a specific subfield as it often takes too long for them to generate a suitable set of theorems that establish those fields.
> And vice versa. For it to do so, it has to be better than the human, also have a model of how the human thinks, and then be able to break down thinking it arrived at one way to something that makes sense to the human.
This is more or less what Gowers is after. And that's exactly why he wants GOFAI instead of ML. He wants to understand how "doing mathematics" works?
I think this approach hinges on how you build and define the network that flags the "most promising search state". I'd argue this is a far harder problem in theorem proving than it is in various games. Before AlphaZero all kinds of position evaluation techniques existed in each of these games and were in widespread use for over 25 years.
Comparatively, it's even a hard problem to define anything resembling a board where the structure is fixed enough to apply something similar to positional evaluation to. What you want is a network that flags next steps in theorems as promising and allows you to effectively ignore most of the search space. If you have a good way to do this I think AlphaZero would be a promising approach.
But even if you get AlphaZero to work it doesn't solve the problem of understanding. AlphaZero made go players better by giving them more content to study but it didn't explicitly teach knowledge of the game. Strategically in Go, it's actually rather unsatisfying because it seems to win most of its games by entirely abandoning influence for territory and then relying on its superior reading skills to pick fights inside the influence where it often seems to win from what pro players would previously describe as a disadvantageous fight. This means the learnings are limited for players who have less good reading skills than AlphaZero as they can't fight nearly as well, which suggests abandoning the entire dominant strategy it uses.
I think for short proofs that fit in a GPT-style NLP predictor's input, it could produce an initial list of likely "next steps" which could be further scored by the adversarially-trained scoring model which has some intuition for which proof search strategies are likely to work, the combination allowing each model to have different areas of expertise (NLP being technically competent locally and the scorer providing global judgement). The entire proof search would correspond to a single play in AlphaGo with proof search happening entirely within the monte-carlo exploration of likely next-steps as opposed to a sequence of plays as in Go, and play only happening when an automatic proof-checker validates the final proof. Automatic proof-checking could also prune syntactically invalid continuations from the monte-carlo search tree, and probably provide some similar optimizations.
My guess is that such a model would learn to compose a collection of lemmas that score as useful, produce sub-proofs for those, and then combine them into the final proof. This still mimics a sequence of plays closely enough that the scoring model could recognize "progress" toward the goal and predict complementary lemmas until a final solution is visible.
It may even work for very long proofs by letting it "play" a portion of its early lemmas which remain visible to the scorer but the proof sequence is chunked into as many pieces as the NLP needs to see it all and backtracking behind what was output is no longer possible. Once a lemma and its proof are complete it can be used or ignored in the future proof but modifying it sounds less useful.
I think the probability of a GPT-style approach learning to generate mathematically interesting next steps is close to zero. The structure GPT captures is a sense of plausibility of a next step and while you could argue that it's possible given enough data to train the network to only rate mathematically interesting steps highly, the rate of that training would be extremely slow. GPT-style training is most effective for language where there is a certain degree of flexibility. It tends to learn plausibility and provides relevance mostly from context. This approach seems akin to searching in the language space of English+Mathematics over all proofs and hoping to learn mathematically relevant continuations. I think the search space is way too large and positive feedback is far too rare for the model to converge to anything useful.
I highly doubt you could get such a model to produce a useful collection of lemmas. Even if you could, I don't think a useful collection of lemmas progressing towards a goal has anywhere near the same structure as scoring a position in a game.
Ultimately for any AlphaZero technique to work you need a powerful position evaluation network. Almost all the magic happens in that network's pruning. It's quite remarkable that you can get a search of 80,000 nodes to do better that a search of 35,000,000 nodes. All of that is due to efficient pruning from the network. You haven't convinced me of any plausible approach to getting such a network here.
There's a very big difference between solving one problem (if you through enough compute at it you obviously will succeed) and general proving being solved. The latter essentially means being able to replace all programmers, that'd effectively mean AGI in 5 to 10 years.
>>doing some GAN like system of both generating problems that yet can’t be solved and a solver to solve them seems to obviate the need for human training data.
I don't see how this is supposed to work, what if the generator just outputs unsolvable problems?
Human-language proofs for anything non-trivial are really just proof sketches, that's why you can't really meaningfully separate this into a two step process.
I think the direction that is ripe for advance is a two-part process: (1) translating from human-language mathematics proof to computer-verifiable formal proof language combine with (2) GPT3-style automated generation of plausible next-steps in human language proofs.
Gowers emphasizes the success of humans at "pruning" to only consider potentially useful next steps in a proof. I think GPT3 demonstrates such pruning in generating plausible natural language text continuations.
And the success of natural language translation between natural languages suggests natural->formal language translation may also be ripe.
Combined with the success of formal language automated theorem provers, it seems plausible to build on current technologies a stack that produces formally verified natural language proofs.