As noted in another comment, that is legit program synthesis from a complete
specification in natural language. The workflow described in the paper and
abstracted in your comment can be very useful, as long as the user can have good
confidence in the correctness of the results (or the ability to eyball and
correct them as needed).
The problem is that this approach relies on a large language model trained on a
copy of the entire web (GPT-3, trained on Common Crawl plus network extras),
fine-tuned on github (Codex) and then fine-tuned again on the kinds of problems
that it is supposed to solve. That is an insane amount of resources to spend to
solve simple programming problems with known solutions that can be implemented
by hand at much lower cost and effort. In that sense it's a little bit
disappointing: so much data, so much compute, and all you can do is tell
computer how to code python?
> and then fine-tuned again on the kinds of problems that it is supposed to solve
Could you point me to where they claim to have fine-tuned Codex?
From what I can see they claim:
> We use OpenAI’s davinci-codex engine for all of our generations. We fix all of Codex’s hyperparameters to be the same for all experiments: top-p which is the portion p of the token probability mass a language model
samples from at each step is set to 1, sampling temperature is
set to 0 (i.e. argmax), and response length is set to 200 tokens.
BTW - OpenAI Davinci costs $0.06 per 1000 tokens. Codex is currently free in closed beta, but I guess cost will be the same. I would be happy to pay even 100x ($6) for correct solutions to advanced mathematical problems. The issue is that this paper has ridiculous evaluation and Codex does not work anywhere near as good as they claim. It is pure hype by people who do not appear to be affiliated with OpenAI.
This is legit program synthesis for an iterative brute force solution to the problem. But it is in no sense "solving university level mathematical problems". It's not even figuring out that it can brute force it based on the original problem - a human looked at the original problem and told the model how to brute force it, and it did. It's cool that it did, but this achievement has nothing to do with the title and abstract of the paper.
The problem is that this approach relies on a large language model trained on a copy of the entire web (GPT-3, trained on Common Crawl plus network extras), fine-tuned on github (Codex) and then fine-tuned again on the kinds of problems that it is supposed to solve. That is an insane amount of resources to spend to solve simple programming problems with known solutions that can be implemented by hand at much lower cost and effort. In that sense it's a little bit disappointing: so much data, so much compute, and all you can do is tell computer how to code python?