Of course program synthesis has been a thing for years, I remember some excellent papers out of MSR 10 years ago. But which of those could read a prompt and build the program from the prompt? Setting up a whole bunch of constraints and having your optimizer spit out a program that fulfills them is program synthesis and is super interesting, but not at all what I think of when I'm told we can make the computer program for us.
For instance, RobustFill takes its optimization criteria from a bundle of pre-completed inputs and outputs of how people want the program to behave instead of having the problem described in natural language and creating the solution program.
Program synthesis from natural language specifications has existed for many
years, also. It's not my specialty (neither am I particularly interested in it),
but here's a paper I found from 2017, with a quick search:
AlphaCode is not particularly good at it, either. In the arxiv preprint, besides
the subjetive and pretty meaningless "evaluation" against human coders it's also
tested on a formal program synthesis benchmark, the APPS dataset. The best
performing AlphaCode variant reported in the arxiv preprint solves 25% of the
"introductory" APPS tasks (the least challenging ones). All AlphaCode variants
tested solve less than 10% of the "interview" and "competition" (intermediary
and advanced) tasks. These more objective results are not reported in the
article above, I think for obvious reasons (because they are extremely poor).
So it's not doing anything radically new and it's not doing it particularlly
well either. Please be better informed before propagating hype.
Edit: really, from a technical point of view, AlphaCode is a brute-force,
generate-and-test approach to program synthesis that was state-of-the-art 40
years ago. It's just a big generator that spams programs hoping it will hit a
good one. I have no idea who came up with this. Oriol Vinyals is the last author
and I've seen enough of that guy's work to know he knows better than bet on such
a primitive, even backwards approach. I'm really shocked that this is DeepMind
work.
I've also worked in the area and published research in it a couple years ago. I almost worked for a company focused on neural program synthesis but they did a large pivot and gave up a couple years ago working on much simpler problems and decided that current research was not good enough to do well on problems like this. I had a paper accepted that translated between toy programming languages 3ish years ago. Toy here meaning about complexity of simply typed lambda calculus that I wrote language in a couple hundred lines.
This and copilot are much better than level of problems being tackled a couple years ago.
I don't agree. I don't know your work, but the approach in AlphaCode and Copilot (or the Codex model behind it) is a step backwards for neural program synthesis, and for program synthesis in general. The idea is to train a large language model to generate code. The trained language model has no way to direct its generation towards code that satisfies a specification. It can only complete source code from some initial prompt. The code generated is remarkably grammatical (in the context of a programming language grammar) which is certainly an advance for language representation, but some kind of additional mechanism is required to ensure that the generated code is relevant to the specification. In Copilot, that mechanism is the user's eyballs. In AlphaCode the mechanism is to test against the few I/O examples of the programming problems. In either case, the whole thing is just hit-and-miss. The language model generates mostly garbage -DeepMind brags that AlphaCode generates "orders of magnitude" more code than previous work, but that's just to say that it's even more random, and its generation misses the target even more than previous work! Even filtering on I/O examples is not enough to control the excessive over-generation and so additional measures are needed (clustering and ranking of programs etc).
All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.
Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.
___________
[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.
I agree the approach itself is quite brute force heavy. There is a lot of information that could be used and I’d hope is helpful like grammar or language, traces/simulations of behavior, etc.
But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.
I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.
Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.
Well, I'm not surprised because I know that large language models can learn smooth approximations of natural language. They can generate very grammatical natural English, so why not grammatical source code, which is easier? Of course, once you have a generator for code, finding a program that satisfies a specification is just a matter of searching -and assuming the generated code includes such a program. But it seems like that isn't really the case with AlphaCode, because its performance is very poor.
I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).
I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.