At the risk of sounding relentlessly skeptical - surely by training the code on ...

ogogmad · on Feb 2, 2022

We validated our performance using competitions hosted on Codeforces, a popular platform which hosts regular competitions that attract tens of thousands of participants from around the world who come to test their coding skills. We selected for evaluation 10 recent contests, each newer than our training data. AlphaCode placed at about the level of the median competitor, marking the first time an AI code generation system has reached a competitive level of performance in programming competitions.

[edit] Is "10 recent contests" a large enough sample size to prove whatever point is being made?

YeGoblynQueenne · on Feb 2, 2022

The test against human contestants doesn't tell us anything because we have no objective measure of the ability of those human coders (they're just the median in some unknown distribution of skill).

There's more objective measures of performance, like a good, old-fashioned, benchmark dataset. For such an evaluation, see table 10 in the arxiv preprint (page 21 of the pdf), listing the results against the APPS dataset of programming tasks. The best performing variant of AlphaCode solves 25% of the simplest ("introductory") APPS tasks and less than 10% of the intermediary ("interview") and more advanced ones ("competition").

So it's not very good.

Note also that the article above doesn't report the results on APPS. Because they're not that good.

solididiot · on Feb 2, 2022

Does it need to solve original problems? Most of the code we write is dealing with the same problems in a slightly different context each time.

As others say in commends it might be the case where we meet in the middle. Us writing some form of tests for AI-produced code to pass.

qualudeheart · on Feb 2, 2022

That’s been a common objection to Copilot and other recent program synthesis papers.

The models regurgitate solutions to problems already encountered in the training set. This is very common with Leetcode problems and seems To still happen with harder competitive programming problems.

I think someone else in this thread even pointed put an example of AlphaCode doing the same thing.