Artificial intelligence software programs are becoming shockingly adept at carrying on conversations, winning board games and generating artwork — but what about creating software programs? In a newly published paper, researchers at Google DeepMind say their AlphaCode program can keep up with the average human coder in standardized programming contests.
“This result marks the first time an artificial intelligence system has performed competitively in programming contests,” the researchers report in this week’s issue of the journal Science.
There’s no need to sound the alarm about Skynet just yet: DeepMind’s code-generating system earned an average ranking in the top 54.3% in simulated evaluations on recent programming competitions on the Codeforces platform — which is a very “average” average.
“Competitive programming is an extremely difficult challenge, and there’s a massive gap between where we are now (solving around 30% of problems in 10 submissions) and top programmers (solving >90% of problems in a single submission),” DeepMind research scientist Yujia Li, one of the Science paper’s principal authors, told GeekWire in an email. “The remaining problems are also significantly harder than the problems we’re currently solving.”
Nevertheless, the experiment points to a new frontier in AI applications. Microsoft is also exploring the frontier with a code-suggesting program called Copilot that’s offered through GitHub. Amazon has a similar software tool, called CodeWhisperer.
Oren Etzioni, the founding CEO of Seattle’s Allen Institute for Artificial Intelligence and technical director of the AI2 Incubator, told GeekWire that the newly published research highlights DeepMind’s status as a major player in the application of AI tools known as large language models, or LLMs.
“This is an impressive reminder that OpenAI and Microsoft don’t have a monopoly on the impressive feats of LLMs,” Etzioni said in an email. “Far from it, AlphaCode outperforms both GPT-3 and Microsoft’s Github Copilot.”
AlphaCode is arguably as notable for how it programs as it is for how well it programs. “What is perhaps most surprising about the system is what AlphaCode does not do: AlphaCode contains no explicit built-in knowledge about the structure of computer code. Instead, AlphaCode relies on a purely ‘data-driven’ approach to writing code, learning the structure of computer programs by simply observing lots of existing code,” J. Zico Kolter, a computer scientist at Carnegie Mellon University, wrote in a Science commentary on the study.
AlphaCode uses a large language model to build code in response to natural language descriptions of a problem. The software takes advantage of a massive data set of programming problems and solutions, plus a set of unstructured code from GitHub. AlphaCode generates thousands of proposed solutions to the problem at hand, filters those solutions to toss out the ones that aren’t valid, clusters the solutions that survive into groups, and then selects a single example from each group to submit.
“It may seem surprising that this procedure has any chance of creating correct code,” Kolter said.
Kolter said AlphaCode’s approach could conceivably be integrated with more structured machine language methods to improve the system’s performance.
“If ‘hybrid’ ML methods that combine data-driven learning with engineered knowledge can perform better on this tasks, let them try,” he wrote. “AlphaCode cast the die.”
Li told GeekWire that DeepMind is continuing to refine AlphaCode. “While AlphaCode is a significant step from ~0% to 30%, there’s still a lot of work to do,” he wrote in his email.
Etzioni agreed that “there is plenty of headroom” in the quest to create code-generating software. “I expect rapid iteration and improvements,” he said.
“We are merely 10 seconds from the generative AI ‘big bang.’ Many more impressive products on a wider variety of data, both textual and structured, are coming soon,” Etzioni said. “We are feverishly trying to figure out how far this technology goes.”
As the work proceeds, AlphaCode could stir up the long-running debate over the promise and potential perils of AI, just as DeepMind’s AlphaGo program did when it demonstrated machine-based mastery over the ancient game of Go. And programming isn’t the only field where AI’s rapid advance is causing controversy:
When we asked Li whether DeepMind had any qualms about what it was creating, he provided a thoughtful answer:
“AI has the potential to help with humanity’s greatest challenges, but it must be built responsibly and safely, and be used for the benefit of everyone. Whether or not it’s beneficial or harmful to us and society depends on how we deploy it, how we use it, and what sorts of things we decide to use it for.
“At DeepMind, we take a thoughtful approach to the development of AI — inviting scrutiny of our work and not releasing technology before considering consequences and mitigating risks. Guided by our values, our culture of pioneering responsibly is centered around responsible governance, responsible research, and responsible impact (you can see our Operating Principles here).
Update for 1 p.m. PT Dec. 8: Sam Skjonsberg — a principal engineer at the Allen Institute for Artificial Intelligence who leads the team that builds Beaker, AI2’s internal AI experimentation platform — weighed in with his observations about AlphaCode:
“The application of LLMs to code synthesis is not surprising. The generalizability of these large-scale models is becoming widely apparent, with efforts like DALL-E, OpenAI Codex, Unified-IO and, of course, ChatGPT.
“One thing that’s interesting about AlphaCode is the post-processing step to filter the solution space, as to rule out those that are obviously incorrect or crash. This helps emphasize an important point – that these models are most effective when they augment our abilities, rather than try to replace them.
“I’d love to see how AlphaCode compares to ChatGPT as a source of coding suggestions. The competitive coding exercise that AlphaCode was evaluated against is an objective measure of performance, but it says nothing about the intelligibility of the resulting code. I’ve been impressed with the solutions produced by ChatGPT. They often contain small errors and bugs, but the code is readable and easy to modify. That’s not an easy thing to evaluate, but a really important aspect of these models that we’ll need to find a way to measure.
“On a separate note, I applaud Google and the research team behind AlphaCode for releasing the paper dataset and energy requirements publicly. ChatGPT should follow suit. These LLMs already tilt the scales towards large organizations, due to the significant cost of training and operating them. Open publishing helps offset that, encouraging scientific collaboration and further evaluation – which is important for both progress and equity.”
In addition to Li, the principal authors of the research paper in Science, “Competition-level Code Generation With AlphaCode,” include DeepMind’s David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy and Cyprien de Masson d’Autume. Thirteen other researchers are listed as co-authors. A pre-print version of the paper and supplemental materials is available via ArXiv.