AI Reaches Silver-Medal Level at This Year’s Math Olympiad

During the 2024 International Mathematical Olympiad, Google DeepMind debuted an AI program that can generate complex mathematical proofs

By Manon Bischoff edited by Gary Stix

Gold, silver and bronze medals. — estherpoon/Getty Images

While Paris was preparing to host the 33rd Olympic Games, more than 600 students from nearly 110 countries came together in the idyllic English town of Bath in July for the International Mathematical Olympiad (IMO). They had two sessions of four and a half hours each to answer six problems from various mathematical disciplines. Chinese student Haojia Shi took first place in the individual rankings with a perfect score. In the rankings by country, the team from the U.S. came out on top. The most noteworthy results at the event, however, were those achieved by two machines from Google DeepMind that entered the competition. DeepMind’s artificial intelligence programs were able to solve a total of four out of six problems, which would correspond to the level of a silver medalist. The two programs scored 28 out of a possible 42 points. Only around 60 students scored better, wrote mathematician and Fields Medalist Timothy Gowers, a previous gold medalist in the competition, in a thread on X (formerly Twitter).

To achieve this impressive result, the DeepMind team used two different AI programs: AlphaProof and AlphaGeometry 2. The former works in a similar way to the algorithms that mastered chess, shogi and Go. Using what is called reinforcement learning, AlphaProof repeatedly competes against itself and improves step-by-step. This method can be implemented quite easily for board games. The AI executes several moves; if these do not lead to a win, it is penalized and learns to pursue other strategies.

To do the same for mathematical problems, however, a program must be able not only to check that it has solved the problem but also to verify that the reasoning steps it took to arrive at the solution were correct. To accomplish this, AlphaProof uses so-called proof assistants—algorithms that go through a logical argument step-by-step to check whether answers to the problems posed are correct. Although proof assistants have been around for several decades, their use in machine learning been constrained by the very limited amount of mathematical data available in a formal language, such as Lean, that the computer can understand.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Solutions to math problems that are written in natural language, on the other hand, are available in abundance. There are numerous problems on the Internet that humans have solved step-by-step. The DeepMind team therefore trained a large language model called Gemini to translate a million such problems into the Lean programming language so that the proof assistant could use them to train. “When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean,” the developers wrote on DeepMind’s website. By doing so, AlphaProof gradually learns which proof steps are useful and which are not, enhancing its ability to solve more complex problems.

Geometry problems, which also appear in the IMO, usually require a completely different approach. Back in January DeepMind presented an AI called AlphaGeometry that can successfully solve such problems. To do this, the experts first generated a large set of geometric “premises,” or starting points: for example, a triangle with heights drawn in and points marked along the sides. The researchers then used what is called a "deduction engine" to infer further properties of the triangle, such as which angles coincide and which lines are perpendicular to each other. By combining these diagrams with the derived properties, the experts created a training dataset consisting of theorems and corresponding proofs. This procedure was coupled with a large language model that sometimes also uses what are known as auxiliary constructions; the model might add another point to a triangle to make it quadrilateral, which can assist in solving a problem. The DeepMind team has now come out with an improved version, called AlphaGeometry 2, by training the model with even more data and speeding up the algorithm.

To test their programs, the DeepMind researchers had the two AI systems compete in this year’s Math Olympiad. The team first had to manually translate the problems into Lean. AlphaGeometry 2 managed to solve the geometry problem correctly in just 19 seconds. AlphaProof, meanwhile, was able to solve one number theory and two algebra problems, including one that only five of the human contestants were able to crack. The AI failed to solve the combinatorial problems, however, which might be because these problems are very difficult to translate into programming languages such as Lean.

AlphaProof’s performance was slow, requiring more than 60 hours to complete some of the problems—significantly longer than the total nine hours the students were allotted. “If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher,” Gowers wrote on X. “Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.”

Gowers and mathematician Joseph K. Myers, another previous gold medalist, evaluated the solutions of the two AI systems using the same criteria as was used for the human participants. According to these standards, the programs scored an impressive 28 points, which corresponds to a silver medal. This meant that the AI only narrowly missed out on reaching a gold-medal level of performance, which was awarded for a score of 29 points or more.

On X, Gowers emphasized that the AI programs have been trained with a fairly wide range of problems and that these methods are not limited to Mathematical Olympiads. “We might be close to having a program that would enable mathematicians to get answers to a wide range of questions,” he explained. “Are we close to the point where mathematicians are redundant? It’s hard to say.”

This article originally appeared in Spektrum der Wissenschaft and was reproduced with permission.

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American