If artificial intelligence is going to revolutionize the way science is done, as many of the frontier AI laboratories hope, it needs to master board games first. That’s the lesson from a recent study of AI models’ decision-making skills, tested with the game Battleship. The goal was to find ways for models to be more careful with limited resources: “cheap interventions” for information seeking, as research scientist Valerio Pepe puts it.
Science requires lots of decisions—researchers must choose which hypotheses to pursue and which simulations to run. The choices will determine which path to follow when resources for experiments are limited. “You can get only so much data because getting data is either expensive or time-consuming,” says Pepe, who led work on the project before joining OpenAI. In April, Pepe and his colleagues presented their findings at the International Conference on Learning Representations, an annual meeting dedicated to AI deep learning.
The researchers designed a collaborative version of Battleship that could be played by humans or AI. In the game, one team member generated questions about the map of ships’ locations while another answered them, in a combined effort to pinpoint where the vessels were hidden and sink them. By counting how many rounds it took to sink all the ships, the researchers could test how large language models (LLMs) performed compared with other LLMs and with the 42 human players the group had enlisted. Initially, humans consistently won in fewer moves than Llama-4-Scout, Meta’s efficiency-focused AI model. OpenAI’s premier reasoning model, GPT-5, performed better than both.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
The scientists were inspired by Bayesian experimental design, in which researchers interpret decision-making by estimating the likelihoods of events given prior assumptions. They optimized their models to ask questions that maximized the chances of hitting targets accurately and the amount of information they gained with each question, as well as to look ahead a turn when deciding which move to make. The scientists also found that accuracy increased when the players communicated with snippets of code rather than natural language. Through this process, the group led Llama-4-Scout to win in fewer moves than GPT-5 two thirds of the time at about one hundredth of the cost. On average, it also won in seven fewer moves than the human players.
Battleship is much simpler than many problems in science—chemical and biological samples, for instance, can’t be interpreted as clearly as Battleship boards. But Pepe says the methods AI used in the game will probably also be applicable to scientific decision-making.
“The framework will be very useful to measure whether language models are really making progress” in deciding which hypotheses to pursue among all possibilities, says Yuanqi Du, a researcher focused on AI for chemistry who recently completed his Ph.D. at Cornell University and was not involved in the study. “Understanding the whole hypothesis space you’re searching, that’s the hardest part.”

