In 2019 Harish Natarajan took part in a debate with a five-and-a-half-foot tall rectangular computer screen in front of a live audience of about 800 people. The computer was Project Debater, an artificial intelligence system designed by IBM. Natarajan is a globally recognized debate champion. And the topic at hand was whether or not preschool should be subsidized.

Based on an audience vote, Project Debater lost the contest. But the “it” present held its own, forming logical opening statements. And in 2018 Project Debater won one debate and almost tied in another. Still, the system is fully capable of sounding awkward during an argument-and-rebuttal with an opponent.

While computers will not be ambling to a political podium any time soon, a study published today in Nature suggests that this algorithm is inching closer to engaging in the type of complex human interaction represented by formal argumentation.

The researchers observe that the requirements of a debate are outside the “comfort zone” for AIs, which have triumphed in a range of board and video games—not to mention a famous quiz show. In recent decades, startling advances have been registered in AI. In 1997 IBM’s Deep Blue became the first computer to defeat a reigning chess champion, besting titan Garry Kasparov in a six-game match. Fourteen years later IBM’s Watson defeated Jeopardy! all-stars Brad Rutter and Ken Jennings at their own game.

But a lot of competitive computer intelligence has been tested on tasks or games with a clear winner and loser. And it has been amenable to coding that leads to a defined binary algorithmic path to victory. What has eluded computer scientists is a system that can interact with the nuance that enables complex discourse with human beings. Project Debater is getting close to this goal.

In the new Nature paper, IBM researchers—comprising a collaborative team at the company’s AI research centers in Haifa, Israel, and Dublin, Ireland—report on their system’s progress. Following the 2019 debate, speeches by both Project Debater and three expert human debaters were evaluated on nearly 80 different topics by 15 members of a virtual audience.

In these human-against-machine contests, neither side is allowed access to the Internet. Instead each is given 15 minutes to “collect their thoughts,” as Christopher P. Sciacca, manager of communications for IBM Research’s global labs, puts it. This means the human debater can take a moment to jot down ideas about a topic at hand, such as subsidized preschool, while Project Debater combs through millions of previously stored newspaper articles and Wikipedia entries, analyzing specific sentences and commonalities and disagreements on particular topics. Following the prep time, both sides alternately deliver four-minute speeches, and then each gives a two-minute closing statement.

Based on audience and reader scoring, Project Debater managed to “win” in 2018 against one of the three experts, and it scored impressively high in making opening statements. But on average, it was still slightly inferior to the humans overall. The hurdle is maintaining a meaningful exchange that can take any number of directions, similar to a real human conversation. Still, the study results move the needle in developing an AI system that can understand and produce meaningful linguistic interaction.

“In recent years there’s been a tremendous amount of work in developing algorithms that can understand and generate human language,” says Noam Slonim, a distinguished engineer at IBM Research and principal investigator of Project Debater since its inception. “The tasks being pursued span from predicting the sentiment of a single sentence to more complex tasks such as machine translation and dialogue systems.” He adds that IBM’s results reflect a system that, while still coming in second place to a Homo sapiens “rival,” can engage with an opponent in a way that, until now, was out of reach with other AI systems. Plenty of such systems can generate what seems to be meaningful language with actual syntax. But a big question for the field is whether or not machines will ever be able to emulate actual human reasoning or become conscious.

“On stage, Project Debater is far from perfect, and its missteps reveal just how difficult—and how definingly human—argumentation and debate are,” says computer scientist Chris Reed of the University of Dundee in Scotland, who was not involved with the research but was present in the audience at the 2019 debate. “[Yet] the Project Debater research is a tour de force of innovative engineering.... The scale of the achievement of the IBM team is also clear from the live performance of the system: not only using knowledge extracted from very large data sets but also responding on the fly to human discourse.”

Natarajan and other debaters are not yet ready to concede defeat to “machine overlords.” But for better or worse—one hopes for the better—machine learning is starting to enter a realm beyond the defined rules of chess and Go.