God moves the player,
he in turn, the piece.
But what god beyond
God begins the round
of dust and time and
sleep and agonies?
—Jorge Luis Borges, from “Chess,” 1960
The victory in March of the computer program AlphaGo over one of the world's top handful of go players marks the highest accomplishment to date for the burgeoning field of machine learning and intelligence. The computer beat Lee Se-dol at go, a very old and traditional board game, at a highly publicized tournament in Seoul in a 4–1 rout. With this defeat, computers have bettered people in the last of the classical board games, this one known for its depth and simplicity. An era is over, and a new one has begun. The methods underlying AlphaGo, and its recent victory, have startling implications for the future of machine intelligence.
Coming Out of Nowhere
The ascent of AlphaGo to the top of the go world has been stunning and quite distinct from the trajectory of machines playing chess. Over a period of more than a decade a dedicated team of hardware and software engineers hired by IBM built and programmed a special-purpose supercomputer named Deep Blue that did one thing and one thing only: play chess by evaluating 200 million board positions per second. In a widely expected development, the IBM team challenged then reigning world chess champion Garry Kasparov. In a six-game match played in 1996, Kasparov prevailed against Deep Blue by three wins, two draws and one loss but lost a year later in a historic rematch 3.5 to 2.5. (Scoring rules permit half points in the case of a draw.)
Chess is a classic game of strategy, similar to tic-tac-toe (noughts and crosses), checkers (draughts), Reversi (Othello), backgammon and go, in which players take turns placing or moving pieces. Unlike games where players see only their own cards and all discarded cards, players have full access to relevant information, with chance playing no role.
The rules of go are considerably simpler than those of chess. Black and White sides each have access to a bowl of black and white stones, and each places one in turn on a 19-by-19 grid. Once placed, stones do not move. The intent of the game, originating in China more than 2,500 years ago, is to completely surround opposite stones. Such encircled stones are considered captured and are removed from the board. Out of this sheer simplicity, great beauty arises—complex battles between Black and White armies that span from the corners to the center of the board.
Strictly logical games, such as chess and go, can be characterized by how many possible positions can arise—a measure that defines their complexity. Depending on the phase of the game, players must pick one out of a small number of possible moves. A typical chess game may have 10120 possible moves, a huge number, considering there are only about 1080 atoms in the entire observable universe of galaxies, stars, planets, dogs, trees, people. But go's complexity is much bigger—at 10360 possible moves. This is a number beyond imagination and renders any thought of exhaustively evaluating all possible moves utterly unrealistic.
Given this virtually illimitable complexity, go is, much more than chess, about recognizing patterns that arise when clutches of stones surround empty spaces. Players perceive, consciously or not, relationships among groups of stones and talk about such seemingly fuzzy concepts as “light” and “heavy” shapes of stones and aji, meaning latent possibilities. Such concepts, however, are much harder to capture algorithmically than the formal rules of the game. Accordingly, computer go programs struggled compared with their chess counterparts, and none had ever beat a professional human under regular tournament conditions. Such an event was prognosticated to be at least a decade away.
And then AlphaGo burst into public consciousness via an article in one of the world's most respected science magazines, Nature, on January 28 of this year. Its software was developed by a 20-person team under erstwhile chess child prodigy and neuroscientist turned AI pioneer Demis Hassabis. (His London-based DeepMind Technologies was acquired in 2014 by Google.) Most intriguingly, the Nature article revealed that AlphaGo had played against the winner of the European go championship, Fan Hui, in October 2015 and won 5 to 0 without handicapping the human player, an unheard-of event. What is noteworthy is that AlphaGo's algorithms do not contain any genuinely novel insights or breakthroughs. The software combines good old-fashioned neural network algorithms and machine-learning techniques with superb software engineering running on powerful but fairly standard hardware—48 central processing units (CPUs) augmented by eight graphics processing units (GPUs) developed to render 3-D graphics for the gaming communities and exquisitely powered for running certain mathematical operations.
At the heart of the computations are neural networks, distant descendants of neuronal circuits operating in biological brains. Multiple layers of artificial neurons process the input—the positions of stones on the 19-by-19 go board—and derive increasingly more abstract representations of various aspects of the game using something called convolutional networks. This same technology has made possible recent breakout performances in automatic image recognition—labeling, for example, all images posted to Facebook.
For any particular board position, two neural networks operate in tandem to optimize performance. A “policy network” reduces the breadth of the game by limiting the number of moves for a particular board position. It does so by learning to choose a small range of good moves for that position. A “value network” then estimates how likely a given board position will lead to a win without chasing down every node of the search tree. The policy network generates possible moves that the value network then judges on their likelihood to vanquish the opponent. These are processed using a technique called a Monte Carlo tree search, which can lead to optimal behavior even if only a tiny fraction of the complete game tree is explored.
A Monte Carlo tree search by itself was not good enough for these programs to compete at the world-class level. That required giving AlphaGo the ability to learn, initially by exposing it to previously played games of professional go players and subsequently by enabling the program to play millions of games against itself, continuously improving its performance in the process.
In the first stage, a 13-layer policy neural network started as a blank slate—with no prior exposure to go. It was then trained on 30 million board positions from 160,000 real-life games taken from a go database. That number represents far more games than any professional player would encounter in a lifetime. Each board position was paired with the actual move chosen by the player (which is why this technique is called supervised learning), and the connections among the simulated neurons in the network were adjusted using so-called deep-machine-learning techniques to make the network more likely to pick the better move the next time. The network was then tested by giving it a board position from a game it had previously never seen. It accurately, though far from perfectly, predicted the move that the professional player had picked.
In a second stage, the policy network trained itself using reinforcement learning. This technique is a lasting legacy of behaviorism—a school of thought dominant in psychology and biology in the first half of the 20th century. It professes the idea that organisms—from worms, flies and sea slugs to rats and people—learn by relating a particular action to specific stimuli that preceded it. As they do this over and over again, the organisms build up an association between stimulus and response. This can be done unconsciously using rote learning.
Reinforcement learning was implemented years ago in neural networks to mimic animal behavior and to train robots. DeepMind demonstrated this last year with a vengeance when networks were taught how to play 49 different Atari 2600 video games, including Video Pinball, Stargunner, Robot Tank, Road Runner, Pong, Space Invaders, Ms. Pac-Man, Alien and Montezuma's Revenge. (It was a sign of things to come: atari is a Japanese go term, signifying the imminent capture of one or more stones.)
Each time it played, the DeepMind network “saw” the same video-game screen, including the current score, that any human player would see. The network's output was a command to the joystick to move the cursor on the screen. Following the diktat of the programmer to maximize the game score, the algorithm did so and figured out the rules of the game over thousands and thousands of trials. It learned to move, to hit alien ships and to avoid being destroyed by them. And for some games, it achieved superhuman performance. The same powerful reinforcement-learning algorithm was deployed by AlphaGo, starting from the configuration of the policy networks after the supervised learning step.
In a third and final stage of training, the value network that estimates how likely a given board position will lead to a win is trained using 30 million self-generated positions that the policy network chose. It is this feature of self-play, impossible for humans to replicate (because it would require the player's mind to split itself into two independent “minds”) that enables the algorithm to relentlessly improve.
A peculiarity of AlphaGo is that it will pick a strategy that maximizes the probability of winning regardless of by how much. For example, AlphaGo would prefer to win with 90 percent probability by two stones rather than with 85 percent probability by 50 stones. Few players would give up a slightly riskier chance to crush their opponent in favor of eking out a narrow but surer victory.
The end result is a program that performed better than any competitor and beat the go master Fan. Fan, however, is not among the top 300 world players, and among the upper echelons of players, differences in ability are so pronounced that even a lifetime of training would not enable Fan to beat somebody like Lee. Thus, based on the five publicly available games between AlphaGo and Fan, Lee confidently predicted that he would dominate AlphaGo, winning five games to nothing or, perhaps on a bad day, four games to one. What he did not reckon is that the program he was facing in Seoul was a vastly improved version of the one Fan had encountered six months earlier, optimized by relentless self-play.
Deep Blue beating Kasparov represented a triumph of machine brawn over a single human brain. Its success was predicated on very fast processors, built for this purpose. Although its victory over Kasparov was unprecedented, the triumph did not lead to any practical application or to any spin-off. Indeed, IBM retired the machine soon thereafter.
The same situation is not likely to occur for AlphaGo. The program runs on off-the-shelf processors. Giving it access to more computational power (by distributing it over a network of 1,200 CPUs and GPUs) only improved its performance marginally. The feature that makes the difference is AlphaGo's ability to split itself into two, playing against itself and continuously improving its overall performance. At this point it is not clear whether there is any limitation to how much AlphaGo can improve. (If only the same could be said of our old-fashioned brains.) It may be that this constitutes the beating heart of any intelligent system, the holy grail that researchers are pursuing—general artificial intelligence, rivaling human intelligence in its power and flexibility.
Most likely Hassabis's DeepMind team will contemplate designing more powerful programs, such as versions that can teach themselves go from scratch, without having to rely on the corpus of human games as examples, versions that learn chess, programs that simultaneously play checkers, chess and go at the world-class level, or ones that can tackle no-limit Texas hold'em poker or similar games of chance.
In a very commendable move, Hassabis and his colleagues described in exhaustive detail in their Nature article the algorithms and parameter settings used to generate AlphaGo. Their explanation of what was accomplished further accelerates the frenetic pace of AI research in academic and industrial laboratories around the globe. These types of reinforcement algorithms based on trial-and-error learning can be applied to myriad problems with sufficient labeled data, be they financial markets, medical diagnostics, robotics or warfare. A new era has begun with unknown but potentially monumental medium- and long-term consequences for employment patterns, population-wide surveillance, and growing political and economic inequity.
What of the effects of AlphaGo on the ancient game of go itself? Despite doomsayers, the rise of ubiquitous chess programs has revitalized chess, helping to train a generation of ever more powerful players. The same may well happen in the go community. After all, the fact that any car or motorcycle can speed faster than any runner did not eliminate running for fun. More people run marathons than ever.
Indeed, it could be argued that by removing the need to continually prove oneself to be the best, more humans may now enjoy the nature of this supremely aesthetic and intellectual game in its austere splendor for its own sake. In ancient China one of the four arts any cultivated scholar and gentleman was expected to master was the game of go. Just as a meaningful life must be lived and justified for its own intrinsic reasons, so should go be played for its intrinsic value—for the joy it gives.
Editor's Note: This article was adapted from the artcle "How the Computer Beat the Go Master."