AlphaGo Zero needed three days to train up in Go; AlphaZero needed just eight hours.
We last heard from DeepMind’s dominant gaming AI in October. As opposed to earlier sessions of AlphaGo besting the world’s best Go players after the DeepMind team trained it on observations of said humans, the company’s Go-playing AI (version AlphaGo Zero) started beating pros after three days of playing against itself with no prior knowledge of the game.
On the sentience front, this still qualified as a ways off. To achieve self-training success, the AI had to be limited to a problem in which clear rules limited its actions and clear rules determined the outcome of a game. (Not every problem is so neatly defined, and fortunately, the outcomes of an AI uprising probably fall into the “poorly defined” category.)
This week, a new paper (PDF, not yet peer reviewed) details how quickly DeepMind’s AI has improved at its self-training in such scenarios. Evolved now to AlphaZero, this latest iteration started from scratch and bested the program that beat the human Go champions after just eight hours of self-training. And when AlphaZero instead decided to teach itself chess, the AI defeated the current world-champion chess program, Stockfish, after a mere four hours of self-training. (For fun, AlphaZero also took two hours to learn shogi—”a Japanese version of chess that’s played on a bigger board,” according to The Verge—and then defeated one of the best bots around.)
So for those keeping track, DeepMind’s latest AI became a world-class competitor at three separate complex games in less than a day. The team set out to build a “more generic version” of its previous software this time, and it would appear they succeeded.
Back in October 2015 when the original AlphaGo beat three-time European champion Fan Hui 5-0, it relied on a novel mix of deep neural-network machine learning and tree search techniques. Without getting into all the complexities, the system observed humans and then honed its strategy by pitching instances of AlphaGo against each other in a process known as reinforcement learning. Thousands (millions?) of iterations later, AlphaGo could dominate.
This time, AlphaZero relied more heavily on reinforcement training similar to the October 2017 success with AlphaGo Zero. As Ars Science Editor John Timmer described the process at that time:
The algorithm would learn by playing against a second instance of itself. Both Zeroes would start off with knowledge of the rules, but they would only be capable of playing random moves. Once a move was played, however, the algorithm tracked if it was associated with better game outcomes. Over time, that knowledge led to more sophisticated play.
Over time, the AI built up a tree of possible moves, along with values associated with the game outcomes in which they were played. It also kept track of how often a given move had been played in the past, so it could quickly identify moves that were consistently associated with success. Since both instances of the neural network were improving at the same time, the procedure ensured that AlphaGo Zero was always playing against an opponent that was challenging at its current skill level.
Both Go and chess can be incredibly complex, with possible position totals that comfortably exceed 10100 possibilities.
This feat is merely DeepMind’s latest in a Go résumé that now includes beating the best humans, an online streak of 51 wins (before losing connectivity in match 52), and training itself to become world-class. As we’ve noted before, there’s almost no chance that a human will ever beat AlphaGo again, but we meatsacks can still learn a lot about the game itself by watching this AI play.