MuZero

MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing anything about their rules. Its first release included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero.

On November 19, 2019, the DeepMind team released a preprint introducing MuZero, which was able to match AlphaZero's performance in chess and shogi, improve on its performance in go (setting a new world record), and improve on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain. MuZero was trained via "self-play" and via play against AlphaZero, with no access to rules, opening books, or endgame tables. The trained algorithm used the same convolutional and residual algorithms as AlphaZero, but with 20% fewer computation steps per node in the search tree -- 16 residual blocks used in evaluation, rather than 20.[1]

Relation to previous work

Comparison with AlphaZero

MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.

MZ was derived directly from AZ code, and shares its rules for setting search hyperparameters.

Differences between MZ and AZ include:[2]

  • AZ's planning process uses a simulator (which knows the rules of the game) and a neural network (which predicts the policy and value of a future position). Perfect knowledge of game rules are used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to a perfect ruleset, and replaces these two components with a single neural network (its learned model), which is updated continually.
  • AZ has a single model of the state of the game; MZ has separate models for representation of the current state (in a hidden model), dynamics of the current state (immediate reward associated with each potential action), and prediction of how the policy and value functions will update after a move.
  • MZ's hidden model may be complex, and it may turn out it can cache computation in it; exploring the details of a hidden model in a successfully-trained instance of MZ is an avenue for future exploration
  • MZ doesn't expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with discounting over time. AZ was designed exclusively for two-player games that could be won, drawn, or lost.

Comparison with R2D2

The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.[3]

MZ surpassed R2D2's mean and median performance, after training for 12 hours through 1M training steps, in contrast with 5 days to train 2M training steps for R2D2.

Training

MuZero used 16 third-generation tensor processing units [TPUs] for training, and on 1000 TPUs for selfplay (for board games, with 800 simulations per step) and 8 TPUs for training and 32 TPUs for selfplay (for Atari games, with 50 simulations per step).

AlphaZero used 64 first-generation TPUs for training, and on 5000 second-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful indivudally as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are fairly comparable training setups.

Preliminary results

MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance after 500 thousand training steps, and surpassed it by 1 million steps. It matched R2D2's mean and media performance across the Atari game suite after 500 thousand training steps, and surpassed it by 1 million steps; though it never performed well on 6 games in the suite.

Reactions and commentary

MZ was viewed this as a significant advancement over AZ,[4] and a generalizable step forward in unsupervised learning techniques.[5]

See also

    References

    1. Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy (2019-11-19). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model". arXiv:1911.08265 [cs.LG].
    2. Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (5 December 2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm". arXiv:1712.01815 [cs.AI].
    3. Kapturowski, Steven; et al. "RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING". Open Review. Explicit use of et al. in: |first= (help)
    4. Shorten, Connor (2020-01-18). "The Evolution of AlphaGo to MuZero". Medium. Retrieved 2020-06-07.
    5. "[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee - LessWrong 2.0". www.lesswrong.com. Retrieved 2020-06-07.
    This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.