簡易檢索 / 詳目顯示

研究生: 黃得為
Huang, De-Wei
論文名稱: 比較Gumbel和KataGo方法提升AlphaZero在外圍開局五子棋的訓練效能
Comparing the Performance Enhancement of AlphaZero in Outer-Open Gomoku Using Gumbel and KataGo Methods
指導教授: 林順喜
Lin, Shun-Shii
口試委員: 吳毅成
Wu, I-Chen
顏士淨
Yen, Shi-Jim
陳志昌
Chen, Jr-Chang
周信宏
Chou, Hsin-Hung
林順喜
Lin, Shun-Shii
口試日期: 2023/07/20
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 56
中文關鍵詞: 神經網路外圍開局五子棋AlphaZeroKataGoGumbel
英文關鍵詞: Neural Network, Outer-Open Gomoku, AlphaZero, KataGo, Gumbel
研究方法: 實驗設計法比較研究
DOI URL: http://doi.org/10.6345/NTNU202301459
論文種類: 學術論文
相關次數: 點閱:127下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究的目的是探討透過比較 KataGo 和 Gumbel 這兩種方法來儘量減少資 源的數量並保持或提升訓練的效率。KataGo 是一個改良版的 AlphaZero 演算法, 其作者使用了更有效率的訓練算法和重新設計的神經網路架構,並宣稱其訓練 速度比 AlphaZero 快50倍。而 Gumbel 方法則是 DeepMind 在2022年提出的一種方 法,可以在展開蒙地卡羅樹搜索(Monte Carlo Tree Search)時只需展開極少數節點 即可訓練出遠超在相同條件下其他已知演算法的效果。
    本研究使用這兩種方法應用在提升 AlphaZero 在外圍開局五子棋的棋力,並 比較這兩種方法的優劣和效果。實驗結果顯示,使用 Gumbel 和 KataGo 都可以 有效提升 AlphaZero 在訓練外圍開局五子棋上的效能。並且通過實驗發現,在相 同的訓練代數情況下,KataGo 所訓練出來的棋力比 Gumbel 好。但在相同短期時 間內的訓練中 Gumbel 所訓練出來的棋力比 KataGo 好。
    在本研究中,我們除了探討 AlphaZero、KataGo 和 Gumbel 演算法的改進外, 還額外討論了兩種提升自我對弈速度的方法以及兩種改進訓練效能的通用方法。
    首先,我們實作了兩種方法來提升自我對弈速度,並對三種演算法進行了 測試。通過實驗,我們發現這兩種方法的應用能夠平均提升自我對弈速度13.16 倍。這是一個顯著的改善,有效地節省了訓練時間。
    此外,我們還提出了兩種通用的方法來改進 AlphaZero、KataGo 和 Gumbel 的訓練效能。透過這兩種方法的應用,我們獲得了不錯的結果。這些方法不僅 提升了演算法的訓練效率,還改善了模型的學習能力和準確性。
    這些結果顯示出,改良 AlphaZero 的 KataGo 以及 Gumbel 方法可以顯著提升外圍開局五子棋 AI 的訓練效果和速度,並且減少所需的訓練資源。這樣的技術 創新可以讓更多的研究者參與到強化學習的研究中,並推動人工智慧在遊戲和 其他領域的發展。

    The purpose of this research is to explore two methods, KataGo and Gumbel, and compare their effectiveness in reducing the amount of resources while maintaining or improving training efficiency. KataGo is an improved version of the AlphaZero algorithm, where the author used more efficient training algorithms, redesigned the neural network architecture, and claimed that it achieves a 50 times reduction in computation over comparable methods. On the other hand, Gumbel is a method proposed by DeepMind in 2022, which can achieve significantly better results than other algorithms under the same conditions by expanding only a few nodes during Monte Carlo Tree Search.
    In this research, we applied these two methods to enhance the performance of AlphaZero in Outer-Open Gomoku and compared their advantages, disadvantages, and effects. The experimental results show that both Gumbel and KataGo effectively improve the performance of AlphaZero in training Outer-Open Gomoku program. Additionally, through experiments, we found that KataGo trains a stronger model compared to Gumbel under the same training epochs. However, within the same short- term training duration, Gumbel trains a stronger model than KataGo.
    Furthermore, this research also investigates two methods for improving self-play speed and two general methods for enhancing the training performance of AlphaZero, KataGo, and Gumbel. Through the implementation of these two methods to enhance self-play, the experimental results show an average speedup of 13.16 over the original three algorithms. The other two general methods for improving training performance also yielded promising results.
    These results demonstrate that the KataGo and Gumbel methods can significantly enhance the training effectiveness and speed for developing the Outer-Open Gomoku program, while reducing the required computation resources. Such technological innovations enable more researchers to participate in reinforcement learning research and advance the development of artificial intelligence in games and other domains.

    第一章 緒論 1 1.1研究背景 1 1.2 研究目的 3 1.3 研究意義 4 第二章 文獻探討 5 2.1 AlphaZero 5 2.1.1 訓練流程 5 2.1.2 神經網路架構 6 2.1.3 蒙地卡羅樹搜索 8 2.2 KataGo 10 2.2.1 Forced Playout 和 Policy Target Pruning 10 2.2.2 Global Pooling Structure 11 2.2.3 Auxiliary Policy Target 12 2.3 Gumbel 13 2.3.1 Gumbel vs. AlphaZero 13 2.3.2 Gumbel-Top-k trick 14 2.3.3 Sequential Halving Algorithm 15 2.4 外圍開局五子棋 17 2.5 外圍開局五子棋 Bitboard 設計 18 2.6 弈心 19 第三章 方法與步驟 20 3.1 AlphaZero 實作 20 3.2 KataGo 實作 21 3.3 Gumbel 實作 23 3.4 外五棋規則實作 24 3.5 外五棋 Bitboard 實作 25 3.6 Multi-Processing 加速和 NN Table 27 3.7 Average Value Target 30 3.8 Slow Window 33 第四章 實驗與結果 35 4.1 自我對弈階段加速 38 4.1.1 Multi-Processing 加速實驗 38 4.1.2 NN Table 實驗 39 4.1.3 Multi-Processing + NN Table 實驗總結 41 4.2 Slow Window 和 Average Value Target 實驗 42 4.2.1 KataGo 針對 Slow Window 和 Average Value Target 實驗 42 4.2.2 Gumbel 針對 Slow Window 和 Average Value Target 實驗 43 4.2 Gumbel 實驗 44 4.2.1 各演算法針對不同模擬次數之實驗 44 4.2.2 Gumbel 針對不同 k 值之實驗 46 4.4 各演算法在不同情境下的棋力比較 48 4.5 KataGo 和 Gumbel 與弈心的對弈結果 51 第五章 結論與未來工作 52 參考文獻 54

    [1] D. Silver et al., “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016, doi: 10.1038/nature16961.
    [2] D. Silver et al., “Mastering the Game of Go without Human Knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017, doi: 10.1038/nature24270.
    [3] D. Silver et al., “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” Science, vol. 362, Issue 6419, pp. 1140-1144, 2018, doi: 10.1126/science.aar6404.
    [4] D. J. Wu, “Accelerating Self-Play Learning in Go,” Feb. 2019, doi: 10.48550/arxiv.1902.10565.
    [5] I. Danihelka, A. Guez, J. Schrittwieser, and D. Silver, “Policy Improvement by Planning with Gumbel,” International Conference on Learning Representations, 2022.
    [6] 楊子頤,應用 AlphaZero 於六子棋,交通大學資工所碩士論文,2020。
    [7] 劉浩萱 ,AlphaZero 演算法結合快贏策略或迫著空間實現於五子棋,台灣 師範大學資工所碩士論文,2020。
    [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770-778, 2016.
    [9] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks,” 2016 European Conference on Computer Vision (ECCV), Amsterdam, pp. 630-645, 2016.
    [10] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel- Softmax,” 5th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
    [11] L. V. Allis, H. J. Herik, and M. P. H. Huntjens, “Go-Moku Solved by New Search Techniques,” Computational Intelligence, vol. 12, no. 1, pp. 7-23, 1996.
    [12] S.-S. Lin, and C.-Y. Chen, “How to Rescue Gomoku? The Introduction of Lin's New Rule,” (in Chinese) The 2012 Conference on Technologies and Applications of Artificial Intelligence (TAAI 2012), Tainan, Taiwan, November 2012.
    [13] C.-H. Chen, S.-S. Lin, and Y.-C. Chen, “An Algorithmic Design and Implementation of Outer-Open Gomoku,” 2nd International Conference on Computer and Communication Systems (ICCCS), 2017, doi:10.1109/ccoms.2017.8075180.
    [14] K. Sun, Yixin 程式, https://gomocup.org/static/download-ai/YIXIN18.zip.
    [15] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A.Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model,” https://arxiv.org/src/1911.08265v2/anc/pseudocode.py.
    [16] GitHub - suragnair/alpha-zero-general: A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more, https://github.com/suragnair/alpha- zero-general (accessed Jan. 04, 2023).
    [17] D. Willemsen, H. Baier, and M. Kaisers, “Value Targets in Off-Policy AlphaZero: a New Greedy Backup,” Neural Computing and Applications, pp. 1–14, 2021.
    [18] Lessons From AlphaZero (part 4) —Improving the Training Target, https://medium.com/oracledevs/lessons-from-alphazero-part-4-improving-the- training-target-6efba2e71628 (accessed Jun. 28, 2018).
    [19] Lessons From AlphaZero (part 6) — Hyperparameter Tuning, https://medium.com/oracledevs/lessons-from-alpha-zero-part-6-hyperparameter- tuning-b1cfcbe4ca9a (accessed Jul. 12, 2018).

    下載圖示
    QR CODE