The document summarizes recent research related to "theory of mind" in multi-agent reinforcement learning. It discusses three papers that propose methods for agents to infer the intentions of other agents by applying concepts from theory of mind:
1. The papers propose that in multi-agent reinforcement learning, being able to understand the intentions of other agents could help with cooperation and increase success rates.
2. The methods aim to estimate the intentions of other agents by modeling their beliefs and private information, using ideas from theory of mind in cognitive science. This involves inferring information about other agents that is not directly observable.
3. Bayesian inference is often used to reason about the beliefs, goals and private information of other agents based
The detailed results are described at GitHub (in English):
https://github.com/jkatsuta/exp-18-1q
(maddpg/experiments/my_notes/のexp1 ~ exp6)
立教大学のセミナー資料(前篇)です。
資料後篇:
/JunichiroKatsuta/ss-108099542
ブログ(動画あり):
https://recruit.gmo.jp/engineer/jisedai/blog/multi-agent-reinforcement-learning/
ゼロから始める深層強化学习(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
?
Introduction of Deep Reinforcement Learning, which was presented at domestic NLP conference.
言語処理学会第24回年次大会(NLP2018) での講演資料です。
http://www.anlp.jp/nlp2018/#tutorial
- The document introduces Deep Counterfactual Regret Minimization (Deep CFR), a new algorithm proposed by Noam Brown et al. in ICML 2019 that incorporates deep neural networks into Counterfactual Regret Minimization (CFR) for solving large imperfect-information games.
- CFR is an algorithm for computing Nash equilibria in two-player zero-sum games by minimizing cumulative counterfactual regret. It scales poorly to very large games that require abstraction of the game tree.
- Deep CFR removes the need for abstraction by using a neural network to generalize the strategy across the game tree, allowing it to solve previously intractable games like no-limit poker.
This document discusses methods for automated machine learning (AutoML) and optimization of hyperparameters. It focuses on accelerating the Nelder-Mead method for hyperparameter optimization using predictive parallel evaluation. Specifically, it proposes using a Gaussian process to model the objective function and perform predictive evaluations in parallel to reduce the number of actual function evaluations needed by the Nelder-Mead method. The results show this approach reduces evaluations by 49-63% compared to baseline methods.
- The document introduces Deep Counterfactual Regret Minimization (Deep CFR), a new algorithm proposed by Noam Brown et al. in ICML 2019 that incorporates deep neural networks into Counterfactual Regret Minimization (CFR) for solving large imperfect-information games.
- CFR is an algorithm for computing Nash equilibria in two-player zero-sum games by minimizing cumulative counterfactual regret. It scales poorly to very large games that require abstraction of the game tree.
- Deep CFR removes the need for abstraction by using a neural network to generalize the strategy across the game tree, allowing it to solve previously intractable games like no-limit poker.
This document discusses methods for automated machine learning (AutoML) and optimization of hyperparameters. It focuses on accelerating the Nelder-Mead method for hyperparameter optimization using predictive parallel evaluation. Specifically, it proposes using a Gaussian process to model the objective function and perform predictive evaluations in parallel to reduce the number of actual function evaluations needed by the Nelder-Mead method. The results show this approach reduces evaluations by 49-63% compared to baseline methods.
26. Libratusの結果
■ プロ vs Libratus
– プロ相?に?い勝率
-200
-100
0
100
200
300
400
0 30000 60000
mbb/game
Games pl
Cumulative mbb/g
-100
-50
0
50
100
150
200
250
300
0 30000 60000 90000 120000
mbbx106
Games played
Cumulative mbb
Fig. 3. Libratus performance against top humans. Shown are the results of the 2017 Brains vs. Artificial Intelligence: U
RESEARCH | RESEARCH ARTICLE
33. Pluribus
■ 6?プレイヤのHeads up No-limit Texas Hold’emでプロを超えた
■ 基本的な考え?
– 最初にある程度の戦略をCFRによって計算しておき,それを実際のプレイ中
に改善していけばよいのでは?
→やりたいことはLibratusとほぼ?緒
■ やってることもLibratusと似ている
1. 予めCFRによってある程度の強さの戦略を計算しておく
2. 実際のプレイ中にSubgameに対してCFRを適?し,戦略を逐?計算
■ Libratusとの主な差分
– プレイヤの数が増えて状態数も膨?な数になるので,実際のプレイ中に戦
略を計算することが厳しい
– 深さ制限探索で探索するノード数を減少
Superhuman AI for multiplayer poker (2019)
https://science.sciencemag.org/content/early/2019/07/10/science.aay2400
34. Depth Limited Search
1. ある深さのノード(葉ノード)に到達するまでは通常
と同様に探索
2. 葉ノードに到達したらノードのvalueを推定
– Monte Carlo Rolloutによって推定
– 予め計算しておいたk = 4個の戦略の中から戦略を
選択,葉ノード以降はその戦略に従って?動
■ 単純なblueprint戦略
■ 降りることに特化したblueprint戦略
■ コールすることに特化したblueprint戦略
■ レイズすることに特化したblueprint戦略
– ノードのvalueが戦略に依存したものであることを
近似的に表現
Fig. 4. Real-time search in Pluribus. The subgame sh
nodes indicates that the player to act does not know w
information subgame. Right: The transformed subga
strategy. An initial chance node reaches each root nod
reached in the previously-computed strategy profile (o
time in the hand that real-time search is conducted). T
which each player still in the hand chooses among k
chooses. For simplicity, 2k = in the figure. In Pluribu
selection of a continuation strategy for that player f
terminal node (whose value is estimated by rolling
continuation strategies the players chose).
Fig. 4. Real-time search in Pluribus. The subgame shows just two players for simplicity. A dashed line between
nodes indicates that the player to act does not know which of the two nodes she is in. Left: The original imperfect-
information subgame. Right: The transformed subgame that is searched in real time to determine a player’s
38. 教師あり学习×?雀AI
■ 捨て牌選択などを牌譜データから教師あり学习させるのが主流
■ 最近では,CNNをモデルとして?いることによって上級者との捨
て牌の?致率が68%程度まで向上
? この局?では何捨てる??
? ポンする?しない??
? チーする?しない??
Building a computer mahjong player based on monte carlo simulation and
opponent models (2015)
https://ieeexplore.ieee.org/document/7317929
Supervised Learning of Imperfect Information Data in the Game of Mahjong via
Deep Convolutional Neural Networks (2018)
https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_it
em_detail&page_id=13&block_id=8&item_id=192052&item_no=1
40. プレイ中のシミュレーションによる
捨て牌選択
■ “Building a computer mahjong player based on monte carlo simulation and opponent
models (2015)”にて提案
■ 捨て牌選択における各?動に対する得点をモンテカルロシミュレーションに
よって推定
– 各牌を捨てる?動
– 降りる?動(絶対放銃しないと仮定)TABLE VI. EVALUATION OF SCORE PREDICTION
Player Mean square error
Prediction model 0.37
-Revealed Melds 0.38
-Revealed fan value 0.38
Expert player 0.40
A. Game records for training
D? ????
??? s?????? ????
Fig. 1. Overview of Monte Carlo moves
Building a computer mahjong player based on monte carlo simulation and
opponent models (2015)
https://ieeexplore.ieee.org/document/7317929
41. プレイ中のシミュレーションによる
捨て牌選択
■ シミュレーション中の挙動
– ?分の?番
■ 各牌を捨てた局?からシミュレーション
■ 降りる?動に対しての場合,?から牌を
へらすだけで牌は捨てない
– 相?の?番
■ 具体的な?牌は持たずに学习済みのモデ
ルに基づいて遷移を決定
– 聴牌状態へ移?する確率分布
– 降り状態へ移?する確率分布
– ツモ?ロン上がりする確率分布
■ 捨て牌は?からのツモをそのまま捨てる
ining
a player has won in the game records,
mple by using the score and creating
ting the information available to the
number of states is about 5.92 × 107
.
g scores
re of a hand basically increases ex-
o the fan7
value of the hand. We
regression model that predicts the
actual score in the training data.
es used in the model. Examples of the
Fig. 1. Overview of Monte Carlo mov
42. プレイ中のシミュレーションによる
捨て牌選択
■ シミュレーション中の挙動
– ?分の?番
■ 各牌を捨てた局?からシミュレーション
■ 降りる?動に対しての場合,?から牌を
へらすだけで牌は捨てない
– 相?の?番
■ 具体的な?牌は持たずに学习済みのモデ
ルに基づいて遷移を決定
– 聴牌状態へ移?する確率分布
– 降り状態へ移?する確率分布
– ツモ?ロン上がりする確率分布
■ 捨て牌は?からのツモをそのまま捨てる
he game records,
ore and creating
available to the
bout 5.92 × 107
.
ly increases ex-
the hand. We
hat predicts the
e training data.
Examples of the
d melds and the
Fig. 1. Overview of Monte Carlo moves
43. MDPとしての定式化
■ “Method for constructing artificial intelligence player with abstraction to markov
decision processes in multiplayer game of mahjong (2019)”にて提案
■ ?雀を抽象化したゲームをMDPとして定式化
– 降りる戦略や,聴牌に向かう戦略をプレイヤが取ったときのダイナミクス
をそれぞれ定式化
– MDP内の状態遷移確率は実際のデータを元にlogistic regressionでモデル化
■ MDPから導出した価値関数から,どの?動を取るべきかを決定
Method for constructing artificial intelligence player with abstraction to markov
decision processes in multiplayer game of mahjong (2019)
https://arxiv.org/abs/1904.07491