際際滷

際際滷Share a Scribd company logo
MADRL
https://github.com/NeuroCSUT/DeepMind-Atari-Deep-Q-Learner-2Player
Multiagent Cooperative
and Competition with
Deep Reinforcement Learning
1
蟾谿(Paul Kim)
Index
1. Introduction
2. Method
- 2.1 The Deep Q-Learning Algorithm
- 2.2 Adaptation of the Code for the Multiplayer Paradigm
- 2.3 Game Selection
- 2.4 Reward Schemes
- 2.4.1 Score More than the Opponent(Fully Competitive)
- 2.4.2 Loosing the Ball Penalizes Both Players(Fully Cooperative)
- 2.4.3 Transition Between Cooperation and Competition
- 2.5 Training Procedure
- 2.6 Collecting the Game Statistics
3. Results
- 3.1 Emergence of Competitive Agents
- 3.2 Emergence of Collaborative Agents
- 3.3 Progression from Competition to Collaboration
2
Abstract
Abstract
DeepMind螳  Deep Q-Learning Network  Multiagent system朱 ロ螻
Pong蟆  IQN(Independent Q-Networks)襦  螳 agentれ control讌襯 譟一
Pong蟆曙 Rewardろる襯  伎 competitive 蟆曙一 cooperative
螳 企至 讌襯 
Competitive agents 螳 蟆曙一 score襯 襴蟆  旧  蟆 
Cooperative agents襯 螳 蟆曙一  螳 伎瑚 螳ロ 襦 る 螻旧 keeping 蟆

Competitive Cooperative 讌 る(Reward 螳 谿朱 覲蟆曙貅 )
蟆郁骸朱  郁規朱語 れ Deep Q-Network襯 覲旧″ 蟆  れ 伎 ろ
decentralized(覿) 旧 郁規蠍  れ 覦 
3
Introduction
螳旧 豌
Biological 轟 Engineered agents 豸 覿螳レ煙 豌伎 覃 trial-and-error襯 牛伎 襦螻
覲 蟆曙    
螳旧 伎碁 蟆所骸 語覃伎 讌 覲伎 磯  螻   覲伎 豕蠍
伎 伎碁 覲旧″ リ鍵旧 蟲 覦 磯蟆 
DQN螻 Multi-Agent
螳ロ 襴るれ 覓企 襷蠍 覓語 螳旧 蟆  蟆曙襷 蟲螻, 蟆曙 dynamics
 豢螳 覲企  覦 蠏狩
讌襷 DeepMind螳 Video蟆螻 螳 螻谿願 覲旧″ 蟆曙 RL  炎骸襯 螻, super-human
焔レ . 蠍一牛  raw sensory input(企語) reward signal(蟆)襷 
Deep Q-Network(Convolution Neural Network襯 Q-Learning朱 representation 覈)襯 螻
轟(2015)襯 蠍一朱 Model-Free 覦朱 State-of-the-art襯 
るジ 蟆れ 覦一郁鍵 伎 狩 螻襴讀 る 蟆 general application  螳レ煙 
4
Introduction
螳旧 豌
Biological 轟 Engineered agents 豸 覿螳レ煙 豌伎 覃 trial-and-error襯 牛伎 襦螻
覲 蟆曙    
螳旧 伎碁 蟆所骸 語覃伎 讌 覲伎 磯  螻   覲伎 豕蠍
伎 伎碁 覲旧″ リ鍵旧 蟲 覦 磯蟆 
DQN螻 Multi-Agent
螳ロ 襴るれ 覓企 襷蠍 覓語 螳旧 蟆  蟆曙襷 蟲螻, 蟆曙 dynamics
 豢螳 覲企  覦 蠏狩
讌襷 DeepMind螳 Video蟆螻 螳 螻谿願 覲旧″ 蟆曙 RL  炎骸襯 螻, super-human
焔レ . 蠍一牛  raw sensory input(企語) reward signal(蟆)襷 
Deep Q-Network(Convolution Neural Network襯 Q-Learning朱 representation 覈)襯 螻
轟(2015)襯 蠍一朱 Model-Free 覦朱 State-of-the-art襯 
るジ 蟆れ 覦一郁鍵 伎 狩 螻襴讀 る 蟆 general application  螳レ煙 
れ 郁規覦
DQN 螻 raw screen image reward signal
input朱 覦 蟆曙朱 Atari襯 . 蠏碁Μ螻  伎瑚
 rewarding scheme襦 training  覲旧″ 蟆 
企至 螻 語讌襯 誤螻 
Competitive 伎碁 襯 襴 蟆 牛讌襷
Cooperative 蟆曙一 螻旧 豕 讌 旧
谿場企り .
Competitive 覈 Cooperative覈 伎 譴螳 襯
郁規螻 Competitive Cooperative襦 讌 蟯豸″蠍
伎 rewarding schemes襯 tuning
5
Method : DQN
Q-Learning螻 DQN
RL 覈 dynamic 蟆曙 伎語
accumulative long term reward襯
maximize 豈 谿城 蟆.
Agent螳 蟆曙 dynamics reward 螳
implicit 覲願   覓語螳 .   Model-
Free覦 Q-Learning 轟( 2015
) 襷 
Q-Learning 蟆曙 轟  
螳豺襯 螳  
DeepMind 企 Q-Learning Convolution
Neural Network襯  Q-Value襯
approximate伎 觜 蟆 super-human
焔レ 燕
6
Method : DQN
DQN螻 Multi-Agent
 螳 伎 伎瑚 蟆曙 螻旧 蟆曙磯 炎 伎語 觜 危願 曙 
旧 覿磯 轟煙 襦 伎 螻牛讌襷, 譬  覈  螻襴讀  覦 手 螳 
螻ろ伎 
Ex) Multi-Agent Setting 蟆曙 蟆曙 state transition reward螳 覈 伎語 螻給 
レ 覦. 伎1   螳豺螳 るジ 伎2  磯 覲蟆暑蠍 覓語
螳 伎碁 るジ  伎碁ゼ 豢伎朱 . 朱朱 るジ 伎瑚  
旧 覃 螳 伎語 stability adaptive behavior伎 蠏 
れ 伎語 Q-Learning螻襴讀  襦螳 讌襷 蠏 轟襯 蠍一朱 轟  磯殊
曙 譟伎
(蠏 轟 蠍一)Simplicity, decentrailized nature, computational speed 蠏碁Μ螻  覯  手
蟆郁骸襯 一   ルレ DQN 蠍 覓語  覦覯 .
朱誤磯 Human-level control through deep reinforcement learning  朱瑚骸 狩蟆 れ
7
Method : Adaptation of the Code
for the Multiplayer Paradigm
OpenAI Gym螻 MultiAgent Setting
蠍一ヾ Human-level control through deep reinforcement learning 朱瑚骸 蟷 螻糾 貊 覃
伎 蟆    蟆曙 螻糾讌 
伎碁 DQN朱 襴曙朱 蟲 伎螻, Atari 蟆 れ 伎企ゼ 讌襷 伎語
覡危 螳 communication protocol  伎企ゼ 襦  蟇語企
蟆 覃伎 Fully Observable螻  伎 螳 螻旧覩襦   伎語 蟆 襯 螻牛蠍
伎 豢螳  讌 
8
Method : Game Selection
Game蟆 蟆一 蠍一
ALE(Atari Learning Environment) 61螳 蟆 讌螻 讌襷 2螳 伎 覈螳 螳ロ
蟆蟆曙 蟲ロ . 蠏碁 ろ 伎  3螳讌 蠍一朱 ろ蟆曙 磯
1. Real-time two-player mode螳 伎狩
: 襯 れ Breakout 蟆曙磯  覈 伎願 覯螳 伎狩  朱襦  X
2. Deep Q-Learning 螻襴讀 炎 伎 覈
瑚 譴 伎 蟆  蟆 蟆讀 蟆暑 
: 襯 れ Wor of Wizard朱 蟆 2 覈襯 螻牛讌襷 覩碁  蠍一ヾ DQN朱
襷ろ壱  
3. 蟆 豌願 蟆曙 覈 蟆曙
: 蠏碁Μ螻 reward function 覲蟆渚伎 Cooperation 蟆曙一 Competitive 蟆曙磯ゼ   
蟆伎伎 
蟆郁骸朱 Pong蟆 蟆曙 覈 蠍一 襷譟燕蠍 覓語 蟆 螻 朱 襴 れ 蟆企襦
!!
9
Method : Game Selection
Game蟆 蟆一 蠍一
Pong 螳 伎碁 覃伎 殊所骸 るジ讓曙 豺  譴  企
 伎瑚     譟壱 4螳讌 : 襦 企, 襦 企, 讌 覦 覦(覲殊 蟇磯 蟆 )
企  蟆一讌 螳螳 DQN   伎瑚 覲螳襦 蟆一 る 蟆 蠍一牛伎 
 郁規 覯襯 覯企讌襷 蟆曙螻  蟯 ル碁 螻狩 讌覓瑚骸  伎碁Μ るジ  螳讌 蟆 伎
(Outlaw, Warlords)
1. Outlaw  蟆 覈 譯 襷  れ螳 蠏殊豺襦 覲   螳 蟆
2. Warlords 豕 4覈 伎願    collaboration 讌襯 ろ誤  
蟆
10
Method : Rewarding Schemes
Rewarding Schemes
ろ 覈 agent蟆 覲伎 覦 覦覯 磯殊 れ蟆 譟一   蟆 郁規 蟆.
企 覈 磯殊 螳讌 Rewarding Schemeれ れ
1. Score More than the Opponent(Fully Competitive)
2. Loosing the Ball Penalizes Both Players(Fully Cooperative)
3. Transition Between Cooperation and Competition
11
Score More than the
Opponent(Fully Competitive)
Competitive Zero-Sum Schemes : 願鍵覃 +1, 讌覃 -1
Pong 螻旧 襷讌襷朱 一 Player螳  襯 至, 螻旧 豺 Player螳 襷企 襯 詞.
蠍磯蓋 Zero-Sum蟆  蟲譟. 襷 殊 伎願 蠍 覲伎 至 覃 るジ讓曙 
Player 覿 覲伎 覦蟆 螻 覦 襷谿螳讌
企 蟆曙磯ゼ  蟆曙覈手 覿襴
12
Loosing the Ball Penalizes
Both Players(Fully Cooperative)
Cooperative Mode : Ball 企襴覃 
螳ロ 伎碁れ る 蟆 螻旧 讌 覯 覦一磯 蟆.
螻旧 豺 蟆曙一  Player蟆 覿 覲伎 れ
蠍一牛  企 Player螳 螻旧 豎 覦蟆 讌讌 譴讌 螻 蠍 覲伎 覦   れ
譟伎讌 
tip)  るジ 螳ロ  覈 螳螳 螳 螻旧  蟆 覲伎  蟆伎襷 ろ 讌 り 
13
Transition Between
Cooperation and Competition
Transition Between Cooperation and Competition
Cooperative 蟆曙一 Competitive 蟆曙 覈 螻旧 企襴覃 penalty襯 覦
 螳讌 旧 蟲覲 蟆 Reward Matrix 螳  譟伎 螳. 覲伎 伎碁れ るジ
Player襯 讌豎 螻旧 讌覃伎 詞. Reward 螳豺螳 -1 1襦 讌朱 覲襦 覃
Competitive Cooperation  譴螳 襴るゼ 誤  
誤 rho襯 襦  覲伎 -1 1襦 覲螻 螻旧 れ  磯 -1襦 螻貅 蠏狩.
14
Training Procedure
Training Procedure
覈 ろ, 50 Epochs襯 れ螻 timestep 1~250000 れ
Frame skipping 蠍 覓語 伎瑚 4覯讌  蠍一朱 . visible frame, frame,
time step 
Epsilon greedy襯  Epsilon decay襯 れ(1 0朱 0.05襯 れ 蠏)
15
Collecting the Game Statics
Pong蟆曙 伎語    豸′ 至鍵 伎 蟆 企欧碁ゼ 誤螻 螻壱(覲殊
Paddle  Wall 覿 蟆曙)
- stella : A Multi-Platform Atari 2600 VCS Emulator 襯 伎 企欧碁ゼ 螳讌
Cooperative Competitive襯 朱 螳蠍 伎 覈 rewarding scheme 伎 螳  epoch
 危 豸′ 蟆郁骸襯 讌. 螳 epoch  危  Player  state  Q-Network襯
覃 random seed 螳 覲蟆渚 10覯 蟆 ろ螻 糾豺襯 讌. Testing螻
epsilon(exploration rate) 螳 0.01襦 れ
1. Average paddle-bounces per point
point覲 蠏 paddle-bounces  Player 譴  覈 蠍  螻旧   伎 覈 覯
讌襯 語企慨 蟆. Random play襯  伎碁 螻旧 蟇一 豺讌 覈詩. ろ Paddle-
bounces手 覈覈
2. Average wall-bounces per paddle-bounce
paddle-bounce 蠏 wall-bounce  Player蟆 蠍  讓所骸 讓 覯曙 朱 襷
螻旧 覿讌襯 覩誤. 覦 Player蟆 衰鍵   覯 覯曙 蠍磯襦 豺企 螳襦 螻旧 豺  .
ろ wall-bounce手 覈覈
3. Average serving time per point
point覲 蠏 serving螳 Player螳   蟆所鍵襯 れ  蟇碁Μ 螳. 蟆 れ る
 詞 Player螳 fire朱 轟 覈轟 伎狩. ろ serving time企手 覈覈
16
Results : Emergence of
Competitive Agents
Results : Emergence of Competitive Agents
17
Results : Emergence of
Competitive Agents
Results : Emergence of Competitive Agents
18
Results : Emergence of
Competitive Agents
Competitive Agents蟆郁骸 
19
Results : Emergence of
Cooperative Agents
Results : Emergence of Cooperative Agents
20
Results : Emergence of
Cooperative Agents
Results : Emergence of Cooperative Agents
21
Results : Emergence of
Cooperative Agents
Cooperative Agents蟆郁骸 
22
Results : Progression from
Competition to Collaboration
Results : Progression from Competition to Collaborarion
23
Results : Progression from
Competition to Collaboration
Results : Progression from Competition to Collaborarion
24

More Related Content

Multiagent Cooperative and Competition with Deep Reinforcement Learning

  • 2. Index 1. Introduction 2. Method - 2.1 The Deep Q-Learning Algorithm - 2.2 Adaptation of the Code for the Multiplayer Paradigm - 2.3 Game Selection - 2.4 Reward Schemes - 2.4.1 Score More than the Opponent(Fully Competitive) - 2.4.2 Loosing the Ball Penalizes Both Players(Fully Cooperative) - 2.4.3 Transition Between Cooperation and Competition - 2.5 Training Procedure - 2.6 Collecting the Game Statistics 3. Results - 3.1 Emergence of Competitive Agents - 3.2 Emergence of Collaborative Agents - 3.3 Progression from Competition to Collaboration 2
  • 3. Abstract Abstract DeepMind螳 Deep Q-Learning Network Multiagent system朱 ロ螻 Pong蟆 IQN(Independent Q-Networks)襦 螳 agentれ control讌襯 譟一 Pong蟆曙 Rewardろる襯 伎 competitive 蟆曙一 cooperative 螳 企至 讌襯 Competitive agents 螳 蟆曙一 score襯 襴蟆 旧 蟆 Cooperative agents襯 螳 蟆曙一 螳 伎瑚 螳ロ 襦 る 螻旧 keeping 蟆 Competitive Cooperative 讌 る(Reward 螳 谿朱 覲蟆曙貅 ) 蟆郁骸朱 郁規朱語 れ Deep Q-Network襯 覲旧″ 蟆 れ 伎 ろ decentralized(覿) 旧 郁規蠍 れ 覦 3
  • 4. Introduction 螳旧 豌 Biological 轟 Engineered agents 豸 覿螳レ煙 豌伎 覃 trial-and-error襯 牛伎 襦螻 覲 蟆曙 螳旧 伎碁 蟆所骸 語覃伎 讌 覲伎 磯 螻 覲伎 豕蠍 伎 伎碁 覲旧″ リ鍵旧 蟲 覦 磯蟆 DQN螻 Multi-Agent 螳ロ 襴るれ 覓企 襷蠍 覓語 螳旧 蟆 蟆曙襷 蟲螻, 蟆曙 dynamics 豢螳 覲企 覦 蠏狩 讌襷 DeepMind螳 Video蟆螻 螳 螻谿願 覲旧″ 蟆曙 RL 炎骸襯 螻, super-human 焔レ . 蠍一牛 raw sensory input(企語) reward signal(蟆)襷 Deep Q-Network(Convolution Neural Network襯 Q-Learning朱 representation 覈)襯 螻 轟(2015)襯 蠍一朱 Model-Free 覦朱 State-of-the-art襯 るジ 蟆れ 覦一郁鍵 伎 狩 螻襴讀 る 蟆 general application 螳レ煙 4
  • 5. Introduction 螳旧 豌 Biological 轟 Engineered agents 豸 覿螳レ煙 豌伎 覃 trial-and-error襯 牛伎 襦螻 覲 蟆曙 螳旧 伎碁 蟆所骸 語覃伎 讌 覲伎 磯 螻 覲伎 豕蠍 伎 伎碁 覲旧″ リ鍵旧 蟲 覦 磯蟆 DQN螻 Multi-Agent 螳ロ 襴るれ 覓企 襷蠍 覓語 螳旧 蟆 蟆曙襷 蟲螻, 蟆曙 dynamics 豢螳 覲企 覦 蠏狩 讌襷 DeepMind螳 Video蟆螻 螳 螻谿願 覲旧″ 蟆曙 RL 炎骸襯 螻, super-human 焔レ . 蠍一牛 raw sensory input(企語) reward signal(蟆)襷 Deep Q-Network(Convolution Neural Network襯 Q-Learning朱 representation 覈)襯 螻 轟(2015)襯 蠍一朱 Model-Free 覦朱 State-of-the-art襯 るジ 蟆れ 覦一郁鍵 伎 狩 螻襴讀 る 蟆 general application 螳レ煙 れ 郁規覦 DQN 螻 raw screen image reward signal input朱 覦 蟆曙朱 Atari襯 . 蠏碁Μ螻 伎瑚 rewarding scheme襦 training 覲旧″ 蟆 企至 螻 語讌襯 誤螻 Competitive 伎碁 襯 襴 蟆 牛讌襷 Cooperative 蟆曙一 螻旧 豕 讌 旧 谿場企り . Competitive 覈 Cooperative覈 伎 譴螳 襯 郁規螻 Competitive Cooperative襦 讌 蟯豸″蠍 伎 rewarding schemes襯 tuning 5
  • 6. Method : DQN Q-Learning螻 DQN RL 覈 dynamic 蟆曙 伎語 accumulative long term reward襯 maximize 豈 谿城 蟆. Agent螳 蟆曙 dynamics reward 螳 implicit 覲願 覓語螳 . Model- Free覦 Q-Learning 轟( 2015 ) 襷 Q-Learning 蟆曙 轟 螳豺襯 螳 DeepMind 企 Q-Learning Convolution Neural Network襯 Q-Value襯 approximate伎 觜 蟆 super-human 焔レ 燕 6
  • 7. Method : DQN DQN螻 Multi-Agent 螳 伎 伎瑚 蟆曙 螻旧 蟆曙磯 炎 伎語 觜 危願 曙 旧 覿磯 轟煙 襦 伎 螻牛讌襷, 譬 覈 螻襴讀 覦 手 螳 螻ろ伎 Ex) Multi-Agent Setting 蟆曙 蟆曙 state transition reward螳 覈 伎語 螻給 レ 覦. 伎1 螳豺螳 るジ 伎2 磯 覲蟆暑蠍 覓語 螳 伎碁 るジ 伎碁ゼ 豢伎朱 . 朱朱 るジ 伎瑚 旧 覃 螳 伎語 stability adaptive behavior伎 蠏 れ 伎語 Q-Learning螻襴讀 襦螳 讌襷 蠏 轟襯 蠍一朱 轟 磯殊 曙 譟伎 (蠏 轟 蠍一)Simplicity, decentrailized nature, computational speed 蠏碁Μ螻 覯 手 蟆郁骸襯 一 ルレ DQN 蠍 覓語 覦覯 . 朱誤磯 Human-level control through deep reinforcement learning 朱瑚骸 狩蟆 れ 7
  • 8. Method : Adaptation of the Code for the Multiplayer Paradigm OpenAI Gym螻 MultiAgent Setting 蠍一ヾ Human-level control through deep reinforcement learning 朱瑚骸 蟷 螻糾 貊 覃 伎 蟆 蟆曙 螻糾讌 伎碁 DQN朱 襴曙朱 蟲 伎螻, Atari 蟆 れ 伎企ゼ 讌襷 伎語 覡危 螳 communication protocol 伎企ゼ 襦 蟇語企 蟆 覃伎 Fully Observable螻 伎 螳 螻旧覩襦 伎語 蟆 襯 螻牛蠍 伎 豢螳 讌 8
  • 9. Method : Game Selection Game蟆 蟆一 蠍一 ALE(Atari Learning Environment) 61螳 蟆 讌螻 讌襷 2螳 伎 覈螳 螳ロ 蟆蟆曙 蟲ロ . 蠏碁 ろ 伎 3螳讌 蠍一朱 ろ蟆曙 磯 1. Real-time two-player mode螳 伎狩 : 襯 れ Breakout 蟆曙磯 覈 伎願 覯螳 伎狩 朱襦 X 2. Deep Q-Learning 螻襴讀 炎 伎 覈 瑚 譴 伎 蟆 蟆 蟆讀 蟆暑 : 襯 れ Wor of Wizard朱 蟆 2 覈襯 螻牛讌襷 覩碁 蠍一ヾ DQN朱 襷ろ壱 3. 蟆 豌願 蟆曙 覈 蟆曙 : 蠏碁Μ螻 reward function 覲蟆渚伎 Cooperation 蟆曙一 Competitive 蟆曙磯ゼ 蟆伎伎 蟆郁骸朱 Pong蟆 蟆曙 覈 蠍一 襷譟燕蠍 覓語 蟆 螻 朱 襴 れ 蟆企襦 !! 9
  • 10. Method : Game Selection Game蟆 蟆一 蠍一 Pong 螳 伎碁 覃伎 殊所骸 るジ讓曙 豺 譴 企 伎瑚 譟壱 4螳讌 : 襦 企, 襦 企, 讌 覦 覦(覲殊 蟇磯 蟆 ) 企 蟆一讌 螳螳 DQN 伎瑚 覲螳襦 蟆一 る 蟆 蠍一牛伎 郁規 覯襯 覯企讌襷 蟆曙螻 蟯 ル碁 螻狩 讌覓瑚骸 伎碁Μ るジ 螳讌 蟆 伎 (Outlaw, Warlords) 1. Outlaw 蟆 覈 譯 襷 れ螳 蠏殊豺襦 覲 螳 蟆 2. Warlords 豕 4覈 伎願 collaboration 讌襯 ろ誤 蟆 10
  • 11. Method : Rewarding Schemes Rewarding Schemes ろ 覈 agent蟆 覲伎 覦 覦覯 磯殊 れ蟆 譟一 蟆 郁規 蟆. 企 覈 磯殊 螳讌 Rewarding Schemeれ れ 1. Score More than the Opponent(Fully Competitive) 2. Loosing the Ball Penalizes Both Players(Fully Cooperative) 3. Transition Between Cooperation and Competition 11
  • 12. Score More than the Opponent(Fully Competitive) Competitive Zero-Sum Schemes : 願鍵覃 +1, 讌覃 -1 Pong 螻旧 襷讌襷朱 一 Player螳 襯 至, 螻旧 豺 Player螳 襷企 襯 詞. 蠍磯蓋 Zero-Sum蟆 蟲譟. 襷 殊 伎願 蠍 覲伎 至 覃 るジ讓曙 Player 覿 覲伎 覦蟆 螻 覦 襷谿螳讌 企 蟆曙磯ゼ 蟆曙覈手 覿襴 12
  • 13. Loosing the Ball Penalizes Both Players(Fully Cooperative) Cooperative Mode : Ball 企襴覃 螳ロ 伎碁れ る 蟆 螻旧 讌 覯 覦一磯 蟆. 螻旧 豺 蟆曙一 Player蟆 覿 覲伎 れ 蠍一牛 企 Player螳 螻旧 豎 覦蟆 讌讌 譴讌 螻 蠍 覲伎 覦 れ 譟伎讌 tip) るジ 螳ロ 覈 螳螳 螳 螻旧 蟆 覲伎 蟆伎襷 ろ 讌 り 13
  • 14. Transition Between Cooperation and Competition Transition Between Cooperation and Competition Cooperative 蟆曙一 Competitive 蟆曙 覈 螻旧 企襴覃 penalty襯 覦 螳讌 旧 蟲覲 蟆 Reward Matrix 螳 譟伎 螳. 覲伎 伎碁れ るジ Player襯 讌豎 螻旧 讌覃伎 詞. Reward 螳豺螳 -1 1襦 讌朱 覲襦 覃 Competitive Cooperation 譴螳 襴るゼ 誤 誤 rho襯 襦 覲伎 -1 1襦 覲螻 螻旧 れ 磯 -1襦 螻貅 蠏狩. 14
  • 15. Training Procedure Training Procedure 覈 ろ, 50 Epochs襯 れ螻 timestep 1~250000 れ Frame skipping 蠍 覓語 伎瑚 4覯讌 蠍一朱 . visible frame, frame, time step Epsilon greedy襯 Epsilon decay襯 れ(1 0朱 0.05襯 れ 蠏) 15
  • 16. Collecting the Game Statics Pong蟆曙 伎語 豸′ 至鍵 伎 蟆 企欧碁ゼ 誤螻 螻壱(覲殊 Paddle Wall 覿 蟆曙) - stella : A Multi-Platform Atari 2600 VCS Emulator 襯 伎 企欧碁ゼ 螳讌 Cooperative Competitive襯 朱 螳蠍 伎 覈 rewarding scheme 伎 螳 epoch 危 豸′ 蟆郁骸襯 讌. 螳 epoch 危 Player state Q-Network襯 覃 random seed 螳 覲蟆渚 10覯 蟆 ろ螻 糾豺襯 讌. Testing螻 epsilon(exploration rate) 螳 0.01襦 れ 1. Average paddle-bounces per point point覲 蠏 paddle-bounces Player 譴 覈 蠍 螻旧 伎 覈 覯 讌襯 語企慨 蟆. Random play襯 伎碁 螻旧 蟇一 豺讌 覈詩. ろ Paddle- bounces手 覈覈 2. Average wall-bounces per paddle-bounce paddle-bounce 蠏 wall-bounce Player蟆 蠍 讓所骸 讓 覯曙 朱 襷 螻旧 覿讌襯 覩誤. 覦 Player蟆 衰鍵 覯 覯曙 蠍磯襦 豺企 螳襦 螻旧 豺 . ろ wall-bounce手 覈覈 3. Average serving time per point point覲 蠏 serving螳 Player螳 蟆所鍵襯 れ 蟇碁Μ 螳. 蟆 れ る 詞 Player螳 fire朱 轟 覈轟 伎狩. ろ serving time企手 覈覈 16
  • 17. Results : Emergence of Competitive Agents Results : Emergence of Competitive Agents 17
  • 18. Results : Emergence of Competitive Agents Results : Emergence of Competitive Agents 18
  • 19. Results : Emergence of Competitive Agents Competitive Agents蟆郁骸 19
  • 20. Results : Emergence of Cooperative Agents Results : Emergence of Cooperative Agents 20
  • 21. Results : Emergence of Cooperative Agents Results : Emergence of Cooperative Agents 21
  • 22. Results : Emergence of Cooperative Agents Cooperative Agents蟆郁骸 22
  • 23. Results : Progression from Competition to Collaboration Results : Progression from Competition to Collaborarion 23
  • 24. Results : Progression from Competition to Collaboration Results : Progression from Competition to Collaborarion 24