際際滷

際際滷Share a Scribd company logo
Learning Montezumas Revenge from a
Single Demonstration (18.07)
Ryan Lee
Exploration and Learning
 Exploration: Find action sequence with positive reward
 Learning: Remember and generalize action sequence
 Need both for a successful agent
Montezumas Revenge
 One of the hardest games in Atari 2600
 Sparse rewards  Exploration is difficult
https://www.retrogames.cz/play_124-Atari2600.php?language=EN
Simplifying Exploration with Demonstrations
 Solution: Shorten the episode
 Start the agent near the end of demonstration
 Train agent until it ties or beats the demonstrators score
 Gradually move starting point back in time
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
Go down
Ladder 1
Go down
Rope
Go down
Ladder 2
Jump over
Skull
Go up
Ladder 3
[1807] Learning Montezuma's Revenge from a Single Demonstration
[1807] Learning Montezuma's Revenge from a Single Demonstration
[1807] Learning Montezuma's Revenge from a Single Demonstration
Result
 74500 points on Montezumas Revenge (State of the Art)
 Surpasses demo score of 71500
 Exploits emulator flaw
Comparison with DeepMinds approach
 DeepMinds approach
 Less control over environment needed
 Agents imitate the demo
 This approach
 Need full game states in demo
 Directly optimize game score  Less overfitting for sub-optimal demo
 Better in multiplayer games where performance should be optimized against various
opponents
Remaining Challenges
 Agent cannot reach exact state in demo
 Agent needs to generalize between similar states
 Problematic in Gravitar or Pitfall
 Careful hyperparameter tuning needed
 High variance in each run
 NN does not generalize as well as human
https://blog.openai.com/openai-baselines-ppo/
Thank you!
Original content by OpenAI
 Learning Montezumas Revenge from a Single Demonstration
You can find more content in
 github.com/seungjaeryanlee
 www.endtoend.ai

More Related Content

[1807] Learning Montezuma's Revenge from a Single Demonstration

  • 1. Learning Montezumas Revenge from a Single Demonstration (18.07) Ryan Lee
  • 2. Exploration and Learning Exploration: Find action sequence with positive reward Learning: Remember and generalize action sequence Need both for a successful agent
  • 3. Montezumas Revenge One of the hardest games in Atari 2600 Sparse rewards Exploration is difficult https://www.retrogames.cz/play_124-Atari2600.php?language=EN
  • 4. Simplifying Exploration with Demonstrations Solution: Shorten the episode Start the agent near the end of demonstration Train agent until it ties or beats the demonstrators score Gradually move starting point back in time Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3
  • 5. Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3
  • 9. Result 74500 points on Montezumas Revenge (State of the Art) Surpasses demo score of 71500 Exploits emulator flaw
  • 10. Comparison with DeepMinds approach DeepMinds approach Less control over environment needed Agents imitate the demo This approach Need full game states in demo Directly optimize game score Less overfitting for sub-optimal demo Better in multiplayer games where performance should be optimized against various opponents
  • 11. Remaining Challenges Agent cannot reach exact state in demo Agent needs to generalize between similar states Problematic in Gravitar or Pitfall Careful hyperparameter tuning needed High variance in each run NN does not generalize as well as human https://blog.openai.com/openai-baselines-ppo/
  • 12. Thank you! Original content by OpenAI Learning Montezumas Revenge from a Single Demonstration You can find more content in github.com/seungjaeryanlee www.endtoend.ai