�ݺ�ߣ

알아두면 쓸데있는
신기한 강화��습
김태훈
carpedm20

저는
졸업
머신러닝 엔지니어
+
20+

강화 ��습
Reinforcement Learning (RL)

알아두면 쓸데있는 신기한 강화��습 NAVER 2017

Environment
Agent
State 𝑠" Action 𝑎" = 2

Environment
Agent
Action 𝑎" = 2State 𝑠" Reward 𝑟" = 1

Environment
Agent
Action 𝑎" = 0State 𝑠" Reward 𝑟" = −1

행동을 하고 시행착오를 겪으며 ��습
강화 ��습

최근 강화 ��습 연구들

https://deepmind.com/blog/agents-imagine-and-plan/
https://blog.openai.com/learning-to-cooperate-compete-and-communicate/

https://sites.google.com/view/nips17assembly/home
/carpedm20/ai-67616630

2014
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489.
Vinyals, Oriol, et al. "StarCraft II: A New Challenge for Reinforcement Learning."
2016
2017

2014
2016
이전의 강화��습은 잘 알려진 반면..

2014
2016
이후의 강화��습?

최근 강화 ��습17.08.16

1.Multi Agent
2.Planning
3.Meta Learning
4.Guided RL
5.ETC Exploration, Continuous action, Imitation learning …

1.여러 로봇 ��습하기
2.전략 세우기
3.배경 지식 활용하기
4.명령에 따라 다르게 행동하기
5.그 외 다양한 시도, 연속적인 행동, 따라하기, …

WARNING
강화 ��습이 처음이신 분께 다소 어려울 수 있기 때문에
전체적인 흐름 파악에만 집중해 주세요

1. 여러 로봇 ��습하기
Multi Agent RL

Single Agent
https://deepmind.com/research/alphago/alphago-vs-alphago-self-play-games/

협업 or 경쟁이 필요한 Multi Agent
자율 주행 자동차, 대화 AI, 대규모 공장 로봇 …

Starcraft
Peng, Peng, et al. "Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games." arXiv preprint arXiv:1703.10069 (2017).

Multi-Agent RL
다중 에이전트 강화 ��습

Single Agent ��습 방식을
그대로 쓰기 어렵다
Deep Qlearning, Policy Gradient …

다양한 어려움이 있지만..
Multi-Agent RL

Non stationary environment
다른 Agent 때문에 생기는 불확실성 때문에 ��습이 어렵고 기존의 경험을 바로 활용하기 어렵다

B에 가까이 갈 때 +1 reward
B
A

+1
+1-1
-1
B에 가까이 갈 때 +1 reward
B
A

+1+1-1+1+1+1
Q( ) = +2
Q(𝑎") : 각 행동 𝑎"가 가져울 미래 가치
B
A

+1+1-1+1+1+1
Q( ) = +2
+1+1-1+1+1+1
Q( ) = +4
B
A

+1+1-1+1+1+1
Q( ) = +2
+1+1-1+1+1+1
Q( ) = +4
-1-1-1+1-1+1
Q( ) = -2
-1+1-1-1-1-1
Q( ) = -4
B
A

B가 갑자기 움직이기 시작한다면?
B
B
A

Q( ) = ?
Q( ) = ?Q( ) = ?
Q( ) = ?
A가 이전에 배웠던 Q(𝑎")는 무쓸모
B
A
B
예를 들어 B가 갑자기 순간 이동을 한다고 했을때

B가 다른 reward를 받는 Agent라면?
��습하면서 행동을 바꾼다면
B
B
Q( ) = ?
Q( ) = ?Q( ) = ?
Q( ) = ?
A

Q-value ��습이 굉장히 불안정해 질 것
B
B
Q( ) = ?
Q( ) = ?Q( ) = ?
Q( ) = ?
A

다양한 시도
Multi-Agent RL

Communication
Mordatch, Igor, and Pieter Abbeel. "Emergence of Grounded Compositional Language in Multi-Agent Populations." arXiv preprint arXiv:1703.04908 (2017)
https://blog.openai.com/learning-to-communicate/
다른 모든 Agent에게 메세지 전달

Actor-Critic + Centralized Q-value
다른 Agent의 내부 정보를 공유
Lowe, Ryan, et al. "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments." arXiv preprint arXiv:1706.02275 (2017)
https://blog.openai.com/learning-to-cooperate-compete-and-communicate/
Centralized Q-value

2. 전략 세우기
Hierarchical RL + Model-based RL

Reward가 자주 생겨서 ��습이 쉬움

Reward가 너무 드물어서 ��습이 어려움

Sparse Reward
Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016.
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017).
30번 정도의 올바른 행동 후에 0이 아닌 Reward을 얻음
Feedback
밧줄을 타고 내려가서 해골을 피하고 사다리를 타서 열쇠를 얻어야 100점 얻음

Hierarchical RL
계층 강화 ��습

A
행동 𝑎"
Non-hierarchical RL

A
행동 𝑎"Reward 𝑟"
Non-hierarchical RL

Kulkarni, Tejas D., et al. "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation." Advances in Neural Information Processing Systems. 2016
Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." arXiv preprint arXiv:1703.01161 (2017)
Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The Option-Critic Architecture." AAAI. 2017
A A
Non-hierarchical RL Hierarchical RL

A A
목표1 목표2 목표3

A A
밧줄 잡기

A A
밧줄 잡기 사다리 내려가기

A A
밧줄 잡기 사다리 내려가기 점프 하기

A A
𝑎*,"𝑎,,"
𝑎-,"
밧줄 잡기 사다리 내려가기 점프 하기

- - ON
A A
목표 Ω
𝑎*,"𝑎,," 𝑎-,"

- - ON
A A
목표 Ω
행동 𝑎-,"행동 𝑎"Reward 𝑟"
𝑎*,"𝑎,,"

- - ON
A A
목표 Ω
행동 𝑎-,"행동 𝑎"Reward 𝑟" Reward 𝑟"
𝑎*,"𝑎,,"

Montezuma 잘 풀었다

하지만, 암기로 풀 수 있음

암기로 풀 수 없는 문제
Weber, Théophane, et al. "Imagination-Augmented Agents for Deep Reinforcement Learning." arXiv preprint arXiv:1707.06203 (2017).

실제로 일어날 일을 시뮬레이션으로 (internal simulation) 상상해 보고 행동

Model-free RL + Model-based RL
Deep Q-learning
Policy Gradient
…

Model-free RL + Model-based RL
Imagination

3. 배경 지식 활용하기
Meta Learning

사람처럼 기존의 경험을 활용해
새로운 환경에서 어떻게 잘 적응을 할 수 있을까?
Meta Learning

여러가지 접근법
Meta Learning

Weight Update를 빠르게 하려면?
http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/

최적의 네트워크를 찾으려면?

작은 데이터만 보고도 잘 분류하려면?

한번도 안 본 게임도 잘 클리어 하려면?
Meta Learning + RL

Meta Learning

Meta Reinforcement Learning
한번도 안 본 게임도 잘 클리어 하려면?

Duan, Yan, et al. "RL $^ 2$: Fast Reinforcement Learning via Slow Reinforcement Learning." arXiv preprint arXiv:1611.02779 (2016).
https://www.youtube.com/playlist?list=PLp24ODExrsVeA-ZnOQhdhX6X7ed5H_W4q

한판 = 한 Episode

Episode가 끝나도 정보를 리셋하지 않고 계속 사용

N번의 Episode를 하나의 Trial로 정의
N번의 Episode를 통해서 최적의 플레이를 찾는 방법을 ��습

새로운 시도에는 새로운 게임(여기서는 새로운 맵)을 플레이

좀 더 현실적인 예시: 마리오를 N번 플레이 내에 끝까지 클리어

다양한 마리오 게임을 ��습하고 ��습하지 않았던 마리오 게임을 플레이

다양한 레이싱 게임을 ��습하고 ��습하지 않았던 레이싱 게임을 플레이
ex. GTA, 실제 자율 주행 자동차

RL2: Recurrent Network
https://www.youtube.com/playlist?list=PLp24ODExrsVeA-ZnOQhdhX6X7ed5H_W4q
Episode의 Return이 아닌 Trial의 Return을 optimize

Model-Agnostic Meta-Learning
Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks." arXiv preprint arXiv:1703.03400 (2017).
여러 Task를 동시에 ��습해 weight의 central point를 찾음
그리고 1번의 gradient update로 새 Task에 적응

4. 명령에 따라 다르게 행동하기

단 한가지 목표 자율 주행 = 무한가지 목표
학교까지 주행
앞 차를 따라서 주행
주차장에 주차
...

Guided RL
명령에 따라 다르게 행동하도록 Agent를 ��습

Teaching Machines to Understand Visual Manuals
via Attention Supervision for Object Assembly
Taehoon Kim1, Youngwoon Lee2, Joseph Lim2
1
2

사람처럼 새로운 환경에서 잘 적응하려면?
Generalization in Reinforcement Learning

http://www.ikea.com/ms/en_US/customer_service/assembly_instructions.html
의자 조립을 배운 사람

책상을 매뉴얼 없이 조립할 수 있을까?

하지만 매뉴얼이 있다면?

사람도 새로운 문제를 풀기 위해서는
매뉴얼을 봐야한다

칠교 퍼즐 가구 조립
Hierarchical Planning이 필요한 문제

State 𝑠" Manual 𝑚&��ܴǳ�;

어떻게?
두가지 방법으로 접근

Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in Neural Information Processing Systems. 2015.
…
…
Pointer Network

𝒔 𝒩,"

𝝅
𝑽

𝑎"5,

𝑒𝑛𝑐

…
𝒔*,"
𝒔,,"
…
⟨𝑔⟩
… …
𝑠<,"
,
𝑠<,"
*
𝑠<,"
𝒮
𝑝<,"
,
𝑝<,"
𝒫5,
𝑴"
Image segmentation + Pointer Network

하지만 Pointer Network ��습을 위해
추가적인 Supervision 필요
단점
몇 번째 segment가 매뉴얼 조각을 포함하는지
…
…

Attention
Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.

메뉴얼에 해당하는 부분에 집중(��ٳٱ�Գپ��ǲ�)

Query
Attention maps
Guided
Attention
π
V
Manual
State
Context Fusion
Map Fusion
…
…
…
…
…
Guided Attention + A3C

그리고 복잡한 ��습 과정을 거쳐서..
Curriculum Learning
Semi-supervised Learning
Self-supervision
…

https://sites.google.com/view/nips17assembly/home
: 입력

다른 Guided RL 연구들
Text as Manual

Gated-Attention + A3C
Hermann, Karl Moritz, et al. "Grounded language learning in a simulated 3D world." arXiv preprint arXiv:1706.06551 (2017)
https://sites.google.com/view/gated-attention/home

Self-Supervision + A3C
Chaplot, Devendra Singh, et al. "Gated-Attention Architectures for Task-Oriented Language Grounding." arXiv preprint arXiv:1706.07230 (2017)
https://www.youtube.com/watch?v=wJjdu1bPJ04
물체들의 관계까지 이해해야 하는 Agent

5. ETC
Exploration, Continuous action, Imitation learning

Exploration
지금까지 좋다고 생각했던 행동이 아닌 모험(랜덤 행동)을 하는 것

Exploration
랜덤으로 모험(행동)을 하는 것

Exploration
랜덤으로 모험(행동)을 하는 것
Exploitation
지금까지 배운 최선의 행동을 하는 것

Exploration
Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." arXiv preprint arXiv:1705.05363 (2017)
https://pathak22.github.io/noreward-rl/
Curiosity reward
+
Inverse Dynamics Model

Curriculum Learning
쉬운 문제부터 어려운 문제까지 차근차근 난이도를 올려가며 ��습

��습 시간
난이도 하 중 상
Non-curriculum learning
특정 난이도의 문제 뽑을 확률

��습 시간
��습 처음부터 끝까지 모든 난이도를 동일한 확률로 뽑기
Non-curriculum learning

��습 시간
Curriculum learning
처음에는 가장 쉬운 문제를 많이 ��습

��습 시간
난이도 하 중 상 하 중 상
Curriculum learning
특정 조건 달성 이후 좀 더 어려운 문제 풀기 시작
문제 하 성공률 80% 달성

��습 시간
난이도 하 중 상 하 중 상 하 중 상
문제 하 성공률 80% 달성 문제 중 성공률 80% 달성
Curriculum learning
특정 조건 달성 이후 좀 더 어려운 문제 풀기 시작

Curriculum Learning + GAN
Held, David, et al. "Automatic Goal Generation for Reinforcement Learning Agents." arXiv preprint arXiv:1705.06366 (2017)
https://sites.google.com/view/goalgeneration4rl

Continuous Action
연속적인 행동을 가진 Agent의 ��습 (ex. 로봇)

Discrete Action 𝑎"
<
∈ {0,1}
위 아래
ON -

Continuous Action −1 ≤ 𝑎"
<
≤ 1Discrete Action 𝑎"
<
∈ {0,1}
어깨 무릎 허리
0.1 -0.2 0.5
위 아래
ON -

Continuous Action
Schulman, John, et al. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017)
https://blog.openai.com/openai-baselines-ppo/
PPO

Continuous Action
Heess, Nicolas, et al. "Emergence of Locomotion Behaviours in Rich Environments." arXiv preprint arXiv:1707.02286 (2017)
https://www.youtube.com/watch?v=hx_bgoTF7bs
Distributed PPO

강화 ��습 캉화 ��습 강화 ��습 강화 ��습
강화 ��습 강화 ��습 강화 ��습 강화 ��습
강화 ��습 강화 ��습 강화 ��습 감화 ��습
강회 ��습 강화 ��습 강화 ��습 강화 ��습

Neural Turing Machine
Differentiable Neural Computer
Neural Module Network
Neural Programmer-Interpreter
Programmable Agent
…
강화 ��습 외에도 관심있는 분야
Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).
Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory." Nature 538.7626 (2016): 471-476.
Andreas, Jacob, et al. "Neural module networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Reed, Scott, and Nando De Freitas. "Neural programmer-interpreters." arXiv preprint arXiv:1511.06279 (2015).
Denil, Misha, et al. "Programmable agents." arXiv preprint arXiv:1706.06383(2017).

다 이야기하고 싶지만 오늘읶�..

Generative Model
GAN이라던가..

Berthelot, David, Tom Schumm, and Luke Metz. "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017).
https://github.com/carpedm20/BEGAN-tensorflow

Kim, Taeksoo, et al. "Learning to discover cross-domain relations with generative adversarial networks." arXiv preprint arXiv:1703.05192 (2017).
https://github.com/carpedm20/DiscoGAN-pytorch

Shrivastava, Ashish, et al. "Learning from simulated and unsupervised images through adversarial training." arXiv preprint arXiv:1612.07828 (2016).
https://github.com/carpedm20/simulated-unsupervised-tensorflow

카카오뱅크가 개시 5일만에 100만 계좌를 돌파하면서 돌풍을 일으키고 있다.
CVPR2017 현장 풍경입니다. 많은 컴퓨터비전 연구자들이 네이버랩스 부스를 찾았습니다.
오늘의 날씨는 어제보다 3도 높습니다. 총 3개의 일정이 등록되어 있습니다.

.voice
Voice Synthesis Technologies for Developers

http://www.devsisters.com/jobs/

END
http://carpedm20.github.io/

�ݺ�ߣ

알아두면 쓸데있는 신기한 강화��습 NAVER 2017

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

More from Taehoon Kim (7)

알아두면 쓸데있는 신기한 강화��습 NAVER 2017