ݺߣ

ݺߣShare a Scribd company logo
Reinforcement	Learning:	An	Introduction	
Richard	S.	Sutton	and	Andrew	G.	Barto
???
carpedm20
Chapter 1, 2, 3
Reinforcement Learning:
?? ??? ?? ??? ???? ??? ??? ??
learner,	decision	maker
everything	outside	the	agent
Policy ?? ? ? : ?  ?  [?, ?]
Episodic and Continuous tasks
? Agent-environment interaction break down into
? Sequence of separate episodes (episodic tasks)
? ?H, ?I,, ?K. ?M = 0	?????	? > ?
? or just One (continuing tasks)
Value functions, V ?
? Estimate how good it is for the agent to be in a given state
? ?: ?  ?
? how	good	is	defined	in	terms	of	future rewards that	can	be	expected
? ??? reward? ?? action????? ?? ???? (?)
? Vc
? = ?c
? ? ?M = ? = ?c  ?h
?MihiI
KjM
hkH ?M = ?
? where	Rl is	the	total	return and	?l is	a	immediate	reward
Vc
? = ?c
?M ?M = ? 			  max
op
?c
(?, ?)
= ?c  ?h
?MihiI
KjM
hkH ?M = ?
= ?c ?MiI + ?  ?hKjM
hkH ?Mihit ?M = ?
=  ? ? ?  ? ?` ?, ? ? ?, ?, ?` + ??c  ?h
?Mihit
KjM
hkH ?MiI = ?`w`o
=  ? ? ?  ? ?` ?, ? ? ?, ?, ?` + ??c
(?`)w`o
Bellman	equation
state-action	pair
? ?` ?, ?
? ? ?
Vc
? = ?c
?M ?M = ?
Vc
?`
recursive	expression
?c
(?, ?)
?c
(?`, ?`)
<backup	diagram	for	?c
><backup	diagram	for	Vc
>
??? action? ???? s	? ?? ?? :
stochastic	MDP
Action-value functions, Q ?, ?
? The value of taking action ? in state ? under a policy ?
? ?: ?  ?|?
? how good it is for the agent to be in taking action ? in state ? under a policy ?
? ?c
?, ? = ?c
?M ?M = ?, ?M = ? = ?c  ?h
?MihiI
KjM
hkH ?M = ?, ?M = ?
? Optimal action-value function : ??
? = max
op(w)
??
(?, ?)
Optimal Value Functions
? Solving RL = finding an optimal policy
? ?? ?	`?? ??? ?? ? ?? ? ?? ?? expected return? ?`?? ? ?
? ?c ?  ?c`(?)
? ??
? = max
c
?c
?
? ??
?, ? = ???c ?c
(?, ?) : the expected return for taking action a in state s
? Express ??
in terms of ??
??
?, ? = ?  ?h
?MihiI
KjM
hkH ?M = ?, ?M = ?
= ? ?MiI + ?  ?hKjM
hkH ?Mihit ?M = ?, ?M = ?
= ? ?MiI + ?	??
(?MiI)|?M = ?, ?M = ?
= ?[?MiI + ? max
o`
? ?MiI, ?` |?M = ?, ?M = ?]
Bellman optimality equation
? ?? ? should be equal the expected return for the best action from that state
Bellman optimality equation
??
? = max
o
?[?MiI + ???
(?MiI)|?M = ?, ?M = ?]
= max
o
 ? ?` ?, ? [? ?, ?, ?` + ???
(?`)]w` 	
??
?, ? = ?[?MiI + ? max
o`
??
(?MiI, ?`)|?M = ?, ?M = ?]
=  ? ?` ?, ? [? ?, ?, ?` + ? max
o`
??
?MiI, ?` ]w`
? For finite MDPs, Bellman optimality equation has a unique solution independent of the policy
? DP are obtained by turning Bellman equations into assignments
into update rules for improving approximations of value functions
Chapter 4
Dynamic Programming
ONLY for known MDP!
Policy Evaluation
? How to compute ?c
? for an arbitrary policy ?? ?	? value? ?? ?? ???
= policy evaluation
?c
? = ? ?M + ??c
?` ?M = ?
= ? ? ? ? ? ? ?` ?, ? [?(?, ?, ?`) + ??c
?` ]
w`o
?? ?? ?? future reward? expectation
Policy Evaluation
? ?? environment? ?? ??? ?? ??? (known MDP)
?c
? ?? |?|?? unknown variables (Vc
? , ?  ?)? ???
|?|?? linear equations
? ??? arbitrary approximate value function ? ?H??? ??. ?H, ?I, ?t,
?hiI ? = ? ?M + ??h ?` ?M = ?
= ? ? ? ? ? ? ?` ?, ? [?(?, ?, ?`) + ??h ?` ]
w`o
<Iterative	policy	evaluation>
Iterative policy evaluation
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/DP.pdf
?h
? upper	bound? ?h
? ????? ?? ??
?   ?? ?h
? converge	??
Iterative policy evaluation
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/DP.pdf
full	backup
Gridworld ??
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/DP.pdf
Gridworld ??
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/DP.pdf
?10.25?20.75=?1.75
Gridworld ??
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/DP.pdf
Value	function?
?? policy? ????? ??? ??
Policy improvement
? Policy evaluation ? ??? ? ?? policy? ?? ??
? Policy evaluation?? Arbitrary ?? ?? Vc
? ? ??
? ?? ?	?? ?? policy ?	? ??? ??? ??? Vc
? ?? ??
? ?? ? ?? policy ?`? ?????
? ??? ?`? ?` = ??????(?c) ? ?? ? ??
Policy improvement theorem
? ??? policy? ??? ?? ??? ????
?c ?, ? = ?c ?M ?M = ?, ?M = ?
= ?c  ?h ?MihiI
KjM
hkH ?M = ?, ?M = ?
= ?c ?MiI +  ?h ?MihiI
KjM
hkI ?M = ?, ?M = ? = ?c ?MiI + ??c(?`) ?M = ?, ?M = ?
? ?c ?, ?`(?) ? ?c(?)??	??? ?`? ??? ??
?c ?, ?`(?)  ?c ? for all ?  ? ??,
?c` ?  ?c ? for all ?  ?. WHY?
?? value??? ?`? ??? value
? ?
?, ? = ? ? ?` ?, ? [? ?, ?, ?` + ?max
o`
? ?
?MiI, ?` ]
w`
Policy improvement theorem
?? ??? arbitrary? ?? ? ? ? ??? ? ??
DP? ?h
? ? ?? ?hiI
? ? ?? ? ??
?c
? ? ??? ???, ? ?? ??
? ?? ???.
?? ?  ?? ?? ?c
? ? ?`  ?? ??? ? ??
???? ?? ?? ??
?`  ?? policy ?`	? ??? ?? ? ????
Greedy policy
?` ? = argmax
o
?c(?, ?)	
= argmax
o
?[?MiI + ??c
?MiI |?M = ?, ?M = ?]
= argmax
o
 ? ?` ?, ? [?(?, ?, ?`) + ??c
?` ]w`
? Greedy policy ? ?c
??? ? + 1?? ??? value? ??? action? ??
<Policy	improvement	theorem>
Policy iteration
? Gridworld ???? improved policy? optimal policy, ?` = ??
? Policy iteration? finite MDP?? ?? ??
? converge ??
<Policy iteration>
Value iteration
? Policy iteration? ??
? ?c
? ?????? Policy evaluation? converge ? ??? ?????
? ??? Gridworld ????? ??? ??
converge ? ? ?? ??? ??? ??
? ??? ?? ??? Value iteration
Value iteration
? Policy improvement? truncated policy evaluation? ?? ?
Value iteration
? Policy improvement? truncated policy evaluation? ???
<Policy iteration><Value iteration>
Value iteration's one sweep combine
PE and PI effectively
Policy iteration & Value iteration
? ? ???? ?? optimal policy? converge ???,
only for discounted finite MDPs
Asynchronous Dynamic Programming
? DP? ?? : MDP? ??? state? ?? ???? ??,
? ???? ???(converge ? ??? ????)? iteration ??
? State? ???? ???, ??? sweep? ?? expensive
? Asynchronous DP algorithm
? back up values of state in any order
? in-place dynamic programming
? prioritized sweeping
? real-time dynamic programming
In-Place Dynamic Programming
? Synchronous value iteration : ??? value function? ??
? ?  max
op
? ?, ?, ?` + ? ? ? ?` ?, ? ?C?(?`)
w`?
?C?  ?
? Asynchronous (In-place) value iteration : ?? ??? value function? ??
? ?  max
op
? ?, ?, ?` + ? ? ? ?` ?, ? ?(?`)
w`?
Real-Time Dynamic Programming
? ??? agent? ???? state? ????
? Agent? experience? state ???? ???? ??
? ? time-step? Sl, ?M, ?MiI??
? ?M  max
op
? ?M, ?, ?` + ? ? ? ?` ?M, ? ?(?`)
w`?
Dynamic Programming
ONLY for known MDP!
Chapter 5
Monte-Carlo Learning
For unknown MDP!
We dont know MDP transition in the world!
Model-free prediction
Model-free control
? ?
(?)
?`
Policy Evaluation
Policy Improvement
? ?
(?)
?`
Definition
?prediction problem (policy evaluation) :
given ?	? ?? value function ? ?(?)? ???? ?
?control problem (policy improvement):
optimal policy ??
? ?? ?
Monte-Carlo Reinforcement Learning
? Return : total discounted reward
GM = ?MiI + ??Mit + ?+ ?KiMjI
?K
? Value function : expected return  reward
?c ? = ? ?M|?M = ?
? MC policy evaluation? expected return ?c ? ?
empirical mean return GM? ?????? ?? ?? ????
?  ?? ?? ??? ????
MC? ?? ?c
(?)? ???? ????
First-Visit MC Policy Evaluation
Episode ?? ?? with	?
?, ??? episode??s? ?? ?? ???? ????
? episode?? ??? N(s)? ??? ? ??
V(s),	N(s),	S(s)? ???? ????
Every-Visit MC Policy Evaluation
Episode ?? ?? with	?
? episode?? ??? N(s)? ??? ? ??
V(s),	N(s),	S(s)? ???? ????
Back to basic : Incremental Mean
?hiI =
I
h
 ??
h
?kI
=
I
h
?h +  ??
hjI
?kI =
I
h
?h + (? ? 1 ?h + ?h ? ?h)
=
I
h
?h + ??h ? ?h = ?h +
I
h
?h ? ?h
? Only need memory for k and ?h
? General form:
???????????  ??????????? + ???????? ?????? ? ???????????
????? =	???????? ??????????
Incremental Monte-Carlo Updates
? ?I, ?I, ?I,  , ?K? episode? ?? ?(?)? ???? ?? ??
? ? ?M? ?M? ???
? ?M  ? ?M + 1
? ?M  ? ?M +
1
? ?M
(?M ? ? ?M )
? non-stationary ???? fixed constant ?? ??? old episodes? ?? ?
??
? ?M  ? ?M + ?(?M ? ? ?M )
?????
??? ??? ?? ??? ??
ex)	??? ?? ???, ??? 30? ??? (?? ?? ???? ??? ??? ??)
??? ?????????? ??? ?????
V(s),	N(s)? ??. S(s)? ?? ??
MC? ??
? ?c ? ? ? state? independent ??
? DP : ?hiI ? =  ? ? ?  ? ?` ?, ? [?(?, ?, ?`) + ??h ?` ]w`o
? MC : ? ?M  ? ?M +
I
h
(?M ? ? ?M )
? MC do not bootstrap
? Computation independent of size of states ?
? ?? ?? state? ?? ?? state ?  ?? ? ?? ?c ? ? ??? ??
? DP?? ?? ? ??
DP
1.	all	actions
2.	only	one	transition
MC
1.	sampled
2.	to	the	end
MC? ??
? ?? State s? ???? ????, ??? V(s)? ??? ??? ??
????? ??? guarantee ? ? ???? ??
? ???, ??? ??? ?? ???? ???? ??? ?? s  ? ? ????
????? ??? ?? ???? ??, ? ?  ?? ? ? ? ?	? subset ??? ???
????? ??
??
? MC methods?
? episode? ???? ?? ???
? Model-free: MDP transition? reward? ??? ??
? complete episode? ?? ???: no bootstrapping, ?? complete Return
? ?? ???? : ? ?
? = ?? Return
? ???:
? episodic MDP??? ? ? ??. episode? ???(terminate) ??? MC ??!
Chapter 6
Temporal-Difference Learning
For unknown MDP!
We dont know MDP transition in the world!
Temporal-Difference (TD) Learning
? TD methods
? ?? experience? ?? ??? (no use of experience memory)
? Model free: MDP trainsition / rewards ??? ?? ??
? DP?? estimate? ???? ?? (terminal ?? ??? ??) ?? ??
? MC? ?? bootstraping ??? ???? ??
? Guess? ?? Guess? ???? ?? (?? ????)
Temporal-Difference (TD) Learning
? ?? : ? ?
? ? ?? experience? ?? online?? ??
? Incremental every-visit MC
? ? ?M ? ?? return? ??? ?? ???? ??
? ? ?M  ? ?M + ?(?? ? ? ?M )
? Simplest temporal-differnce learning algorithm: TD(0)
? ? ?M ? ??? return? RMiI + ?? ?M = immediate	reward +
discounted	value	of	next	step
? ? ?M  ? ?M + ?(? ?i? + ??(? ?i?) ? ? ?M )
TD	target
TD	error
Tabular TD(0)
?
?`
?? action? ?????
TD(0):	Sample	backup	??? ??. Full	backup	?? ??? ?? ?`  ?? ???? ?? ??
?
?
MC vs TD
? ??? ??? ??
? MC? ??
? ???? ??? ?? ??? ???. ?? = ?
? ????? trajectory ? ?, ? ?j?, , ? ??? ??? ? ? ??? ??? ??? ??
negative reward? ?? ??? ????? ???
? ??? ??? ? ???? negative ? ??? ???? ?? ? ??
? TD? ??
? ???? ??? ?? ????? ??? ??? ?. ? ? ?i? = ??
? ? ??? trajectory ? ?? negative reward? ???? ? ? ??
? ??? ??? ?? ?? ???? negative ? ??? ???? ?? ? ??
MC vs TD
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
MC vs TD
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
MC vs TD (1)
? TD? ??? return? ?? ?? ?? ? ??
? ? step ?? online?? ???
? MC? episode? ????? ??? ? ?? ??? ??
? TD? ??? return? ??? ?? ? ??
? TD? episode? ???? ??? (??? ???, ??? ????) ???
? MC? episode? ???? ???
? MC? episodic (terminating) environments??? ? ? ??
? TD? continuing (non-terminating) environments??? ? ? ??
Bias/Variance Trade-off
? Return : ?M = ?MiI + ??Mit + ? + ?KjI
?K. unbiased prediction of ? ?
? ?
? True TD target : RMiI + ??c
?MiI . unbiased prediction of ? ?
(? ?)
? TD target : ?MiI + ??h
?MiI . biased estimate of ? ?
(? ?)
? ?h
? ?? ??? ???? ?? (bias) ? ????
? ??? TD target? Return ?? ?? ?? variance? ???
? Return??? random action, transitions, reward? ???? episode? ?????
variance? ???
? TD target? ?? ??? random action, transition, reward? ?? ????
bias? ? ? ?? ????? ??
MC vs TD (2)
? MC : high variance, zero bias
? Good convergence property (even with function approximation)
? ?? ?? sensitive?? ??. ??? ?? ??? ?? bootstrap?? ?? ??
? TD : low variance, some bias
? ?? MC?? ?? efficient??
? TD(0)? ?c
?M ? converge??.
? ??? function approximation?? ?? converge?? ??? (??? specific? case)
? ???? ?? sensitive??
Random Walk example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
Random Walk example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
boostraping (TD)? ??? ? ??? ??? ???
Batch MC & TD
? MC & TD? experiecne  ???????? ?? ?h
(?)  ?c
(?)
? ??? K?? ??? experience (batch solution)? ????
?I
I
, ?t
I
, ?-
I
,  , ?K?
I

?I
?
, ?t
?
, ?-
?
,  , ?K?
?
? ??? MC, TD(0)? ?????
AB Example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
AB Example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
MC	:	? ? = 0
TD(0)	:	? ? = 0.75
Certainty Equivalence
?? data? ?? MDP? ???
? MDP? maximum	likelihood? ?? ??
count of	transition
mean	rewards,	divided	by	number	of	visit
only	observe	last	return
MC & TD (3)
? TD? Markov property ? ???? ??
? TD? MDP? ??? ???? (Environment? state? ?? ??) ???
??
? ?? Markov environment?? ?? ?????
? MC? Markov property? ???? ?? ???
? ?? non-Markov environment?? ?? ?????
? ex) partially observable MDP
MC backup
sample	one	complete	trajectory
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
TD backup
sample an	action,	look	ahead one	step
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
DP backup
look	one	step	ahead,	but	no	sample
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
Bootstrapping & Sampling
? Bootstrapping : dont use real return
? MC does not bootstrap
? DP bootstrap : use estimate as a target
? TD bootstrap
? Sampling
? MC samples
? DP does not samples
? TD samples
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
n-step Prediction
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
Large Random Walk Example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
MC:	high	variance
TD:	low	variance
On-line	:	?hiI
?M  ?h
?M ? ? step ??
Off-line	:	?hiI
?M  ?h
?M ? ???episode? ???
Large Random Walk Example
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
MC:	high	variance
TD:	low	variance ? ?? random	walk? ?? ?? ??? ???
? ?? ?? state? 10?? ??? 1000?? ??
? 10? ??? best	n? 1000? ??? best	n? ?? ???
? environment? robust	?? ??
Averaging n-step Returns
? ?M
()
= ?MiI + ??Mit + ?+ ?jI
?Mi + ?
?(?Mi)
? 1-step ? ?? 2-step ?? ??? n-step ??? ? ?? ????
? ?M
(t,?)
=
I
t
?M
(t)
+
I
t
?M
(?)
? ? ????? ??? ? ????
? ? ??????
? ?? step? ?????? ?M
()
? ????? factor
? 1 ? ?? sum? 1? ????? normalizing factor
? ?M
?
= 1 ? ?  ?jI
?M
()?
kI 	
? Forward-view TD(?)
? ? ?M = ? ?M + ? ?M
?
? ?(?M) 	
weighted sum of n-step returns
TD(?) ??
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf
?? weighting? geometric	weighting ??? ?
Forward-view of TD(?)
? ?M
?
? ?? ???? ?
? ? ?M = ? ?M + ? ?M
? ? ?(?M) 	
? Forward-view TD? ?M
?
? ???? ?? ??? ???? ?
? MC ??, episode? ????? ???? ??
Forward-view TD(?)
? = 1 :	MC ? = 0 :	TD(0)
? ?? random	walk? ?? ?? ??? ???
? ?? ?? state? 10?? ??? 1000?? ??
? n? ????? best	?? ? ??? ???
? Robust	to	environment
Backward-view TD(?)
? Update online, every step, from incomplete sequences ? ??? ??
? ?? ??? ?????? ??? ?? ??? ??????
? Frequency heuristic : ?? ?? ???? state? importance?
? Recency heuristic : ?? ??? ??? state? importance?
? Eligibility trace : ? ?? heuristic? ?? ?
Eligibility Traces
Backward-view of TD(?)
? Eligibility trace? ?? state ?	? ?? ??
? ?(?)? ?? ? ? ?? ?????
?M = ?MiI + ?? ?MiI ? ? ?M
? ?  ? ? + ??M ?M(?)
? Forward-view TD(?)
? ? ?M = ? ?M + ? ?M
?
? ?(?M) 	
:	1-step	TD	error
:	multiply	eligibility	trace
TD(?) and TD(0)
? ? = 0? ?:
?M = ?MiI + ?? ?MiI ? ? ?M
?M ? = ?(?M = ?)
? ?  ? ? + ??M ?M(?)
? ?? ??? ? ??? ??? ???. ? = 0? ?M = ?? ???? ? ??? ????? ??
? ?? = ?? ?????? ????
? ?? TD(0)? ???? ???.
? ?  ? ? + ??M
TD(?) and MC
? ? = 1??
?M ? = ??MjI ? + ?(?M = ?)
Chapter 7
in progress
Reference
? Reinforcement Learning: An Introduction
? RL Course by David Silver
http://carpedm20.github.io/

More Related Content

?? ?? ?? Reinforcement Learning an introduction