[PPT] - Reinforcement Learning Lecture 8 Reinforcement Learning November PowerPoint Presentation

SLIDE 1

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Reinforcement Learning

Lecture 8

November 24, 2015 Reinforcement Learning 1

SLIDE 2

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Outline

1. Context
2. TD Learning
3. Issues

November 24, 2015 Reinforcement Learning 2

SLIDE 3

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Machine Learning Tasks

Supervised

– Given a training set and a target variable, generalize; measured over a testing set

Unsupervised

– Given a dataset, find “interesting” patterns; potentially no “right” answer

Reinforcement

– Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward, maximize the expected future reward

Never given I/O pairs
Focus: online (balancing exploration/exploitation)

November 24, 2015 Reinforcement Learning 3

SLIDE 4

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Success Stories

November 24, 2015 Reinforcement Learning 4

SLIDE 5

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

The Agent-Environment Interface

Agent Environment (stochas2c) state st ac'on at reward rt+1 st+1

November 24, 2015 5 Reinforcement Learning

SLIDE 6

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Pole Balancing

November 24, 2015 Reinforcement Learning 6

SLIDE 7

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Multi-Armed Bandit

November 24, 2015 Reinforcement Learning 7

SLIDE 8

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Types of Tasks

Some tasks are continuous, meaning they

are an ongoing sequence of decisions

Some tasks are episodic, meaning there

exist terminal states that reset the problem

November 24, 2015 Reinforcement Learning 8

SLIDE 9

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Policies

A policy is a function that associates a probability with taking a particular action in a particular state The goal of RL is to learn an “effective” policy for a particular task

November 24, 2015 Reinforcement Learning 9

π(s, a)

SLIDE 10

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Objective

Select actions so that the sum of the discounted rewards it receives over the future is maximized

– Discount rate:

November 24, 2015 Reinforcement Learning 10

0 ≤ γ ≤ 1

Rt = rt+1 + γrt+2 + γ2rt+3 + . . . =

∞

X

k=0

γkrt+k+1

SLIDE 11

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Environmental Modeling

An important issue in RL is state

representation

– Current sensors (observability!) – Past history?

A stochastic process has the Markov

property if the conditional probability distribution of future states of the process depends only upon the present state

– Given the present, the future does not depend on the past – Memoryless, pathless

November 24, 2015 Reinforcement Learning 11

SLIDE 12

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Implications of the Markov Property

Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate via

November 24, 2015 Reinforcement Learning 12

P(st+1 = s0, rt+1 = r|st, at, st1, at1, . . . r1, s0, a1) P(st+1 = s0, rt+1 = r|st, at)

SLIDE 13

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Markov Decision Processes

If a process is Markovian, we can model it as a 5-tuple MDP:

– S: set of states – A: set of actions – Pa(s, s’): transition function – Ra(s, s’): immediate reward

November 24, 2015 Reinforcement Learning 13

(S, A, P(·, ·), R(·, ·), γ)

SLIDE 14

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Recycling Robot MDP

November 24, 2015 Reinforcement Learning 14

SLIDE 15

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Value Functions

Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies

November 24, 2015 Reinforcement Learning 15

SLIDE 16

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

State-Value Function

V π(s) = Eπ[Rt|st = s] = Eπ[

∞

X

k=0

γkrt+k+1|st = s]

November 24, 2015 Reinforcement Learning 16

SLIDE 17

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Action-Value Function

Qπ(s, a) = Eπ[Rt|st = s, at = a] = Eπ[

∞

X

k=0

γkrt+k+1|st = s, at = a]

November 24, 2015 Reinforcement Learning 17

SLIDE 18

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example: Golf

November 24, 2015 Reinforcement Learning 18

SLIDE 19

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Temporal Difference (TD) Learning

Combines ideas from Monte Carlo

sampling and dynamic programming

Learns directly from raw experience

without a model of environment dynamics

Update estimates based in part on other

learned estimates, without waiting for a final outcome

November 24, 2015 Reinforcement Learning 19

SLIDE 20

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Visual TD Learning

November 24, 2015 Reinforcement Learning 20

SLIDE 21

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Q-Learning: Off-Policy TD Control

1. Initialize Q(s,a)

– Random, optimistic, realistic, knowledge

2. Repeat (for each episode):
a. Initialize s
b. Repeat (for each step of episode)

i. Choose action via Q ii. Take action, observe r, s’ iii.

iv. s = s’

until s is terminal

November 24, 2015 Reinforcement Learning 21

Q(s, a) ← Q(s, a) + α[r + γ max

a0 Q(s0, a0) − Q(s, a)]

SLIDE 22

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Choosing Actions

Given a Q function, a common approach

to selecting action is ε-greedy

1. Select a random value in [0,1]

Ø If > ε, take action with highest estimated value Ø Else, select randomly

In the limit, every action will be sampled

an infinite number of times

November 24, 2015 Reinforcement Learning 22

SLIDE 23

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Function Representation

Given large state-action spaces, there is a

practical problem of how to sample the space, and how to represent it

Modern approaches include hierarchical

methods and neural networks

November 24, 2015 Reinforcement Learning 23

SLIDE 24

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Application: Michigan Liar’s Dice

Multi-agent
pponents
Varying degrees of

background knowledge

– Opponent modeling – Probabilistic calculation – Symbolic heuristics

November 24, 2015 Reinforcement Learning 24

SLIDE 25

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Learning vs. Static

November 24, 2015 Reinforcement Learning 25

50 100 150 200 250 2 4 6 8 10 12 Games Won Blocks of Training (250 Games/Block)

SLIDE 26

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Learning vs. Learned

November 24, 2015 Reinforcement Learning 26

50 100 150 200 250 2 4 6 8 10 12 Games Won Blocks of Training (250 Games/Block)

SLIDE 27

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Value-Function Initialization

November 24, 2015 Reinforcement Learning 27

50 100 150 200 250 5 10 15 20 Games Won Blocks of Trainings (250 Games/Block) PMH PM PH P PMH-0 PM-0 PH-0 P-0

SLIDE 28

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Summary

Reinforcement Learning (RL) is the problem
f learning an effective action policy for
btaining reward
Most RL algorithms model the task as a

Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function

Temporal-Difference (TD) Learning is one

effective method that is online and model-free

November 24, 2015 Reinforcement Learning 28