Reinforcement Learning Lecture 8 Reinforcement Learning November - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Lecture 8 Reinforcement Learning November - - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 |


slide-1
SLIDE 1

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Reinforcement Learning

Lecture 8

November 24, 2015 Reinforcement Learning 1

slide-2
SLIDE 2

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Outline

  • 1. Context
  • 2. TD Learning
  • 3. Issues

November 24, 2015 Reinforcement Learning 2

slide-3
SLIDE 3

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Machine Learning Tasks

  • Supervised

– Given a training set and a target variable, generalize; measured over a testing set

  • Unsupervised

– Given a dataset, find “interesting” patterns; potentially no “right” answer

  • Reinforcement

– Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward, maximize the expected future reward

  • Never given I/O pairs
  • Focus: online (balancing exploration/exploitation)

November 24, 2015 Reinforcement Learning 3

slide-4
SLIDE 4

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Success Stories

November 24, 2015 Reinforcement Learning 4

slide-5
SLIDE 5

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

The Agent-Environment Interface

Agent Environment (stochas2c) state st ac'on at reward rt+1 st+1

November 24, 2015 5 Reinforcement Learning

slide-6
SLIDE 6

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Pole Balancing

November 24, 2015 Reinforcement Learning 6

slide-7
SLIDE 7

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Multi-Armed Bandit

November 24, 2015 Reinforcement Learning 7

slide-8
SLIDE 8

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Types of Tasks

  • Some tasks are continuous, meaning they

are an ongoing sequence of decisions

  • Some tasks are episodic, meaning there

exist terminal states that reset the problem

November 24, 2015 Reinforcement Learning 8

slide-9
SLIDE 9

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Policies

A policy is a function that associates a probability with taking a particular action in a particular state The goal of RL is to learn an “effective” policy for a particular task

November 24, 2015 Reinforcement Learning 9

π(s, a)

slide-10
SLIDE 10

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Objective

Select actions so that the sum of the discounted rewards it receives over the future is maximized

– Discount rate:

November 24, 2015 Reinforcement Learning 10

0 ≤ γ ≤ 1

Rt = rt+1 + γrt+2 + γ2rt+3 + . . . =

X

k=0

γkrt+k+1

slide-11
SLIDE 11

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Environmental Modeling

  • An important issue in RL is state

representation

– Current sensors (observability!) – Past history?

  • A stochastic process has the Markov

property if the conditional probability distribution of future states of the process depends only upon the present state

– Given the present, the future does not depend on the past – Memoryless, pathless

November 24, 2015 Reinforcement Learning 11

slide-12
SLIDE 12

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Implications of the Markov Property

Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate via

November 24, 2015 Reinforcement Learning 12

P(st+1 = s0, rt+1 = r|st, at, st1, at1, . . . r1, s0, a1) P(st+1 = s0, rt+1 = r|st, at)

slide-13
SLIDE 13

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Markov Decision Processes

If a process is Markovian, we can model it as a 5-tuple MDP:

– S: set of states – A: set of actions – Pa(s, s’): transition function – Ra(s, s’): immediate reward

November 24, 2015 Reinforcement Learning 13

(S, A, P(·, ·), R(·, ·), γ)

slide-14
SLIDE 14

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Recycling Robot MDP

November 24, 2015 Reinforcement Learning 14

slide-15
SLIDE 15

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Value Functions

Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies

November 24, 2015 Reinforcement Learning 15

slide-16
SLIDE 16

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

State-Value Function

V π(s) = Eπ[Rt|st = s] = Eπ[

X

k=0

γkrt+k+1|st = s]

November 24, 2015 Reinforcement Learning 16

slide-17
SLIDE 17

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Action-Value Function

Qπ(s, a) = Eπ[Rt|st = s, at = a] = Eπ[

X

k=0

γkrt+k+1|st = s, at = a]

November 24, 2015 Reinforcement Learning 17

slide-18
SLIDE 18

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Example: Golf

November 24, 2015 Reinforcement Learning 18

slide-19
SLIDE 19

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Temporal Difference (TD) Learning

  • Combines ideas from Monte Carlo

sampling and dynamic programming

  • Learns directly from raw experience

without a model of environment dynamics

  • Update estimates based in part on other

learned estimates, without waiting for a final outcome

November 24, 2015 Reinforcement Learning 19

slide-20
SLIDE 20

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Visual TD Learning

November 24, 2015 Reinforcement Learning 20

slide-21
SLIDE 21

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Q-Learning: Off-Policy TD Control

  • 1. Initialize Q(s,a)

– Random, optimistic, realistic, knowledge

  • 2. Repeat (for each episode):
  • a. Initialize s
  • b. Repeat (for each step of episode)

i. Choose action via Q ii. Take action, observe r, s’ iii.

  • iv. s = s’

until s is terminal

November 24, 2015 Reinforcement Learning 21

Q(s, a) ← Q(s, a) + α[r + γ max

a0 Q(s0, a0) − Q(s, a)]

slide-22
SLIDE 22

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Choosing Actions

  • Given a Q function, a common approach

to selecting action is ε-greedy

  • 1. Select a random value in [0,1]

Ø If > ε, take action with highest estimated value Ø Else, select randomly

  • In the limit, every action will be sampled

an infinite number of times

November 24, 2015 Reinforcement Learning 22

slide-23
SLIDE 23

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Function Representation

  • Given large state-action spaces, there is a

practical problem of how to sample the space, and how to represent it

  • Modern approaches include hierarchical

methods and neural networks

November 24, 2015 Reinforcement Learning 23

slide-24
SLIDE 24

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Application: Michigan Liar’s Dice

  • Multi-agent
  • pponents
  • Varying degrees of

background knowledge

– Opponent modeling – Probabilistic calculation – Symbolic heuristics

November 24, 2015 Reinforcement Learning 24

slide-25
SLIDE 25

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Learning vs. Static

November 24, 2015 Reinforcement Learning 25

50 100 150 200 250 2 4 6 8 10 12 Games Won Blocks of Training (250 Games/Block)

slide-26
SLIDE 26

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Learning vs. Learned

November 24, 2015 Reinforcement Learning 26

50 100 150 200 250 2 4 6 8 10 12 Games Won Blocks of Training (250 Games/Block)

slide-27
SLIDE 27

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Evaluation: Value-Function Initialization

November 24, 2015 Reinforcement Learning 27

50 100 150 200 250 5 10 15 20 Games Won Blocks of Trainings (250 Games/Block) PMH PM PH P PMH-0 PM-0 PH-0 P-0

slide-28
SLIDE 28

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky

Summary

  • Reinforcement Learning (RL) is the problem
  • f learning an effective action policy for
  • btaining reward
  • Most RL algorithms model the task as a

Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function

  • Temporal-Difference (TD) Learning is one

effective method that is online and model-free

November 24, 2015 Reinforcement Learning 28