Simultaneous Acquisition of Task and Feedback Models q Manuel - - PowerPoint PPT Presentation

simultaneous acquisition of task and feedback models q
SMART_READER_LITE
LIVE PREVIEW

Simultaneous Acquisition of Task and Feedback Models q Manuel - - PowerPoint PPT Presentation

Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest manuel.lopes@inria.fr flowers.inria.fr/mlopes Outline Outline Interactive Learning Interactive


slide-1
SLIDE 1

Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest

manuel.lopes@inria.fr flowers.inria.fr/mlopes

slide-2
SLIDE 2

Outline Outline

  • Interactive Learning

Interactive Learning

– Ambiguous Protocols Ambiguous Signals – Ambiguous Signals – Active Learning

slide-3
SLIDE 3

Learning from Demonstration g

Pros

  • Natural/intuitive (is it?)
  • Facilitates social acceptance

Cons

  • Requires an expert with

knowledge about the task and the learning system

  • Long and Costly

Demonstrations N F db k h L i

  • No Feedback on the Learning

Process (on most methods)

slide-4
SLIDE 4

What is the best strategy to learn/teach? What is the best strategy to learn/teach?

Considering teaching how to play Considering teaching how to play

  • tennis. Information provided:
  • Rules of the game
  • Rules of the game

R(x) S i b l i i

  • Strategies or verbal instructions
  • f how to behave

V( ) V( ) V(x)>V(y)

  • Demonstrations (of a particular

hit) π(x)=a

slide-5
SLIDE 5

How to improve learning from demonstration?

  • Combine:

Combine:

– demonstrations to initialize self experiment to correct modeling errors – self-experiment to correct modeling errors

  • Feedback corrections
  • Instructions
  • More data
slide-6
SLIDE 6

How to improve learning/teaching? How to improve learning/teaching?

Learner Learner

– Active Learning Combine with Self – Combine with Self- Experimentation

Teacher

B S i – Better Strategies – Extra Cues

slide-7
SLIDE 7

How are demonstrations provided? How are demonstrations provided?

  • Remote control (direct control)

Remote control (direct control)

– Exoskeleton, joystick, Wiimote,…

  • Unobtrusive

– Acquired with vision, 3d-cameras from someone’s execution

  • Remote instruction (indirect control)

( )

– Verbal commands, gestures, …

slide-8
SLIDE 8

Behavior of Humans Behavior of Humans

  • People want to direct the agent’s

i id l i attention to guide exploration

  • People have a positive bias in

their rewarding behavior, suggesting both instrumental and motivational intents with their communication channel.

  • People adapt their teaching

strategy as they develop a mental model of how the agent learns. model of how the agent learns.

  • People are not optimal, even

when they try to be so Cakmak, Thomaz

slide-9
SLIDE 9

Interactive Learning Approaches Interactive Learning Approaches

Active Learner

  • Decide what to ask (Lopes Cohn Judah)
  • Decide what to ask (Lopes,Cohn,Judah)
  • Ask when Uncertain/Risk (Chernova, Roy, …)
  • Decide when to ask (Cakmak)

I d T h Improved Teacher

  • Dogged Learning (Grollman)
  • User Preferences (Mason)

User Preferences (Mason)

  • Extra Cues (Thomaz, Knox, Judah)
  • User Queries the Learner (Cakmak)
  • Tactile Guidance (Billard)
slide-10
SLIDE 10

Learning under a weakly specified protocol Learning under a weakly specified protocol

  • People do not follow protocols

i idl rigidly

  • Some of the provided cues

d t f th i th ti l depart from their mathematical meaning, e.g. extra utterances, gestures guidance motivation gestures, guidance, motivation

  • Can we exploit those extra

cues?

  • If robots adapt to the user, will

training be easier? g

slide-11
SLIDE 11

Different Feedback Structures Different Feedback Structures

User can provide direct feedback: User can provide direct feedback:

  • Reward

– Quantitative evaluation Q

  • Corrections

– Yes/No classifications of behavior

  • Actions

U id i l User can provide extra signals:

  • Reward of exploratory actions

R d f i l

  • Reward of getting closer to target
slide-12
SLIDE 12

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback

Unknown feedback signals:

  • Gestures
  • Prosody
  • Word synonyms
slide-13
SLIDE 13

Goal / Contribution Goal / Contribution

Learn simultaneously:

–Task

reward function f

–Interaction Protocol

what information is the user providing what information is the user providing

–Meaning of extra signals

what is the meaning of novel signals, e.g. prosody, unknown words,…

Simultaneous Acquisition of Task and Feedback Models, Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer, ICDL, 2011.

slide-14
SLIDE 14

Markov decision process Markov decision process

Set of possible states of the world and actions: p X = {1, ..., |X|} A = {1, ..., |A|}

  • State evolves according to

P[Xt + 1 = y | Xt = x, At = a] = Pa(x, y)

  • Reward r defines the task of the agent
  • A policy defines how to choose actions

P[At = a | Xt = x] = π(x, a)

  • Determine the policy that maximizes the total (expected) reward:

V(x) = Eπ[∑t gt rt | X0 = x]

  • Optimal policy can be computed using DP:

V *(x) = r(x) + g maxa Ea[V *(y)] Q*(x a) = r(x) + γ E [V *(y)] Q (x, a) = r(x) + γ Ea[V (y)]

slide-15
SLIDE 15

Inverse Reinforcement Learning

r ˆ

T r

The goal of the task is unknown IRL RL IRL RL

T

π ˆ

*

π

From world model and reward Find optimal policy From samples of the policy and world model Estimate reward

Ng et al, ICML00; Abbeel et al ICML04; Neu et al, UAI07; Ramachandran et al IJCAI 07; Lopes et al IROS07

slide-16
SLIDE 16

Probabilistic View of IRL Probabilistic View of IRL

  • Suppose now that agent is
  • Prior distribution P[r]

given a demonstration: D = {(x1, a1), ..., (xn, an)}

  • Likelyhood of demo,

L(D) ∏ ( ) D {(x1, a1), ..., (xn, an)}

  • The teacher is not perfect

(sometimes makes mistakes) L(D) = ∏i πr(xi, ai)

  • Posterior over rewards:

π’(x, a) = P[r | D] ∝ P[r] P[D | r]

∑b

b x Q a x Q

e e

) , ( η ) , ( η

* *

  • Likelihood of observed

demo: L(D) = ∏i π’(xi, ai)

  • MC-based methods to

sample P[r | D]

Ramachandran

slide-17
SLIDE 17

Bayesian inverse reinforcement learning Bayesian inverse reinforcement learning

slide-18
SLIDE 18

Gradient-based IRL Gradient based IRL

  • Idea: Compute the maximum likelihood estimate for r given the
  • Idea: Compute the maximum-likelihood estimate for r given the

demonstration D W di l i h

  • We use a gradient ascent algorithm:

rt + 1 = rt + ∇r L(D)

  • Upon convergence, the obtained reward maximizes the likelihood
  • f the demonstration

Policy Loss (Neu et al.), Maximum likelihood (Lopes et al.)

slide-19
SLIDE 19

The Selection Criterion The Selection Criterion

  • Distribution P[r | D] induces a distribution on Π

Distribution P[r | D] induces a distribution on Π

  • Use MC to approximate P[r | D]
  • For each (x, a), P[r | D] induces a distribution on π(x, a):

µxa(p) = P[π(x, a) = p | D]

  • Compute per state average entropy:

H(x) = 1/|A| ∑a H(µxa)

Compute entropy H(µxa)

a1 a2 a3 a4 ... aN

slide-20
SLIDE 20

Active IRL

Require: Initial demonstration D 1. Estimate P[π | D] using MC . st ate [π | ] us g C maybe only around the ML estimate 2 for all x ∈ X 2. for all x ∈ X Compute H(x) df endfor 3. Query action for x* = argmaxx H(x) 4. Add new sample to D

Active Learning for Reward Estimation in Inverse Reinforcement Learning, Manuel Lopes, Francisco Melo and Luis Montesano. ECML/PKDD, 2009.

slide-21
SLIDE 21

Results

  • General grid world (M × M grid), >200 states
  • Four actions available (N, S, E, W)
  • Parameterized reward (goal state)
slide-22
SLIDE 22

Active IRL, sample trajectories Active IRL, sample trajectories

Require: Initial demonstration D q 1. Estimate P[π | D] using MC 2. for all x ∈ X

.9 1

3. Compute H(x) 4. endfor 5 S l MDP ith R H( )

.7 .8

5. Solve MDP with R=H(x) 6. Query trajectory following optimal policy

.5 .6 rn d a cttra j1

p y 7. Add new trajectory to D

1 1

1

.4

slide-23
SLIDE 23

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback

Unknown feedback protocol Unknown feedback protocol The information provided by the demonstration has not a predefined semantics has not a predefined semantics

Meanings of the user signals

  • Binary Reward

y

  • Action
slide-24
SLIDE 24

Feedback Profiles Feedback Profiles

Demonstration Binary Reward Ambiguous Ambiguous

slide-25
SLIDE 25

Combination of Profiles Combination of Profiles

slide-26
SLIDE 26

Acquisition of Task and Feedback Model Acquisition of Task and Feedback Model

slide-27
SLIDE 27

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback

Unknown feedback signals:

  • Gestures
  • Prosody
  • Word synonyms
slide-28
SLIDE 28

Feedback meaning of user signals Feedback meaning of user signals

User might use different words to provide feedback

  • Ok, correct, good, nice, …
  • Wrong, error, no no, …
  • Up Go Forward

Up, Go, Forward An intuitive interface should allow the interaction to be as free as possible possible Even if the user does not follow a strict vocabulary, can the robot still make use of such extra signals? make use of such extra signals?

Learn the meaning of new vocabulary g y

slide-29
SLIDE 29
slide-30
SLIDE 30

TO TR TT RT OT Init State Action Next State Feedback F1 (_/A) F2 (A/_) OT TO OT TO TT Grasp1 RT _ +

  • +

RT Grasp2 RT RelOnObj ++

  • +
  • +-

RT RelOnObj OT _ +++

  • +-
  • -+

+-+ TT Grasp2 TR AgarraVer Assuming (F1,OT) AgarraVer means Grasp1

slide-31
SLIDE 31
slide-32
SLIDE 32

Scenario Scenario

Actions: Up, Down, Left, Right, Pick, Release

T?

Task consist in finding: what object to pick and where to take it

T?

Robot tries an action, including none User provides feedback

T?

8 known symbols, 8 unknown ones Robot must learn the task goal how Robot must learn the task goal, how the user provides feedback and some unknown signs

slide-33
SLIDE 33

Protocol Uncertainty

slide-34
SLIDE 34

Unknown Task and Feedback Unknown Task and Feedback

Passive Active Passive Active

slide-35
SLIDE 35

Query Strategies Query Strategies

slide-36
SLIDE 36

Unknown Task/Feedback/Utterances Unknown Task/Feedback/Utterances

slide-37
SLIDE 37
slide-38
SLIDE 38

Conclusions/Future

  • Experimental results show active sampling in IRL can help

decrease number of demonstrated samples

  • Prior knowledge (about reward parameterization) impacts

usefulness of active IRL, Experimental results indicate that active is not worse than random

  • Under the presence of noisy feedback active IRL is still robust.

If we learn the feedback structure we can learn even faster.

  • We can learn the task the feedback and (some) guidance

We can learn the task, the feedback and (some) guidance symbols simultaneously

Future

  • More General Feedback/Guidance Models
  • Include More Sources of Information, e.g. Speech prosody

Include More Sources of Information, e.g. Speech prosody