[PPT] - Regret-Based Reward Elicitation for Markov Decision Processes Kevin PowerPoint Presentation

SLIDE 1

Regret-Based Reward Elicitation for Markov Decision Processes

Kevin Regan University of Toronto

Department of Computer Science

Research Proposal

SLIDE 2

Introduction

Motivation

Setting: Computational approaches to sequential decision making under uncertainty, specifically MDPs These approaches require A model of dynamics A model of rewards

2

SLIDE 3

Introduction

Motivation

Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user

3

SLIDE 4

Introduction

Motivation

My research goal is develop minimax regret-based framework for the incremental elicitation of reward functions for MDPs that is cognitively and computationally effective.

4

“ ’’

SLIDE 5

Introduction

Reward Elicitation

5

MDP Reward Compute Robust Policy Satisfied? Select Query User

response query no yes policy measure

Done

SLIDE 6

Introduction

Considerations

6

1. The efficient computation of robust policies using minimax regret

MDP Reward Compute Robust Policy

policy measure

SLIDE 7

Introduction

Considerations

7

2. The effective selection of queries in terms of query type and parameters

MDP Reward Select Query User

response query

SLIDE 8

Outline

1. MMR Computation
2. Reward Elicitation
3. IRMDP Applications
4. Research Plan

SLIDE 9

MMR Computation - Exact (MIP)

9 MMR Comptuation

We use constraint generation, solving max regret using a mixed integer program (MIP) [UAI09] Explicitly encodes the adversary’s policy choices using binary indicator variables Requires O(|S||A|) constraints and variables Effective for small MDPs (less than 10 states)

SLIDE 10

MMR Computation - Approximations

10 MMR Comptuation

Compute regret maximizing adversary policy for fixed reward, and regret maximizing reward for fixed adversary policy Alternating optimization Linear Relaxation [UAI09] We remove integrality constraints on binary variables encoding policy in the MIP . The resulting adversarial reward is used to construct approximation to max regret. These approximations exhibit low empirical error, however, we have no bound on this error.

SLIDE 11

Nondominated Policies

11 MMR Comptuation

Given the set of policies that are nondominated w.r.t. reward we gain significant computational leverage for computing MMR In constraint generation framework, max regret is computed by finding regret maximizing reward for each nondominated policy [AAAI10] We leverage similarities to POMDP solution techniques to develop two generation algorithms: πWitness [AAAI10] and Linear Support [TechRep] Generating the set of nondominated policies Γ Γ

SLIDE 12

MMR Computation - Nondominated Policies

12 MMR Comptuation

Exact Approximation A partial set of nondominated policies produces an lower bound on minimax regret

1. We show that given polynomial number
f nondominated policies, minimax regret

is polynomial [AAAI10]

2. Empirically outperforms related

approach [XM09]

3. Bottleneck: the size of the set of

nondominated policies

SLIDE 13

Generating - πWitness [AAAI10]

13 MMR Comptuation

Idea Properties Given a parital set of nondominated policies For each , construct its “local adjustments” Look for reward “witnesses” that testify to a new nondominated policy Find optimal policy at the witness and add it to the set of nondominated policies Runtime is polynomial in the number of nondominated policies generated No anytime error guarantee  Γ f ∈  Γ

Γ

SLIDE 14

Generating - Linear Support [TechRep10]

14 MMR Computation

Idea Properties Each generated nondominated policy induces a convex “nondominated region” (w.r.t. ) The maximal error will occur at vertices of these regions Find the optimal policy at each vertex Add the policy with maximal error to Relies on vertex enumeration (exponential in worst case) Provides anytime error bound Empirically, error drops quickly  Γ

Γ

 Γ

SLIDE 15

Online Adjustment of Γ [TechRep10]

15 MMR Computation

Idea Small leads to efficient minimax regret computation. During elicitation the volume of the reward polytope is reduced Policies in can become dominated. We can eliminate these dominated policies and add new nondominated policies, reducing approximation error. Γ Γ Empirically, elicitation using that begins with high error quickly sees error reduced to zero Γ

SLIDE 16

MMR Computation Summary

16 MMR Computation

Generating πWitness Linear Support Γ Computing MMR Constraint Generation with Max Regret using: MIP Linear Relaxation Alternating Optimization The set of nondominated policies Γ The approximate set Adjusting online  Γ  Γ

SLIDE 17

Characterizing the Set of Nondom. Policies

17 MMR Computation

Is there a simple characterization of an MDP that allows us to quantify:

1. The number of nondominated policies
2. Whether the set of nondominated

policies is easy to approximate Question

?

SLIDE 18

Fully Factored MDPs

18 MMR Computation

It is desirable to adapt our computational approaches to incorporate structure in the transition model. Important because:

1. Many existing MDP models encode this structure
2. Leverage transition structure could speed up our

algorithms However, to realize these benefits requires reimplementing most of our algorithms.

?

SLIDE 19

Prior on Reward - Constructing Γ

19 MMR Computation

Given a prior over feasible reward functions: Using adaptations of point based POMDP algorithms, we may be able to generate very small approximate sets with low expected error. However, such sets will offer no guarantee

n worst case error.

 Γ

?

SLIDE 20

Outline

1. MMR Computation
2. Reward Elicitation
3. IRMDP Applications
4. Research Plan

SLIDE 21

Simple Elicitation Strategies [UAI09]

21 Reward Elicitation

We used bound queries of the form “Is r(s,a) ≥ b?” with the following parameter selection strategies:

1. Halve the Largest Gap: Select the state-action pair

with the largest “gap” between upper and lower bound:

2. Current Solution: Weight each gap with the occupancy

frequencies found in solution to minimax regret. Choose state-action pair with largest weighted gap: Δ(s,a) = max

r'∈R r'(s,a) − min r∈R r(s,a)

choose argmax

a*∈A, s*∈S Δ(s*,a*)

argmax

a*∈A, s*∈S max

f (s*,a*)Δ(s*,a*), g(s*,a*)Δ(s*,a*)

{ }

SLIDE 22

Simple Elicitation Strategies - Assessment

22 Reward Elicitation

We analyzed the effectiveness of elicitation on randomly generated IRMDPs and the autonomic computing domain [UAI09] Results

1. Both strategies performed well, but current solution

reduced regret to zero in half the queries

2. Less than 2 queries per parameter were required to find

a provable optimal policy

3. Current solution effectively reduces regret while leaving

a large amount of uncertainty

SLIDE 23

Sequential Queries

23 Reward Elicitation

1. Policies - π: S→A
2. Trajectories - s1,a1,s2,a2,...,an-1,sn
3. Occupancy Frequencies - f(s,a)

The sequential nature of the MDP motivates novel modes of interacting with the user, including queries about: It is unreasonable to expect a user to specify a numerical value for a policy, trajectory or occupancy frequency. Comparisons may be more manageable.

?

SLIDE 24

Sequential Queries

24 Reward Elicitation

?

Policy Comparison Trajectory Comparison Is policy π preferred to policy π’ ? Is the potential trajectory preferred to trajectory ?” s1′, a1′,…, an−1′, sn′ s1, a1,…, an−1, sn Responses to both types of query imply linear constraints on reward However, trajectory comparisons may be easier to reason about than policy comparisons

SLIDE 25

Occupancy Frequency Comparisons

25 Reward Elicitation

A response imposes a linear constraint on reward. “Yes” implies: “Are occupancy frequencies f preferred to g?” f (si,ai)r(si,ai)

i=0 n

∑

≥ g(si,ai)r(si,ai)

i=0 n

∑

Given a factored model, it allows a user to directly specify trade-offs pertaining to portions of the policy.

?

x = x1,x2,…,xn domain(x1) = ⎧ ⎨ ⎪ ⎩ ⎪ r(x) = r

1(x1) + r 2(x2)…

SLIDE 26

Reward Elicitation

Occupancy Frequency Comparisons

26

?

payoff r(x1 = ) with frequency f (x1 = ), along with payoff r(x1 = ) with frequency f (x1 = )and payoff r(x1 = ) with frequency f (x1 = ) or f (x1) = … f (x1,…,xn)

xn

∑

x2

∑

= payoff r(x1 = ) with frequency g(x1 = ), along with payoff r(x1 = ) with frequency g(x1 = )and payoff r(x1 = ) with frequency g(x1 = )? Do you prefer g(x1) = … g(x1,…,xn)

xn

∑

x2

∑

= Yes response implies : f (x1)r(x1) ≥ g(x1)r(x1)

x1

∑

x1

∑

SLIDE 27

Optimal Query Selection

27 Reward Elicitation

EVOI(φ,R) = MMR(R) − Pr(ρ |φ)MMR(R

φ ρ) ρ

∑

(with prior) = min

ρ

MMR(R) − MMR(R

φ ρ)

⎡ ⎣ ⎤ ⎦ (w/o prior) The “value” of a query can be measured by the Expected Value of Information (EVOI): For query parameters ϕ and response ρ: We can attempt to select queries that

1. Myopically maximize EVOI for a single query
2. Maximize EVOI for an entire sequence of queries

?

SLIDE 28

Heuristic Query Selection

28 Reward Elicitation

Attempt to focus query on “relevant” part of reward polytope using the current minimax regret optimal policy and regret maximizing adversarial policy. Select queries whose responses induce constraints that equally partition the reward polytope. Volumetric Approaches Current Solution

?

SLIDE 29

Indifference

29 Reward Elicitation

Allow user to answer “I don’t know”. Indifference imposes linear constraints. For example: For query “Is r(s,a) > b?”, it implies the

constraint

For the query comparing occupancy frequencies f and g, it implies the constraint: b −δ ≤ r(s,a) ≤ b +δ f·r − g·r ≤ δ The implementation required is minor, and results in query

ptions which are more natural, potentially increasing the

effectiveness of elicitation.

?

SLIDE 30

Observing a full policy implies the constraints: Observing the trajectory implies the constraints:

Passive Information

30 Reward Elicitation

In some settings we may be able to observe behaviour before

r during elicitation.

Open question: If we can request a demonstration of behaviour, how do we select what to observe? (P

π − P a)(I − γ P π )−1r  0

f (si,ai) ≥ f (si,a′

i)

∀ i, a′

i ≠ ai

s1, a1,…, an−1, sn

?

SLIDE 31

Eliciting Reward Structure

31 Reward Elicitation

?

Idea Open Question For factored IRMDP , allow user to specify structure of reward function during elicitation We can use a concept c, to represent a setting of features that a user cares about: r

c(x) =

r

i(xi) i

∑

+ c(x)·bonus Computing minimax regret with respect to reward uncertainty and concept uncertainty: MMR(R,C) :min

f

max

g

max

r∈R max c∈ C

g·rc − f·rc

SLIDE 32

Outline

1. MMR Computation
2. Reward Elicitation
3. IRMDP Applications
4. Research Plan

SLIDE 33

Application Desirata

33 Applications

Reward encodes competing objectives, requiring trade-offs to be made Multi-dimensional Reward Repeatability Domain necessitates reward function being elicited for different users in different settings Natural MDP Structure Domain comfortably fits the MDP modeling assumptions.

SLIDE 34

Applications

Autonomic Computing

34

Demand Resource

Host 1

Demand Resource

Host k



Total Resource

Allocate computing and storage resources to application servers with changing demand Pro: Generally the server utility function has no closed form Con: The absence of a human user obviates our work towards cognitively tractable queries We have empirically seen reward elicitation to be effective in this domain

SLIDE 35

Applications

Cognitive Assistance - (COACH)

35

Guide a person through a small task by providing verbal or visual cues. Reward elicitation involves a human domain expert Practitioners found the manual specification of reward functions difficult [BPH05] We investigated reward elicitation on a simplified version of the COACH domain with good empirical results

turn off tap turn on tap wet hands soap hands rinse hands

SLIDE 36

Applications

Consumer Decision Support

36

?

Flight Price

Recommend purchase decision strategies that take into account dynamics of pricing and availability over time MDP state would capture current price, availability and features of potential purchases Reward function captures trade-offs between product features, prince and risk attitudes towards future availability and price Clear necessity for specifying individual reward for each user

SLIDE 37

Applications

Organizational Planning

37

?

Address planning and logistics problems faced by large

rganizations that involve trade-offs between competing
bjectives

Scheduling hospital CT scans to minimize used capacity subject to overtime constraints Rather than specifying hard constraints, elicitation can directly assess the preferences over the competing

bjectives

Reward elicitation could be repeated to express the trade-offs specific to each hospital Example

SLIDE 38

Outline

1. MMR Computation
2. Reward Elicitation
3. IRMDP Applications
4. Research Plan

SLIDE 39

Completed Research

39 Summary

Computing Minimax Regret Reward Elicitation Generating with πWitness and Linear Support Computing exact MMR with MIP and Computing approximate MMR with linear relaxation (of MIP) and Improving approximation online by adjusting Γ Γ  Γ  Γ Investigated efficacy of reward bound queries Assessed parameter selection strategies using halve the largest gap and current solution

SLIDE 40

Research Plan

40 Summary

Further assessment of developed techniques, specifically methods for online adjustment of the approximate nondominated policy set Computing Minimax Regret Reward Elicitation

1. Occupancy frequency queries
2. Query selection heuristics
3. Fleshing out potential applications
4. Optimal selection of query parameters

?

 Γ

SLIDE 41

Regret-Based Reward Elicitation for Markov Decision Processes

Kevin Regan University of Toronto

Department of Computer Science

Research Proposal

SLIDE 42

MMR Comptuation

Minimax Regret - Constraint Generation

1. We limit adversary Player finds policy f, which minimizes regret w.r.t. a limited set of adversary choices (constraints) 2. Untie adversary’s hands, given f: Adversary finds policy g and rewards r which maximizes regret Add adversary’s choice of g and r to the set of adversary choices

42

Constraint Generation We wish to compute: min

f∈F max g∈F max r∈R

g ·r − f ·r