Regret-Based Reward Elicitation for Markov Decision Processes
Kevin Regan University of Toronto
Department of Computer Science
Research Proposal
Regret-Based Reward Elicitation for Markov Decision Processes Kevin - - PowerPoint PPT Presentation
Research Proposal Regret-Based Reward Elicitation for Markov Decision Processes Kevin Regan University of Toronto Department of Computer Science Introduction 2 Motivation
Kevin Regan University of Toronto
Department of Computer Science
Research Proposal
Introduction
Setting: Computational approaches to sequential decision making under uncertainty, specifically MDPs These approaches require A model of dynamics A model of rewards
2
Introduction
Except in some simple cases, the specification of rewards is problematic Preferences about which states/actions are “good” and “bad” need to be translated into precise numerical reward Time consuming to specify reward for all states/actions Rewards can vary user-to-user
3
Introduction
4
Introduction
5
MDP Reward Compute Robust Policy Satisfied? Select Query User
response query no yes policy measure
Done
Introduction
6
MDP Reward Compute Robust Policy
policy measure
Introduction
7
MDP Reward Select Query User
response query
9 MMR Comptuation
We use constraint generation, solving max regret using a mixed integer program (MIP) [UAI09] Explicitly encodes the adversary’s policy choices using binary indicator variables Requires O(|S||A|) constraints and variables Effective for small MDPs (less than 10 states)
10 MMR Comptuation
Compute regret maximizing adversary policy for fixed reward, and regret maximizing reward for fixed adversary policy Alternating optimization Linear Relaxation [UAI09] We remove integrality constraints on binary variables encoding policy in the MIP . The resulting adversarial reward is used to construct approximation to max regret. These approximations exhibit low empirical error, however, we have no bound on this error.
11 MMR Comptuation
Given the set of policies that are nondominated w.r.t. reward we gain significant computational leverage for computing MMR In constraint generation framework, max regret is computed by finding regret maximizing reward for each nondominated policy [AAAI10] We leverage similarities to POMDP solution techniques to develop two generation algorithms: πWitness [AAAI10] and Linear Support [TechRep] Generating the set of nondominated policies Γ Γ
12 MMR Comptuation
Exact Approximation A partial set of nondominated policies produces an lower bound on minimax regret
is polynomial [AAAI10]
approach [XM09]
nondominated policies
13 MMR Comptuation
Idea Properties Given a parital set of nondominated policies For each , construct its “local adjustments” Look for reward “witnesses” that testify to a new nondominated policy Find optimal policy at the witness and add it to the set of nondominated policies Runtime is polynomial in the number of nondominated policies generated No anytime error guarantee Γ f ∈ Γ
14 MMR Computation
Idea Properties Each generated nondominated policy induces a convex “nondominated region” (w.r.t. ) The maximal error will occur at vertices of these regions Find the optimal policy at each vertex Add the policy with maximal error to Relies on vertex enumeration (exponential in worst case) Provides anytime error bound Empirically, error drops quickly Γ
Γ
15 MMR Computation
Idea Small leads to efficient minimax regret computation. During elicitation the volume of the reward polytope is reduced Policies in can become dominated. We can eliminate these dominated policies and add new nondominated policies, reducing approximation error. Γ Γ Empirically, elicitation using that begins with high error quickly sees error reduced to zero Γ
16 MMR Computation
Generating πWitness Linear Support Γ Computing MMR Constraint Generation with Max Regret using: MIP Linear Relaxation Alternating Optimization The set of nondominated policies Γ The approximate set Adjusting online Γ Γ
17 MMR Computation
Is there a simple characterization of an MDP that allows us to quantify:
policies is easy to approximate Question
?
18 MMR Computation
It is desirable to adapt our computational approaches to incorporate structure in the transition model. Important because:
algorithms However, to realize these benefits requires reimplementing most of our algorithms.
?
19 MMR Computation
Given a prior over feasible reward functions: Using adaptations of point based POMDP algorithms, we may be able to generate very small approximate sets with low expected error. However, such sets will offer no guarantee
Γ
?
21 Reward Elicitation
We used bound queries of the form “Is r(s,a) ≥ b?” with the following parameter selection strategies:
with the largest “gap” between upper and lower bound:
frequencies found in solution to minimax regret. Choose state-action pair with largest weighted gap: Δ(s,a) = max
r'∈R r'(s,a) − min r∈R r(s,a)
choose argmax
a*∈A, s*∈S Δ(s*,a*)
argmax
a*∈A, s*∈S max
f (s*,a*)Δ(s*,a*), g(s*,a*)Δ(s*,a*)
{ }
22 Reward Elicitation
We analyzed the effectiveness of elicitation on randomly generated IRMDPs and the autonomic computing domain [UAI09] Results
reduced regret to zero in half the queries
a provable optimal policy
a large amount of uncertainty
23 Reward Elicitation
The sequential nature of the MDP motivates novel modes of interacting with the user, including queries about: It is unreasonable to expect a user to specify a numerical value for a policy, trajectory or occupancy frequency. Comparisons may be more manageable.
?
24 Reward Elicitation
?
Policy Comparison Trajectory Comparison Is policy π preferred to policy π’ ? Is the potential trajectory preferred to trajectory ?” s1′, a1′,…, an−1′, sn′ s1, a1,…, an−1, sn Responses to both types of query imply linear constraints on reward However, trajectory comparisons may be easier to reason about than policy comparisons
25 Reward Elicitation
A response imposes a linear constraint on reward. “Yes” implies: “Are occupancy frequencies f preferred to g?” f (si,ai)r(si,ai)
i=0 n
≥ g(si,ai)r(si,ai)
i=0 n
Given a factored model, it allows a user to directly specify trade-offs pertaining to portions of the policy.
?
x = x1,x2,…,xn domain(x1) = ⎧ ⎨ ⎪ ⎩ ⎪ r(x) = r
1(x1) + r 2(x2)…
Reward Elicitation
26
?
payoff r(x1 = ) with frequency f (x1 = ), along with payoff r(x1 = ) with frequency f (x1 = )and payoff r(x1 = ) with frequency f (x1 = ) or f (x1) = … f (x1,…,xn)
xn
x2
= payoff r(x1 = ) with frequency g(x1 = ), along with payoff r(x1 = ) with frequency g(x1 = )and payoff r(x1 = ) with frequency g(x1 = )? Do you prefer g(x1) = … g(x1,…,xn)
xn
x2
= Yes response implies : f (x1)r(x1) ≥ g(x1)r(x1)
x1
x1
27 Reward Elicitation
EVOI(φ,R) = MMR(R) − Pr(ρ |φ)MMR(R
φ ρ) ρ
(with prior) = min
ρ
MMR(R) − MMR(R
φ ρ)
⎡ ⎣ ⎤ ⎦ (w/o prior) The “value” of a query can be measured by the Expected Value of Information (EVOI): For query parameters ϕ and response ρ: We can attempt to select queries that
?
28 Reward Elicitation
Attempt to focus query on “relevant” part of reward polytope using the current minimax regret optimal policy and regret maximizing adversarial policy. Select queries whose responses induce constraints that equally partition the reward polytope. Volumetric Approaches Current Solution
?
29 Reward Elicitation
Allow user to answer “I don’t know”. Indifference imposes linear constraints. For example: For query “Is r(s,a) > b?”, it implies the
constraint
For the query comparing occupancy frequencies f and g, it implies the constraint: b −δ ≤ r(s,a) ≤ b +δ f·r − g·r ≤ δ The implementation required is minor, and results in query
effectiveness of elicitation.
?
Observing a full policy implies the constraints: Observing the trajectory implies the constraints:
30 Reward Elicitation
In some settings we may be able to observe behaviour before
Open question: If we can request a demonstration of behaviour, how do we select what to observe? (P
π − P a)(I − γ P π )−1r 0
f (si,ai) ≥ f (si,a′
i)
∀ i, a′
i ≠ ai
s1, a1,…, an−1, sn
?
31 Reward Elicitation
?
Idea Open Question For factored IRMDP , allow user to specify structure of reward function during elicitation We can use a concept c, to represent a setting of features that a user cares about: r
c(x) =
r
i(xi) i
+ c(x)·bonus Computing minimax regret with respect to reward uncertainty and concept uncertainty: MMR(R,C) :min
f
max
g
max
r∈R max c∈ C
g·rc − f·rc
33 Applications
Reward encodes competing objectives, requiring trade-offs to be made Multi-dimensional Reward Repeatability Domain necessitates reward function being elicited for different users in different settings Natural MDP Structure Domain comfortably fits the MDP modeling assumptions.
Applications
34
Demand Resource
Host 1
Demand Resource
Host k
Total Resource
Allocate computing and storage resources to application servers with changing demand Pro: Generally the server utility function has no closed form Con: The absence of a human user obviates our work towards cognitively tractable queries We have empirically seen reward elicitation to be effective in this domain
Applications
35
Guide a person through a small task by providing verbal or visual cues. Reward elicitation involves a human domain expert Practitioners found the manual specification of reward functions difficult [BPH05] We investigated reward elicitation on a simplified version of the COACH domain with good empirical results
turn off tap turn on tap wet hands soap hands rinse hands
Applications
36
?
Flight Price
Recommend purchase decision strategies that take into account dynamics of pricing and availability over time MDP state would capture current price, availability and features of potential purchases Reward function captures trade-offs between product features, prince and risk attitudes towards future availability and price Clear necessity for specifying individual reward for each user
Applications
37
?
Address planning and logistics problems faced by large
Scheduling hospital CT scans to minimize used capacity subject to overtime constraints Rather than specifying hard constraints, elicitation can directly assess the preferences over the competing
Reward elicitation could be repeated to express the trade-offs specific to each hospital Example
39 Summary
Computing Minimax Regret Reward Elicitation Generating with πWitness and Linear Support Computing exact MMR with MIP and Computing approximate MMR with linear relaxation (of MIP) and Improving approximation online by adjusting Γ Γ Γ Γ Investigated efficacy of reward bound queries Assessed parameter selection strategies using halve the largest gap and current solution
40 Summary
Further assessment of developed techniques, specifically methods for online adjustment of the approximate nondominated policy set Computing Minimax Regret Reward Elicitation
?
Γ
Kevin Regan University of Toronto
Department of Computer Science
Research Proposal
MMR Comptuation
1. We limit adversary Player finds policy f, which minimizes regret w.r.t. a limited set of adversary choices (constraints) 2. Untie adversary’s hands, given f: Adversary finds policy g and rewards r which maximizes regret Add adversary’s choice of g and r to the set of adversary choices
42
f∈F max g∈F max r∈R
g ·r − f ·r