Control, inference and learning Bert Kappen : SNN Donders - - PowerPoint PPT Presentation

control inference and learning
SMART_READER_LITE
LIVE PREVIEW

Control, inference and learning Bert Kappen : SNN Donders - - PowerPoint PPT Presentation

Control, inference and learning Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015 Bert Kappen Why control theory? A theory for intelligent behaviour: - neuroscience Bert Kappen Oxford


slide-1
SLIDE 1

Control, inference and learning

Bert Kappen : SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 21, 2015

Bert Kappen

slide-2
SLIDE 2

Why control theory?

A theory for intelligent behaviour:

  • neuroscience

Bert Kappen Oxford 2015 1/58

slide-3
SLIDE 3

Why control theory?

A theory for intelligent behaviour:

  • neuroscience
  • robotics

Bert Kappen Oxford 2015 2/58

slide-4
SLIDE 4

Control theory

Given a current state and a future desired state, what is the best/cheapest/fastest way to get there.

Bert Kappen Oxford 2015 3/58

slide-5
SLIDE 5

Why stochastic control?

Bert Kappen Oxford 2015 4/58

slide-6
SLIDE 6

How to control?

Hard problems:

  • a learning and exploration problem
  • a stochastic optimal control computation
  • a representation problem u(x, t)

Bert Kappen Oxford 2015 5/58

slide-7
SLIDE 7

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation.

Bert Kappen Oxford 2015 6/58

slide-8
SLIDE 8

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling

Bert Kappen Oxford 2015 7/58

slide-9
SLIDE 9

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller

Bert Kappen Oxford 2015 8/58

slide-10
SLIDE 10

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data

Bert Kappen Oxford 2015 9/58

slide-11
SLIDE 11

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control

Bert Kappen Oxford 2015 10/58

slide-12
SLIDE 12

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling, a state-feedback controller Learn controller from self-generated data Optimal importance sampler is optimal control Learn a good importance sampler using PICE

Bert Kappen Oxford 2015 11/58

slide-13
SLIDE 13

Outline

  • Introduction to control theory
  • Link between control theory, inference and statistical physics

– Schr¨

  • dinger, Fleming Mitter ’82, Kappen ’05, Todorov ’06
  • Importance sampling

– Relation between optimal sampling and optimal control

  • Cross entropy method for adaptive importance sampling (PICE)

– A criterion for parametrized control optimization – Learning by gradient descent

  • Some examples

Bert Kappen Oxford 2015 12/58

slide-14
SLIDE 14

Discrete time optimal control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(xt, ut), t = 0, 1, . . . , T − 1 xt describes the state and ut specifies the control or action at time t.

Given x0 and u0:T−1, we can compute x1:T. Define a cost for each sequence of controls:

C(x0, u0:T−1) =

T−1

  • t=0

V(xt, ut)

Find the sequence u0:T−1 that minimizes C(x0, u0:T−1).

Bert Kappen Oxford 2015 13/58

slide-15
SLIDE 15

Dynamic programming

Find the minimal cost path from A to J.

C(F) = min(6 + C(H), 3 + C(I)) = 7

Minimal cost at time t easily expressable in terms of minimal cost at time t + 1.

Bert Kappen Oxford 2015 14/58

slide-16
SLIDE 16

Discrete time optimal control

Dynamic programming uses concept of optimal cost-to-go J(t, x). One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way:

J(t, xt) = min

ut:T−1

       

T−1

  • s=t

V(xs, us)         = min

ut (V(t, xt, ut) + J(t + 1, xt + f(t, xt, ut)))

J(T, x) = J(0, x) = min

u0:T−1 C(x, u0:T−1)

This is called the Bellman Equation. Computes ut(x) for all intermediate t, x.

Bert Kappen Oxford 2015 15/58

slide-17
SLIDE 17

Stochastic optimal control

Consider a stochastic dynamical system

dXi = fi(Xt, u)dt + dWi E(dWidW j) = νi jdt

Given x(0) find control function u(x, t) that minimizes the expected future cost

C = E

  • φ(XT) +

T dtV(Xt, u(Xt, t))

  • Expectation is over all trajectories given the control path.

J(t, x) = min

u (V(x, u) + E J(t + dt, x + dx))

−∂tJ(t, x) = min

u

  • V(x, u) + f(x, u)∇xJ(x, t) + 1

2ν∇2

xJ(x, t)

  • with u = u(x, t) and boundary condition J(x, T) = φ(x). This is HJB equation.

Bert Kappen Oxford 2015 16/58

slide-18
SLIDE 18

Computing the optimal control solution is hard

  • solve a Bellman Equation, a PDE
  • scales badly with dimension

Efficient solutions exist for

  • linear dynamical systems with quadratic costs (Gaussians)
  • deterministic systems (no noise)

Bert Kappen Oxford 2015 17/58

slide-19
SLIDE 19

Path integral control theory

dXt = f(Xt, t)dt + g(Xt, t)(udt + dWt) C = E

  • φ(XT) +

T

t

dsV(Xs, s) + 1 2uT(Xt, t)Ru(Xt, t)

  • with E(dWadWb) = νabdt and R = λν−1, λ > 0. f ∈ Rn, g ∈ Rn×m, u ∈ Rm.

The HJB equation becomes

−∂tJ = min

u

1 2uTRu + V + ( f + gu)T(∇J) + 1 2Tr

  • gνgT∇2J
  • with boundary condition J(x, T) = φ(x).

Bert Kappen Oxford 2015 18/58

slide-20
SLIDE 20

Path integral control theory

Minimization wrt u yields:

u(x, t) = −R−1gT(x, t)∇J(x, t) −∂tJ = −1 2(∇J)TgR−1gT(∇J) + V + f T∇J + 1 2Tr

  • gνgT∇2J
  • Define ψ(x, t) through J(x, t) = −λ log ψ(x, t). We obtain a linear HJB:

∂tψ = V λ − f T∇ − 1 2Tr

  • gνgT∇2

ψ

Bert Kappen Oxford 2015 19/58

slide-21
SLIDE 21

Feynman-Kac formula

Denote q(τ|x, t) the distribution over uncontrolled trajectories that start at x, t:

dXt = f(Xt, t)dt + g(Xt, t)dW

with τ a trajectory x(t → T). Then

ψ(x, t) =

  • dq(τ|x, t) exp
  • −S (τ)

λ

  • = Eq
  • e−S/λ

S (τ) = φ(x(T)) + T

t

dsV(x(s), s)

Bert Kappen Oxford 2015 20/58

slide-22
SLIDE 22

Posterior distribution over optimal trajectories

ψ(x, t) is the partition sum for the distribution over paths under optimal control: p∗(τ|x, t) = 1 ψ(x, t)q(τ|x, t) exp

  • −S (τ)

λ

  • The optimal cost-to-go is a free energy:

J(x, t) = −λ log Eq

  • e−S/λ

The optimal control is an expectation wrt p:

u∗(x, t)dt = Ep∗(dWt) = Eq

  • dWe−S/λ

Eq e−S/λ J, u∗ can be computed by forward sampling from q.

Bert Kappen Oxford 2015 21/58

slide-23
SLIDE 23

Delayed choice

dXt = u(Xt, t)dt + dWt

  • dW2

t

  • = νdt

C(p) = Epφ(xT) + 2 dt1 2u(t)2

Cost encodes targets at t = 2.

0.5 1 1.5 2 −3 −2 −1 1 2 3

Bert Kappen Oxford 2015 22/58

slide-24
SLIDE 24

Delayed choice

Time-to-go T = 2 − t.

0.5 1 1.5 2 −3 −2 −1 1 2 3

−2 −1 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x J(x,t) T=2 T=1 T=0.5

J(x, t) = −ν log Eq exp(−φ(X2)/ν)

Decision is made at T = 1

ν

Bert Kappen Oxford 2015 23/58

slide-25
SLIDE 25

Delayed choice

Time-to-go T = 2 − t.

0.5 1 1.5 2 −3 −2 −1 1 2 3

−2 −1 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x J(x,t) T=2 T=1 T=0.5

J(x, t) = −ν log Eq exp(−φ(X2)/ν)

”When the future is uncertain, delay your decisions.”

Bert Kappen Oxford 2015 24/58

slide-26
SLIDE 26

KL control

Uncontrolled dynamics specifies distribution q(τ|x, t) over trajectories τ from t → T. Cost for trajectory τ is S (τ) = φ(xT) +

T

t dsV(xs, s).

Find optimal distribution p(τ|x.t) that minimizes Ep S and is ’close’ to q(τ|x, t).

Bert Kappen Oxford 2015 25/58

slide-27
SLIDE 27

KL control

Find p∗ that minimizes

C(p) = KL(p|q) + Ep S KL(p|q) =

  • dτp(τ|x, t) log p(τ|x, t)

q(τ|x, t)

The optimal solution is given by

p∗(τ|x, t) = 1 ψ(x, t)q(τ|x, t) exp(−S (τ|x, t)) ψ(x, t) =

  • dτq(τ|x, t) exp(−S (τ|x, t))

The optimal cost is:

C(p∗) = − log ψ(x, t)

Bert Kappen Oxford 2015 26/58

slide-28
SLIDE 28

Controlled diffusions are special case

In the case of controlled diffusions, p is parametrised by functions u(x, t):

dXt = f(Xt, t)dt + g(Xt, t)(u(Xt, t)dt + dWt) E(dWidW j) = νi jdt C(p) = Ep

  • φ(XT) +

T

t

ds1 2u(Xs, s)Tν−1u(Xs, s) + V(Xs, s)

  • ψ(x, t) is the solution of the linear Bellman equation and J(x, t) = − log ψ(x, t) is the
  • ptimal cost-to-go.

Bert Kappen Oxford 2015 27/58

slide-29
SLIDE 29

Sampling efficiency

0.5 1 1.5 2 −10 −5 5 10

Sampling with uncontrolled dynamics is theoretically correct, but inefficient in effi- cient in practice.

Bert Kappen Oxford 2015 28/58

slide-30
SLIDE 30

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider simple 1-d sampling problem. Given q(x), compute

a = Prob(x < 0) = ∞

−∞

I(x)q(x)dx

with I(x) = 0, 1 if x > 0, x < 0, respectively. Naive method: generate N samples Xi ∼ q

ˆ a = 1 N

N

  • i=1

I(Xi)

Bert Kappen Oxford 2015 29/58

slide-31
SLIDE 31

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider another distribution p(x). Then

a = Prob(x < 0) = ∞

−∞

I(x)q(x) p(x)p(x)dx

Importance sampling: generate N samples Xi ∼ p

ˆ a = 1 N

N

  • i=1

I(Xi)q(Xi) p(Xi)

Unbiased (= correct) for any p!

Bert Kappen Oxford 2015 30/58

slide-32
SLIDE 32

Optimal importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

The distribution

p∗(x) = q(x)I(x) a

is the optimal importance sampler. One sample Xi ∼ p∗ is sufficient to estimate a:

ˆ a = 1 N

N

  • i=1

I(Xi) q(Xi) p∗(Xi) = a

”Optimal importance sampler has zero variance”.

Bert Kappen Oxford 2015 31/58

slide-33
SLIDE 33

Importance sampling and control

Theorem 1. The solution of the control problem is given by

J(x, t) = − log Eqe−S = − log Epe−S dq dp = − log Eue−S u u∗(x, t)dt = Eq

  • dWte−S

Eq e−S = u(t, x)dt + Eu

  • dWte−S u

Eu e−S u dq dp = exp

T

t

dt1 2u(x, t)Tν−1u(x, t) − T

t

u(x, t)Tν−1dWt

  • with Ep = Eu.

We can choose any p, ie. any sampling control u.

Bert Kappen Oxford 2015 32/58

slide-34
SLIDE 34

Importance sampling and control

0.5 1 1.5 2 −10 −5 5 10 0.5 1 1.5 2 −10 −5 5 10

Bert Kappen Oxford 2015 33/58

slide-35
SLIDE 35

Relation between optimal sampling and optimal control

Definition 2.

  • 1. The weight of a path is defined as αu =

e−S u(t0,x0) E[e−S u(t0,x0)].

  • 2. The fraction of effective samples is FES =

1 E[(αu)2] = 1 Var(αu)+1.

Theorem 3. Let 0 < ǫ < 1. Then:

  • 1. (u∗ − u)′(u∗ − u) ≤

ǫ t1−t0 point-wise implies Var (αu) ≤ ǫ 1−ǫ

  • 2. Var(αu) ≤ ǫ implies

t1

t0 u∗ − u′ u∗ − u dt ≤ ǫ.

  • 1. Better u (in the sense of optimal control) provides a better sampler (in the sense
  • f effective sample size).
  • 2. Optimal u = u∗ (in the sense of optimal control) requires only one sample.

Bert Kappen Oxford 2015 34/58

slide-36
SLIDE 36

The Cross-entropy method

Let X be a random variable taking values in the space X. Let fv(x) be a family of probability density function on X parametrized by v and h(x) be a positive function. Suppose that we are interested in the expectation value

a = Eu h =

  • dx fu(x)h(x)

where Eu denotes expectation with respect to the pdf fu for a particular value of

v = u.

The optimal importance sampling distribution is g∗(x) = h(x)fu(x)/a. The cross entropy method suggests to find the distribution fv in the parametrized family of distributions that minimises the KL divergence

KL(g∗|fv) =

  • dxg∗(x) log g∗(x)

fv(x) ∝ −Eg∗ log fv(X) ∝ −Euh(X) log fv(X) = −D(v)

Bert Kappen Oxford 2015 35/58

slide-37
SLIDE 37

The Cross-entropy method

We can use again importance sampling to compute D(v):

D(v) = Euh(X) log fv(X) = Ewh(X) fu(X) fw(X) log fv(X)

We estimate the expectation value by drawing N samples from fw. If D is convex and differentiable with respect to v, the optimal v is given by

1 N

N

  • i=1

h(Xi) fu(Xi) fw(Xi) d dv log fv(Xi) = 0 Xi ∼ fw

Bert Kappen Oxford 2015 36/58

slide-38
SLIDE 38

The CE algorithm

Initialize w0 = u. for k = 0, . . . , K do generate N samples X1:N from fwk compute v by solving

1 N

N

  • i=1

h(Xi) fu(Xi) fw(Xi) d dv log fv(Xi) = 0

Set wk+1 = v. end for return wK

Bert Kappen Oxford 2015 37/58

slide-39
SLIDE 39

The CE method for PI control: Preliminaries

Let X denote the space of continuous trajectories on the interval [t, T]: τ = X(s), t ≤

s ≤ T with fixed initial value X(t) = x satisfying the dynamics dXt = f(Xt, t)dt + g(Xt, t) (u(Xt, t)dt + dWt)

Denote pu(τ) the distribution over trajectories τ with control u. The distributions pu and p0 are related by the Girsanov Theorem.

p(Xs+ds|Xs) = N(Xs+ds|µs, Ξsds) µs = Xs + EdXs Ξs = EdX2

s

pu(τ) = lim

ds→0 T−ds

  • s=t

N(Xs+ds|µs, Ξs) = p0(τ) exp

T

t

ds1 2u2(s, Xs) + T

t

u(s, Xs)g(s, Xs)−1(dXs − f(s, Xs)ds)

  • Bert Kappen

Oxford 2015 38/58

slide-40
SLIDE 40

The Radon-Nikodym can be used to rewrite the optimal distribution:

dp0(τ) dpu(τ) = exp

T

t

ds1 2u2(s, X(s)) − T

t

u(s, X(s))dW(s)

  • p∗(τ)

= 1 ψ(t, x)p0(τ) exp(−V(τ)) = 1 ψ(t, x)pu(τ)dp0(τ) dpu(τ) exp(−V(τ)) = 1 ψ(t, x)pu(τ) exp(−S (t, x, u))

Bert Kappen Oxford 2015 39/58

slide-41
SLIDE 41

The CE method for PI control

We have a family of distributions pu. We wish to compute a near optimal control ˆ

u

such that pˆ

u is close to p∗. Following the CE argument, we minimise

KL(p∗|pˆ

u)

= Ep∗ log p∗ − Ep∗ log pˆ

u ∝ −Ep∗ log pˆ u

∝ Ep∗ T

t

1 2 ˆ u2(s, Xs)ds − ˆ u(s, Xs)g(s, Xs)−1(dXs − f(s, Xs)ds)

  • =

1 ψ(t, x)Epe−S (t,x,u) T

t

ds 1 2 ˆ u(s, X(s))2 − ˆ u(s, X(s))

  • u(s, X(s)) + dWs

ds

  • The expression must be optimized with respect to the functions ˆ

ut:T = {ˆ u(s, Xs), t ≤ s ≤ T}. It is independent of the sampling control ut:T = {u(s, Xs), t ≤ s ≤ T}.

Bert Kappen Oxford 2015 40/58

slide-42
SLIDE 42

The CE method for PI control: Time-dependent solution

We now assume that ˆ

u is a parametrized function with parameters θ. In the time-

dependent case, we consider different θs for each of the functions ˆ

u(s, x|θs) sepa-

  • rately. The gradient is given by:

∂KL(p∗| ˆ p) ∂θs = 1 ψ(t, x)Epe−S (t,x,u)

  • ˆ

u(s, X(s)) − u(s, X(s)) − dWs ds ∂ˆ u(s, X(s)) ∂θs

Choosing u = ˆ

u yields the gradient procedure θs,n+1 = θs,n − η∂KL(p∗| ˆ p) ∂θs,n

  • u=ˆ

un = θs,n + η

dWs ds ∂ˆ u(s, X(s)) ∂θs,n

  • with F =

1 ψ(t,x)Epe−S (t,x,u)F and η > 0 a small parameter.

Convergence is guaranteed. We refer to this gradient method as PICE.

Bert Kappen Oxford 2015 41/58

slide-43
SLIDE 43

The CE method for PI control: Time-dependent solution

Linear basis functions:

ˆ u(s, x) =

K

  • k=1

θskhsk(x) u(s, x) =

K

  • k=1

θ0

skhsk(x)

we obtain regression problem:

K

  • l=1
  • θsl − θ0

sl

  • hslhsk =

dWs ds hsk

  • For each s a system of K linear equations with K unknowns θsk, k = 1, . . . , K. The

statistics hslhsk and

dWs

ds hsk

  • can be estimated for all times t ≤ s ≤ T simultane-
  • usly from a single Monte Carlo sampling run using the control u parametrized by

θ0.

Bert Kappen Oxford 2015 42/58

slide-44
SLIDE 44

The CE method for PI control: Time-independent solution

We consider ˆ

u(Xs) independent of time parametrised by θ. The gradient of the KL

divergence involves an integral:

∂KL(p∗| ˆ p) ∂θ = 1 ψ(t, x)Epe−S (t,x,u) T

t

ds (ˆ u(X(s)) − u(X(s))) − T

t

dW(s)∂ˆ u(X(s)) ∂θ

  • Choosing u = ˆ

u yields the gradient procedure θn+1 = θn − η∂KL(p∗| ˆ p) ∂θn

  • u=ˆ

un = θn + η

T

t

dWs ∂ˆ u(X(s)) ∂θn

  • Bert Kappen

Oxford 2015 43/58

slide-45
SLIDE 45

Example: Linear time-dependent feedback control

For t0 ≤ t ≤ t1, the 1-dimensional problem

dXt =Xt dt 2 + u(tXt, t)dt + dWt

  • ,

C =E Q 2 log(XT)2

has solution

u∗(t, x) = −Q log(x) Q(t1 − t) + 1.

For the experiments we will take x0 = 1/2, t0 = 0, t1 = 1, Q = 10.

Bert Kappen Oxford 2015 44/58

slide-46
SLIDE 46

Example: Linear time-dependent feedback control

Consider different state-dependent parametrizations:

  • one basis function: log(x) yields exact controller
  • three polynomial parameterizations: a constant-, affine- and quadratic-function
  • f the state denoted by u(0), u(1), u(2), e.g. u(2)(t, x) = a(t) + b(t)x + c(t)x2.

u = 0 u(0) u(1) u(2) a(t) log(x) u∗ E[S ]

7.526 5.139 1.507 1.461 1.422 1.420

Var(αu)

1.981 1.376 0.143 0.0506 0.0085 0.0071 FES(%) 34.3 42.08 87.5 95.2 99.1 99.3

Performance estimates of various controllers based on 10000 sample paths.

Bert Kappen Oxford 2015 45/58

slide-47
SLIDE 47

Example: Linear time-dependent feedback control

  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2 2.5 3 200 400 600 800 1000 1200 1400 u(t, x) particles x(t) at t = 1/2 u(0) u(1) u(2) u∗

State dependence of the feed-back controllers at the intermediate time t = 1/2. The approximate controls were calculated with 10000 sample paths using a time discretization of dt = 0.001 for numeric integration. The histogram was created with 10000 draws from Xu∗(t) at t = 1/2.

Bert Kappen Oxford 2015 46/58

slide-48
SLIDE 48

Example: Latent state estimation

The path integral control computation is mathematically equivalent to a Bayesian inference problem in a time series model with p0(τ) the forward model and e−V(τ) =

  • t p(yt|xt) is the likelihood of the trajectory τ = xt:T|x. The Bayesian posterior is

then given by p∗(τ). PICE provides an efficient alternative to particle smoothing methods.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 Time t MSE of mean estimation MSE of mean; Nfb=6000, M=2100, N=6000, max iters=80 BPS RPIIS 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 −8 −6 −4 −2 2 4 6 8 Open loop controller b1 b2 b3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 −9 −8 −7 −6 −5 −4 −3 −2 −1 1 Feedback controller; N=6000 A1,1 A2,2 A3,3 Observation time

Left: MSE of posterior mean versus time of a chaotic 3-d Lorentz attractor with 7 1-d noisy obser-

  • vations. PI computed ˆ

ui(t, x) = 3

j=1 Aij(t)xj + bi(t) (red) using 80 importance sampling iterations

with 6000 particles per iteration. Particle smoothing method (green) using N = 6000 forward and

M = 2100 backward particles. Middle: open loop control bi versus time. Right: diagonal feedback

control terms Aii versus time.

Bert Kappen Oxford 2015 47/58

slide-49
SLIDE 49

Example: Linear time-independent feedback control

Consider a simple inverted pendulum, that satisfies the dynamics

¨ α = − cos α + u

where α is the angle that the pendulum makes with the horizontal, α = 3π/2 is the initial ’down’ position and α = π/2 is the target ’up’ position, − cos α is the force acting on the pendulum due to gravity. Introducing x1 = α, x2 = ˙

α and adding noise,

we write this system as

dXi(s) = fi(X(s))ds + gi(u(s, X(s) + dW(s)) 0 ≤ s ≤ T, i = 1, 2 f1(x) = x2 f2(x) = − cos x1 g = (0, 1) C = E T dsR 2u(s, X(s))2 + Q1 2 (sin X1(s) − 1)2 + Q2 2 X2(s)2

with EdW2

s = νds and ν the noise variance.

Bert Kappen Oxford 2015 48/58

slide-50
SLIDE 50

Example: Linear time-independent feedback control

We estimate a time-independent feed-back controller on a grid

ˆ u(x1, x2) = θk1,k2 if (x1, x2) is in cell (k1, k2)

with ki, i = 1, 2 integers that label the grid points. The results of the path integral learning rule Eq. 1 are shown in fig. ??.

200 400 600 800 1000 0.2 0.4 0.6 0.8 1 ss 200 400 600 800 1000 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 J

2 4 6 −2 −1 1 2 −3 −2 −1 1 2 3 Bert Kappen Oxford 2015 49/58

slide-51
SLIDE 51

Acrobot

Bert Kappen Oxford 2015 50/58

slide-52
SLIDE 52

Acrobot

(movie92.mp4) Result after 100 iterations, 50 samples per iteration.

Bert Kappen Oxford 2015 51/58

slide-53
SLIDE 53

Quadrotors

  • circular holding/hovering pattern

– penalizes large deviations from the centers, collisions and too large/small velocities – 15 quadrotor units, rollouts N=7000, horizon H=4

  • cat & mouse

– penalizes large deviations from the mouse, collisions and large/small veloci- ties. – Mouse is not controlled and tries to escape the cats Compute (feed-back) control for current state. Use adaptive importance sampling.

  • ≈ 100.000 trajectories/second for 1 second of 1 quadrotor simulation.

Bert Kappen Oxford 2015 52/58

slide-54
SLIDE 54

UAVs

(AAMAS 2015.mp4) Kappen et al. 2015

Bert Kappen Oxford 2015 53/58

slide-55
SLIDE 55

Discussion

PICE presents challenging learning problems, as is evident from the large fluctua- tions despite the large number of samples for these relatively small problems.

  • The weights of the trajectories are proportional to e−S with S ∝ 1/λ and λ = Rν

– Small λ yields small sample size and difficult learning – Large ν requires large controls, requires small R. This problem is due to the log transform that is used to linearize the Bellman equation.

  • Small deviations from optimallity may yield large decrease in sample size.

– Optimal model is infinitely large – An infinite model requires infinitely many samples to avoid overfitting. – for finite samples there is an optimal finite model

Bert Kappen Oxford 2015 54/58

slide-56
SLIDE 56

Conclusion

Importance sampling improves sampling efficiency:

  • optimal control = optimal sampling

Bert Kappen Oxford 2015 55/58

slide-57
SLIDE 57

Conclusion

Importance sampling improves sampling efficiency:

  • optimal control = optimal sampling

Learning state dependent/feedback control with PICE

  • CE provides a criterion for parametrized controllers
  • learn from self-generated data
  • use ∞ data to learn ∞ models
  • Connecting Control, Inference and Learning
  • application in robotics

Bert Kappen Oxford 2015 56/58

slide-58
SLIDE 58

Conclusion

Importance sampling improves sampling efficiency:

  • optimal control = optimal sampling

Learning state dependent/feedback control with PICE

  • CE provides a criterion for parametrized controllers
  • learn from self-generated data
  • use ∞ data to learn ∞ models
  • Connecting Control, Inference and Learning
  • application in robotics

Inference:

  • reformulate as control problem
  • improve estimates through importance sampling controls

Bert Kappen Oxford 2015 57/58

slide-59
SLIDE 59
  • S. Thijssen and H. J. Kappen. ”Path Integral Control and State Dependent Feed-

back.” Phys. Rev. E 91, 032104 Published 2 March 2015 V G´

  • mez, S Thijssen, HJ Kappen, S Hailes ”Real-Time Stochastic Optimal Control

for Multi-agent Quadrotor Swarms”. arXiv preprint arXiv:1502.04548, 2015 J Bierkens, HJ Kappen ”Explicit solution of relative entropy weighted control”. Sys- tems & Control Letters 36-43, 2014

Bert Kappen Oxford 2015 58/58