Integrating control, inference and learning. Is it what robots - - PowerPoint PPT Presentation

integrating control inference and learning is it what
SMART_READER_LITE
LIVE PREVIEW

Integrating control, inference and learning. Is it what robots - - PowerPoint PPT Presentation

Integrating control, inference and learning. Is it what robots should be doing? Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 18, 2016 Bert Kappen Optimal control theory Given a current state


slide-1
SLIDE 1

Integrating control, inference and learning. Is it what robots should be doing?

Bert Kappen SNN Donders Institute, Radboud University, Nijmegen Gatsby Unit, UCL London July 18, 2016

Bert Kappen

slide-2
SLIDE 2

Optimal control theory

Given a current state and a future desired state, what is the best/cheapest/fastest way to get there.

Bert Kappen 1/40

slide-3
SLIDE 3

Why stochastic optimal control?

Bert Kappen 2/40

slide-4
SLIDE 4

Why stochastic optimal control?

Exploration Learning

Bert Kappen 3/40

slide-5
SLIDE 5

Optimal control theory

Hard problems:

  • a learning and exploration problem
  • a stochastic optimal control computation
  • a representation problem u(x, t)

Bert Kappen 4/40

slide-6
SLIDE 6

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling

Bert Kappen 5/40

slide-7
SLIDE 7

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (=a state-feedback controller) Optimal importance sampler is optimal control

Bert Kappen 6/40

slide-8
SLIDE 8

The idea: Control, Inference and Learning

Linear Bellman equation and path integral solution Express a control computation as an inference computation. Compute optimal control using MC sampling Importance sampling Accellerate with importance sampling (=a state-feedback controller) Optimal importance sampler is optimal control Learning Learn the controller from self-generated data Use Cross Entropy method for parametrized controller

Bert Kappen 7/40

slide-9
SLIDE 9

Outline

  • Review of path integral control theory

– Some results

  • Importance sampling

– Relation between optimal sampling and optimal control

  • Cross entropy method for adaptive importance sampling (PICE)

– A criterion for parametrized control optimization – Learning by gradient descent

  • Some examples

Bert Kappen 8/40

slide-10
SLIDE 10

Discrete time optimal control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(xt, ut), t = 0, 1, . . . , T − 1 xt describes the state and ut specifies the control or action at time t.

Given x0 and u0:T−1, we can compute x1:T. Define a cost for each sequence of controls:

C(x0, u0:T−1) =

T−1

  • t=0

V(xt, ut)

Find the sequence u0:T−1 that minimizes C(x0, u0:T−1).

Bert Kappen 9/40

slide-11
SLIDE 11

Dynamic programming

Find the minimal cost path from A to J.

J(J) = J(H) = 3 J(I) = 4 J(F) = min(6 + J(H), 3 + J(I)) = 7 J(B) = min(7 + J(E), 4 + J(F), 2 + J(G)) = . . .

Minimal cost at time t easily expressable in terms of minimal cost at time t + 1.

Bert Kappen 10/40

slide-12
SLIDE 12

Discrete time optimal control

Dynamic programming uses concept of optimal cost-to-go J(t, x). One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way:

J(t, xt) = min

ut (V(xt, ut) + J(t + 1, xt + f(t, xt, ut)))

J(T, x) = J(0, x) = min

u0:T−1 C(x, u0:T−1)

This is called the Bellman Equation. Computes ut(x) for all intermediate t, x.

Bert Kappen 11/40

slide-13
SLIDE 13

Stochastic optimal control

Consider a stochastic dynamical system

dXt = f(Xt, u)dt + dWt E(dWt,idWt. j) = νi jdt

Given X0 find control function u(x, t) that minimizes the expected future cost

C = E

  • φ(XT) +

T dtV(Xt, u(Xt, t))

  • Expectation is over all trajectories given the control function u(x, t).

−∂tJ(t, x) = min

u

  • V(x, u) + f(x, u)∇xJ(x, t) + 1

2ν∇2

xJ(x, t)

  • with u = u(x, t) and boundary condition J(x, T) = φ(x). This is HJB equation.

Bert Kappen 12/40

slide-14
SLIDE 14

Computing the optimal control solution is hard

  • solve a Bellman Equation, a PDE
  • scales badly with dimension

Efficient solutions exist for

  • linear dynamical systems with quadratic costs (Gaussians)
  • deterministic systems (no noise)

Bert Kappen 13/40

slide-15
SLIDE 15

The idea

Uncontrolled dynamics specifies distribution q(τ|x, t) over trajectories τ from x, t. Cost for trajectory τ is S (τ|x, t) = φ(xT) +

T

t dsV(xs, s).

Find optimal distribution p(τ|x, t) that minimizes Ep S and is ’close’ to q(τ|x, t).

Bert Kappen 14/40

slide-16
SLIDE 16

KL control

Find p∗ that minimizes

C(p) = KL(p|q) + Ep S KL(p|q) =

  • dτp(τ|x, t) log p(τ|x, t)

q(τ|x, t)

The optimal solution is given by

p∗(τ|x, t) = 1 ψ(x, t)q(τ|x, t) exp(−S (τ|x, t)) ψ(x, t) =

  • dτq(τ|x, t) exp(−S (τ|x, t)) = Eqe−S

The optimal cost is:

C(p∗) = − log ψ(x, t)

Bert Kappen 15/40

slide-17
SLIDE 17

Controlled diffusions

p(τ|x, t) is parametrised by functions u(x, t): dXt = f(Xt, t)dt + g(Xt, t)(u(Xt, t)dt + dWt) E(dW2

t ) = dt

C(u|x, t) = Eu

  • S (τ|x, t) +

T

t

ds1 2u(Xs, s)2

  • q(τ|x, t) corresponds to u = 0.

The Bellman equation becomes a ’Schr¨

  • dinger’ equation with J(x, t) = − log ψ(x, t):

∂tψ =

  • V − f T∂x − 1

2∂2

x

  • ψ

ψ(x, T) = e−φ(x)

Bert Kappen 16/40

slide-18
SLIDE 18

Controlled diffusions

The Schr¨

  • dinger’ equation can be solved formally as a Feynman-Kac path integral:

ψ(x, t) =

  • dτq(τ|x, t)e−S (τ|x,t) = Eq
  • e−S

Optimal control

u∗(x, t)dt = Ep∗(dWt) = Eq

  • dWe−S

Eq e−S ψ, u∗ can be computed by forward sampling from q.

Bert Kappen 17/40

slide-19
SLIDE 19

Delayed choice

Time-to-go T = 2 − t.

0.5 1 1.5 2 −3 −2 −1 1 2 3

−2 −1 1 2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x J(x,t) T=2 T=1 T=0.5

J(x, t) = −ν log Eq exp(−φ(X2)/ν)

Decision is made at T = 1

ν

Bert Kappen 18/40

slide-20
SLIDE 20

Acrobot

50 100 −4 −2 2 4 50 100 2 4 6 8 ss 50 100 −100 −50 50 Ju Jphi 50 100 −10 10 20 increment mean std 50 100 5 10 15 20 50 100 5000 10000 15000 −4 −2 2 −4 −2 2 4 x1 x2 10 20 −10 10 20 30 t u

100 iterations. At each iteration 50 trajectories were generated. Noise was lowered at each iteration. Top left: final height for each trajectory.

Bert Kappen 19/40

slide-21
SLIDE 21

Acrobot

(movie92.mp4) Result after 100 iterations, 50 samples per iteration.

Bert Kappen 20/40

slide-22
SLIDE 22

Robotics

≈ 100.00 trajectories per iteration, 3 iterations per second.

Video at: http://www.snn.ru.nl/˜bertk/control_theory/PI_quadrotors.mp4 Theodorou et al. 2011 ICRA Gomez et al. 2016 ICAPS

Bert Kappen 21/40

slide-23
SLIDE 23

Importance sampling and control

0.5 1 1.5 2 −10 −5 5 10 0.5 1 1.5 2 −10 −5 5 10

ψ(x, t) = Eqe−S S (τ|x, t) = φ(xT) + T

t

dsV(xs, s)

Sampling is ’correct’ but inefficient.

Bert Kappen 22/40

slide-24
SLIDE 24

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider simple 1-d sampling problem. Given q(x), compute

a = Prob(x < 0) = ∞

−∞

I(x)q(x)dx

with I(x) = 0, 1 if x > 0, x < 0, respectively. Naive method: generate N samples Xi ∼ q

ˆ a = 1 N

N

  • i=1

I(Xi)

Bert Kappen 23/40

slide-25
SLIDE 25

Importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

Consider another distribution p(x). Then

a = Prob(x < 0) = ∞

−∞

I(x)q(x) p(x)p(x)dx

Importance sampling: generate N samples Xi ∼ p

ˆ a = 1 N

N

  • i=1

I(Xi)q(Xi) p(Xi)

Unbiased (= correct) for any p!

Bert Kappen 24/40

slide-26
SLIDE 26

Optimal importance sampling

−2 2 4 0.2 0.4 0.6 0.8 1 1.2

The distribution

p∗(x) = q(x)I(x) a

is the optimal importance sampler. One sample X ∼ p∗ is sufficient to estimate a:

ˆ a = I(X) q(X) p∗(X) = a

Bert Kappen 25/40

slide-27
SLIDE 27

Importance sampling and control

In the case of control we must compute

J(x, t) = − log Eqe−S u∗(x, t) = Eq

  • dWe−S

Eq e−S

Instead of samples from uncontrolled dynamics q (u = 0), we sample with p (u 0).

Eqe−S = Epe−S u e−S u = e−S dq dp = e−S −

T

t 1 2u(x,t)2dt−

T

t

u(x,t)dWt

We can choose any p, ie. any sampling control u.

Bert Kappen 26/40

slide-28
SLIDE 28

Relation between optimal sampling and optimal control

Draw N trajectories τi, i = 1, . . . , N from p(τ|x, t) using control function u and define

αi = e−S u(τi|x,t)) N

j=1 e−S u(τj|x,t)

ES S = 1 N

j=1 α2 j

(1 ≤ ES S ≤ N)

Thm:

  • 1. Better u (in the sense of optimal control) provides a better sampler (in the sense
  • f effective sample size).
  • 2. Optimal u = u∗ (in the sense of optimal control) requires only one sample, αi =

1/N and S u(τ|x, t) deterministic! S u(τ|x, t) = S (τ|x, t) + T

t

dt1 2u(xs, s)Tν−1u(xs, s) + T

t

u(xx, s)Tν−1dWs

Bert Kappen 27/40

slide-29
SLIDE 29

Example

Geometric Brownian motion on the interval t = 0 to T.

dXt =Xtu(Xt, t)dt + XtdWt C =E 1 2 log(XT)2 u(x, t) =a(t)x + b(t)

  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2 2.5 3 200 400 600 800 1000 1200 1400 u(t, x) particles x(t) at t = 1/2 u(0) u(1) u(2) u∗

u = 0

constant linear quadratic

  • ptimal

C

7.526 5.139 1.507 1.461 1.420 FES(%) 34.3 42.08 87.5 95.2 99.3

Bert Kappen 28/40

slide-30
SLIDE 30

The Cross-entropy method

fu(x) be a family of probability density function parametrized by u. h(x) be a positive function.

Conside the expectation value

a = E0 h =

  • dx f0(x)h(x)

for a particular value of u = 0. The optimal importance sampling distribution is g∗(x) = h(x) f0(x)/a. The cross entropy method minimises the KL divergence

KL(g∗| fu) =

  • dxg∗(x) log g∗(x)

fu(x) ∝ −Eg∗ log fu(X) ∝ −E0h(X) log fu(X) = Evh(X) f0(X) fv(X) log fu(X) f0 → f1 → f2 . . .

Bert Kappen 29/40

slide-31
SLIDE 31

The CE method for PI control

Sample pu using

dXt = f(Xt, t)dt + g(Xt, t) (u(Xt, t)dt + dWt)

We wish to compute close to optimal control u such that pu is close to p∗. Following the CE argument, we minimise

KL(p∗|pu) = 1 ψ(t, x)Eve−S (t,x,v) T

t

ds1 2

  • u(Xs, s) − v(Xs, s) − dWs

ds 2 v is the importance sampling control. Expected value is independent of v, but

variance/accuracy depends on v.

Bert Kappen 30/40

slide-32
SLIDE 32

The CE method for PI control

We parametrize the control functions u(x, t|θ) and v(x, t|θ′). The gradient is given by:

∂KL(p∗|pu) ∂θ = T

t

(u(Xs, s)ds − v(Xs, s)ds − dWs) ∂u(Xs, s) ∂θ

  • v

= − T

t

dWs ∂u(Xs, s) ∂θ

  • u

θ := θ − ǫ∂KL(p∗|pu) ∂θ

We refer to the method as PICE (Path Integral Cross Entropy).

Bert Kappen 31/40

slide-33
SLIDE 33

Inverted pendulum

x1 x2 2 4 6 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

100 200 300 0.2 0.4 0.6 0.8 1 ss 2 4 6 8 10 −1 −0.5 0.5 1 t sin(xt)

PICE for inverted pendulum with K = 25 basis functions. 10 sample trajectories for each gradient

  • computation. Left: ESS vs. importance sampling iteration. Middle: Final control solution u(α, ˙

α)

versus α, ˙

α. Right: 10 sample trajectories sin(αt) versus t under learned control.

Kappen & Ruiz 2016 J Stat. Phys

Bert Kappen 32/40

slide-34
SLIDE 34

Controlled noisy Lorenz attractor

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 −8 −6 −4 −2 2 4 6 8 Open loop controller b1 b2 b3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 −9 −8 −7 −6 −5 −4 −3 −2 −1 1 Feedback controller; N=6000 A1,1 A2,2 A3,3 Observation time

u(t, x) = A(t)x + b(t). N = 6000

Bert Kappen 33/40

slide-35
SLIDE 35

Model based motor learning

Initialize control u0 for t = 0, . . . do

datat =act in the world(ut) modelt = learn model(ut, datat) ut+1= compute control(modelt)

end for

Control Importance sampling Learning Learning Learning Evaluation

Model Plant/Env. Controller Cost

compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

Bert Kappen 34/40

slide-36
SLIDE 36

Model based motor learning

Initialize control u0 for t = 0, . . . do

datat =act in the world(ut) modelt = learn model(ut, datat) ut+1= compute control(modelt)

end for

Control Importance sampling Learning Learning Learning Evaluation

Model Plant/Env. Controller Cost

compute control for k = 0, . . . do

datak = generate data(model, uk)

% Monte Carlo importance sampler

uk+1 = learn control(datak, uk)

% Deep or recurrent learning end for

  • generate infinite data to learn infinitely complex

Bert Kappen 35/40

slide-37
SLIDE 37

Parallel implementation

Goal: provide generic solver for any PI control problem to arbitrary precision. Massive parallel sampling on CPUs Massive parallel gradient computation on C/GPU

Bert Kappen 36/40

slide-38
SLIDE 38

Integrating control, inference and learning. Is it what robots should be doing?

  • Monte Carlo sampling
  • Learning through gradient descent

Bert Kappen 37/40

slide-39
SLIDE 39

Integrating control, inference and learning. Is it what robots should be doing?

  • Monte Carlo sampling
  • Learning through gradient descent

Both are trivial to parallelize.

  • massive parallel data
  • massive parallel gradient descent learning

Bert Kappen 38/40

slide-40
SLIDE 40

Integrating control, inference and learning. Is it what robots should be doing?

  • Monte Carlo sampling
  • Learning through gradient descent

Both are trivial to parallelize.

  • massive parallel data
  • massive parallel gradient descent learning

Bert Kappen 39/40

slide-41
SLIDE 41
  • S. Thijssen and H. J. Kappen. ”Path Integral Control and State Dependent Feedback.” Phys. Rev.

E 91, 032104 2015 V G´

  • mez, S Thijssen, HJ Kappen, S Hailes ”Real-Time Stochastic Optimal Control for Multi-agent

Quadrotor Swarms”. arXiv:1502.04548, 2015. ICAPS 2016. H.J. Kappen and H.C. Ruiz. ”Adaptive importance sampling for control and inference”. arXiv:1505.01874. J Stat. Phys 2016 (in press).

www.snn.ru.nl/˜bertk

Bert Kappen 40/40

slide-42
SLIDE 42

Neural activity from BOLD

Blood oxygen level dependence on neural activity is modelled by 4 differential equations.

Bert Kappen 41/40

slide-43
SLIDE 43

Neural activity from BOLD

u(z, t) = a(t)z + b(t), N = 5000, K = 200 iterations.

Bert Kappen 42/40

slide-44
SLIDE 44

Pendulum details

¨ φt + cω0 ˙ φt + ω2

0 sin φt = ut + dWt

cω0 = 0.1s−1 ω2

0 = 10s−2

Network has one input ut + dWt and three outputs sin φt, cos φt, ˙

φt and N = 500

hidden neurons, J = 50 non-linear dendrites Learning phase: u = 0. Force learning. Feed-back is 0.9 output + 0.1 target Control phase: Simulate trajectories from current state to compute u using PI con- trol.

Bert Kappen 43/40