[PPT] - State space methods for temporal GPs Arno Solin Assistant Professor PowerPoint Presentation

SLIDE 1

State space methods for temporal GPs

Arno Solin Assistant Professor in Machine Learning Department of Computer Science Aalto University

GAUSSIAN PROCESS SUMMER SCHOOL September 11, 2019

@arnosolin

arno.solin.fi

SLIDE 2

State space methods for temporal GPs Arno Solin 2/44

Outline

Motivation: Temporal models Three views into GPs State space models General likelihoods Spatio- temporal GPs Further extensions Recap

SLIDE 3

State space methods for temporal GPs Arno Solin 3/44

Motivation: Temporal models

One-dimensional problems (the data has a natural ordering) Spatio-temporal models (something developing over time) Long / unbounded data (sensor data streams, daily observations, etc.)

SLIDE 4

State space methods for temporal GPs Arno Solin 4/44

Three views into GPs GP GP

Kernel (moment) Spectral (Fourier) State space (path)

SLIDE 5

State space methods for temporal GPs Arno Solin 5/44

Kernel (moment) representation

f(t) ∼ GP(µ(t), κ(t, t′)) GP prior y | f ∼

i

p(yi | f(ti)) likelihood ◮ Let’s focus on the GP prior only. ◮ A temporal Gaussian process (GP) is a random function f(t), such that joint distribution of f(t1), . . . , f(tn) is always Gaussian. ◮ Mean and covariance functions have the form: µ(t) = E[f(t)], κ(t, t′) = E[(f(t) − µ(t))(f(t′) − µ(t′))T]. ◮ Convenient for model specification, but expanding the kernel to a covariance matrix can be problematic (the notorious O(n3) scaling).

SLIDE 6

State space methods for temporal GPs Arno Solin 6/44

Spectral (Fourier) representation

◮ The Fourier transform of a function f(t) : R → R is F[f](i ω) =

R

f(t) exp(−i ω t) dt ◮ For a stationary GP, the covariance function can be written in terms of the difference between two inputs: κ(t, t′) κ(t − t′) ◮ Wiener–Khinchin: If f(t) is a stationary Gaussian process with covariance function κ(t), then its spectral density is S(ω) = F[κ]. ◮ Spectral representation of a GP in terms of spectral density function S(ω) = E[˜ f(i ω)˜ f T(−i ω)]

SLIDE 7

State space methods for temporal GPs Arno Solin 7/44

State space (path) representation [1/3]

◮ Path or state space representation as solution to a linear time-invariant (LTI) stochastic differential equation (SDE): df = F f dt + L dβ, where f = (f, df/dt, . . .) and β(t) is a vector of Wiener processes. ◮ Equivalently, but more informally df(t) dt = F f(t) + L w(t), where w(t) is white noise. ◮ The model now consists of a drift matrix F ∈ Rm×m, a diffusion matrix L ∈ Rm×s, and the spectral density matrix of the white noise process Qc ∈ Rs×s. ◮ The scalar-valued GP can be recovered by f(t) = hT f(t).

SLIDE 8

State space methods for temporal GPs Arno Solin 8/44

State space (path) representation [2/3]

◮ The initial state is given by a stationary state f(0) ∼ N(0, P∞) which fulfils F P∞ + P∞ FT + L Qc LT = 0 ◮ The covariance function at the stationary state can be recovered by κ(t, t′) =

hTP∞ exp((t′ − t)F)Th,

t′ ≥ t hT exp((t′ − t)F) P∞ h, t′ < t where exp(·) denotes the matrix exponential function. ◮ The spectral density function at the stationary state can be recovered by S(ω) = hT(F + i ω I)−1 L Qc LT (F − i ω I)−Th

SLIDE 9

State space methods for temporal GPs Arno Solin 9/44

State space (path) representation [3/3]

◮ Similarly as the kernel has to be evaluated into a covariance matrix for computations, the SDE can be solved for discrete time points {ti}n

i=1.

◮ The resulting model is a discrete state space model: fi = Ai−1 fi−1 + qi−1, qi ∼ N(0, Qi), where fi = f(ti). ◮ The discrete-time model matrices are given by: Ai = exp(F ∆ti), Qi = ∆ti exp(F (∆ti − τ)) L Qc LT exp(F (∆ti − τ))T dτ, where ∆ti = ti+1 − ti ◮ If the model is stationary, Qi is given by Qi = P∞ − Ai P∞ AT

i

SLIDE 10

State space methods for temporal GPs Arno Solin 10/44

Three views into GPs

−4 −2 2 4 0.2 0.4 0.6 0.8 1 τ = t − t′ κ(τ) Covariance function −4 −2 2 4 0.5 1 1.5 2 ω S(ω) Spectral density function 1 2 3 4 5 6 7 8 9 10 −2 2 Input, t Output, f(t) Sample functions

SLIDE 11

State space methods for temporal GPs Arno Solin 11/44

Example: Exponential covariance function

◮ Exponential covariance function (Ornstein-Uhlenbeck process): κ(t, t′) = exp(−λ |t − t′|) ◮ Spectral density function: S(ω) = 2 λ + ω2/λ ◮ Path representation: Stochastic differential equation (SDE) df(t) dt = −λ f(t) + w(t),

r using the notation from before:

F = −λ, L = 1, Qc = 2, h = 1, and P∞ = 1.

SLIDE 12

State space methods for temporal GPs Arno Solin 12/44

Examples of applicable GP priors

SLIDE 13

State space methods for temporal GPs Arno Solin 13/44

Applicable GP priors

◮ The covariance function needs to be Markovian (or approximated as such). ◮ Covers many common stationary and non-stationary models. ◮ Sums of kernels: κ(t, t′) = κ1(t, t′) + κ2(t, t′)

Stacking of the state spaces
State dimension: m = m1 + m2

◮ Product of kernels: κ(t, t′) = κ1(t, t′) κ2(t, t′)

Kronecker sum of the models
State dimension: m = m1 m2

SLIDE 14

State space methods for temporal GPs Arno Solin 14/44

Example: GP regression, O(n3)

SLIDE 15

State space methods for temporal GPs Arno Solin 15/44

Example: GP regression, O(n3)

◮ Consider the GP regression problem with input–output training pairs {(ti, yi)}n

i=1:

f(t) ∼ GP(0, κ(t, t′)), yi = f(ti) + εi, εi ∼ N(0, σ2

n)

◮ The posterior mean and variance for an unseen test input t∗ is given by (see previous lectures): E[f∗] = k∗ (K + σ2

n I)−1 y,

V[f∗] = K∗∗ − k∗ (K + σ2

n I)−1 kT ∗

◮ Note the inversion of the n × n matrix.

SLIDE 16

State space methods for temporal GPs Arno Solin 16/44

Example: GP regression, O(n3)

SLIDE 17

State space methods for temporal GPs Arno Solin 17/44

Example: GP regression, O(n)

◮ The sequential solution (goes under the name ‘Kalman filter’) considers

ne data point at a time, hence the linear time-scaling.

◮ Start from m0 = 0 and P0 = P∞ and for each data point iterate the following steps. ◮ Kalman prediction: mi|i−1 = Ai−1 mi−1|i−1, Pi|i−1 = Ai−1 Pi−1|i−1 AT

i−1 + Qi−1.

◮ Kalman update: vi = yi − hTmi|i−1, Si = hTPi|i−1 h + σ2

n,

Ki = Pi|i−1 h S−1

i

, mi|i = mi|i−1 + Ki vi, Pi|i = Pi|i−1 − Ki Si KT

i .

SLIDE 18

State space methods for temporal GPs Arno Solin 18/44

Example: GP regression, O(n)

◮ To condition all time-marginals on all data, run a backward sweep (Rauch–Tung–Striebel smoother): mi+1|i = Ai mi|i, Pi+1|i = Ai Pi|i AT

i + Qi,

Gi = Pi|i AT

i P−1 i+1|i,

mi|n = mi|i + Gi (mi+1|n − mi+1|i), Pi|n = Pi|i + Gi (Pi+1|n − Pi+1|i) GT

i ,

◮ The marginal mean and variance can be recovered by: E[fi] = hTmi|n, V[fi] = hTPi|n h ◮ The log marginal likelihood can be evaluated as a by-product of the Kalman update: log p(y) = −1 2

n

i=1

log |2π Si| + v T

i S−1 i

vi

SLIDE 19

State space methods for temporal GPs Arno Solin 19/44

Example: GP regression, O(n)

SLIDE 20

State space methods for temporal GPs Arno Solin 20/44

Basic regression example

◮ Number of births in the US (from BDA3 by Gelman et al.) ◮ Daily data between 1969–1988 (n = 7305) ◮ GP regression with a prior covariance function: κ(t, t′) = κν=5/2

Mat.

(t, t′) + κν=3/2

Mat.

(t, t′) + κyear

Per. (t, t′) κν=3/2

Mat.

(t, t′) + κweek

Per. (t, t′) κν=3/2

Mat.

(t, t′) ◮ Learn hyperparameters by optimizing the marginal likelihood

SLIDE 21

State space methods for temporal GPs Arno Solin 20/44

Basic regression example

◮ Number of births in the US (from BDA3 by Gelman et al.) ◮ Daily data between 1969–1988 (n = 7305) ◮ GP regression with a prior covariance function: κ(t, t′) = κν=5/2

Mat.

(t, t′) + κν=3/2

Mat.

(t, t′) + κyear

Per. (t, t′) κν=3/2

Mat.

(t, t′) + κweek

Per. (t, t′) κν=3/2

Mat.

(t, t′) ◮ Learn hyperparameters by optimizing the marginal likelihood

Explaining changes in number of births in the US

SLIDE 22

State space methods for temporal GPs Arno Solin 21/44

Connection to banded precision matrices

SLIDE 23

State space methods for temporal GPs Arno Solin 22/44

Precision matrices

Covariance (Gram) matrix: K = κ(X, X)

1 2 3 4 5 6 1 2 3 4 5 6

K = k(X, X)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Precision matrix: K−1

1 2 3 4 5 6 1 2 3 4 5 6

Q = k(X, X)

1 3 2 1 1 2 3

For Markovian models the precision is sparse! (block tri-diagonal)

see Durrande et al. (2019)

SLIDE 24

State space methods for temporal GPs Arno Solin 23/44

Constructing the precision matrix

◮ The full precision matrix can be constructed from the state space model matrices: ˆ K−1 =

       I . . . −A1 I . . . −A2 I . . . . . . . . . . . . ... . . . −An I       

−T 

       P0 . . . . . . Q1 . . . . . . Q2 . . . . . . . . . ... . . . . . . . . . Qn        

−1 

      I . . . −A1 I . . . −A2 I . . . . . . . . . . . . ... . . . −An I       

−1

◮ Discarding the other model states by passing through the measurement model: K−1 = (In ⊗ h) ˆ K−1 (In ⊗ h)T

SLIDE 25

State space methods for temporal GPs Arno Solin 24/44

General likelihoods

SLIDE 26

State space methods for temporal GPs Arno Solin 25/44

Non-Gaussian likelihoods

◮ The observation model might not be Gaussian f(t) ∼ GP(0, κ(t, t′)) y | f ∼

i

p(yi | f(ti)) ◮ There exists a multitude of great methods to tackle general likelihoods with approximations of the form Q(f | D) = N(f | m + Kα, (K−1 + W)−1) ◮ Use those methods, but deal with the latent using state space models

SLIDE 27

State space methods for temporal GPs Arno Solin 26/44

Inference

◮ Laplace approximation ◮ Variational Bayes ◮ Direct KL minimization ◮ EP or Assumed density filtering (Single-sweep EP) ◮ Can be evaluated in terms of a (Kalman) filter forward and backward pass, or by iterating them

SLIDE 28

State space methods for temporal GPs Arno Solin 27/44

Example

◮ Commercial aircraft accidents 1919–2017 ◮ Log-Gaussian Cox process (Poisson likelihood) by ADF/EP ◮ Daily binning, n = 35,959 ◮ GP prior with a covariance function:

κ(t, t′) = κν=3/2

Mat.

(t, t′) + κyear

Per. (t, t′) κν=3/2

Mat.

(t, t′) + κweek

Per. (t, t′) κν=3/2

Mat.

(t, t′)

◮ Learn hyperparameters by optimizing the marginal likelihood

SLIDE 29

State space methods for temporal GPs Arno Solin 27/44

Example

◮ Commercial aircraft accidents 1919–2017 ◮ Log-Gaussian Cox process (Poisson likelihood) by ADF/EP ◮ Daily binning, n = 35,959 ◮ GP prior with a covariance function:

κ(t, t′) = κν=3/2

Mat.

(t, t′) + κyear

Per. (t, t′) κν=3/2

Mat.

(t, t′) + κweek

Per. (t, t′) κν=3/2

Mat.

(t, t′)

◮ Learn hyperparameters by optimizing the marginal likelihood

SLIDE 30

State space methods for temporal GPs Arno Solin 27/44

Example

◮ Commercial aircraft accidents 1919–2017 ◮ Log-Gaussian Cox process (Poisson likelihood) by ADF/EP ◮ Daily binning, n = 35,959 ◮ GP prior with a covariance function:

κ(t, t′) = κν=3/2

Mat.

(t, t′) + κyear

Per. (t, t′) κν=3/2

Mat.

(t, t′) + κweek

Per. (t, t′) κν=3/2

Mat.

(t, t′)

◮ Learn hyperparameters by optimizing the marginal likelihood

SLIDE 31

State space methods for temporal GPs Arno Solin 28/44

Spatio-temporal Gaussian processes

SLIDE 32

State space methods for temporal GPs Arno Solin 29/44

Spatio-temporal GPs

f(x) ∼ GP(0, κ(x, x′)) y | f ∼

i

p(yi | f(xi)) f(r, t) ∼ GP(0, κ(r, t; r′, t′)) y | f ∼

i

p(yi | f(ri, ti))

SLIDE 33

State space methods for temporal GPs Arno Solin 30/44

Spatio-temporal Gaussian processes

GPs under the kernel formalism f(x, t) ∼ GP(0, k(x, t; x′, t′)) yi = f(xi, ti) + εi Stochastic partial differential equations ∂f(x, t) ∂t = F f(x, t) + L w(x, t) yi = Hi f(x, t) + εi

Location (x) Time (t) f(x, t) Covariance k(x, t; x′, t′) Location (x) Time (t) f(x, t) The state at time t

SLIDE 34

State space methods for temporal GPs Arno Solin 31/44

Spatio-temporal GP regression

−1 1 −1 1 Temporal dimension, t Spatial dimension, x −1 1 Estimate mean, E[f(t, x)]

SLIDE 35

State space methods for temporal GPs Arno Solin 32/44

Spatio-temporal GP regression

−1 1 −1 1 Temporal dimension, t Spatial dimension, x −1 1 Estimate mean, E[f(t, x)]

SLIDE 36

State space methods for temporal GPs Arno Solin 33/44

Spatio-temporal GP priors

SLIDE 37

State space methods for temporal GPs Arno Solin 34/44

Application examples

SLIDE 38

State space methods for temporal GPs Arno Solin 35/44

What if the data really is infinite?

SLIDE 39

State space methods for temporal GPs Arno Solin 36/44

Adapting the hyperparameters online

https://youtu.be/myCvUT3XGPc

SLIDE 40

State space methods for temporal GPs Arno Solin 37/44

Online inference as a part of a larger system

◮ Single-camera depth estimation ◮ An infinite stream of camera frames ◮ An unholy alliance between deep learning and GPs

SLIDE 41

State space methods for temporal GPs Arno Solin 38/44

Online inference as a part of a larger system

https://youtu.be/iellGrlNW7k

SLIDE 42

State space methods for temporal GPs Arno Solin 39/44

Recap

SLIDE 43

State space methods for temporal GPs Arno Solin 40/44

Gaussian processes ♥ SDEs

GPs under the kernel formalism f(t) ∼ GP(0, κ(t, t′)) y | f ∼

i

p(yi | f(ti)) Stochastic differential equations df(t) = F f(t) + L dβ(t) yi ∼ p(yi | hTf(ti)) Flexible model specification Inference / First-principles

SLIDE 44

State space methods for temporal GPs Arno Solin 41/44

Recap

◮ Gaussian processes have different representations:

Covariance function • Spectral density • State space

◮ Temporal (single-input) Gaussian processes ⇐ ⇒ stochastic differential equations (SDEs) ◮ Conversions between the representations can make model building easier ◮ (Exact) inference of the latent functions, can be done in O(n) time and memory complexity by Kalman filtering

SLIDE 45

State space methods for temporal GPs Arno Solin 42/44

Bibliography

The examples and methods presented on this lecture are presented in greater detail in the following works: Hartikainen, J. and S¨ arkk¨ a, S. (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. Proceedings of IEEE International Workshop on Machine Learning for Signal Processing (MLSP). S¨ arkk¨ a, S., Solin, A., and Hartikainen, J. (2013). Spatio-temporal learning via infinite-dimensional Bayesian filtering and smoothing. IEEE Signal Processing Magazine, 30(4):51–61. S¨ arkk¨ a, S. (2013). Bayesian Filtering and Smoothing. Cambridge University Press. Cambridge, UK. S¨ arkk¨ a, S., and Solin, A. (2019). Applied Stochastic Differential

Equations. Cambridge University Press. Cambridge, UK.

Solin, A. (2016). Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression. Doctoral dissertation, Aalto University.

SLIDE 46

State space methods for temporal GPs Arno Solin 43/44

Bibliography

The examples and methods presented on this lecture are presented in greater detail in the following works: Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, E., Hensman, J. (2019). Banded matrix operators for Gaussian Markov models in the automatic differentiation era. International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR 89:2780–2789. Nickisch, H., Solin, A., and Grigorievskiy, A. (2018). State apace Gaussian processes with non-Gaussian likelihood. International Conference on Machine Learning (ICML). PMLR 80:3789–3798. Solin, A., Hensman, J., and Turner, R.E. (2018). Infinite-horizon Gaussian processes. Advances in Neural Information Processing Systems (NeurIPS), pages 3490–3499. Hou, Y., Kannala, J. and Solin, A. (2019). Multi-view stereo by temporal nonparametric fusion. International Conference on Computer Vision (ICCV).

SLIDE 47

State space methods for temporal GPs Arno Solin 44/44

◮ Homepage: http://arno.solin.fi ◮ Twitter: @arnosolin

S. S¨

arkk¨ a and A. Solin (2019). Applied Stochastic Differential

Equations. Cambridge University Press. Cambridge, UK.

Book PDF and codes for replicating examples available online.