Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 - - PowerPoint PPT Presentation

gaussian process
SMART_READER_LITE
LIVE PREVIEW

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 - - PowerPoint PPT Presentation

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 1 / 22 Gaussian Process, (kriging in geostatistics) Autoregressive moving average model, Kalman filters, and radial basis


slide-1
SLIDE 1

Gaussian Process

Lei Tang

Arizona State University

  • Jul. 31th, 2007

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

1 / 22

slide-2
SLIDE 2

Gaussian Process, (kriging in geostatistics) Autoregressive moving average model, Kalman filters, and radial basis function networks can be viewed as forms of Gaussian process models.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

2 / 22

slide-3
SLIDE 3

Linear regression revisited

y(x) = wTφ(x) p(w) = N(w|0, α−1I) y = Φw E[y] = ΦE[w] cov[y] = E[yyT] = ΦE[wwT]ΦT = 1 αΦΦT = K where K is Gram matrix with elements Knm = k(xn, xm) = 1 αφ(xn)Tφ(xm)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

3 / 22

slide-4
SLIDE 4

Gaussian Process

A Gaussian process is defined as a probability distribution over functions y(x) suth that the set of values of y(x) evaluated at an arbitrary set of points x1, · · · , xN jointly have a Gaussian distribution. Gaussian random field: when the input vector x is two-dimentional. Stochastic process: y(x) is specified by giving the joint probability distirubtion for any finite set of values y(x1), · · · , y(xN) in a consistent manner.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

4 / 22

slide-5
SLIDE 5

GP Connection to Kernel

For Gaussian stochstic process, the joint distribution over N variables y1, · · · , yN is specified completely by the second-order statistics. For most applications, we have no prior knowledge, so by symmetry(also for sparsity) we take the mean of y(x) to be zero. Then the Gaussian process is deteremined by the covariance of y(x) which is specified by the kernel function: E[y(xn), y(xm)] = k(xn, xm)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

5 / 22

slide-6
SLIDE 6

Two Examples of GP

Specificy the covaraince (kernel) directly.

1 Gaussian Kernel: k(x, x′) = exp(−||x − x′||2/2σ2) 2 Exponential Kernel: k(x, x′) = exp(−θ|x − x′|) (correpsonds to the

Ornstein-Uhlenbeck process original introduced for Brownian motion)

−1 −0.5 0.5 1 −3 −1.5 1.5 3 −1 −0.5 0.5 1 −3 −1.5 1.5 3

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

6 / 22

slide-7
SLIDE 7

GP for Regression with Random Noise

If the noise on the observed target values are considered: p(tn|yn) = N(tn|yn, β−1) p(t|y) = N(t|y, β−1IN) p(y) = N(y|0, K) p(t) =

  • p(t|y)p(y)dy = N(t|0, C)

where C(xn, xm) = k(xn, xm) + β−1δnm. Covraince simply add.

Hint: matrix inverse lemma

[B−1 + CD−1C T]−1 = B − BC(D + C TBC)−1C TB

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

7 / 22

slide-8
SLIDE 8

GP for Regression with Random Noise

If the noise on the observed target values are considered: p(tn|yn) = N(tn|yn, β−1) p(t|y) = N(t|y, β−1IN) p(y) = N(y|0, K) p(t) =

  • p(t|y)p(y)dy = N(t|0, C)

where C(xn, xm) = k(xn, xm) + β−1δnm. Covraince simply add.

Hint: matrix inverse lemma

[B−1 + CD−1C T]−1 = B − BC(D + C TBC)−1C TB

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

7 / 22

slide-9
SLIDE 9

GP for Regression with Random Noise

If the noise on the observed target values are considered: p(tn|yn) = N(tn|yn, β−1) p(t|y) = N(t|y, β−1IN) p(y) = N(y|0, K) p(t) =

  • p(t|y)p(y)dy = N(t|0, C)

where C(xn, xm) = k(xn, xm) + β−1δnm. Covraince simply add.

Hint: matrix inverse lemma

[B−1 + CD−1C T]−1 = B − BC(D + C TBC)−1C TB

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

7 / 22

slide-10
SLIDE 10

−1 1 −3 3

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

8 / 22

slide-11
SLIDE 11

Commonly Used Kernel for Regression

k(xn, xm) = θ0exp

  • −θ1

2 ||xn − xm||2

  • + θ2 + θ3xT

n xm

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

9 / 22

slide-12
SLIDE 12

GP for Prediction

p(tN+1) = N(tN+1|0, CN+1) CN+1 = CN k kT c

  • m(xN+1)

= kTC−1

N t

σ2(xN+1) = c − kTC−1

N k

If we rewrite m(xN+1) = N

n=1 ank(xn, xN+1), and define a kernel

function depending only on the distance ||xn − xm||, we obtain an expansion in radial basis function.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

10 / 22

slide-13
SLIDE 13

GP for Prediction

p(tN+1) = N(tN+1|0, CN+1) CN+1 = CN k kT c

  • m(xN+1)

= kTC−1

N t

σ2(xN+1) = c − kTC−1

N k

If we rewrite m(xN+1) = N

n=1 ank(xn, xN+1), and define a kernel

function depending only on the distance ||xn − xm||, we obtain an expansion in radial basis function.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

10 / 22

slide-14
SLIDE 14

0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

11 / 22

slide-15
SLIDE 15

Computation time for GP regression

1 Training:

GP: inversion of a N × N matrix O(N3) + O(N2). Linear basis function model: inversion of a M × M matrix O(M3) + O(M2).

2 Prediction:

GP: O(N). Linear basis function: O(M).

Advantages of GP

If the number of basis functions is larger than the number of data points, GP is computionally more efficient. Donot need to construct the basis function. Can learn the hyperparameters (maximum likelihood estimation)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

12 / 22

slide-16
SLIDE 16

Computation time for GP regression

1 Training:

GP: inversion of a N × N matrix O(N3) + O(N2). Linear basis function model: inversion of a M × M matrix O(M3) + O(M2).

2 Prediction:

GP: O(N). Linear basis function: O(M).

Advantages of GP

If the number of basis functions is larger than the number of data points, GP is computionally more efficient. Donot need to construct the basis function. Can learn the hyperparameters (maximum likelihood estimation)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

12 / 22

slide-17
SLIDE 17

Automatic relevance determination

Previous example doesn’t consider the relevave importance of each dimension. Define a kernel as k(x, x′) = θ0exp

  • −1

2

2

  • i=1

γi(xi − x′

i )2

  • Atuomatically learn the hyperparameters resulting ARD which

automatically determine the relative importance of each basis.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

13 / 22

slide-18
SLIDE 18

GP for Classification

Similar to logistic/probit regression, using a nonlinear activation function to transform (−∞, +∞) into probability interval (0, 1). Assume latent variable a and the target output given latent variables are determined: p(t|a) = σ(a)t(1 − σ(a))1−t Latent variables a follows the Gaussian Process For prediction, p(tN+1 = 1|tN) =

  • p(tN+1 = 1|aN+1)p(aN+1|tN)daN+1

Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

14 / 22

slide-19
SLIDE 19

GP for Classification

Similar to logistic/probit regression, using a nonlinear activation function to transform (−∞, +∞) into probability interval (0, 1). Assume latent variable a and the target output given latent variables are determined: p(t|a) = σ(a)t(1 − σ(a))1−t Latent variables a follows the Gaussian Process For prediction, p(tN+1 = 1|tN) =

  • p(tN+1 = 1|aN+1)p(aN+1|tN)daN+1

Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

14 / 22

slide-20
SLIDE 20

GP for Classification

Similar to logistic/probit regression, using a nonlinear activation function to transform (−∞, +∞) into probability interval (0, 1). Assume latent variable a and the target output given latent variables are determined: p(t|a) = σ(a)t(1 − σ(a))1−t Latent variables a follows the Gaussian Process For prediction, p(tN+1 = 1|tN) =

  • p(tN+1 = 1|aN+1)p(aN+1|tN)daN+1

Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

14 / 22

slide-21
SLIDE 21

GP for classification prediction

Gaussin approximation to the posterior distribution over aN+1. p(aN+1|tN) =

  • p(aN+1|aN)p(aN|tN)daN

p(aN+1|aN) = N(aN+1|kTC−1

N aN, c − kTC−1 N k)

Need to estimate p(aN|tN): use Gaussian Approximation The shape of single-mode distribution is close to Gaussian distribution. Increasing the number of data points falling in a fixed region of x space, then the corresponding uncertainty in the function a(x) will decrease, asymptotically leading to a Gaussian.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

15 / 22

slide-22
SLIDE 22

GP for classification prediction

Gaussin approximation to the posterior distribution over aN+1. p(aN+1|tN) =

  • p(aN+1|aN)p(aN|tN)daN

p(aN+1|aN) = N(aN+1|kTC−1

N aN, c − kTC−1 N k)

Need to estimate p(aN|tN): use Gaussian Approximation The shape of single-mode distribution is close to Gaussian distribution. Increasing the number of data points falling in a fixed region of x space, then the corresponding uncertainty in the function a(x) will decrease, asymptotically leading to a Gaussian.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

15 / 22

slide-23
SLIDE 23

Different approach to obtain a Gaussian approximation

1 variational inference 2 expectation propagation 3 Laplace approximation Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

16 / 22

slide-24
SLIDE 24

Laplace Approximation

p(aN) is given by a zero-mean Gaussian process with covariance matrix CN: p(aN) = N(0, CN) p(tN|aN) =

N

  • n=1

σ(an)tn(1 − σ(an))1−tn

N

  • n=1

eantnσ(−an) where wN is a diagonal matrix with elements σ(an)(1 − σ(an)). The hessian matrix A = −∇∇Ψ(aN) is positive definite. So the posterior is log convex and has a single model that is the global maximum.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

17 / 22

slide-25
SLIDE 25

Laplace Approximation (2)

How to find the mode

Use Newton method, anew

N

= aold

N − ∇∇Ψ(aN)−1∇Ψ(aN)

= aold

N + (WN + C −1 N )−1(tN − σN − C −1 N aN)

= CN(I + WNCN)−1{tN − σN + WNaN} At the mode, a∗

N = CN(tN − σN)

How to get the Hessian

H = −∇∇Ψ(aN) = WN + C −1

N

q(aN) = N(aN|a∗

N, H−1)

(1)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

18 / 22

slide-26
SLIDE 26

Laplace Approximation (2)

How to find the mode

Use Newton method, anew

N

= aold

N − ∇∇Ψ(aN)−1∇Ψ(aN)

= aold

N + (WN + C −1 N )−1(tN − σN − C −1 N aN)

= CN(I + WNCN)−1{tN − σN + WNaN} At the mode, a∗

N = CN(tN − σN)

How to get the Hessian

H = −∇∇Ψ(aN) = WN + C −1

N

q(aN) = N(aN|a∗

N, H−1)

(1)

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

18 / 22

slide-27
SLIDE 27

Laplace Approximation for Prediction(1)

p(aN+1|tN) ≈

  • p(aN+1|aN)q(aN|tN)daN

E[aN+1|tN] = kT(tN − σN) var[aN+1|tN] = c − kTC −1

N k + kTC −1 N (WN + C −1 N )−1C −1 N k

= c − kTC −1

N k + kT(CNWNCN + CN)−1)k

= c − kTC −1

N k + kT(C −1 N

− C −1

N CN(W −1 N

+ CN)−1CNC −1

N )k

= c − kTC −1

N k + kT(C −1 N

− (W −1

N

+ CN)−1)k = c − kT(W −1

N

+ CN)−1k

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

19 / 22

slide-28
SLIDE 28

Laplace Approximation for Prediction(1)

p(aN+1|tN) ≈

  • p(aN+1|aN)q(aN|tN)daN

E[aN+1|tN] = kT(tN − σN) var[aN+1|tN] = c − kTC −1

N k + kTC −1 N (WN + C −1 N )−1C −1 N k

= c − kTC −1

N k + kT(CNWNCN + CN)−1)k

= c − kTC −1

N k + kT(C −1 N

− C −1

N CN(W −1 N

+ CN)−1CNC −1

N )k

= c − kTC −1

N k + kT(C −1 N

− (W −1

N

+ CN)−1)k = c − kT(W −1

N

+ CN)−1k

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

19 / 22

slide-29
SLIDE 29

Laplace Approximation for Prediction(2)

Recall that p(aN+1|tN) =

  • p(aN+1|aN)p(aN|tN)daN

=

  • σ(aN+1)p(aN+1|tN)daN+1

Use a probit function to approximate the sigmoid function: σ(a) ≈ Φ(λa) where λ2 = π 8

  • Φ(λa)N(a|µ, σ2)da

= Φ

  • µ

(λ−2 + σ2)

1 2

  • Lei Tang (ASU)

Gaussian Process

  • Jul. 31th, 2007

20 / 22

slide-30
SLIDE 30

Connection to Neural Network

The functions represented by a neural network is governed by the number of hidden units (M). Hence, the number of hidden units is limited based on the size of training data to avoid over-fitting. In a Bayesian perspective, it makes no sense to limit the number of parameters according to the size of training data. For a broad class of prior distributions over w, the distribution of functions generated by a neural network will tend to a Gaussian process in the limit M → ∞. In the limit, the output variables of the neural network are

  • independent. But in neural network, they can still borrow strength

from each other.

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

21 / 22

slide-31
SLIDE 31

Lei Tang (ASU) Gaussian Process

  • Jul. 31th, 2007

22 / 22