[PPT] - Probabilistic & Unsupervised Learning Factored Variational PowerPoint Presentation

SLIDE 1

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational Bayes

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

SLIDE 2

Expectations in Statistical Modelling

◮ Parameter estimation

ˆ θ = argmax

θ

dZ P(Z|θ)P(X|Z, θ)

(or, using EM)

θnew = argmax

θ

dZ P(Z|X, θold) log P(X, Z|θ)

◮ Prediction

p(x|D, m) =

dθ p(θ|D, m)p(x|θ, D, m)

◮ Model selection or weighting (by marginal likelihood)

p(D|m) =

dθ p(θ|m)p(D|θ, m)

These integrals are often intractable:

◮ Analytic intractability: integrals may not have closed form in non-linear, non-Gaussian

models ⇒ numerical integration.

◮ Computational intractability: Numerical integral (or sum if Z or θ are discrete) may be

exponential in data or model size.

SLIDE 3

Intractabilities and approximations

◮ Inference – computational intractability

◮ Factored variational approx ◮ Loopy BP/EP/Power EP ◮ LP relaxations/ convexified BP ◮ Gibbs sampling, other MCMC

◮ Inference – analytic intractability

◮ Laplace approximation (global) ◮ Parametric variational approx ◮ Message approximations (linearised, sigma-point, Laplace) ◮ Assumed-density methods and Expectation-Propagation ◮ (Sequential) Monte-Carlo methods

◮ Learning – intractable partition function

◮ Sampling parameters ◮ Constrastive divergence ◮ Score-matching

◮ Model selection

◮ Laplace approximation / BIC ◮ Variational Bayes ◮ (Annealed) importance sampling ◮ Reversible jump MCMC

Not a complete list!

SLIDE 4

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

SLIDE 5

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

SLIDE 6

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

◮ Triangulation extends cliques to size M + 1

SLIDE 7

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

◮ Triangulation extends cliques to size M + 1 ◮ Each state takes K values ⇒ sums over K M+1 terms.

SLIDE 8

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

◮ Triangulation extends cliques to size M + 1 ◮ Each state takes K values ⇒ sums over K M+1 terms. ◮ Factorial prior ⇒ Factorial posterior (explaining away).

SLIDE 9

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

◮ Triangulation extends cliques to size M + 1 ◮ Each state takes K values ⇒ sums over K M+1 terms. ◮ Factorial prior ⇒ Factorial posterior (explaining away).

Variational methods approximate the posterior, often in a factored form.

SLIDE 10

Distributed models

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT Consider an FHMM with M state variables taking on K values each.

◮ Moralisation puts simultaneous states (s(1) t

, s(2)

t

, . . . , s(M)

t

) into a single clique

◮ Triangulation extends cliques to size M + 1 ◮ Each state takes K values ⇒ sums over K M+1 terms. ◮ Factorial prior ⇒ Factorial posterior (explaining away).

Variational methods approximate the posterior, often in a factored form. To see how they work, we need to review the free-energy interpretation of EM.

SLIDE 11

The Free Energy for a Latent Variable Model

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ. Goal: Maximize the log likelihood wrt θ (i.e. ML learning):

ℓ(θ) = log P(X|θ) = log

P(Z, X|θ)dZ

SLIDE 12

The Free Energy for a Latent Variable Model

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ. Goal: Maximize the log likelihood wrt θ (i.e. ML learning):

ℓ(θ) = log P(X|θ) = log

P(Z, X|θ)dZ

Any distribution, q(Z), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality:

ℓ(θ) = log

q(Z)P(Z, X|θ)

q(Z) dZ ≥

q(Z) log P(Z, X|θ)

q(Z) dZ

def

= F(q, θ)

SLIDE 13

The Free Energy for a Latent Variable Model

Observed data X = {xi}; Latent variables Z = {zi}; Parameters θ. Goal: Maximize the log likelihood wrt θ (i.e. ML learning):

ℓ(θ) = log P(X|θ) = log

P(Z, X|θ)dZ

Any distribution, q(Z), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality:

ℓ(θ) = log

q(Z)P(Z, X|θ)

q(Z) dZ ≥

q(Z) log P(Z, X|θ)

q(Z) dZ

def

= F(q, θ)

q(Z) log P(Z, X|θ)

q(Z) dZ =

q(Z) log P(Z, X|θ) dZ −
q(Z) log q(Z) dZ

=

q(Z) log P(Z, X|θ) dZ + H[q],

where H[q] is the entropy of q(Z). So: F(q, θ) = log P(Z, X|θ)q(Z) + H[q]

SLIDE 14

The E and M steps of EM

The log likelihood is bounded below by:

F(q, θ) = log P(Z, X|θ)q(Z) + H[q] = ℓ(θ) − KL[q(Z)P(Z|X, θ)]

EM alternates between: E step: optimise F(q, θ) wrt distribution over hidden variables holding parameters fixed: q(k)(Z) := argmax

q(Z)

F

q(Z), θ(k−1)

= P(Z|X, θ(k−1))

M step: maximise F(q, θ) wrt parameters holding hidden distribution fixed:

θ(k) := argmax

θ

F

q(k)(Z), θ
= argmax

θ

log P(Z, X|θ)q(k)(Z)

SLIDE 15

EM as Coordinate Ascent in F

SLIDE 16

EM Never Decreases the Likelihood

The E and M steps together never decrease the log likelihood:

ℓ

θ(k−1)

=

E step

F

q(k), θ(k−1)

≤

M step

F

q(k), θ(k)

≤

Jensen

ℓ

θ(k)

,

◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ. ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

If the M-step is executed so that θ(k) = θ(k−1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

SLIDE 17

Free-energy-based variational approximation

What if finding expected sufficient stats under P(Z|X, θ) is computationally intractable?

SLIDE 18

Free-energy-based variational approximation

What if finding expected sufficient stats under P(Z|X, θ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps.

◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint.

SLIDE 19

Free-energy-based variational approximation

What if finding expected sufficient stats under P(Z|X, θ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps.

◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint.

For the E-step we could:

◮ Parameterise q = qρ(Z) and take a gradient step in ρ. ◮ Assume some simplified form for q, usually factored: q = i qi(Zi) where Zi partition

Z, and maximise within this form.

SLIDE 20

Free-energy-based variational approximation

What if finding expected sufficient stats under P(Z|X, θ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps.

◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint.

For the E-step we could:

◮ Parameterise q = qρ(Z) and take a gradient step in ρ. ◮ Assume some simplified form for q, usually factored: q = i qi(Zi) where Zi partition

Z, and maximise within this form.

In either case, we choose q from within a limited set Q: VE step: maximise F(q, θ) wrt constrained latent distribution given parameters: q(k)(Z) := argmax

q(Z)∈Q←Constraint

F

q(Z), θ(k−1)

.

M step: unchanged

θ(k) := argmax

θ

F

q(k)(Z), θ
= argmax

θ

q(k)(Z) log p(Z, X|θ)dZ,

Unlike in GEM, the fixed point may not be at an unconstrained optimum of F.

SLIDE 21

What do we lose?

What does restricting q to Q cost us?

SLIDE 22

What do we lose?

What does restricting q to Q cost us?

◮ Recall that the free-energy is bounded above by Jensen:

F(q, θ) ≤ ℓ(θML)

Thus, as long as every step increases F, convergence is still guaranteed.

SLIDE 23

What do we lose?

What does restricting q to Q cost us?

◮ Recall that the free-energy is bounded above by Jensen:

F(q, θ) ≤ ℓ(θML)

Thus, as long as every step increases F, convergence is still guaranteed.

◮ But, since P(Z|X, θ(k)) may not lie in Q, we no longer saturate the bound after the

E-step. Thus, the likelihood may not increase on each full EM step.

ℓ

θ(k−1)
=

E step

F

q(k), θ(k−1)

≤

M step

F

q(k), θ(k)

≤

Jensen

ℓ

θ(k)

,

SLIDE 24

What do we lose?

What does restricting q to Q cost us?

◮ Recall that the free-energy is bounded above by Jensen:

F(q, θ) ≤ ℓ(θML)

Thus, as long as every step increases F, convergence is still guaranteed.

◮ But, since P(Z|X, θ(k)) may not lie in Q, we no longer saturate the bound after the

E-step. Thus, the likelihood may not increase on each full EM step.

ℓ

θ(k−1)
=

E step

F

q(k), θ(k−1)

≤

M step

F

q(k), θ(k)

≤

Jensen

ℓ

θ(k)

,

◮ This means we may not (and usually won’t) converge to a maximum of ℓ.

SLIDE 25

What do we lose?

What does restricting q to Q cost us?

◮ Recall that the free-energy is bounded above by Jensen:

F(q, θ) ≤ ℓ(θML)

Thus, as long as every step increases F, convergence is still guaranteed.

◮ But, since P(Z|X, θ(k)) may not lie in Q, we no longer saturate the bound after the

E-step. Thus, the likelihood may not increase on each full EM step.

ℓ

θ(k−1)
=

E step

F

q(k), θ(k−1)

≤

M step

F

q(k), θ(k)

≤

Jensen

ℓ

θ(k)

,

◮ This means we may not (and usually won’t) converge to a maximum of ℓ.

The hope is that by increasing a lower bound on ℓ we will find a decent solution.

SLIDE 26

What do we lose?

What does restricting q to Q cost us?

◮ Recall that the free-energy is bounded above by Jensen:

F(q, θ) ≤ ℓ(θML)

Thus, as long as every step increases F, convergence is still guaranteed.

◮ But, since P(Z|X, θ(k)) may not lie in Q, we no longer saturate the bound after the

E-step. Thus, the likelihood may not increase on each full EM step.

ℓ

θ(k−1)
=

E step

F

q(k), θ(k−1)

≤

M step

F

q(k), θ(k)

≤

Jensen

ℓ

θ(k)

,

◮ This means we may not (and usually won’t) converge to a maximum of ℓ.

The hope is that by increasing a lower bound on ℓ we will find a decent solution. [Note that if P(Z|X, θML) ∈ Q, then θML is a fixed point of the variational algorithm.]

SLIDE 27

KL divergence

Recall that

F(q, θ) = log P(X, Z|θ)q(Z) + H[q] = log P(X|θ) + log P(Z|X, θ)q(Z) − log q(Z)q(Z) = log P(X|θ)q(Z) − KL[qP(Z|X, θ)].

Thus, E step maximise F(q, θ) wrt the distribution over latents, given parameters: q(k)(Z) := argmax

q(Z)∈Q

F

q(Z), θ(k−1)

.

is equivalent to: E step minimise KL[qp(Z|X, θ)] wrt distribution over latents, given parameters: q(k)(Z) := argmin

q(Z)∈Q

q(Z) log

q(Z) p(Z|X, θ(k−1))dZ So, in each E step, the algorithm is trying to find the best approximation to P(Z|X) in Q in a KL sense. This is related to ideas in information geometry. It also suggests generalisations to

ther distance measures.

SLIDE 28

Factored Variational E-step

The most common form of variational approximation partitions Z into disjoint sets Zi with

Q =

q
q(Z) =
i

qi(Zi)

.

SLIDE 29

Factored Variational E-step

The most common form of variational approximation partitions Z into disjoint sets Zi with

Q =

q
q(Z) =
i

qi(Zi)

.

In this case the E-step is itself iterative: (Factored VE step)i: maximise F(q, θ) wrt qi(Zi) given other qj and parameters: q(k)

i

(Zi) := argmax

qi(Zi)

F

qi(Zi)
j=i

qj(Zj), θ(k−1)

.

SLIDE 30

Factored Variational E-step

The most common form of variational approximation partitions Z into disjoint sets Zi with

Q =

q
q(Z) =
i

qi(Zi)

.

In this case the E-step is itself iterative: (Factored VE step)i: maximise F(q, θ) wrt qi(Zi) given other qj and parameters: q(k)

i

(Zi) := argmax

qi(Zi)

F

qi(Zi)
j=i

qj(Zj), θ(k−1)

.

◮ qi updates iterated to convergence to “complete” VE-step.

SLIDE 31

Factored Variational E-step

The most common form of variational approximation partitions Z into disjoint sets Zi with

Q =

q
q(Z) =
i

qi(Zi)

.

In this case the E-step is itself iterative: (Factored VE step)i: maximise F(q, θ) wrt qi(Zi) given other qj and parameters: q(k)

i

(Zi) := argmax

qi(Zi)

F

qi(Zi)
j=i

qj(Zj), θ(k−1)

.

◮ qi updates iterated to convergence to “complete” VE-step. ◮ In fact, every (VE)i-step separately increases F, so any schedule of (VE)i- and M-steps

will converge. Choice can be dictated by practical issues (rarely efficient to fully converge E-step before updating parameters).

SLIDE 32

Factored Variational E-step

The Factored Variational E-step has a general form.

SLIDE 33

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

SLIDE 34

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

=
dZi qi(Zi)
log P(X, Z|θ(k−1))
j=i qj(Zj)+ H[qi] +
j=i

H[qj]

SLIDE 35

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

=
dZi qi(Zi)
log P(X, Z|θ(k−1))
j=i qj(Zj)+ H[qi] +
j=i

H[qj] Now, taking the variational derivative of the Lagrangian (enforcing normalisation of qi):

δ δqi

F + λ
qi − 1
=

SLIDE 36

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

=
dZi qi(Zi)
log P(X, Z|θ(k−1))
j=i qj(Zj)+ H[qi] +
j=i

H[qj] Now, taking the variational derivative of the Lagrangian (enforcing normalisation of qi):

δ δqi

F + λ
qi − 1
=
log P(X, Z|θ(k−1))
j=i qj(Zj) − log qi(Zi) − qi(Zi)

qi(Zi) + λ

SLIDE 37

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

=
dZi qi(Zi)
log P(X, Z|θ(k−1))
j=i qj(Zj)+ H[qi] +
j=i

H[qj] Now, taking the variational derivative of the Lagrangian (enforcing normalisation of qi):

δ δqi

F + λ
qi − 1
=
log P(X, Z|θ(k−1))
j=i qj(Zj) − log qi(Zi) − qi(Zi)

qi(Zi) + λ

(= 0) ⇒

qi(Zi) ∝ exp

log P(X, Z|θ(k−1))
j=i qj(Zj)

SLIDE 38

Factored Variational E-step

The Factored Variational E-step has a general form. The free energy is:

F

j

qj(Zj), θ(k−1)

=

log P(X, Z|θ(k−1))
j qj(Zj) + H
j

qj(Zj)

=
dZi qi(Zi)
log P(X, Z|θ(k−1))
j=i qj(Zj)+ H[qi] +
j=i

H[qj] Now, taking the variational derivative of the Lagrangian (enforcing normalisation of qi):

δ δqi

F + λ
qi − 1
=
log P(X, Z|θ(k−1))
j=i qj(Zj) − log qi(Zi) − qi(Zi)

qi(Zi) + λ

(= 0) ⇒

qi(Zi) ∝ exp

log P(X, Z|θ(k−1))
j=i qj(Zj)

In general, this depends only on the expected sufficient statistics under qj. Thus, again, we don’t actually need the entire distributions, just the relevant expectations (now for approximate inference as well as learning).

SLIDE 39

Mean-field approximations

If Zi = zi (i.e., q is factored over all variables) then the variational technique is often called a “mean field” approximation.

SLIDE 40

Mean-field approximations

If Zi = zi (i.e., q is factored over all variables) then the variational technique is often called a “mean field” approximation.

◮ Suppose P(X, Z) has sufficient statistics that are separable in the latent variables:

e.g. the Boltzmann machine P(X, Z) = 1 Z exp

ij

Wijsisj +

i

bisi

with some si ∈ Z and others observed.

SLIDE 41

Mean-field approximations

If Zi = zi (i.e., q is factored over all variables) then the variational technique is often called a “mean field” approximation.

◮ Suppose P(X, Z) has sufficient statistics that are separable in the latent variables:

e.g. the Boltzmann machine P(X, Z) = 1 Z exp

ij

Wijsisj +

i

bisi

with some si ∈ Z and others observed.

◮ Expectations wrt a fully-factored q distribute over all si ∈ Z

log P(X, Z)

qi =

ij

Wijsiqi sjqj +

i

bisiqi (where qi for si ∈ X is a delta function on the observed value).

SLIDE 42

Mean-field approximations

If Zi = zi (i.e., q is factored over all variables) then the variational technique is often called a “mean field” approximation.

◮ Suppose P(X, Z) has sufficient statistics that are separable in the latent variables:

e.g. the Boltzmann machine P(X, Z) = 1 Z exp

ij

Wijsisj +

i

bisi

with some si ∈ Z and others observed.

◮ Expectations wrt a fully-factored q distribute over all si ∈ Z

log P(X, Z)

qi =

ij

Wijsiqi sjqj +

i

bisiqi (where qi for si ∈ X is a delta function on the observed value).

◮ Thus, we can update each qi in turn given the means (or, in general, mean sufficient

statistics) of the others.

SLIDE 43

Mean-field approximations

If Zi = zi (i.e., q is factored over all variables) then the variational technique is often called a “mean field” approximation.

◮ Suppose P(X, Z) has sufficient statistics that are separable in the latent variables:

e.g. the Boltzmann machine P(X, Z) = 1 Z exp

ij

Wijsisj +

i

bisi

with some si ∈ Z and others observed.

◮ Expectations wrt a fully-factored q distribute over all si ∈ Z

log P(X, Z)

qi =

ij

Wijsiqi sjqj +

i

bisiqi (where qi for si ∈ X is a delta function on the observed value).

◮ Thus, we can update each qi in turn given the means (or, in general, mean sufficient

statistics) of the others.

◮ Each variable sees the mean field imposed by its neighbours, and we update these

fields until they all agree.

SLIDE 44

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

SLIDE 45

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

SLIDE 46

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(s1:M

1:T , x1:T)

¬(m,t)

qm′

t′ (sm′ t′ )

SLIDE 47

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(s1:M

1:T , x1:T)

¬(m,t)

qm′

t′ (sm′ t′ )

= exp

µ
τ

log P(sµ

τ |sµ τ–1) +

τ

log P(xτ|s1:M

τ )

¬(m,t)

qm′

t′

SLIDE 48

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(s1:M

1:T , x1:T)

¬(m,t)

qm′

t′ (sm′ t′ )

= exp

µ
τ

log P(sµ

τ |sµ τ–1) +

τ

log P(xτ|s1:M

τ )

¬(m,t)

qm′

t′

∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t +

log P(sm

t+1|sm t )

qm

t+1

SLIDE 49

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(s1:M

1:T , x1:T)

¬(m,t)

qm′

t′ (sm′ t′ )

= exp

µ
τ

log P(sµ

τ |sµ τ–1) +

τ

log P(xτ|s1:M

τ )

¬(m,t)

qm′

t′

∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

SLIDE 50

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

◮ Yields a message-passing algorithm like forward-backward

SLIDE 51

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

◮ Yields a message-passing algorithm like forward-backward

◮ Updates depend only on immediate neighbours in chain

SLIDE 52

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

◮ Yields a message-passing algorithm like forward-backward

◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output

SLIDE 53

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

◮ Yields a message-passing algorithm like forward-backward

◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals

SLIDE 54

Mean-field FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

q(s1:M

1:T ) =

m,t

qm

t (sm t )

qm

t (sm t ) ∝ exp

log P(sm

t |sm t–1)

qm

t–1+

log P(xt|s1:M

t

)

¬m

qm′

t

αm

t (i) ∝ e

j log Φm

ji qm t–1(j) · e

log Ai(xt )q¬m

t

Cf. forward-backward:

αt (i)∝

j αt–1(j)Φji·Ai(xt )

+

log P(sm

t+1|sm t )

qm

t+1

βm

t (i) ∝ e

j log Φm

ij qm t+1(j)

βt (i)∝

j Φij Aj(xt+1)βt+1(j)

◮ Yields a message-passing algorithm like forward-backward

◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals ◮ Evidence does not appear explicitly in backward message (cf Kalman smoothing)

SLIDE 55

Structured variational approximation

◮ q(Z) need not be completely factorized.

At Dt Ct Bt At+1 Dt+1 Ct+1 Bt+1

...

At+2 Dt+2 Ct+2 Bt+2

SLIDE 56

Structured variational approximation

◮ q(Z) need not be completely factorized. ◮ For example, suppose Z can be partitioned into sets Z1 and Z2 such that computing the

expected sufficient statistics under P(Z1|Z2, X) and P(Z2|Z1, X) would be tractable.

⇒ Then the factored approximation q(Z) = q(Z1)q(Z2) is tractable.

At Dt Ct Bt At+1 Dt+1 Ct+1 Bt+1

...

At+2 Dt+2 Ct+2 Bt+2

SLIDE 57

Structured variational approximation

◮ q(Z) need not be completely factorized. ◮ For example, suppose Z can be partitioned into sets Z1 and Z2 such that computing the

expected sufficient statistics under P(Z1|Z2, X) and P(Z2|Z1, X) would be tractable.

⇒ Then the factored approximation q(Z) = q(Z1)q(Z2) is tractable.

◮ In particular, any factorisation of q(Z) into a product of distributions on trees, yields a

tractable approximation. At Dt Ct Bt At+1 Dt+1 Ct+1 Bt+1

...

At+2 Dt+2 Ct+2 Bt+2

SLIDE 58

Stuctured FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

SLIDE 59

Stuctured FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

For the FHMM we can factor the chains: q(s1:M

1:T ) =

m

qm(sm

1:T)

SLIDE 60

Stuctured FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

For the FHMM we can factor the chains: q(s1:M

1:T ) =

m

qm(sm

1:T)

qm(sm

1:T) ∝ exp

log P(s1:M

1:T , x1:T)

¬m

qm′ (sm′

1:T )

= exp

µ
t

log P(sµ

t |sµ t−1) +

t

log P(xt|s1:M

t

)

¬m

qm′

∝ exp

t

log P(sm

t |sm t−1) +

t
log P(xt|s1:M

t

)

¬m

qm′ (sm′

t

)

=
t

P(sm

t |sm t−1)

t

e

log P(xt |s1:M

t

)

¬m qm′ (sm′ t )

SLIDE 61

Stuctured FHMM

s(1)

1

s(1)

2

s(1)

3

s(1)

T

• •

s(2)

1

s(2)

2

s(2)

3

s(2)

T

• •

s(3)

1

s(3)

2

s(3)

3

s(3)

T

• •

x1 x2 x3 xT

For the FHMM we can factor the chains: q(s1:M

1:T ) =

m

qm(sm

1:T)

qm(sm

1:T) ∝ exp

log P(s1:M

1:T , x1:T)

¬m

qm′ (sm′

1:T )

= exp

µ
t

log P(sµ

t |sµ t−1) +

t

log P(xt|s1:M

t

)

¬m

qm′

∝ exp

t

log P(sm

t |sm t−1) +

t
log P(xt|s1:M

t

)

¬m

qm′ (sm′

t

)

=
t

P(sm

t |sm t−1)

t

e

log P(xt |s1:M

t

)

¬m qm′ (sm′ t )

This looks like a standard HMM joint, with a modified likelihood term ⇒ cycle through multiple forward-backward passes, updating likelihood terms each time.

SLIDE 62

Messages on an arbitrary graph

Consider a DAG: A B C D E P(X, Z) =

k

P(Vk| pa(Vk)) and let q(Z) =

i qi(Zi) for disjoint sets {Zi}.

SLIDE 63

Messages on an arbitrary graph

Consider a DAG: A B C D E P(X, Z) =

k

P(Vk| pa(Vk)) and let q(Z) =

i qi(Zi) for disjoint sets {Zi}.

We have that the VE update for qi is given by q∗

i (Zi) ∝ exp log p(Z, X)q¬i(Z) where

·q¬i(Z) denotes averaging with respect to qj(Zj) for all j = i

SLIDE 64

Messages on an arbitrary graph

Consider a DAG: A B C D E P(X, Z) =

k

P(Vk| pa(Vk)) and let q(Z) =

i qi(Zi) for disjoint sets {Zi}.

We have that the VE update for qi is given by q∗

i (Zi) ∝ exp log p(Z, X)q¬i(Z) where

·q¬i(Z) denotes averaging with respect to qj(Zj) for all j = i

Then: log q∗

i (Zi) =

k

log P(Vk| pa(Vk))

q¬i(Z)

+ const =

j∈Zi

log P(Zj| pa(Zj))q¬i(Z) +

j∈ch(Zi)

log P(Vj| pa(Vj))q¬i(Z) + const

SLIDE 65

Messages on an arbitrary graph

Consider a DAG: A B C D E P(X, Z) =

k

P(Vk| pa(Vk)) and let q(Z) =

i qi(Zi) for disjoint sets {Zi}.

We have that the VE update for qi is given by q∗

i (Zi) ∝ exp log p(Z, X)q¬i(Z) where

·q¬i(Z) denotes averaging with respect to qj(Zj) for all j = i

Then: log q∗

i (Zi) =

k

log P(Vk| pa(Vk))

q¬i(Z)

+ const =

j∈Zi

log P(Zj| pa(Zj))q¬i(Z) +

j∈ch(Zi)

log P(Vj| pa(Vj))q¬i(Z) + const

This defines messages that are passed between nodes in the graph. Each node receives messages from its Markov boundary: parents, children and parents of children (all neighbours in the corresponding factor graph).

SLIDE 66

Non-factored variational methods

The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including:

◮ parametric forms (e.g. Gaussian) for non-linear models

◮ closed form updates in special cases ◮ numerical or sampling-based computation of expectations ◮ ’recognition networks’ or amortisation to estimate variational parameters

◮ non-free-energy-based bounds (both upper and lower) on the likelihood.

We can also see MAP- or zero-temperature EM and recognition models as parametric forms

f variational inference.

SLIDE 67

Non-factored variational methods

The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including:

◮ parametric forms (e.g. Gaussian) for non-linear models

◮ closed form updates in special cases ◮ numerical or sampling-based computation of expectations ◮ ’recognition networks’ or amortisation to estimate variational parameters

◮ non-free-energy-based bounds (both upper and lower) on the likelihood.

We can also see MAP- or zero-temperature EM and recognition models as parametric forms

f variational inference.

Variational methods can also be used to find an approximate posterior on the parameters.

SLIDE 68

Variational Bayes

So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables.

SLIDE 69

Variational Bayes

So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. log P(X|M) = log

dZ dθ P(X, Z|θ, M)P(θ|M)

= max

Q

dZ dθ Q(Z, θ) log P(X, Z, θ|M)

Q(Z, θ)

SLIDE 70

Variational Bayes

So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. log P(X|M) = log

dZ dθ P(X, Z|θ, M)P(θ|M)

= max

Q

dZ dθ Q(Z, θ) log P(X, Z, θ|M)

Q(Z, θ)

≥ max

QZ ,Qθ

dZ dθ QZ(Z)Qθ(θ) log P(X, Z, θ|M)

QZ(Z)Qθ(θ)

SLIDE 71

Variational Bayes

So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. log P(X|M) = log

dZ dθ P(X, Z|θ, M)P(θ|M)

= max

Q

dZ dθ Q(Z, θ) log P(X, Z, θ|M)

Q(Z, θ)

≥ max

QZ ,Qθ

dZ dθ QZ(Z)Qθ(θ) log P(X, Z, θ|M)

QZ(Z)Qθ(θ) The constraint that the distribution Q must factor into the product Qy(Z)Qθ(θ) leads to the variational Bayesian EM algorithm or just “Variational Bayes”.

SLIDE 72

Variational Bayes

So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. log P(X|M) = log

dZ dθ P(X, Z|θ, M)P(θ|M)

= max

Q

dZ dθ Q(Z, θ) log P(X, Z, θ|M)

Q(Z, θ)

≥ max

QZ ,Qθ

dZ dθ QZ(Z)Qθ(θ) log P(X, Z, θ|M)

QZ(Z)Qθ(θ) The constraint that the distribution Q must factor into the product Qy(Z)Qθ(θ) leads to the variational Bayesian EM algorithm or just “Variational Bayes”. Some call this the “Evidence Lower Bound” (ELBO). I’m not fond of that term.

SLIDE 73

Variational Bayesian EM . . .

Coordinate maximization of the VB free-energy lower bound

F(QZ, Qθ) =

dZ dθ QZ(Z)Qθ(θ) log p(X, Z, θ|M)

QZ(Z)Qθ(θ) leads to EM-like updates:

SLIDE 74

Variational Bayesian EM . . .

Coordinate maximization of the VB free-energy lower bound

F(QZ, Qθ) =

dZ dθ QZ(Z)Qθ(θ) log p(X, Z, θ|M)

QZ(Z)Qθ(θ) leads to EM-like updates: Q∗

Z(Z) ∝ exp log P(Z,X|θ)Qθ(θ)

E-like step Q∗

θ(θ) ∝ P(θ) exp log P(Z,X|θ)QZ (Z)

M-like step

SLIDE 75

Variational Bayesian EM . . .

Coordinate maximization of the VB free-energy lower bound

F(QZ, Qθ) =

dZ dθ QZ(Z)Qθ(θ) log p(X, Z, θ|M)

QZ(Z)Qθ(θ) leads to EM-like updates: Q∗

Z(Z) ∝ exp log P(Z,X|θ)Qθ(θ)

E-like step Q∗

θ(θ) ∝ P(θ) exp log P(Z,X|θ)QZ (Z)

M-like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior, Q(θ)Q(Z) and the true posterior, P(θ, Z|X). log P(X) − F(QZ, Qθ) = log P(X) −

dZ dθ QZ(Z)Qθ(θ) log

P(X, Z, θ) QZ(Z)Qθ(θ)

=

dZ dθ QZ(Z)Qθ(θ) log QZ(Z)Qθ(θ)

P(Z, θ|X)

= KL(Q||P)

SLIDE 76

Conjugate-Exponential models

Let’s focus on conjugate-exponential (CE) latent-variable models:

◮ Condition (1). The joint probability over variables is in the exponential family:

P(Z, X|θ) = f(Z, X) g(θ) exp

φ(θ)TT(Z, X)
where φ(θ) is the vector of natural parameters, T are sufficient statistics

◮ Condition (2). The prior over parameters is conjugate to this joint probability:

P(θ|ν, τ) = h(ν, τ) g(θ)ν exp

φ(θ)Tτ
where ν and τ are hyperparameters of the prior.

Conjugate priors are computationally convenient and have an intuitive interpretation:

◮ ν: number of pseudo-observations ◮ τ: values of pseudo-observations

SLIDE 77

Conjugate-Exponential examples

In the CE family:

◮ Gaussian mixtures ◮ factor analysis, probabilistic PCA ◮ hidden Markov models and factorial HMMs ◮ linear dynamical systems and switching models ◮ discrete-variable belief networks

Other as yet undreamt-of models combinations of Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others.

Not in the CE family:

◮ Boltzmann machines, MRFs (no simple conjugacy) ◮ logistic regression (no simple conjugacy) ◮ sigmoid belief networks (not exponential) ◮ independent components analysis (not exponential)

Note: one can often approximate such models with a suitable choice from the CE family.

SLIDE 78

Conjugate-exponential VB

Given an iid data set D = (x1, . . . xn), if the model is CE then:

SLIDE 79

Conjugate-exponential VB

Given an iid data set D = (x1, . . . xn), if the model is CE then:

◮ Qθ(θ) is also conjugate, i.e.

Qθ(θ) ∝ P(θ) exp

i log P(zi, xi|θ)
QZ

= h(ν, τ)g(θ)νeφ(θ)Tτ

g(θ)ne

log f(Z,X)
QZ e

φ(θ)T

i T(zi,xi)

QZ

∝ h(˜ ν, ˜ τ)g(θ)˜

νeφ(θ)T ˜ τ

with ˜

ν = ν + n and ˜ τ = τ +

i T(zi, xi)QZ

SLIDE 80

Conjugate-exponential VB

Given an iid data set D = (x1, . . . xn), if the model is CE then:

◮ Qθ(θ) is also conjugate, i.e.

Qθ(θ) ∝ P(θ) exp

i log P(zi, xi|θ)
QZ

= h(ν, τ)g(θ)νeφ(θ)Tτ

g(θ)ne

log f(Z,X)
QZ e

φ(θ)T

i T(zi,xi)

QZ

∝ h(˜ ν, ˜ τ)g(θ)˜

νeφ(θ)T ˜ τ

with ˜

ν = ν + n and ˜ τ = τ +

i T(zi, xi)QZ ◮ QZ(Z) = n i=1 Qzi (zi) takes the same form as in the E-step of regular EM

Qzi (zi) ∝ exp log P(zi, xi|θ)Qθ

∝ f(zi, xi)e

φ(θ)T

Qθ T(zi,xi) = P(zi|xi, φ(θ))

with natural parameters φ(θ) = φ(θ)Qθ

SLIDE 81

Conjugate-exponential VB

Given an iid data set D = (x1, . . . xn), if the model is CE then:

◮ Qθ(θ) is also conjugate, i.e.

Qθ(θ) ∝ P(θ) exp

i log P(zi, xi|θ)
QZ

= h(ν, τ)g(θ)νeφ(θ)Tτ

g(θ)ne

log f(Z,X)
QZ e

φ(θ)T

i T(zi,xi)

QZ

∝ h(˜ ν, ˜ τ)g(θ)˜

νeφ(θ)T ˜ τ

with ˜

ν = ν + n and ˜ τ = τ +

i T(zi, xi)QZ ⇒ only need to track ˜

ν, ˜ τ.

◮ QZ(Z) = n i=1 Qzi (zi) takes the same form as in the E-step of regular EM

Qzi (zi) ∝ exp log P(zi, xi|θ)Qθ

∝ f(zi, xi)e

φ(θ)T

Qθ T(zi,xi) = P(zi|xi, φ(θ))

with natural parameters φ(θ) = φ(θ)Qθ ⇒ inference unchanged from regular EM.

SLIDE 82

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

SLIDE 83

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

Properties:

◮ Reduces to the EM algorithm if Qθ(θ) = δ(θ − θ∗).

SLIDE 84

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

Properties:

◮ Reduces to the EM algorithm if Qθ(θ) = δ(θ − θ∗). ◮ Fm increases monotonically, and incorporates the model complexity penalty.

SLIDE 85

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

Properties:

◮ Reduces to the EM algorithm if Qθ(θ) = δ(θ − θ∗). ◮ Fm increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian).

SLIDE 86

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

Properties:

◮ Reduces to the EM algorithm if Qθ(θ) = δ(θ − θ∗). ◮ Fm increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step.

SLIDE 87

The Variational Bayesian EM algorithm

EM for MAP estimation Goal: maximize P(θ|X, m) wrt θ E Step: compute QZ(Z) ← p(Z|X, θ) M Step:

θ ← argmax

θ

dZ QZ(Z) log P(Z, X, θ)

Variational Bayesian EM Goal: maximise bound on P(X|m) wrt Qθ VB-E Step: compute QZ(Z) ← p(Z|X, ¯

φ)

VB-M Step: Qθ(θ) ← exp

dZ QZ(Z) log P(Z, X, θ)

Properties:

◮ Reduces to the EM algorithm if Qθ(θ) = δ(θ − θ∗). ◮ Fm increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step. ◮ We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the

VB-E step of VB-EM, but using expected natural parameters, ¯

φ.

SLIDE 88

VB and model selection

◮ Variational Bayesian EM yields an approximate posterior Qθ over model parameters.

SLIDE 89

VB and model selection

◮ Variational Bayesian EM yields an approximate posterior Qθ over model parameters. ◮ It also yields an optimised lower bound on the model evidence

max FM(QZ, Qθ) ≤ P(D|M)

SLIDE 90

VB and model selection

◮ Variational Bayesian EM yields an approximate posterior Qθ over model parameters. ◮ It also yields an optimised lower bound on the model evidence

max FM(QZ, Qθ) ≤ P(D|M)

◮ These lower bounds can be compared amongst models to learn the right (structure,

connectivity . . . of the) model

SLIDE 91

VB and model selection

◮ Variational Bayesian EM yields an approximate posterior Qθ over model parameters. ◮ It also yields an optimised lower bound on the model evidence

max FM(QZ, Qθ) ≤ P(D|M)

◮ These lower bounds can be compared amongst models to learn the right (structure,

connectivity . . . of the) model

◮ If a continuous domain of models is specified by a hyperparameter η, then the VB free

energy depends on that parameter:

F(QZ, Qθ, η) =

dZ dθ QZ(Z)Qθ(θ) log P(X, Z, θ|η)

QZ(Z)Qθ(θ) ≤ P(X|η) A hyper-M step maximises the current bound wrt η:

η ← argmax

η

dZ dθ QZ(Z)Qθ(θ) log P(X, Z, θ|η)

SLIDE 92

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality.

SLIDE 93

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis:

x ∼ N (Λz, Ψ) z ∼ N (0, I) with a column-wise prior

Λ:i ∼ N

0, α−1

i

I

SLIDE 94

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis:

x ∼ N (Λz, Ψ) z ∼ N (0, I) with a column-wise prior

Λ:i ∼ N

0, α−1

i

I

◮ The VB free energy is

F(QZ(Z), QΛ(Λ), Ψ, α) =

log P(X, Z|Λ, Ψ) + log P(Λ|α) + log P(Ψ)
QZ QΛ +. . .

and so hyperparameter optimisation requires

α ← argmax log P(Λ|α)QΛ

SLIDE 95

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis:

x ∼ N (Λz, Ψ) z ∼ N (0, I) with a column-wise prior

Λ:i ∼ N

0, α−1

i

I

◮ The VB free energy is

F(QZ(Z), QΛ(Λ), Ψ, α) =

log P(X, Z|Λ, Ψ) + log P(Λ|α) + log P(Ψ)
QZ QΛ +. . .

and so hyperparameter optimisation requires

α ← argmax log P(Λ|α)QΛ

◮ Now QΛ is Gaussian, with the same form as in linear regression, but with expected

moments of z appearing in place of the inputs.

SLIDE 96

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis:

x ∼ N (Λz, Ψ) z ∼ N (0, I) with a column-wise prior

Λ:i ∼ N

0, α−1

i

I

◮ The VB free energy is

F(QZ(Z), QΛ(Λ), Ψ, α) =

log P(X, Z|Λ, Ψ) + log P(Λ|α) + log P(Ψ)
QZ QΛ +. . .

and so hyperparameter optimisation requires

α ← argmax log P(Λ|α)QΛ

◮ Now QΛ is Gaussian, with the same form as in linear regression, but with expected

moments of z appearing in place of the inputs.

◮ Optimisation wrt the distributions, Ψ and α in turn causes some αi to diverge as in

regression ARD.

SLIDE 97

ARD for unsupervised learning

Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression.

◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis:

x ∼ N (Λz, Ψ) z ∼ N (0, I) with a column-wise prior

Λ:i ∼ N

0, α−1

i

I

◮ The VB free energy is

F(QZ(Z), QΛ(Λ), Ψ, α) =

log P(X, Z|Λ, Ψ) + log P(Λ|α) + log P(Ψ)
QZ QΛ +. . .

and so hyperparameter optimisation requires

α ← argmax log P(Λ|α)QΛ

◮ Now QΛ is Gaussian, with the same form as in linear regression, but with expected

moments of z appearing in place of the inputs.

◮ Optimisation wrt the distributions, Ψ and α in turn causes some αi to diverge as in

regression ARD.

◮ In this case, these parameters select “relevant” latent dimensions, effectively learning

the dimensionality of z.

SLIDE 98

Augmented Variational Methods

In our examples so far, the approximate variational distribution has been over the “natural” latent variables (and parameters) of the generative model. Sometimes it may be useful to introduce additional latent variables, solely to achieve computational tractability. Two examples are GP regression and the GPLVM.

SLIDE 99

Sparse GP approximations

GP predictions: y′|X, Y, x′ ∼ N

Kx′X(KXX + σ2I)−1Y, Kx′x′ − Kx′X(K −1

XX+σ2I)KXx′ + σ2

Evidence (for learning kernel hyperparameters): log P(Y|X) = −1 2 log |2π(KXX + σ2I)| − 1 2Y(KXX + σ2I)−1Y T Computing either form requires inverting the N × N matrix KXX, in O(N3) time.

SLIDE 100

Sparse GP approximations

GP predictions: y′|X, Y, x′ ∼ N

Kx′X(KXX + σ2I)−1Y, Kx′x′ − Kx′X(K −1

XX+σ2I)KXx′ + σ2

Evidence (for learning kernel hyperparameters): log P(Y|X) = −1 2 log |2π(KXX + σ2I)| − 1 2Y(KXX + σ2I)−1Y T Computing either form requires inverting the N × N matrix KXX, in O(N3) time. One proposal to make this more efficient is to find (or select) a smaller set of possibly fictitious measurements U at inputs Z such that P(y′|Z, U, x′) ≈ P(y′|X, Y, x′) .

SLIDE 101

Sparse GP approximations

GP predictions: y′|X, Y, x′ ∼ N

Kx′X(KXX + σ2I)−1Y, Kx′x′ − Kx′X(K −1

XX+σ2I)KXx′ + σ2

Evidence (for learning kernel hyperparameters): log P(Y|X) = −1 2 log |2π(KXX + σ2I)| − 1 2Y(KXX + σ2I)−1Y T Computing either form requires inverting the N × N matrix KXX, in O(N3) time. One proposal to make this more efficient is to find (or select) a smaller set of possibly fictitious measurements U at inputs Z such that P(y′|Z, U, x′) ≈ P(y′|X, Y, x′) . What values should U and Z take?

SLIDE 102

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

SLIDE 103

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

Now, both U and F are latent, so we introduce a variational distribution q(F, U) to form a free-energy.

F(q(F, U), θ) =

log P(Y|F)P(F|U, X, Z)P(U|Z)

q(F, U)

q(F,U)

SLIDE 104

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

Now, both U and F are latent, so we introduce a variational distribution q(F, U) to form a free-energy.

F(q(F, U), θ) =

log P(Y|F)P(F|U, X, Z)P(U|Z)

q(F, U)

q(F,U)

Now, choose the variational form q(F, U) = P(F|U, X, Z)q(U). That is, fix F|U without reference to Y – so information about Y will need to be “compressed” into q(U).

SLIDE 105

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

Now, both U and F are latent, so we introduce a variational distribution q(F, U) to form a free-energy.

F(q(F, U), θ) =

log P(Y|F)P(F|U, X, Z)P(U|Z)

q(F, U)

q(F,U)

Now, choose the variational form q(F, U) = P(F|U, X, Z)q(U). That is, fix F|U without reference to Y – so information about Y will need to be “compressed” into q(U). Then

F(q(F, U), θ, Z) =

log P(Y|F) P(F|U, X, Z) P(U|Z)

P(F|U, X, Z) q(U)

P(F|U)q(U)

SLIDE 106

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

Now, both U and F are latent, so we introduce a variational distribution q(F, U) to form a free-energy.

F(q(F, U), θ) =

log P(Y|F)P(F|U, X, Z)P(U|Z)

q(F, U)

q(F,U)

Now, choose the variational form q(F, U) = P(F|U, X, Z)q(U). That is, fix F|U without reference to Y – so information about Y will need to be “compressed” into q(U). Then

F(q(F, U), θ, Z) =

log P(Y|F) P(F|U, X, Z) P(U|Z)

P(F|U, X, Z) q(U)

P(F|U)q(U)

SLIDE 107

Variational Sparse GP approximations

Write F for the (smooth) GP function values that underlie Y (so Y ∼ N

F, σ2I
).

Introduce latent measurements U at inputs Z (and integrate over U). The likelihood can be written P(Y|X) =

dF dU P(Y, F, U|X, Z) =
dF dU P(Y|F)P(F|U, X, Z)P(U|Z)

Now, both U and F are latent, so we introduce a variational distribution q(F, U) to form a free-energy.

F(q(F, U), θ) =

log P(Y|F)P(F|U, X, Z)P(U|Z)

q(F, U)

q(F,U)

Now, choose the variational form q(F, U) = P(F|U, X, Z)q(U). That is, fix F|U without reference to Y – so information about Y will need to be “compressed” into q(U). Then

F(q(F, U), θ, Z) =

log P(Y|F) P(F|U, X, Z) P(U|Z)

P(F|U, X, Z) q(U)

P(F|U)q(U)

=

log P(Y|F)P(F|U) + log P(U|Z) − log q(U)
q(U)

SLIDE 108

Variational Sparse GP approximations

F(q(U), θ, Z) =

log P(Y|F)P(F|U) + log P(U|Z) − log q(U)
q(U)

Now P(F|U) is fixed by the generative model (rather than being subject to free optimisation). So we can evaluate that expectation:

log P(Y|F)P(F|U) =

−1

2 log

2πσ2I
−

1 2σ2 Tr

(Y − F)(Y − F)T

P(F|U)

= −1

2 log

2πσ2I
−

1 2σ2 Tr

(Y − FP(F|U))(Y − FP(F|U))T

−

1 2σ2 Tr

ΣF|U
= log N
Y|KXZK −1

ZZ U, σ2I

−

1 2σ2 Tr

KXX − KXZK −1

ZZ KZX

SLIDE 109

Variational Sparse GP approximations

F(q(U), θ, Z) =

log P(Y|F)P(F|U) + log P(U|Z) − log q(U)
q(U)

Now P(F|U) is fixed by the generative model (rather than being subject to free optimisation). So we can evaluate that expectation:

log P(Y|F)P(F|U) =

−1

2 log

2πσ2I
−

1 2σ2 Tr

(Y − F)(Y − F)T

P(F|U)

= −1

2 log

2πσ2I
−

1 2σ2 Tr

(Y − FP(F|U))(Y − FP(F|U))T

−

1 2σ2 Tr

ΣF|U
= log N
Y|KXZK −1

ZZ U, σ2I

−

1 2σ2 Tr

KXX − KXZK −1

ZZ KZX

So,

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

+ log P(U|Z) − log q(U)
q(U)

−

1 2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

SLIDE 110

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

SLIDE 111

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

The expectation is the free energy of a PPCA-like model with normal prior U ∼ N (0, KZZ) and loading matrix KXZK −1

ZZ . The maximum of this free energy is the log-likelihood (achieved

with q equal to the posterior under the PPCA-like model).

SLIDE 112

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

The expectation is the free energy of a PPCA-like model with normal prior U ∼ N (0, KZZ) and loading matrix KXZK −1

ZZ . The maximum of this free energy is the log-likelihood (achieved

with q equal to the posterior under the PPCA-like model). This gives

F(q∗(U), θ, Z) = log N

Y|0, KXZ K −1

ZZ KZZ K −1 ZZ KZX + σ2I

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

SLIDE 113

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

The expectation is the free energy of a PPCA-like model with normal prior U ∼ N (0, KZZ) and loading matrix KXZK −1

ZZ . The maximum of this free energy is the log-likelihood (achieved

with q equal to the posterior under the PPCA-like model). This gives

F(q∗(U), θ, Z) = log N

Y|0, KXZ K −1

ZZ KZZ K −1 ZZ KZX + σ2I

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

Note that we have eliminated all terms in K −1

XX .

SLIDE 114

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

The expectation is the free energy of a PPCA-like model with normal prior U ∼ N (0, KZZ) and loading matrix KXZK −1

ZZ . The maximum of this free energy is the log-likelihood (achieved

with q equal to the posterior under the PPCA-like model). This gives

F(q∗(U), θ, Z) = log N

Y|0, KXZ K −1

ZZ KZZ K −1 ZZ KZX + σ2I

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

Note that we have eliminated all terms in K −1

XX .

We can optimise the free energy numerically with respect to Z and θ to adjust the GP prior and quality of variational approximation.

SLIDE 115

Variational Sparse GP approximations

F(q(U), θ, Z) =

log N
Y|KXZK −1

ZZ U, σ2I

P(U|Z)

q(U)

q(U)

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

The expectation is the free energy of a PPCA-like model with normal prior U ∼ N (0, KZZ) and loading matrix KXZK −1

ZZ . The maximum of this free energy is the log-likelihood (achieved

with q equal to the posterior under the PPCA-like model). This gives

F(q∗(U), θ, Z) = log N

Y|0, KXZ K −1

ZZ KZZ K −1 ZZ KZX + σ2I

− 1

2σ2 Tr

KXX − KXZK −1

ZZ KZX

.

Note that we have eliminated all terms in K −1

XX .

We can optimise the free energy numerically with respect to Z and θ to adjust the GP prior and quality of variational approximation. A similar approach can be used to learn X if they are unobserved (i.e. in the GPLVM). Assume q(X, F, U) = q(X)P(F|X, U)q(U). Then F = log P(Y, F, U|X) log P(X)q(U)q(X) which simplifies into tractable components in much the same way as above.

SLIDE 116

A few references

◮ Jordan, Ghahramani, Jaakkola, Saul, 1999. An introduction to variational methods for graphical

models. Machine Learning 37:183–233.

◮ Attias, 2000. A variational Bayesian framework for graphical models. NIPS 12.

http://www.gatsby.ucl.ac.uk/publications/papers/03-2000.ps

◮ Beal, 2003. Variational algorithms for approximate Bayesian inference. PhD thesis, Gatsby Unit,

UCL. http://www.cse.buffalo.edu/faculty/mbeal/thesis/