Bayesian Neural Network: Foundation and Practice Tianyu Cui, Yi - - PowerPoint PPT Presentation

bayesian neural network foundation and practice
SMART_READER_LITE
LIVE PREVIEW

Bayesian Neural Network: Foundation and Practice Tianyu Cui, Yi - - PowerPoint PPT Presentation

Bayesian Neural Network: Foundation and Practice Tianyu Cui, Yi Zhao Department of Computer Science Aalto University May 2, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian Neural Network: Foundation and Practice

Tianyu Cui, Yi Zhao

Department of Computer Science Aalto University

May 2, 2019

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

Introduction to Bayesian Neural Network Dropout as Bayesian Approximation Concrete Dropout

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Bayesian Neural Network

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2)

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2)

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI)

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI) ▶ Posterior: P(w|y, x) ∝ P(y|x, w)P(w)

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI) ▶ Posterior: P(w|y, x) ∝ P(y|x, w)P(w) ▶ MAP: w⋆ = argmaxw P(w|y, x)

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

Probabilistic interpretation of NN:

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI) ▶ Posterior: P(w|y, x) ∝ P(y|x, w)P(w) ▶ MAP: w⋆ = argmaxw P(w|y, x) ▶ Prediction: y′ = f (x′; w⋆)

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Bayesian Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

What do I mean by being Bayesian?

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI) ▶ Posterior: P(w|y, x) ∝ P(y|x, w)P(w) ▶ MAP: w⋆ = argmaxw P(w|y, x)

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What’s a Bayesian Neural Network?

Figure: A simple NN (left) and a BNN (right)[Blundell, 2015].

What do I mean by being Bayesian?

▶ Model: y = f (x; w) + ϵ, ϵ ∼ N(0, σ2) ▶ Likelihood: P(y|x, w) = N(y; f (x; w), σ2) ▶ Prior: P(w) = N(w; 0, σ2 wI) ▶ Posterior: P(w|y, x) ∝ P(y|x, w)P(w) ▶ MAP: w⋆ = argmaxw P(w|y, x) ▶ Prediction: y′ = f (x′; w),w ∼ P(w|y, x)

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Should We Care?

Calibrated prediction uncertainty: The models should know what they don’t know. One Example: [Gal, 2017]

▶ We train a model to recognise dog breeds.

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Should We Care?

Calibrated prediction uncertainty: The models should know what they don’t know. One Example: [Gal, 2017]

▶ We train a model to recognise dog breeds. ▶ What would you want your model to do when a cat are given?

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Should We Care?

Calibrated prediction uncertainty: The models should know what they don’t know. One Example: [Gal, 2017]

▶ We train a model to recognise dog breeds. ▶ What would you want your model to do when a cat are given? ▶ A prediction with high uncertainty.

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Should We Care?

Calibrated prediction uncertainty: The models should know what they don’t know. One Example: [Gal, 2017]

▶ We train a model to recognise dog breeds. ▶ What would you want your model to do when a cat are given? ▶ A prediction with high uncertainty.

buffer Successful Applications:

▶ Identify adversarial examples [Smith, 2018]. ▶ Adapted exploration rate in RL [Gal, 2016]. ▶ Self-driving car [McAllister, 2017, Michelmore, 2018] and

medican analysis [Gal, 2017]. buffer Self-driving car and medican analysis. buffer Self-driving car and medican analysis. buffer Self-driving car and medican analysis. buffer Self-driving car and medican analysis. buffer Self-driving car and medican analysis. One simple algorhthm: dropout as Bayesian approximation.

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult):

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult): ▶ Laplace Approximation [MacKay, 1992];

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult): ▶ Laplace Approximation [MacKay, 1992]; ▶ Hamiltonian Monte Carlo [Neal, 1995];

slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult): ▶ Laplace Approximation [MacKay, 1992]; ▶ Hamiltonian Monte Carlo [Neal, 1995]; ▶ (Stochastic) Variational Inference [Blundell, 2015].

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult): ▶ Laplace Approximation [MacKay, 1992]; ▶ Hamiltonian Monte Carlo [Neal, 1995]; ▶ (Stochastic) Variational Inference [Blundell, 2015].

▶ Most of the algorithms above are complicated both in theory

and in practice.

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Learn a Bayesian Neural Network?

What’s the difficult part?

▶ P(w|y, x) is generally intractable

▶ Standard approximate inference (difficult): ▶ Laplace Approximation [MacKay, 1992]; ▶ Hamiltonian Monte Carlo [Neal, 1995]; ▶ (Stochastic) Variational Inference [Blundell, 2015].

▶ Most of the algorithms above are complicated both in theory

and in practice.

▶ A simple and pratical Bayesian neural network: dropout

[Gal, 2016].

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero. We can obtain the distribution of prediction by repeating forward passing several times.

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

We can obtain the distribution of prediction by repeating forward passing several times.

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout,

We can obtain the distribution of prediction by repeating forward passing several times.

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn off dropout.

We can obtain the distribution of prediction by repeating forward passing several times.

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn off dropout.

▶ In Bayesian neural network (with prediction uncertainty):

We can obtain the distribution of prediction by repeating forward passing several times.

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn off dropout.

▶ In Bayesian neural network (with prediction uncertainty):

▶ During training: turn on dropout,

We can obtain the distribution of prediction by repeating forward passing several times.

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn off dropout.

▶ In Bayesian neural network (with prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn on dropout.

We can obtain the distribution of prediction by repeating forward passing several times.

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dropout as Bayesian Approximation

Dropout works by randomly setting network units to zero.

▶ In classical neural network (without prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn off dropout.

▶ In Bayesian neural network (with prediction uncertainty):

▶ During training: turn on dropout, ▶ During prediction: turn on dropout.

We can obtain the distribution of prediction by repeating forward passing several times. That’s it!

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Is That?

▶ High-level idea: Implement variance inference with a specific

class of distributions qM(ω) is equivalent to implement dropout training.

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Is That?

▶ High-level idea: Implement variance inference with a specific

class of distributions qM(ω) is equivalent to implement dropout training.

▶ Optimizing ELBO in variance inference is the same as

  • ptimizing the cost function in dropout training.
slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Is That?

▶ High-level idea: Implement variance inference with a specific

class of distributions qM(ω) is equivalent to implement dropout training.

▶ Optimizing ELBO in variance inference is the same as

  • ptimizing the cost function in dropout training.

▶ The optimal variational parameters in variance inference is the

same as the optimal parameters in dropout training.

slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variational Inference

▶ We use a simple distribution qM(ω) to approximate the true

posterior distribution p(ω|y, X):qM(ω) ≈ p(ω|y, X).

slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variational Inference

▶ We use a simple distribution qM(ω) to approximate the true

posterior distribution p(ω|y, X):qM(ω) ≈ p(ω|y, X).

▶ Minimize the KL(qM(ω)|p(ω|y, X)) is equvalent to minimize

the negative ELBO.

slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variational Inference

▶ We use a simple distribution qM(ω) to approximate the true

posterior distribution p(ω|y, X):qM(ω) ≈ p(ω|y, X).

▶ Minimize the KL(qM(ω)|p(ω|y, X)) is equvalent to minimize

the negative ELBO.

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω + KL(qM(ω)|p(ω)).

slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variational Inference

▶ We use a simple distribution qM(ω) to approximate the true

posterior distribution p(ω|y, X):qM(ω) ≈ p(ω|y, X).

▶ Minimize the KL(qM(ω)|p(ω|y, X)) is equvalent to minimize

the negative ELBO.

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω + KL(qM(ω)|p(ω)).

▶ After optimization, prediction can be estimated by:

y′ = f (x′; w),w ∼ qM(ω)

slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Compare Two Objective Functions

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ Coss function:

L(W ) = 1

N

∑N

n=1(yn − f (xn, W, zn)))2+λ ∑L i,j(||wi,j||2).

slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Compare Two Objective Functions

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ qM(ω) = ∏ i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ Coss function:

L(W ) = 1

N

∑N

n=1(yn − f (xn, W, zn)))2+λ ∑L i,j(||wi,j||2).

slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Compare Two Objective Functions

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ qM(ω) = ∏ i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ The loss functions will be the same if we use Monte Carlo to

simulate the integral. (reparameterization)

▶ Coss function:

L(W ) = 1

N

∑N

n=1(yn − f (xn, W, zn)))2+λ ∑L i,j(||wi,j||2).

slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Compare Two Objective Functions

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ qM(ω) = ∏ i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ The loss functions will be the same if we use Monte Carlo to

simulate the integral. (reparameterization)

▶ p(ω) = N(ω; 0, I) ▶ Coss function:

L(W ) = 1

N

∑N

n=1(yn − f (xn, W, zn)))2+λ ∑L i,j(||wi,j||2).

slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Compare Two Objective Functions

▶ negative ELBO:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ qM(ω) = ∏ i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ The loss functions will be the same if we use Monte Carlo to

simulate the integral. (reparameterization)

▶ p(ω) = N(ω; 0, I)

▶ The regularizations will be the same by using further

approximation.

▶ Coss function:

L(W ) = 1

N

∑N

n=1(yn − f (xn, W, zn)))2+λ ∑L i,j(||wi,j||2).

slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I Know You Want Some Code

▶ Train one neural network (network) with dropout;

slide-45
SLIDE 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I Know You Want Some Code

▶ Train one neural network (network) with dropout; ▶ Dropout units at prediction time;

slide-46
SLIDE 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I Know You Want Some Code

▶ Train one neural network (network) with dropout; ▶ Dropout units at prediction time; ▶ Repeat several (10) times;

slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I Know You Want Some Code

▶ Train one neural network (network) with dropout; ▶ Dropout units at prediction time; ▶ Repeat several (10) times; ▶ Look at the mean and sample variance of prediction.

slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results

slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Concrete Dropout

slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Choose Dropout Probability?

The simplest way is Grid Search (used in original paper)

▶ Problems:

▶ Immense waste of computational resources ▶ The number of possible per-layer dropout configurations grow

exponentially as the number of the model layers increases.

slide-51
SLIDE 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Choose Dropout Probability?

The simplest way is Grid Search (used in original paper)

▶ Problems:

▶ Immense waste of computational resources ▶ The number of possible per-layer dropout configurations grow

exponentially as the number of the model layers increases.

▶ One solution: Restrict the grid-search to a small number of

possible dropout values

▶ Might hurt uncertainty calibration.

slide-52
SLIDE 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

More Elegant Method

Concrete Dropout

slide-53
SLIDE 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

More Elegant Method

Concrete Dropout

▶ Tune dropout probability pi using gradient method.

slide-54
SLIDE 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What Is The Optimisation Objective?

▶ The negative ELBO we used before:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

slide-55
SLIDE 55

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What Is The Optimisation Objective?

▶ The negative ELBO we used before:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ Now, we almost use the same objective:

L(θ) = − 1

M

i∈S log p(yi|Xi, ω)+KL(qθ(ω)|p(ω)).

▶ qθ(ω) = ∏

i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ S a random set of M data points

slide-56
SLIDE 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What Is The Optimisation Objective?

▶ The negative ELBO we used before:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ Now, we almost use the same objective:

L(θ) = − 1

M

i∈S log p(yi|Xi, ω)+KL(qθ(ω)|p(ω)).

▶ qθ(ω) = ∏

i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ S a random set of M data points

▶ − 1 M

i∈S log p(yi|Xi, ω) is the model’s likelihood

slide-57
SLIDE 57

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What Is The Optimisation Objective?

▶ The negative ELBO we used before:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ Now, we almost use the same objective:

L(θ) = − 1

M

i∈S log p(yi|Xi, ω)+KL(qθ(ω)|p(ω)).

▶ qθ(ω) = ∏

i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ S a random set of M data points

▶ − 1 M

i∈S log p(yi|Xi, ω) is the model’s likelihood ▶ KL(qθ(ω)|p(ω)) is a ”regularisation” term which ensure that

the approximate posterior qθ(ω) does not deviate too far from the prior p(ω)

slide-58
SLIDE 58

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

What Is The Optimisation Objective?

▶ The negative ELBO we used before:

L(M) = − ∫ qM(ω) log p(y|X, ω)dω+KL(qM(ω)|p(ω)).

▶ Now, we almost use the same objective:

L(θ) = − 1

M

i∈S log p(yi|Xi, ω)+KL(qθ(ω)|p(ω)).

▶ qθ(ω) = ∏

i,j qmi,j(ωi,j) = ∏ i,j mi,jzi,

where zi ∼ Bernoulli(1 − pi)

▶ S a random set of M data points

▶ − 1 M

i∈S log p(yi|Xi, ω) is the model’s likelihood ▶ KL(qθ(ω)|p(ω)) is a ”regularisation” term which ensure that

the approximate posterior qθ(ω) does not deviate too far from the prior p(ω)

▶ Except: θ = {mi,j, pi}

▶ This time, we try to optimize both weight mi,j and dropout

probability pi

slide-59
SLIDE 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Find The Optimal Parameter?

▶ Two methods are often adopted.

slide-60
SLIDE 60

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Find The Optimal Parameter?

▶ Two methods are often adopted.

▶ Score function estimator

the variance of gradient can be very high

slide-61
SLIDE 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Find The Optimal Parameter?

▶ Two methods are often adopted.

▶ Score function estimator

the variance of gradient can be very high

▶ Pathwise derivative estimator (also refer to reparameterization

trick)

slide-62
SLIDE 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Find The Optimal Parameter?

▶ Two methods are often adopted.

▶ Score function estimator

the variance of gradient can be very high

▶ Pathwise derivative estimator (also refer to reparameterization

trick)

▶ Recall the ”reparameterization trick” used in VAE.

slide-63
SLIDE 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How To Find The Optimal Parameter?

▶ Two methods are often adopted.

▶ Score function estimator

the variance of gradient can be very high

▶ Pathwise derivative estimator (also refer to reparameterization

trick)

▶ Recall the ”reparameterization trick” used in VAE. ▶ Similarly, in order to train pi, instead of sample from

Bernoulli(1-pi), we sample from another distribution.

slide-64
SLIDE 64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reparameterization For Bernoulli Distribution

▶ When using reparametrization trick, we assume that the

distribtion at hand can be re-parametrised in the form g(θ, ϵ)

▶ θ is the distribution’s parameters

ϵ is a random variable which does not depend on θ

slide-65
SLIDE 65

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reparameterization For Bernoulli Distribution

▶ When using reparametrization trick, we assume that the

distribtion at hand can be re-parametrised in the form g(θ, ϵ)

▶ θ is the distribution’s parameters

ϵ is a random variable which does not depend on θ

▶ But this cannot be simply done with the discrete Bernoulli

distribution.

slide-66
SLIDE 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reparameterization For Bernoulli Distribution

▶ When using reparametrization trick, we assume that the

distribtion at hand can be re-parametrised in the form g(θ, ϵ)

▶ θ is the distribution’s parameters

ϵ is a random variable which does not depend on θ

▶ But this cannot be simply done with the discrete Bernoulli

distribution.

▶ Concrete Distribution

▶ A continous distribution used to approximate discrete random

variables.

slide-67
SLIDE 67

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reparameterization For Bernoulli Distribution

▶ When using reparametrization trick, we assume that the

distribtion at hand can be re-parametrised in the form g(θ, ϵ)

▶ θ is the distribution’s parameters

ϵ is a random variable which does not depend on θ

▶ But this cannot be simply done with the discrete Bernoulli

distribution.

▶ Concrete Distribution

▶ A continous distribution used to approximate discrete random

variables.

▶ Replace dropout’s discrete Bernoulli distribution with its

continous relaxation.

slide-68
SLIDE 68

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Concrete Dropout

▶ Using the following function, we approximate Bernoulli

distribution as concrete distribution: z = sigmoid( 1

t · (logp − log(1 − p)) + logu − log(1 − u))

slide-69
SLIDE 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Concrete Dropout

▶ Using the following function, we approximate Bernoulli

distribution as concrete distribution: z = sigmoid( 1

t · (logp − log(1 − p)) + logu − log(1 − u)) ▶ Compared with Gaussian case:

▶ Sample from ϵ ∼ N(0, 1), z ∼ N(u, σ)

slide-70
SLIDE 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Concrete Dropout

▶ Using the following function, we approximate Bernoulli

distribution as concrete distribution: z = sigmoid( 1

t · (logp − log(1 − p)) + logu − log(1 − u)) ▶ Sample from u ∼ Unif (0, 1), z ∼ Bern(1 − p) (approximately) ▶ Compared with Gaussian case:

▶ Sample from ϵ ∼ N(0, 1), z ∼ N(u, σ)

slide-71
SLIDE 71

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Concrete Dropout

▶ Using the following function, we approximate Bernoulli

distribution as concrete distribution: z = sigmoid( 1

t · (logp − log(1 − p)) + logu − log(1 − u)) ▶ Sample from u ∼ Unif (0, 1), z ∼ Bern(1 − p) (approximately) ▶ Compared with Gaussian case:

▶ Sample from ϵ ∼ N(0, 1), z ∼ N(u, σ)

▶ Now, we have everything needed to train the model.

slide-72
SLIDE 72

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Result

Using concrete dropout, we can choose the dropout probability effectively, and also get a better performance. The performance of Concrete dropout against base-line models with DenseNet on the CamVid road scene semantic segmentation dataset

slide-73
SLIDE 73

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thanks for listening

slide-74
SLIDE 74

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

References

MacKay, David JC. ”A practical Bayesian framework for backpropagation networks.” Neural computation 4.3 (1992): 448-472. Neal, Radford M. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012. Blundell, Charles, et al. ”Weight uncertainty in neural networks.” arXiv preprint arXiv:1505.05424 (2015). Gal, Yarin, and Zoubin Ghahramani. ”Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” international conference on machine learning. 2016. McAllister, Rowan, et al. ”Concrete problems for autonomous vehicle safety: advantages of Bayesian deep learning.” International Joint Conferences on Artificial Intelligence, Inc., 2017. Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. ”Deep bayesian active learning with image data.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. Michelmore, Rhiannon, Marta Kwiatkowska, and Yarin Gal. ”Evaluating Uncertainty Quantification in End-to-End Autonomous Driving Control.” arXiv preprint arXiv:1811.06817 (2018).