Raising the Reliability of Estimates of Generative Performance of - - PowerPoint PPT Presentation

▶

raising the reliability of estimates of generative

Raising the Reliability of Estimates of Generative Performance of - - PowerPoint PPT Presentation

Nov 29, 2022 233 likes •1.3k views

Raising the Reliability of Estimates of Generative Performance of MRFs Yuri Burda, Fields Institute Joint work with Roger Grosse and Ruslan Salakhutdinov Workshop on Big Data and Statistical Machine Learning, Fields Institute, January 28, 2015

slide-1

SLIDE 1

Raising the Reliability of Estimates of Generative Performance of MRFs

Yuri Burda, Fields Institute Joint work with Roger Grosse and Ruslan Salakhutdinov

Workshop on Big Data and Statistical Machine Learning, Fields Institute, January 28, 2015

slide-2

SLIDE 2

Markov Random Fields

Express relations between random variables

slide-3

SLIDE 3

Markov Random Fields

Express relations between random variables
Are very powerful

slide-4

SLIDE 4

Markov Random Fields

Express relations between random variables
Are very powerful
Can be used as generative models

slide-5

SLIDE 5

Markov Random Fields

Express relations between random variables
Are very powerful
Can be used as generative models

slide-6

SLIDE 6

Markov Random Fields

Express relations between random variables
Are very powerful
Can be used as generative models

slide-7

SLIDE 7

MRFs are powerful

slide-8

SLIDE 8

Restricted Boltzmann Machines – example of an MRF

Image visible variables hidden variables

Pair-wise Unary

slide-9

SLIDE 9

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

slide-10

SLIDE 10

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-11

SLIDE 11

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-12

SLIDE 12

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random

slide-13

SLIDE 13

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V

slide-14

SLIDE 14

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

Random V

slide-15

SLIDE 15

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

…

Random V

slide-16

SLIDE 16

Generating Samples from RBMs

Run Markov chain (alternating Gibbs Sampling):

…

Random V T= infinity

Equilibrium Distribution

slide-17

SLIDE 17

Generating Samples from MRFs

In general useful MRFs admit tractable MCMC

transition operators

Running MCMC for many steps one can

generate (approximate) samples from the MRF equilibrium distribution

slide-18

SLIDE 18

Model comparison/selection

slide-19

SLIDE 19

You can’t judge a model by its samples

Model comparison/selection

slide-20

SLIDE 20

You can’t judge a model by its samples

Model comparison/selection

Better to compare validation/test data log-likelihood

slide-21

SLIDE 21

Computing log-likelihood for MRFs

slide-22

SLIDE 22

Computing log-likelihood for MRFs

slide-23

SLIDE 23

Computing log-likelihood for MRFs

slide-24

SLIDE 24

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum

slide-25

SLIDE 25

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate

slide-26

SLIDE 26

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate sum

slide-27

SLIDE 27

Computing log-likelihood for MRFs

It’s hard to compute an exponentially large sum But simple enough to approximate sum

slide-28

SLIDE 28

Importance sampling

slide-29

SLIDE 29

Importance sampling

slide-30

SLIDE 30

Importance sampling

slide-31

SLIDE 31

Importance sampling

slide-32

SLIDE 32

Importance sampling

slide-33

SLIDE 33

Importance sampling

Variance of this estimator can be large

slide-34

SLIDE 34

Importance sampling

Variance of this estimator can be large

slide-35

SLIDE 35

Annealed Importance Sampling

does not depend on size of

slide-36

SLIDE 36

Annealed Importance Sampling

does not depend on size of

slide-37

SLIDE 37

Annealed Importance Sampling

does not depend on size of

slide-38

SLIDE 38

Annealed Importance Sampling

does not depend on size of

slide-39

SLIDE 39

Annealed Importance Sampling

does not depend on size of

slide-40

SLIDE 40

Annealed Importance Sampling

slide-41

SLIDE 41

Annealed Importance Sampling

MRF

slide-42

SLIDE 42

Annealed Importance Sampling

MRF

slide-43

SLIDE 43

Annealed Importance Sampling

MRF

slide-44

SLIDE 44

Annealed Importance Sampling

MRF

slide-45

SLIDE 45

Annealed Importance Sampling

— MCMC

perator for an MRF

with energy

MRF

slide-46

SLIDE 46

Annealed Importance Sampling

— MCMC

perator for an MRF

with energy

MRF

— MCMC

perator for an MRF

with energy

slide-47

SLIDE 47

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

slide-48

SLIDE 48

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

However, like all importance samplers, it tends

to underestimate the sum

slide-49

SLIDE 49

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

However, like all importance samplers, it tends

to underestimate the sum

Hence it is unreliable for model

comparison/selection

slide-50

SLIDE 50

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

However, like all importance samplers, it tends

to underestimate the sum

Hence it is unreliable for model

comparison/selection

slide-51

SLIDE 51

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

However, like all importance samplers, it tends

to underestimate the sum

Hence it is unreliable for model

comparison/selection

underestimate

slide-52

SLIDE 52

Annealed Importance Sampling

AIS reduces the variance of the estimator

dramatically

However, like all importance samplers, it tends

to underestimate the sum

Hence it is unreliable for model

comparison/selection

underestimate

verestimate

slide-53

SLIDE 53

Importance Sampling underestimates

slide-54

SLIDE 54

Importance Sampling underestimates

slide-55

SLIDE 55

Importance Sampling underestimates

slide-56

SLIDE 56

Importance Sampling underestimates

Jensen’s inequality

slide-57

SLIDE 57

Importance Sampling underestimates

Markov’s inequality

slide-58

SLIDE 58

Importance Sampling underestimates

Markov’s inequality

slide-59

SLIDE 59

Importance Sampling underestimates

The proposal distribution q is likely to miss some modes of f

slide-60

SLIDE 60

Importance Sampling underestimates and hence overestimates !

slide-61

SLIDE 61

Importance Sampling underestimates and hence overestimates !

slide-62

SLIDE 62

Contrast with situation in Directed Graphical Models

slide-63

SLIDE 63

Contrast with situation in Directed Graphical Models

slide-64

SLIDE 64

Contrast with situation in Directed Graphical Models

slide-65

SLIDE 65

Contrast with situation in Directed Graphical Models

Easy to get exact samples

slide-66

SLIDE 66

Contrast with situation in Directed Graphical Models

Easy to get exact samples Evaluating still requires evaluating an exponentially large sum

slide-67

SLIDE 67

Importance Sampling to the rescue

slide-68

SLIDE 68

Importance Sampling to the rescue

slide-69

SLIDE 69

Importance Sampling to the rescue

slide-70

SLIDE 70

Importance Sampling to the rescue

slide-71

SLIDE 71

Importance Sampling to the rescue

slide-72

SLIDE 72

Importance Sampling to the rescue

Moreover, the estimator we get is a stochastic lower bound on

slide-73

SLIDE 73

Importance Sampling to the rescue

We just have to make sure that the proposal distribution is close to the posterior

slide-74

SLIDE 74

Reverse Annealing: a directed graphical model out of an undirected one

slide-75

SLIDE 75

Reverse Annealing: a directed graphical model out of an undirected one

slide-76

SLIDE 76

Reverse Annealing: a directed graphical model out of an undirected one

— MCMC

perator for an MRF

with energy — MCMC

perator for an MRF

with energy

slide-77

SLIDE 77

Reverse Annealing: a directed graphical model out of an undirected one

— MCMC

perator for an MRF

with energy — MCMC

perator for an MRF

with energy

slide-78

SLIDE 78

Reverse Annealing Importance Sampling Estimator — RAISE

As easy to implement as AIS

slide-79

SLIDE 79

Reverse Annealing Importance Sampling Estimator — RAISE

As easy to implement as AIS
Gives a model that approximates the original

MRF

slide-80

SLIDE 80

Reverse Annealing Importance Sampling Estimator — RAISE

As easy to implement as AIS
Gives a model that approximates the original

MRF

Asymptotically equivalent to the original MRF

slide-81

SLIDE 81

Reverse Annealing Importance Sampling Estimator — RAISE

As easy to implement as AIS
Gives a model that approximates the original

MRF

Asymptotically equivalent to the original MRF
Exact samples can be generated from the

approximate model

slide-82

SLIDE 82

Reverse Annealing Importance Sampling Estimator — RAISE

As easy to implement as AIS
Gives a model that approximates the original

MRF

Asymptotically equivalent to the original MRF
Exact samples can be generated from the

approximate model

RAISE is a stochastic lower bound on the log-

likelihood of the approximate model

slide-83

SLIDE 83

In practice

Evaluated the procedure on

Small RBMs
Large RBMs trained with
Contrastive Divergence with 1 step
Contrastive Divergence with 25 steps
Persistent Contrastive divergence
Large Deep Belief Networks
Large Deep Boltzmann Machines
Models of handwritten digits and

models of handwritten characters

Image visible variables hidden variables

slide-84

SLIDE 84

Typical Results

MNIST DBN MNIST DBM

slide-85

SLIDE 85

Typical Results

MNIST, CD-1 trained RBM, 500 hidden units

slide-86

SLIDE 86

Non-typical Results

MNIST CD-1 trained RBM with 20 units

slide-87

SLIDE 87

MNIST and Omniglot RBMs Results

CSL: Conservative Sampling-based Log-likelihood (CSL) estimator
f Bengio et. al.
Y. Bengio, L. Yao, and K. Cho. Bounding the test log-

likelihood of generative models.

RAISE errs on the side of underestimating the log-likelihood.
The gap is typically very small!

slide-88

SLIDE 88

Empirical observations

Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

slide-89

SLIDE 89

Empirical observations

Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

RAISE model approximates the original MRF

reasonably well with 1,000 – 100,000 intermediate distributions

slide-90

SLIDE 90

Empirical observations

Annealing from the data base rates model typically

gives better AIS estimates than annealing from the uniform distribution

RAISE model approximates the original MRF

reasonably well with 1,000 – 100,000 intermediate distributions

For models that don’t model the data distribution

well (overfitting, undertrained etc) the RAISE model can be substantially better than the original MRF.

slide-91

SLIDE 91

Empirical observations

It’s really hard to know when AIS is or isn’t working,

and RAISE can give a clue about that

slide-92

SLIDE 92

Empirical observations

It’s really hard to know when AIS is or isn’t working,

and RAISE can give a clue about that

It’s likely that most, but not all, published results

based on AIS estimates with enough intermediate distributions are reliable.

slide-93

SLIDE 93

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive

slide-94

SLIDE 94

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates

slide-95

SLIDE 95

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated

slide-96

SLIDE 96

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated is a low-variance estimator of Hence

slide-97

SLIDE 97

Computational Tricks

RAISE requires estimating a large sum for each test sample, which is computationally expensive Method of control variates gives a way of using few test samples to achieve reasonably reliable estimates RAISE estimates and MRF unnormalized probabilities tend to be tightly correlated Here are random test set samples and k is small is a low-variance estimator of Hence

slide-98

SLIDE 98

Pretraining Very Very Deep Models

Train an RBM or a DBN

slide-99

SLIDE 99

Pretraining Very Very Deep Models

Train an RBM or a DBN
Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

slide-100

SLIDE 100

Pretraining Very Very Deep Models

Train an RBM or a DBN
Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

Use p and q to fine-tune the model with an

appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)

slide-101

SLIDE 101

Pretraining Very Very Deep Models

Train an RBM or a DBN
Unroll the model using RAISE to create a sigmoid

belief network with 100 or 1000 layers

Use p and q to fine-tune the model with an

appropriate algorithm: wake-sleep (Hinton et al, 95) reweighted wake-sleep (Bornschein, Bengio, 14) neural variational inference (Mnih, Gregor, 13)

Brag about the deepest network that has ever

been trained 

slide-102

SLIDE 102

Conclusions

Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

slide-103

SLIDE 103

Conclusions

Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

RAISE is as easy to use as AIS, and should be

used either instead or in conjunction with AIS when comparing models

slide-104

SLIDE 104

Conclusions

Comparing MRF models using variants of

importance sampling (AIS, sequential Monte Carlo etc) is unreliable

RAISE is as easy to use as AIS, and should be

used either instead or in conjunction with AIS when comparing models

One can (in principle) replace any MRF with a

directed graphical model that has a tractable approximation to the posterior

slide-105

SLIDE 105

Thank you

Burda Y., Grosse R., Salakhutdinov R. Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing AISTATS 2015

Core Hubs:

All Converters PDF Hub Video Hub Audio Hub Image Hub File Compressor