Lecture 8: Autoencoder & DBM Princeton University COS 495 - - PowerPoint PPT Presentation

▶

Sep 14, 2022 746 likes •1.04k views

Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain two parts: Encoder: map

SLIDE 1

Deep Learning Basics Lecture 8: Autoencoder & DBM

Princeton University COS 495 Instructor: Yingyu Liang

SLIDE 2

Autoencoder

SLIDE 3

Autoencoder

Neural networks trained to attempt to copy its input to its output
Contain two parts:
Encoder: map the input to a hidden representation
Decoder: map the hidden representation to the output

SLIDE 4

Autoencoder

ℎ 𝑦 𝑠 Hidden representation (the code) Reconstruction Input

SLIDE 5

Autoencoder

ℎ 𝑦 𝑠 Decoder 𝑕(⋅) Encoder 𝑔(⋅) ℎ = 𝑔 𝑦 , 𝑠 = 𝑕 ℎ = 𝑕(𝑔 𝑦 )

SLIDE 6

Why want to copy input to output

Not really care about copying
Interesting case: NOT able to copy exactly but strive to do so
Autoencoder forced to select which aspects to preserve and thus

hopefully can learn useful properties of the data

Historical note: goes back to (LeCun, 1987; Bourlard and Kamp, 1988;

Hinton and Zemel, 1994).

SLIDE 7

Undercomplete autoencoder

Constrain the code to have smaller dimension than the input
Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) ℎ 𝑦 𝑠

SLIDE 8

Undercomplete autoencoder

Constrain the code to have smaller dimension than the input
Training: minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 𝑦 )

Special case: 𝑔, 𝑕 linear, 𝑀 mean square error
Reduces to Principal Component Analysis

SLIDE 9

Undercomplete autoencoder

What about nonlinear encoder and decoder?
Capacity should not be too large
Suppose given data 𝑦1, 𝑦2, … , 𝑦𝑜
Encoder maps 𝑦𝑗 to 𝑗
Decoder maps 𝑗 to 𝑦𝑗
One dim ℎ suffices for perfect reconstruction

SLIDE 10

Regularization

Typically NOT
Keeping the encoder/decoder shallow or
Using small code size
Regularized autoencoders: add regularization term that encourages

the model to have other properties

Sparsity of the representation (sparse autoencoder)
Robustness to noise or to missing inputs (denoising autoencoder)
Smallness of the derivative of the representation

SLIDE 11

Sparse autoencoder

Constrain the code to have sparsity
Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝑆(ℎ) ℎ 𝑦 𝑠

SLIDE 12

Probabilistic view of regularizing ℎ

Suppose we have a probabilistic model 𝑞(ℎ, 𝑦)
MLE on 𝑦

log 𝑞(𝑦) = log ෍

ℎ′

𝑞(ℎ′, 𝑦)

 Hard to sum over ℎ′

SLIDE 13

Probabilistic view of regularizing ℎ

Suppose we have a probabilistic model 𝑞(ℎ, 𝑦)
MLE on 𝑦

max log 𝑞(𝑦) = max log ෍

ℎ′

𝑞(ℎ′, 𝑦)

Approximation: suppose ℎ = 𝑔(𝑦) gives the most likely hidden

representation, and σℎ′ 𝑞(ℎ′, 𝑦) can be approximated by 𝑞(ℎ, 𝑦)

SLIDE 14

Probabilistic view of regularizing ℎ

Suppose we have a probabilistic model 𝑞(ℎ, 𝑦)
Approximate MLE on 𝑦, ℎ = 𝑔(𝑦)

max log 𝑞(ℎ, 𝑦) = max log 𝑞(𝑦|ℎ) + log 𝑞(ℎ) Regularization Loss

SLIDE 15

Sparse autoencoder

Constrain the code to have sparsity
Laplacian prior: 𝑞 ℎ =

𝜇 2 exp(− 𝜇 2 ℎ 1)

Training: minimize a loss function

𝑀𝑆 = 𝑀(𝑦, 𝑕 𝑔 𝑦 ) + 𝜇 ℎ 1

SLIDE 16

Denoising autoencoder

Traditional autoencoder: encourage to learn 𝑕 𝑔 ⋅

to be identity

Denoising : minimize a loss function

𝑀 𝑦, 𝑠 = 𝑀(𝑦, 𝑕 𝑔 ෤ 𝑦 ) where ෤ 𝑦 is 𝑦 + 𝑜𝑝𝑗𝑡𝑓

SLIDE 17

Boltzmann machine

SLIDE 18

Boltzmann machine

Introduced by Ackley et al. (1985)
General “connectionist” approach to learning arbitrary probability

distributions over binary vectors

Special case of energy model: 𝑞 𝑦 =

exp(−𝐹 𝑦 ) 𝑎

SLIDE 19

Boltzmann machine

Energy model:

𝑞 𝑦 = exp(−𝐹 𝑦 ) 𝑎

Boltzmann machine: special case of energy model with

𝐹 𝑦 = −𝑦𝑈𝑉𝑦 − 𝑐𝑈𝑦 where 𝑉 is the weight matrix and 𝑐 is the bias parameter

SLIDE 20

Boltzmann machine with latent variables

Some variables are not observed

𝑦 = 𝑦𝑤, 𝑦ℎ , 𝑦𝑤 visible, 𝑦ℎ hidden 𝐹 𝑦 = −𝑦𝑤

𝑈𝑆𝑦𝑤 − 𝑦𝑤 𝑈𝑋𝑦ℎ − 𝑦ℎ 𝑈𝑇𝑦ℎ − 𝑐𝑈𝑦𝑤 − 𝑑𝑈𝑦ℎ

Universal approximator of probability mass functions

SLIDE 21

Maximum likelihood

Suppose we are given data 𝑌 = 𝑦𝑤

1, 𝑦𝑤 2, … , 𝑦𝑤 𝑜

Maximum likelihood is to maximize

log 𝑞 𝑌 = ෍

𝑗

log 𝑞(𝑦𝑤

𝑗 )

where 𝑞 𝑦𝑤 = ෍

𝑦ℎ

𝑞(𝑦𝑤, 𝑦ℎ) = ෍

𝑦ℎ

1 𝑎 exp(−𝐹(𝑦𝑤, 𝑦ℎ))

𝑎 = σ exp(−𝐹(𝑦𝑤, 𝑦ℎ)): partition function, difficult to compute

SLIDE 22

Restricted Boltzmann machine

Invented under the name harmonium (Smolensky, 1986)
Popularized by Hinton and collaborators to Restricted Boltzmann

machine

SLIDE 23

Restricted Boltzmann machine

Special case of Boltzmann machine with latent variables:

𝑞 𝑤, ℎ = exp(−𝐹 𝑤, ℎ ) 𝑎 where the energy function is 𝐹 𝑤, ℎ = −𝑤𝑈𝑋ℎ − 𝑐𝑈𝑤 − 𝑑𝑈ℎ with the weight matrix 𝑋 and the bias 𝑐, 𝑑

Partition function

𝑎 = ෍

𝑤

෍

ℎ

exp(−𝐹 𝑤, ℎ )

SLIDE 24

Restricted Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville

SLIDE 25

Restricted Boltzmann machine

Conditional distribution is factorial

𝑞 ℎ|𝑤 = 𝑞(𝑤, ℎ) 𝑞(𝑤) = ෑ

𝑘

𝑞(ℎ𝑘|𝑤) and 𝑞 ℎ𝑘 = 1|𝑤 = 𝜏 𝑑

𝑘 + 𝑤𝑈𝑋 :,𝑘

is logistic function

SLIDE 26

Restricted Boltzmann machine

Similarly,

𝑞 𝑤|ℎ = 𝑞(𝑤, ℎ) 𝑞(ℎ) = ෑ

𝑗

𝑞(𝑤𝑗|ℎ) and 𝑞 𝑤𝑗 = 1|ℎ = 𝜏 𝑐𝑗 + 𝑋

𝑗,:ℎ

is logistic function

SLIDE 27

Deep Boltzmann machine

Special case of energy model. Take 3 hidden layers and ignore bias:

𝑞 𝑤, ℎ1, ℎ2, ℎ3 = exp(−𝐹 𝑤, ℎ1, ℎ2, ℎ3 ) 𝑎

Energy function

𝐹 𝑤, ℎ1, ℎ2, ℎ3 = −𝑤𝑈𝑋1ℎ1 − (ℎ1)𝑈𝑋2ℎ2 − (ℎ2)𝑈𝑋3ℎ3 with the weight matrices 𝑋1, 𝑋2, 𝑋3

Partition function

𝑎 = ෍

𝑤,ℎ1,ℎ2,ℎ3

exp(−𝐹 𝑤, ℎ1, ℎ2, ℎ3 )

SLIDE 28

Deep Boltzmann machine

Figure from Deep Learning, Goodfellow, Bengio and Courville