[PPT] - Machine Learning for Computational Linguistics Autoencoders + deep PowerPoint Presentation

SLIDE 1

Machine Learning for Computational Linguistics

Autoencoders + deep learning summary Çağrı Çöltekin

University of Tübingen Seminar für Sprachwissenschaft

July 12, 2016

SLIDE 2

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

(Deep) neural networks so far

h1 h2 h3 h4 x x1 xm … y

x is the input vector
y is the output vector
h1 . . . hm are the hidden

layers (learned/useful representations)

The network can be fully

connected, or may can use sparse connectivity

The connections can be

feed-forward, or may include recurrent links So far, we only studied supervised models

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 1 / 13

SLIDE 3

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

(Deep) neural networks so far

h1 h2 h3 h4 x x1 xm … y

x is the input vector
y is the output vector
h1 . . . hm are the hidden

layers (learned/useful representations)

The network can be fully

connected, or may can use sparse connectivity

The connections can be

feed-forward, or may include recurrent links So far, we only studied supervised models

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 1 / 13

SLIDE 4

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Unsupervised learning in ANNs

Restricted Boltzmann machines (RBM)

similar to the latent variable models (e.g., Gaussian mixtures), consider the representation learned by hidden layers as hidden variables (h), and learn p(x, h) that maximize the probability of the (unlabeled)data

Autoencoders

train a constrained feed-forward network to predict its

utput

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 2 / 13

SLIDE 5

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Restricted Boltzmann machines (RBMs)

x1 h1 x2 h2 x3 h3 x4 h4 W

RBMs are unsupervised latent

variable models, they learn only from unlabeled data

They are generative models of

the joint probability p(h, x)

They correspond to undirected

graphical models

No links within layers
The aim is to learn useful

features (h)

* As usual, biases are omitted from the diagrams and the formulas. Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 3 / 13

SLIDE 6

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Restricted Boltzmann machines (RBMs)

x1 h1 x2 h2 x3 h3 x4 h4 W h x

RBMs are unsupervised latent

variable models, they learn only from unlabeled data

They are generative models of

the joint probability p(h, x)

They correspond to undirected

graphical models

No links within layers
The aim is to learn useful

features (h)

* As usual, biases are omitted from the diagrams and the formulas. Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 3 / 13

SLIDE 7

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

The distribution defjned by RBMs

x1 h1 x2 h2 x3 h3 x4 h4 W p(h, x) = ehT Wx Z which is intractable (Z is diffjcult to calculate). But conditional distributions are easy to calculate p(h|x) = ∏

j

p(hj|x) = 1 1 + eWjx p(x|h) = ∏

k

p(xk|h) = 1 1 + eWT

kh Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 4 / 13

SLIDE 8

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Learning in RBMs: contrastive divergence algorithm

We want to maximize the probability that the model

assigns to the input, p(x), or equivalently minimize − log p(x)

In general, this is not tractable. But effjcient approximate

algorithms exist Contrastive divergence algorithm

1. Given a training example x, calculate the probabilities of

hidden units, and sample a hidden activation, h, from this distribution

2. Sample a reconstruction, x

′ from p(x|h), and re-sample h ′

using x

′

3. Set the update rule to ∆wij = (xivj − x

′

ih

′

j)ϵ

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 5 / 13

SLIDE 9

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5 Autoencoders are standard feed-forward networks The main difgerence is that they are trained to predict their input (they try to learn the identity function) The aim is to learn useful representations of input at the hidden layer Typically weights are tied ( )

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 6 / 13

SLIDE 10

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5 W W∗

Encoder Decoder

Autoencoders are standard

feed-forward networks

The main difgerence is that

they are trained to predict their input (they try to learn the identity function)

The aim is to learn useful

representations of input at the hidden layer

Typically weights are tied

(W∗ = WT)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 6 / 13

SLIDE 11

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Under-complete autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5

An autoencoder is said to be

under-complete if there are fewer hidden units than inputs

The network is forced to learn

a more compact representation of the input (compress)

An autoencoder with a single

hidden layer is equivalent to PCA

We need multiple layers for

learning non-linear features

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 7 / 13

SLIDE 12

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Over-complete autoencoders

h1 h2 x1 ˆ x1 h3 x2 ˆ x2 h4 x3 ˆ x3 h5

An autoencoder is said to be
ver-complete if there are more

hidden units than inputs

The network can normally

memorize the input perfectly

This type of networks are

useful if trained with a regularization term resulting in sparse hidden units (e.g., L1 regularization)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 8 / 13

SLIDE 13

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Denoising autoencoders

x1 ˆ x1 x2 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 x4 ˆ x4 h3 x5 x5 ˆ x5 x

x

h ˆ x

Instead of providing the exact

input we introduce noise by

– randomly setting some inputs to 0 (dropout) – adding random (Gaussian) noise

Network is still expected to

reconstruct the original input (without noise)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 9 / 13

SLIDE 14

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Learning manifolds

Figure: Goodfellow et al. (2016) Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 10 / 13

SLIDE 15

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Unsupervised pre-training

Deep belief networks or stacked autoencoders

A common use case for RBMs and autoencoders are as

pre-training methods for supervised networks

Autoencoders or RBMs are trained using unlabeled data
The weights learned during the unsupervised learning is

used for initializing the weights of a supervised network

This approach has been one of the reasons for success of

deep networks

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 11 / 13

SLIDE 16

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Deep unsupervised learning

Both autoencoders and RBMs can be ‘stacked’
Learn the weights of the fjrst hidden layer from the data
Freeze the weights, and using the hidden layer activations

as input, train another hidden layer, …

This approach is called greedy layer-wise training
In case of RBMs resulting networks are called deep belief

networks

Deep autoencoders are called stacked autoencoders

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 12 / 13

SLIDE 17

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Why use pre-training?

Pre-training does not require labeled data
It can be considered as a form of regularization
Unsupervised methods may reduce the dimensionality,

allowing in effjcient computation for the supervised phase

Unsupervised learning on large-scale data may fjnd the

manifold that contains input data, counteracting curse of dimensionality

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 13 / 13