Machine Learning for Computational Linguistics Autoencoders + deep - - PowerPoint PPT Presentation

machine learning for computational linguistics
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Computational Linguistics Autoencoders + deep - - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Autoencoders + deep learning summary ar ltekin University of Tbingen Seminar fr Sprachwissenschaft July 12, 2016 Restricted Boltzmann machines Autoencoders July 12, 2016 SfS /


slide-1
SLIDE 1

Machine Learning for Computational Linguistics

Autoencoders + deep learning summary Çağrı Çöltekin

University of Tübingen Seminar für Sprachwissenschaft

July 12, 2016

slide-2
SLIDE 2

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

(Deep) neural networks so far

h1 h2 h3 h4 x x1 xm … y

  • x is the input vector
  • y is the output vector
  • h1 . . . hm are the hidden

layers (learned/useful representations)

  • The network can be fully

connected, or may can use sparse connectivity

  • The connections can be

feed-forward, or may include recurrent links So far, we only studied supervised models

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 1 / 13

slide-3
SLIDE 3

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

(Deep) neural networks so far

h1 h2 h3 h4 x x1 xm … y

  • x is the input vector
  • y is the output vector
  • h1 . . . hm are the hidden

layers (learned/useful representations)

  • The network can be fully

connected, or may can use sparse connectivity

  • The connections can be

feed-forward, or may include recurrent links So far, we only studied supervised models

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 1 / 13

slide-4
SLIDE 4

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Unsupervised learning in ANNs

  • Restricted Boltzmann machines (RBM)

similar to the latent variable models (e.g., Gaussian mixtures), consider the representation learned by hidden layers as hidden variables (h), and learn p(x, h) that maximize the probability of the (unlabeled)data

  • Autoencoders

train a constrained feed-forward network to predict its

  • utput

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 2 / 13

slide-5
SLIDE 5

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Restricted Boltzmann machines (RBMs)

x1 h1 x2 h2 x3 h3 x4 h4 W

  • RBMs are unsupervised latent

variable models, they learn only from unlabeled data

  • They are generative models of

the joint probability p(h, x)

  • They correspond to undirected

graphical models

  • No links within layers
  • The aim is to learn useful

features (h)

* As usual, biases are omitted from the diagrams and the formulas. Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 3 / 13

slide-6
SLIDE 6

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Restricted Boltzmann machines (RBMs)

x1 h1 x2 h2 x3 h3 x4 h4 W h x

  • RBMs are unsupervised latent

variable models, they learn only from unlabeled data

  • They are generative models of

the joint probability p(h, x)

  • They correspond to undirected

graphical models

  • No links within layers
  • The aim is to learn useful

features (h)

* As usual, biases are omitted from the diagrams and the formulas. Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 3 / 13

slide-7
SLIDE 7

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

The distribution defjned by RBMs

x1 h1 x2 h2 x3 h3 x4 h4 W p(h, x) = ehT Wx Z which is intractable (Z is diffjcult to calculate). But conditional distributions are easy to calculate p(h|x) = ∏

j

p(hj|x) = 1 1 + eWjx p(x|h) = ∏

k

p(xk|h) = 1 1 + eWT

kh Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 4 / 13

slide-8
SLIDE 8

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Learning in RBMs: contrastive divergence algorithm

  • We want to maximize the probability that the model

assigns to the input, p(x), or equivalently minimize − log p(x)

  • In general, this is not tractable. But effjcient approximate

algorithms exist Contrastive divergence algorithm

  • 1. Given a training example x, calculate the probabilities of

hidden units, and sample a hidden activation, h, from this distribution

  • 2. Sample a reconstruction, x

′ from p(x|h), and re-sample h ′

using x

  • 3. Set the update rule to ∆wij = (xivj − x

ih

j)ϵ

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 5 / 13

slide-9
SLIDE 9

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5 Autoencoders are standard feed-forward networks The main difgerence is that they are trained to predict their input (they try to learn the identity function) The aim is to learn useful representations of input at the hidden layer Typically weights are tied ( )

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 6 / 13

slide-10
SLIDE 10

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5 W W∗

Encoder Decoder

  • Autoencoders are standard

feed-forward networks

  • The main difgerence is that

they are trained to predict their input (they try to learn the identity function)

  • The aim is to learn useful

representations of input at the hidden layer

  • Typically weights are tied

(W∗ = WT)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 6 / 13

slide-11
SLIDE 11

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Under-complete autoencoders

x1 ˆ x1 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 ˆ x4 h3 x5 ˆ x5

  • An autoencoder is said to be

under-complete if there are fewer hidden units than inputs

  • The network is forced to learn

a more compact representation of the input (compress)

  • An autoencoder with a single

hidden layer is equivalent to PCA

  • We need multiple layers for

learning non-linear features

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 7 / 13

slide-12
SLIDE 12

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Over-complete autoencoders

h1 h2 x1 ˆ x1 h3 x2 ˆ x2 h4 x3 ˆ x3 h5

  • An autoencoder is said to be
  • ver-complete if there are more

hidden units than inputs

  • The network can normally

memorize the input perfectly

  • This type of networks are

useful if trained with a regularization term resulting in sparse hidden units (e.g., L1 regularization)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 8 / 13

slide-13
SLIDE 13

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Denoising autoencoders

x1 ˆ x1 x2 x2 ˆ x2 h1 x3 ˆ x3 h2 x4 x4 ˆ x4 h3 x5 x5 ˆ x5 x

  • x

h ˆ x

  • Instead of providing the exact

input we introduce noise by

– randomly setting some inputs to 0 (dropout) – adding random (Gaussian) noise

  • Network is still expected to

reconstruct the original input (without noise)

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 9 / 13

slide-14
SLIDE 14

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Learning manifolds

Figure: Goodfellow et al. (2016) Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 10 / 13

slide-15
SLIDE 15

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Unsupervised pre-training

Deep belief networks or stacked autoencoders

  • A common use case for RBMs and autoencoders are as

pre-training methods for supervised networks

  • Autoencoders or RBMs are trained using unlabeled data
  • The weights learned during the unsupervised learning is

used for initializing the weights of a supervised network

  • This approach has been one of the reasons for success of

deep networks

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 11 / 13

slide-16
SLIDE 16

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Deep unsupervised learning

  • Both autoencoders and RBMs can be ‘stacked’
  • Learn the weights of the fjrst hidden layer from the data
  • Freeze the weights, and using the hidden layer activations

as input, train another hidden layer, …

  • This approach is called greedy layer-wise training
  • In case of RBMs resulting networks are called deep belief

networks

  • Deep autoencoders are called stacked autoencoders

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 12 / 13

slide-17
SLIDE 17

Restricted Boltzmann machines Autoencoders Unsupervised pre-training

Why use pre-training?

  • Pre-training does not require labeled data
  • It can be considered as a form of regularization
  • Unsupervised methods may reduce the dimensionality,

allowing in effjcient computation for the supervised phase

  • Unsupervised learning on large-scale data may fjnd the

manifold that contains input data, counteracting curse of dimensionality

Ç. Çöltekin, SfS / University of Tübingen July 12, 2016 13 / 13