9.1 Overview 9 Deep Learning Alexander Smola Introduction to - - PowerPoint PPT Presentation

9 1 overview
SMART_READER_LITE
LIVE PREVIEW

9.1 Overview 9 Deep Learning Alexander Smola Introduction to - - PowerPoint PPT Presentation

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 5 10 8 10 2 10 3 10 11 RAM ? 1MB 100MB


slide-1
SLIDE 1

9.1 Overview

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-2
SLIDE 2

A brief history of computers

1970s 1980s 1990s 2000s 2010s Data

102 103 105 108 1011

RAM

? 1MB 100MB 10GB 1TB

CPU

? 10MF 1GF 100GF 1PF GPU

deep nets kernel methods deep nets

  • Data grows


at higher exponent

  • Moore’s law (silicon) vs. Kryder’s law (disks)
  • Early algorithms data bound, now CPU/RAM bound
slide-3
SLIDE 3

Perceptron

x1 x2 x3 xn . . .

  • utput

w1 wn

synaptic weights

y(x) = σ(hw, xi)

slide-4
SLIDE 4

Nonlinearities via Layers

y1i(x) = σ(hw1i, xi) y2(x) = σ(hw2, y1i) y1i = k(xi, x) Kernels

Deep Nets

  • ptimize

all weights

slide-5
SLIDE 5

Nonlinearities via Layers

y1i(x) = σ(hw1i, xi) y2i(x) = σ(hw2i, y1i) y3(x) = σ(hw3, y2i)

slide-6
SLIDE 6

Multilayer Perceptron

  • Layer Representation
  • (typically) iterate between


linear mapping Wx and 
 nonlinear function

  • Loss function


to measure quality of
 estimate so far

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

l(y, yi)

slide-7
SLIDE 7

Backpropagation

  • Layer Representation


  • Compute change in objective


  • Chain rule

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

gj = ∂Wjl(y, yi) ∂x [f2 f1] (x) = [∂f1f2 f1(x)] [∂xf1] (x)

slide-8
SLIDE 8

Backpropagation

  • Layer Representation
  • Gradients
  • Backprop

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

∂xiyi = Wi ∂Wiyi = xi ∂yixi+1 = σ0(yi) = ⇒ ∂xixi+1 = σ0(yi)W >

i

gn = ∂xnl(y, yn) gi = ∂xil(y, yn) = gi+1∂xixi+1 ∂Wil(y, yn) = gi+1σ0(yi)x>

i

slide-9
SLIDE 9

Optimization

  • Layer Representation


  • Gradient descent

  • Second order method


(use higher derivatives)

  • Stochastic gradient descent


(use only one sample)

  • Minibatch (small subset)

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

Wi ← Wi − η∂Wil(y, yn)

slide-10
SLIDE 10

Things we could learn

  • Binary classification

  • Multiclass classification (softmax)


  • Regression
  • Ranking (top-k)
  • Preferences
  • Sequences (see CRFs)

log(1 + exp(−yyn)) log X

y0

exp(yn[y0]) − yn[y] 1 2 ky ynk2

slide-11
SLIDE 11

9.2 Layers

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-12
SLIDE 12

Fully Connected

  • Forward mapping



 
 
 with subsequent nonlinearity

  • Backprop gradients 



 


  • General purpose layer

x2 x3

W2

yi = Wixi xi+1 = σ(yi) ∂xixi+1 = σ0(yi)W >

i

∂Wixi+1 = σ0(yi)x>

i

slide-13
SLIDE 13

Rectified Linear Unit (ReLu)

  • Forward mapping



 
 
 with subsequent nonlinearity

  • Gradients vanish at tails
  • Solution - replace by max(0,x)
  • Derivative is in {0; 1}
  • Sparsity of signal


(Nair & Hinton, machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf)

x2 x3

W2

yi = Wixi xi+1 = σ(yi)

slide-14
SLIDE 14

Where is Wally

slide-15
SLIDE 15

LeNet for OCR (1990s)

slide-16
SLIDE 16

Convolutional Layers

  • Images have translation invariance


(to some extent)

  • Low level is mostly edge


and feature detectors

  • Usually via convolution


(plus nonlinearity)

slide-17
SLIDE 17

Convolutional Layers

  • Images have translation invariance
  • Forward (usually implemented brute force)


  • Backward gradients


(need to convolve appropriately)

yi = xi Wi xi+1 = σ(yi)

slide-18
SLIDE 18

Subsampling & MaxPooling

  • Multiple convolutions blow up dimensionality



 
 
 


  • Subsampling - average over patches


(this works decently)

  • MaxPooling - pick the maximum over patches


(often non overlapping ones)

slide-19
SLIDE 19

Depth vs. Width

  • Longer range effects
  • many narrow

convolutions

  • few wide

convolutions

  • More nonlinearities

work better 
 (same number of parameters)

Simonyan and Zisserman arxiv.org/pdf/1409.1556v6.pdf

slide-20
SLIDE 20

Fancy structures

  • Compute different filters
  • Compose one big vector from all of them
  • Layer this iteratively

Szegedy et al. arxiv.org/pdf/1409.4842v1.pdf

slide-21
SLIDE 21

Whole system training

Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

slide-22
SLIDE 22

Whole system training

  • Layers need not be

‘neural networks’

  • Rankers
  • Segmenters
  • Finite state

automata

  • Jointly train a full

OCR system

Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

slide-23
SLIDE 23

9.3 Objectives

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-24
SLIDE 24

Classification

  • Binary classification



 
 Binary exponential model


  • Multiclass classification (softmax)


Multinomial exponential model 
 


  • Pretty much anything else we did so far in 10-701

log(1 + exp(−yyn)) − log p(y|yn) = − log eyn[y] P

y0 eyn[y0] = log

X

y0

eyn[y0] − yn[y]

slide-25
SLIDE 25

Regression

  • Least mean squares



 
 this works for vectors, too

  • Applications
  • Stock market prediction (more on this later)
  • Image superresolution


(regress from lower dimensional to higher dimensional image)

  • Recommendation and rating (Netflix)

1 2 ky ynk2

2

slide-26
SLIDE 26

Autoencoder

  • Regress from observation to itself (yn = x1)
  • Lower-dimensional layer 


is bottleneck

  • Often trained iteratively

x1 x2

W1 V1

x1 x3

V2

x1

W1 V1

x1 x2 x2

W2

slide-27
SLIDE 27

Autoencoder

  • Regress from observation to itself (yn = x1)
  • Lower-dimensional layer 


is bottleneck

  • Often trained iteratively
  • Extracts approximate


sufficient statistic of data


  • Special case - PCA
  • linear mapping
  • only single layer

x3

V2

x1

W1 V1

x1 x2 x2

W2

slide-28
SLIDE 28

‘Synesthesia’

  • Different data sources
  • Images and captions
  • Natural language

queries and SQL queries

  • Movies and actions
  • Generative embedding

for both entities

  • Minimize distance

between pairs

  • Need to prevent

clumping all together

slide-29
SLIDE 29

‘Synesthesia’

  • Different data sources
  • Images and captions
  • Natural language

queries and SQL queries

  • Movies and actions

max(0, margin + d(a, b) − d(a, n))

large margin

  • f similarity

Grefenstette et al, 2014, arxiv.org/abs/1404.7296

slide-30
SLIDE 30

Synthetic Data Generation

  • Dataset often has useful invariance
  • Images can be shifted, scaled, RGB transformed,

blurred, sharpened, etc.

  • Speech can have echo, background noise,

environmental noise

  • Text can have typos, omissions, etc.
  • Generate data and train on extended noisy set
  • Record breaking speech recognition (Baidu)
  • Record breaking image recognition (Baidu, LeCun)
  • Can be very computationally expensive
slide-31
SLIDE 31

Synthetic Data Generation

  • Sample according to relevance of transform
  • Similar to Virtual Support Vectors (Schölkopf, 1998)
  • Training with input noise & regularization (Bishop, 1995)
slide-32
SLIDE 32

9.4 Optimization

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-33
SLIDE 33

Stochastic Gradient Descent

  • Update parameters according to
  • Rate of decay
  • Adjust each layer
  • Adjust each parameter individually
  • Minibatch size
  • Momentum terms
  • Lots of things that can (should) be adjusted


(via Bayesian optimization, e.g. Spearmint, MOE)

Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf

Wij ← Wij − ηij(t)gij

slide-34
SLIDE 34

Minibatch

  • Update parameters according to
  • Aggregate gradients before applying
  • Reduces variance in gradients
  • Better for vectorization (GPUs)


vector, vector < vector, matrix < matrix, matrix

  • Large minibatch may need large memory


(and slow updates).

  • Magic numbers are 64 to 256 on GPUs

Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf

Wij ← Wij − ηij(t)gij

slide-35
SLIDE 35

Learning rate decay

  • Constant 


(requires schedule for piecewise constant, tricky)

  • Polynomial decay



 
 Recall exponent of 0.5 for conventional SGD, 1 for strong convexity. Bottou picks 0.75

  • Exponential decay



 risky since decay could be to aggressive

η(t) = α (β + t)γ η(t) = αe−βt

slide-36
SLIDE 36

AdaGrad

  • Adaptive learning rate (preconditioner)


  • For directions with large gradient, decrease

learning rate aggressively to avoid instability

  • If gradients start vanishing, learning rate

decrease reduces, too

  • Local variant

ηij(t) = η0 q K + P

t g2 ij(t)

Duchi, Hazan, Singer, 2010 http://www.magicbroom.info/Papers/DuchiHaSi10.pdf

ηij(t) = ηt q K + Pt

t0=tτ g2 ij(t0)

slide-37
SLIDE 37

Momentum

  • Average over recent gradients
  • Helps with local minima
  • Flat (noisy) gradients



 


  • Can lead to oscillations for large momentum
  • Nesterov’s accelerated gradient

mt = (1 − λ)mt−1 + λgt wt ← wt − ηtgt − ˜ ηtmt

momentum

mt+1 = µmt + ✏g(wt − µmt) wt+1 = wt − mt+1

slide-38
SLIDE 38

Capacity control

  • Minimizing loss can lead to overfitting
  • Weight decay
  • Parameter clipping
  • Overheated GPU
  • Numerical instabilities

wt ← wt − ηtgt wt ← (1 − λ)wt − ηtgt

prevents parameters from diverging

slide-39
SLIDE 39

Dropout

  • Avoid parameter sensitivity


(small changes in value shouldn’t change result)

  • Distributed representation


(information carried by more than 1 dimension)

  • Randomized sparsification
  • Same trick works for matrix W, too: DropConnect 


slightly better performance …

yti = ξtiyti where ( Pr(ξti = π−1) = π Pr(ξti = 0) = 1 − π

Srivastava, Hinton, Krizhevski, Sutskever, Salakhutdinov http://jmlr.org/papers/v15/srivastava14a.html http://cs.nyu.edu/~wanli/dropc/

slide-40
SLIDE 40

Dropout & DropConnect

Regular Dropout DropConnect

slide-41
SLIDE 41

Training with Dropout

slide-42
SLIDE 42

9.5 Memory

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-43
SLIDE 43

State and models

  • IID data
  • Classification
  • Regression
  • Feature representation …
  • Most of the data isn’t IID
  • Sequence annotation (tagging, parsing)
  • Sequence generation (translation)
  • Summarization
  • Image annotation (content extraction)
  • Alternatives: dynamic programming/stepwise prediction
slide-44
SLIDE 44

Autoregressive Models / RNN

  • Time series of observations


… xt-2, xt-1, xt, xt+1, xt+2 …

  • Estimate e.g. via deep net
  • Problem
  • Hard to encode latent state (e.g. parity)
  • Hard to encode long range context/knowledge
  • Solution - latent state

xt+1 = f(xt, . . . , xt−τ) xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)

slide-45
SLIDE 45

Autoregressive Models / RNN

xt+1 = f(xt, . . . , xt−τ) xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)

x x z

slide-46
SLIDE 46

Autoregressive Models / RNN

xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)

x z

  • Sequence of observations
  • Gradients need to propagate back through t
  • Gradient may vanish. Makes model difficult to train.


Due to stability condition on gradients

slide-47
SLIDE 47

LSTM (Long Short Term Memory)

x z

  • Sequence of observations


Latent state has custom update semantics
 (like a memory cell), Hochreiter & Schmidhuber

slide-48
SLIDE 48

LSTM (Long Short Term Memory)

  • Sequence of observations


Latent state has custom update semantics

it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)

  • t = σ(Wo(xt, ht−1, zt) + bf)

ht = ot ∗ tanh zt

slide-49
SLIDE 49

LSTM (Long Short Term Memory)

  • Sequence of observations


Latent state has custom update semantics

it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)

  • t = σ(Wo(xt, ht−1, zt) + bf)

ht = ot ∗ tanh zt

input forgetting state

  • utput
  • utput gate
slide-50
SLIDE 50

LSTM (Long Short Term Memory)

  • Sequence of observations


Latent state has custom update semantics

it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)

  • t = σ(Wo(xt, ht−1, zt) + bf)

ht = ot ∗ tanh zt

input forgetting state

  • utput
  • utput gate
slide-51
SLIDE 51

LSTM (Long Short Term Memory)

  • Sequence of observations


Latent state has custom update semantics

sequence generation sequence classification

slide-52
SLIDE 52

LSTM (Long Short Term Memory)

  • Group LSTM cells into layer
  • Multiple layers
  • Can model different scales of dynamics

x

slide-53
SLIDE 53

Example (Le, Sutskever, Vinyals, NIPS 2014)
 Natural Language Translation

slide-54
SLIDE 54

‘Synesthesia’

  • Sequence embedding

via LSTM

  • Enforce closeness

between LSTM state to

  • btain similarity

between sequences

slide-55
SLIDE 55

Much more

  • Memory is area of active research
  • Neural Turing Machine


http://arxiv.org/abs/1410.5401

  • Memory Networks


http://arxiv.org/abs/1410.3916

  • Attention models (Kyunghyun Cho)
slide-56
SLIDE 56

9.6 Toolkits

9 Deep Learning

Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

slide-57
SLIDE 57

Quick overview

  • Caffe http://caffe.berkeleyvision.org/


Efficient for convolutional models / images

  • Torch http://torch.ch/


Very efficient. But you must LIKE Lua …
 Google and Facebook love it

  • Theano http://deeplearning.net/software/theano/


Compiled from Python. Not as efficient as Torch

  • Minerva https://github.com/dmlc/minerva


Compiler layout of execution on machines

  • CXXNet https://github.com/dmlc/cxxnet


Simpler than Caffe. More efficient

  • Parameter Server bindings to https://github.com/dmlc/ 


Minerva, Caffe, CXXNet, …