9.1 Overview
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
9.1 Overview 9 Deep Learning Alexander Smola Introduction to - - PowerPoint PPT Presentation
9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 5 10 8 10 2 10 3 10 11 RAM ? 1MB 100MB
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
1970s 1980s 1990s 2000s 2010s Data
102 103 105 108 1011
RAM
? 1MB 100MB 10GB 1TB
CPU
? 10MF 1GF 100GF 1PF GPU
deep nets kernel methods deep nets
at higher exponent
x1 x2 x3 xn . . .
w1 wn
synaptic weights
y(x) = σ(hw, xi)
y1i(x) = σ(hw1i, xi) y2(x) = σ(hw2, y1i) y1i = k(xi, x) Kernels
Deep Nets
all weights
y1i(x) = σ(hw1i, xi) y2i(x) = σ(hw2i, y1i) y3(x) = σ(hw3, y2i)
linear mapping Wx and nonlinear function
to measure quality of estimate so far
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
l(y, yi)
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
gj = ∂Wjl(y, yi) ∂x [f2 f1] (x) = [∂f1f2 f1(x)] [∂xf1] (x)
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
∂xiyi = Wi ∂Wiyi = xi ∂yixi+1 = σ0(yi) = ⇒ ∂xixi+1 = σ0(yi)W >
i
gn = ∂xnl(y, yn) gi = ∂xil(y, yn) = gi+1∂xixi+1 ∂Wil(y, yn) = gi+1σ0(yi)x>
i
(use higher derivatives)
(use only one sample)
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
Wi ← Wi − η∂Wil(y, yn)
log(1 + exp(−yyn)) log X
y0
exp(yn[y0]) − yn[y] 1 2 ky ynk2
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
with subsequent nonlinearity
x2 x3
W2
yi = Wixi xi+1 = σ(yi) ∂xixi+1 = σ0(yi)W >
i
∂Wixi+1 = σ0(yi)x>
i
with subsequent nonlinearity
(Nair & Hinton, machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf)
x2 x3
W2
yi = Wixi xi+1 = σ(yi)
LeNet for OCR (1990s)
(to some extent)
and feature detectors
(plus nonlinearity)
(need to convolve appropriately)
yi = xi Wi xi+1 = σ(yi)
(this works decently)
(often non overlapping ones)
convolutions
convolutions
work better (same number of parameters)
Simonyan and Zisserman arxiv.org/pdf/1409.1556v6.pdf
Szegedy et al. arxiv.org/pdf/1409.4842v1.pdf
Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
Whole system training
‘neural networks’
automata
OCR system
Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
Binary exponential model
Multinomial exponential model
log(1 + exp(−yyn)) − log p(y|yn) = − log eyn[y] P
y0 eyn[y0] = log
X
y0
eyn[y0] − yn[y]
this works for vectors, too
(regress from lower dimensional to higher dimensional image)
1 2 ky ynk2
2
is bottleneck
x1 x2
W1 V1
x1 x3
V2
x1
W1 V1
x1 x2 x2
W2
is bottleneck
sufficient statistic of data
x3
V2
x1
W1 V1
x1 x2 x2
W2
queries and SQL queries
for both entities
between pairs
clumping all together
queries and SQL queries
max(0, margin + d(a, b) − d(a, n))
large margin
Grefenstette et al, 2014, arxiv.org/abs/1404.7296
blurred, sharpened, etc.
environmental noise
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
(via Bayesian optimization, e.g. Spearmint, MOE)
Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf
Wij ← Wij − ηij(t)gij
vector, vector < vector, matrix < matrix, matrix
(and slow updates).
Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf
Wij ← Wij − ηij(t)gij
(requires schedule for piecewise constant, tricky)
Recall exponent of 0.5 for conventional SGD, 1 for strong convexity. Bottou picks 0.75
risky since decay could be to aggressive
η(t) = α (β + t)γ η(t) = αe−βt
learning rate aggressively to avoid instability
decrease reduces, too
ηij(t) = η0 q K + P
t g2 ij(t)
Duchi, Hazan, Singer, 2010 http://www.magicbroom.info/Papers/DuchiHaSi10.pdf
ηij(t) = ηt q K + Pt
t0=tτ g2 ij(t0)
mt = (1 − λ)mt−1 + λgt wt ← wt − ηtgt − ˜ ηtmt
momentum
mt+1 = µmt + ✏g(wt − µmt) wt+1 = wt − mt+1
wt ← wt − ηtgt wt ← (1 − λ)wt − ηtgt
prevents parameters from diverging
(small changes in value shouldn’t change result)
(information carried by more than 1 dimension)
slightly better performance …
yti = ξtiyti where ( Pr(ξti = π−1) = π Pr(ξti = 0) = 1 − π
Srivastava, Hinton, Krizhevski, Sutskever, Salakhutdinov http://jmlr.org/papers/v15/srivastava14a.html http://cs.nyu.edu/~wanli/dropc/
Regular Dropout DropConnect
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
… xt-2, xt-1, xt, xt+1, xt+2 …
xt+1 = f(xt, . . . , xt−τ) xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)
xt+1 = f(xt, . . . , xt−τ) xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)
x x z
xt+1 = f(xt, . . . , xt−τ, zt, . . . , zt−τ) zt+1 = g(xt+1, . . . , xt−τ, zt, . . . , zt−τ)
x z
Due to stability condition on gradients
x z
Latent state has custom update semantics (like a memory cell), Hochreiter & Schmidhuber
Latent state has custom update semantics
it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)
ht = ot ∗ tanh zt
Latent state has custom update semantics
it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)
ht = ot ∗ tanh zt
input forgetting state
Latent state has custom update semantics
it = σ(Wi(xt, ht−1) + bi) ft = σ(Wf(xt, ht−1) + bf) zt = ft ∗ zt−1 + it ∗ tanh(Wz(xt, ht−1) + bz)
ht = ot ∗ tanh zt
input forgetting state
Latent state has custom update semantics
sequence generation sequence classification
x
Example (Le, Sutskever, Vinyals, NIPS 2014) Natural Language Translation
via LSTM
between LSTM state to
between sequences
http://arxiv.org/abs/1410.5401
http://arxiv.org/abs/1410.3916
9 Deep Learning
Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15
Efficient for convolutional models / images
Very efficient. But you must LIKE Lua … Google and Facebook love it
Compiled from Python. Not as efficient as Torch
Compiler layout of execution on machines
Simpler than Caffe. More efficient
Minerva, Caffe, CXXNet, …