Deep Feedforward Networks
Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04
Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - - PowerPoint PPT Presentation
Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design
Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04
(Goodfellow 2017)
(Goodfellow 2017)
1 x1 1 x2
Original x space
Figure 6.1, left
(Goodfellow 2017)
z g(z) = max{0, z}
Figure 6.3
(Goodfellow 2017)
y h x W w y h1 h1 x1 x1 h2 h2 x2 x2
Figure 6.2
(Goodfellow 2017)
f(x; W , c, w, b) = w> max{0, W >x + c} + b. (6.3)
W = 1 1 1 1
(6.4) c = −1
(6.5) w = 1 −2
(6.6)
(Goodfellow 2017)
1 x1 1 x2
Original x space
1 2 h1 1 h2
Learned h space
Figure 6.1
(Goodfellow 2017)
(Goodfellow 2017)
techniques
(Goodfellow 2017)
J(θ) = −Ex,y∼ˆ
pdata log pmodel(y | x).
(6.12)
(Goodfellow 2017)
Output Type Output Distribution Output Layer Cost Function Binary Bernoulli Sigmoid Binary cross- entropy Discrete Multinoulli Softmax Discrete cross- entropy Continuous Gaussian Linear Gaussian cross- entropy (MSE) Continuous Mixture of Gaussian Mixture Density Cross-entropy Continuous Arbitrary
See part III: GAN, VAE, FVBN
Various
(Goodfellow 2017)
x y
Figure 6.4
(Goodfellow 2017)
−3 −2 −1 1 2 3 z 0.0 0.5 1.0 σ(z) Cross-entropy loss MSE loss
Sigmoid output with target of 1
(Goodfellow 2017)
(Goodfellow 2017)
New hidden units that perform comparably are rarely interesting.
(Goodfellow 2017)
(Goodfellow 2017)
y h1 h1 x1 x1 h2 h2 x2 x2
Width Depth
(Goodfellow 2017)
an approximation of any function to an arbitrary degree of accuracy
(Goodfellow 2017)
Figure 6.5
(Goodfellow 2017)
3 4 5 6 7 8 9 10 11 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 Test accuracy (percent)
Figure 6.6 Layers
(Goodfellow 2017)
0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters ×108 91 92 93 94 95 96 97 Test accuracy (percent)
3, convolutional 3, fully connected 11, convolutional
Figure 6.7
(Goodfellow 2017)
(Goodfellow 2017)
dz dx = dz dy dy dx. (6.44)
rxz = ✓∂y ∂x ◆> ryz, (6.46)
(Goodfellow 2017)
y h1 h1 x1 x1 h2 h2 x2 x2
Forward prop Back-prop Compute activations Compute derivatives Compute loss
(Goodfellow 2017)
z x y (a) × x x w w (b) u(1) u(1) dot b u(2) u(2) + ˆ y ˆ y σ (c) X W U (1) U (1) matmul b b U (2) U (2) + H relu x x w w (d) ˆ y ˆ y dot λ u(1) u(1) sqr u(2) u(2) sum u(3) u(3) ×
Figure 6.8 Multiplication ReLU layer Logistic regression Linear regression and weight decay
(Goodfellow 2017)
z x y w f f f
∂z ∂w (6.50) =∂z ∂y ∂y ∂x ∂x ∂w (6.51) =f 0(y)f 0(x)f 0(w) (6.52) =f 0(f(f(w)))f 0(f(w))f 0(w) (6.53)
Figure 6.9 Back-prop avoids computing this twice
(Goodfellow 2017)
z x y w f f f z x y w f f f dz dy dz dy f 0 dy dx dy dx f 0 dz dx dz dx × dx dw dx dw f 0 dz dw dz dw ×
Figure 6.10
(Goodfellow 2017)
X W (1) W (1) U (1) U (1) matmul H relu U (3) U (3) sqr u(4) u(4) sum λ u(7) u(7) W (2) W (2) U (2) U (2) matmul y JMLE JMLE cross_entropy U (5) U (5) sqr u(6) u(6) sum u(8) u(8) J + × +
Figure 6.11
(Goodfellow 2017)
Hv = rx h (rxf(x))> v i . (6.59)
(Goodfellow 2017)