Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - - PowerPoint PPT Presentation

deep feedforward networks
SMART_READER_LITE
LIVE PREVIEW

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - - PowerPoint PPT Presentation

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design


slide-1
SLIDE 1

Deep Feedforward Networks

Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04

slide-2
SLIDE 2

(Goodfellow 2017)

Roadmap

  • Example: Learning XOR
  • Gradient-Based Learning
  • Hidden Units
  • Architecture Design
  • Back-Propagation
slide-3
SLIDE 3

(Goodfellow 2017)

XOR is not linearly separable

1 x1 1 x2

Original x space

Figure 6.1, left

slide-4
SLIDE 4

(Goodfellow 2017)

Rectified Linear Activation

z g(z) = max{0, z}

Figure 6.3

slide-5
SLIDE 5

(Goodfellow 2017)

Network Diagrams

y h x W w y h1 h1 x1 x1 h2 h2 x2 x2

Figure 6.2

slide-6
SLIDE 6

(Goodfellow 2017)

Solving XOR

f(x; W , c, w, b) = w> max{0, W >x + c} + b. (6.3)

W =  1 1 1 1

  • ,

(6.4) c =  −1

  • ,

(6.5) w =  1 −2

  • ,

(6.6)

slide-7
SLIDE 7

(Goodfellow 2017)

Solving XOR

1 x1 1 x2

Original x space

1 2 h1 1 h2

Learned h space

Figure 6.1

slide-8
SLIDE 8

(Goodfellow 2017)

Roadmap

  • Example: Learning XOR
  • Gradient-Based Learning
  • Hidden Units
  • Architecture Design
  • Back-Propagation
slide-9
SLIDE 9

(Goodfellow 2017)

Gradient-Based Learning

  • Specify
  • Model
  • Cost
  • Design model and cost so cost is smooth
  • Minimize cost using gradient descent or related

techniques

slide-10
SLIDE 10

(Goodfellow 2017)

Conditional Distributions and Cross-Entropy

J(θ) = −Ex,y∼ˆ

pdata log pmodel(y | x).

(6.12)

slide-11
SLIDE 11

(Goodfellow 2017)

Output Types

Output Type Output Distribution Output Layer Cost Function Binary Bernoulli Sigmoid Binary cross- entropy Discrete Multinoulli Softmax Discrete cross- entropy Continuous Gaussian Linear Gaussian cross- entropy (MSE) Continuous Mixture of Gaussian Mixture Density Cross-entropy Continuous Arbitrary

See part III: GAN, VAE, FVBN

Various

slide-12
SLIDE 12

(Goodfellow 2017)

Mixture Density Outputs

x y

Figure 6.4

slide-13
SLIDE 13

(Goodfellow 2017)

Don’t mix and match

−3 −2 −1 1 2 3 z 0.0 0.5 1.0 σ(z) Cross-entropy loss MSE loss

Sigmoid output with target of 1

slide-14
SLIDE 14

(Goodfellow 2017)

Roadmap

  • Example: Learning XOR
  • Gradient-Based Learning
  • Hidden Units
  • Architecture Design
  • Back-Propagation
slide-15
SLIDE 15

(Goodfellow 2017)

Hidden units

  • Use ReLUs, 90% of the time
  • For RNNs, see Chapter 10
  • For some research projects, get creative
  • Many hidden units perform comparably to ReLUs.

New hidden units that perform comparably are rarely interesting.

slide-16
SLIDE 16

(Goodfellow 2017)

Roadmap

  • Example: Learning XOR
  • Gradient-Based Learning
  • Hidden Units
  • Architecture Design
  • Back-Propagation
slide-17
SLIDE 17

(Goodfellow 2017)

Architecture Basics

y h1 h1 x1 x1 h2 h2 x2 x2

Width Depth

slide-18
SLIDE 18

(Goodfellow 2017)

Universal Approximator Theorem

  • One hidden layer is enough to represent (not learn)

an approximation of any function to an arbitrary degree of accuracy

  • So why deeper?
  • Shallow net may need (exponentially) more width
  • Shallow net may overfit more
slide-19
SLIDE 19

(Goodfellow 2017)

Exponential Representation Advantage of Depth

Figure 6.5

slide-20
SLIDE 20

(Goodfellow 2017)

Better Generalization with Greater Depth

3 4 5 6 7 8 9 10 11 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 Test accuracy (percent)

Figure 6.6 Layers

slide-21
SLIDE 21

(Goodfellow 2017)

Large, Shallow Models Overfit More

0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters ×108 91 92 93 94 95 96 97 Test accuracy (percent)

3, convolutional 3, fully connected 11, convolutional

Figure 6.7

slide-22
SLIDE 22

(Goodfellow 2017)

Roadmap

  • Example: Learning XOR
  • Gradient-Based Learning
  • Hidden Units
  • Architecture Design
  • Back-Propagation
slide-23
SLIDE 23

(Goodfellow 2017)

Back-Propagation

  • Back-propagation is “just the chain rule” of calculus
  • But it’s a particular implementation of the chain rule
  • Uses dynamic programming (table filling)
  • Avoids recomputing repeated subexpressions
  • Speed vs memory tradeoff

dz dx = dz dy dy dx. (6.44)

rxz = ✓∂y ∂x ◆> ryz, (6.46)

slide-24
SLIDE 24

(Goodfellow 2017)

Simple Back-Prop Example

y h1 h1 x1 x1 h2 h2 x2 x2

Forward prop Back-prop Compute activations Compute derivatives Compute loss

slide-25
SLIDE 25

(Goodfellow 2017)

Computation Graphs

z x y (a) × x x w w (b) u(1) u(1) dot b u(2) u(2) + ˆ y ˆ y σ (c) X W U (1) U (1) matmul b b U (2) U (2) + H relu x x w w (d) ˆ y ˆ y dot λ u(1) u(1) sqr u(2) u(2) sum u(3) u(3) ×

Figure 6.8 Multiplication ReLU layer Logistic regression Linear regression and weight decay

slide-26
SLIDE 26

(Goodfellow 2017)

Repeated Subexpressions

z x y w f f f

∂z ∂w (6.50) =∂z ∂y ∂y ∂x ∂x ∂w (6.51) =f 0(y)f 0(x)f 0(w) (6.52) =f 0(f(f(w)))f 0(f(w))f 0(w) (6.53)

Figure 6.9 Back-prop avoids computing this twice

slide-27
SLIDE 27

(Goodfellow 2017)

Symbol-to-Symbol Differentiation

z x y w f f f z x y w f f f dz dy dz dy f 0 dy dx dy dx f 0 dz dx dz dx × dx dw dx dw f 0 dz dw dz dw ×

Figure 6.10

slide-28
SLIDE 28

(Goodfellow 2017)

Neural Network Loss Function

X W (1) W (1) U (1) U (1) matmul H relu U (3) U (3) sqr u(4) u(4) sum λ u(7) u(7) W (2) W (2) U (2) U (2) matmul y JMLE JMLE cross_entropy U (5) U (5) sqr u(6) u(6) sum u(8) u(8) J + × +

Figure 6.11

slide-29
SLIDE 29

(Goodfellow 2017)

Hessian-vector Products

Hv = rx h (rxf(x))> v i . (6.59)

slide-30
SLIDE 30

(Goodfellow 2017)

Questions