Deep Feedforward Networks Lecture slides for Chapter 6 of Deep - - PowerPoint PPT Presentation

▶

Dec 19, 2023 175 likes •489 views

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units Architecture Design

SLIDE 1

Deep Feedforward Networks

Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04

SLIDE 2

(Goodfellow 2017)

Roadmap

Example: Learning XOR
Gradient-Based Learning
Hidden Units
Architecture Design
Back-Propagation

SLIDE 3

(Goodfellow 2017)

XOR is not linearly separable

1 x1 1 x2

Original x space

Figure 6.1, left

SLIDE 4

(Goodfellow 2017)

Rectified Linear Activation

z g(z) = max{0, z}

Figure 6.3

SLIDE 5

(Goodfellow 2017)

Network Diagrams

y h x W w y h1 h1 x1 x1 h2 h2 x2 x2

Figure 6.2

SLIDE 6

(Goodfellow 2017)

Solving XOR

f(x; W , c, w, b) = w> max{0, W >x + c} + b. (6.3)

W =  1 1 1 1

(6.4) c =  −1

(6.5) w =  1 −2

(6.6)

SLIDE 7

(Goodfellow 2017)

Solving XOR

1 x1 1 x2

Original x space

1 2 h1 1 h2

Learned h space

Figure 6.1

SLIDE 8

(Goodfellow 2017)

Roadmap

Example: Learning XOR
Gradient-Based Learning
Hidden Units
Architecture Design
Back-Propagation

SLIDE 9

(Goodfellow 2017)

Gradient-Based Learning

Specify
Model
Cost
Design model and cost so cost is smooth
Minimize cost using gradient descent or related

techniques

SLIDE 10

(Goodfellow 2017)

Conditional Distributions and Cross-Entropy

J(θ) = −Ex,y∼ˆ

pdata log pmodel(y | x).

(6.12)

SLIDE 11

(Goodfellow 2017)

Output Types

Output Type Output Distribution Output Layer Cost Function Binary Bernoulli Sigmoid Binary cross- entropy Discrete Multinoulli Softmax Discrete cross- entropy Continuous Gaussian Linear Gaussian cross- entropy (MSE) Continuous Mixture of Gaussian Mixture Density Cross-entropy Continuous Arbitrary

See part III: GAN, VAE, FVBN

Various

SLIDE 12

(Goodfellow 2017)

Mixture Density Outputs

x y

Figure 6.4

SLIDE 13

(Goodfellow 2017)

Don’t mix and match

−3 −2 −1 1 2 3 z 0.0 0.5 1.0 σ(z) Cross-entropy loss MSE loss

Sigmoid output with target of 1

SLIDE 14

(Goodfellow 2017)

Roadmap

Example: Learning XOR
Gradient-Based Learning
Hidden Units
Architecture Design
Back-Propagation

SLIDE 15

(Goodfellow 2017)

Hidden units

Use ReLUs, 90% of the time
For RNNs, see Chapter 10
For some research projects, get creative
Many hidden units perform comparably to ReLUs.

New hidden units that perform comparably are rarely interesting.

SLIDE 16

(Goodfellow 2017)

Roadmap

Example: Learning XOR
Gradient-Based Learning
Hidden Units
Architecture Design
Back-Propagation

SLIDE 17

(Goodfellow 2017)

Architecture Basics

y h1 h1 x1 x1 h2 h2 x2 x2

Width Depth

SLIDE 18

(Goodfellow 2017)

Universal Approximator Theorem

One hidden layer is enough to represent (not learn)

an approximation of any function to an arbitrary degree of accuracy

So why deeper?
Shallow net may need (exponentially) more width
Shallow net may overfit more

SLIDE 19

(Goodfellow 2017)

Exponential Representation Advantage of Depth

Figure 6.5

SLIDE 20

(Goodfellow 2017)

Better Generalization with Greater Depth

3 4 5 6 7 8 9 10 11 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5 Test accuracy (percent)

Figure 6.6 Layers

SLIDE 21

(Goodfellow 2017)

Large, Shallow Models Overfit More

0.0 0.2 0.4 0.6 0.8 1.0 Number of parameters ×108 91 92 93 94 95 96 97 Test accuracy (percent)

3, convolutional 3, fully connected 11, convolutional

Figure 6.7

SLIDE 22

(Goodfellow 2017)

Roadmap

Example: Learning XOR
Gradient-Based Learning
Hidden Units
Architecture Design
Back-Propagation

SLIDE 23

(Goodfellow 2017)

Back-Propagation

Back-propagation is “just the chain rule” of calculus
But it’s a particular implementation of the chain rule
Uses dynamic programming (table filling)
Avoids recomputing repeated subexpressions
Speed vs memory tradeoff

dz dx = dz dy dy dx. (6.44)

rxz = ✓∂y ∂x ◆> ryz, (6.46)

SLIDE 24

(Goodfellow 2017)

Simple Back-Prop Example

y h1 h1 x1 x1 h2 h2 x2 x2

Forward prop Back-prop Compute activations Compute derivatives Compute loss

SLIDE 25

(Goodfellow 2017)

Computation Graphs

z x y (a) × x x w w (b) u(1) u(1) dot b u(2) u(2) + ˆ y ˆ y σ (c) X W U (1) U (1) matmul b b U (2) U (2) + H relu x x w w (d) ˆ y ˆ y dot λ u(1) u(1) sqr u(2) u(2) sum u(3) u(3) ×

Figure 6.8 Multiplication ReLU layer Logistic regression Linear regression and weight decay

SLIDE 26

(Goodfellow 2017)

Repeated Subexpressions

z x y w f f f

∂z ∂w (6.50) =∂z ∂y ∂y ∂x ∂x ∂w (6.51) =f 0(y)f 0(x)f 0(w) (6.52) =f 0(f(f(w)))f 0(f(w))f 0(w) (6.53)

Figure 6.9 Back-prop avoids computing this twice

SLIDE 27

(Goodfellow 2017)

Symbol-to-Symbol Differentiation

z x y w f f f z x y w f f f dz dy dz dy f 0 dy dx dy dx f 0 dz dx dz dx × dx dw dx dw f 0 dz dw dz dw ×

Figure 6.10

SLIDE 28

(Goodfellow 2017)

Neural Network Loss Function

X W (1) W (1) U (1) U (1) matmul H relu U (3) U (3) sqr u(4) u(4) sum λ u(7) u(7) W (2) W (2) U (2) U (2) matmul y JMLE JMLE cross_entropy U (5) U (5) sqr u(6) u(6) sum u(8) u(8) J + × +

Figure 6.11

SLIDE 29

(Goodfellow 2017)

Hessian-vector Products

Hv = rx h (rxf(x))> v i . (6.59)

SLIDE 30

(Goodfellow 2017)

Deep Feedforward Networks

Roadmap

XOR is not linearly separable

Rectified Linear Activation

Network Diagrams

Solving XOR

Solving XOR

Roadmap

Gradient-Based Learning

Conditional Distributions and Cross-Entropy

Output Types

Mixture Density Outputs

Don’t mix and match

Roadmap

Hidden units

Roadmap

Architecture Basics

Universal Approximator Theorem

Exponential Representation Advantage of Depth

Better Generalization with Greater Depth

Large, Shallow Models Overfit More

Roadmap

Back-Propagation

Simple Back-Prop Example

Computation Graphs

Repeated Subexpressions

Symbol-to-Symbol Differentiation

Neural Network Loss Function

Hessian-vector Products

Questions