Neural Network Optimization 1 CS 519: Deep Learning, Winter 2018 - - PowerPoint PPT Presentation

β–Ά
neural network optimization 1
SMART_READER_LITE
LIVE PREVIEW

Neural Network Optimization 1 CS 519: Deep Learning, Winter 2018 - - PowerPoint PPT Presentation

Neural Network Optimization 1 CS 519: Deep Learning, Winter 2018 Fuxin Li With materials from Zsolt Kira Backpropagation learning of a network The algorithm 1. Compute a forward pass on the compute graph (DAG) from the input to all the


slide-1
SLIDE 1

Neural Network Optimization 1

CS 519: Deep Learning, Winter 2018 Fuxin Li

With materials from Zsolt Kira

slide-2
SLIDE 2

Backpropagation learning of a network

  • The algorithm
  • 1. Compute a forward pass on the compute graph (DAG) from the input to all

the outputs

  • 2. Backpropagate all the outputs back all the way to the input and collect all

gradients

  • 3.

for all the weights in all layers

slide-3
SLIDE 3

Modules (Layers)

  • Each layer can be seen as a module
  • Given input, return
  • Output
  • Network gradient
  • Gradient of module parameters
  • During backprop, propagate/update
  • Backpropagated gradient
  • 𝑿=
  • πœ–πΉ

πœ–π‘”

  • where
slide-4
SLIDE 4

The abundance of online layers

slide-5
SLIDE 5

Learning Rates

  • Gradient descent is only guaranteed to converge with small enough

learning rates

  • So that’s a sign you should decrease your learning rate if it explodes
  • Example:
  • Learning rate of
slide-6
SLIDE 6

Weight decay regularization

  • Instead of using a normal step, add a
  • This corresponds to:
  • Early stopping as well!
  • Help generalization

min

𝐗

1 𝑂 π‘š(𝑔 𝑦; 𝐗 , 𝑧)

  • + 1

2 πœ‡ 𝐗

slide-7
SLIDE 7

Momentum

  • Basic updating equation (with momentum):
  • , a lot of β€œinertia” in optimization
  • Check the previous example with a momentum of 0.5
slide-8
SLIDE 8

Normalization

  • Each component to 0 mean, 1

standard deviation

  • For ease of L2 regularization +
  • ptimization convergence rates

w1 w2

101, 101 101, 99 1, 1 1, -1

color indicates training case

1, 1 1, -1 0.1, 10 0.1, -10

slide-9
SLIDE 9

Computing the energy function and gradient

  • Usual ERM energy function
  • One problem:
  • Very slow to compute when

is large

  • One gradient step takes long time!
  • Approximate?

min

𝐹 𝑔 = 𝑀(𝑔 𝑦; 𝑋 , 𝑧)

  • 𝛼𝐹 = πœ–π‘€(𝑔 𝑦; 𝑋 , 𝑧)

πœ–π‘‹

slide-10
SLIDE 10

Stochastic Mini-batch Approximation

  • Ensure the expectation is the same
  • Uniformly sample every time
  • Sample how many? 1 (SGD) – 256 (Mini-batch SGD)
  • Common mini-batch size is 32-256
  • In practice: dependent on GPU memory size

min

𝐹 𝑔 = 𝑀(𝑔 𝑦; 𝑋 , 𝑧)

  • 𝛼𝐹 = πœ–π‘€(𝑔 𝑦; 𝑋 , 𝑧)

πœ–π‘‹

  • 𝛼𝐹

β‰ˆ πœ–π‘€(𝑔 𝑦; 𝑋 , 𝑧) πœ–π‘‹

∈

𝔽 𝛼𝐹 = 𝛼𝐹

slide-11
SLIDE 11

In Practice

  • Randomly re-arrange the input examples
  • Use a fixed order on the input examples
  • Define an iteration to be every time the gradient is computed
  • An epoch to be every time that all the input examples is looped

through once

Iteration Data Iteration Iteration Epoch

slide-12
SLIDE 12

A practical run of training a neural network

  • Check:
  • Energy
  • Training error
  • Validation error
slide-13
SLIDE 13

Data Augmentation

  • Create artificial data to increase the size of the dataset
  • Example: Elastic deformations
  • n MNIST
slide-14
SLIDE 14

Data Augmentation

256x256 224x224 224x224 224x224 224x224 224x224 224x224

Horizontal Flip Training Image

slide-15
SLIDE 15
  • One of the easiest ways to prevent overfitting is to augment the dataset

256x256 224x224 224x224 224x224 224x224 224x224 224x224

Horizontal Flip Training Image

Data Augmentation

slide-16
SLIDE 16

CIFAR-10 dataset

  • 60,000 images in 10 classes
  • 50,000 training
  • 10,000 test
  • Designed to mimic MNIST
  • 32x32
  • Assignment (will post on

Canvas with more explicity):

  • Write your own

backpropagation NN to test

  • n CIFAR-10