About me : ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd - - PowerPoint PPT Presentation

about me ens mva fair engineer fair phd 3rd year about my
SMART_READER_LITE
LIVE PREVIEW

About me : ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd - - PowerPoint PPT Presentation

Presentation About me : ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd year) About my PhD : Interested in sign matrices and tensors (graphs / multi-graphs) Observe a few entries, predict the remaining edges Factorization


slide-1
SLIDE 1

About me :

  • ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd year)

About my PhD :

  • Interested in sign matrices and tensors (graphs / multi-graphs)
  • Observe a few entries, predict the remaining edges
  • Factorization methods are simple and efficient for these problems
  • When is it needed to go beyond factorization methods ?
  • How ?

Presentation

slide-2
SLIDE 2
  • Artificial Neuron
  • Why go deep ?
  • Neural Networks

Introduction to Neural Networks

slide-3
SLIDE 3

Neural Net

Poodle

Examples of Inputs / Outputs

Image In, Class out

slide-4
SLIDE 4

Neural Net

Examples of Inputs / Outputs

Image In, Text out

a person on skis jumping down part of a hill

slide-5
SLIDE 5

Neural Net

Examples of Inputs / Outputs

Text In, Text out

a person on skis jumping down part of a hill Un skieur sautant au dessus d’un talus

slide-6
SLIDE 6

Neural Net

From NVIDIA’s paper at ICLR2018

Examples of Inputs / Outputs

Noise In, Face out ?!

slide-7
SLIDE 7

Neural Net

IN OUT Neural Networks

One model to rule them all

Hidden Layers Input Layer Output Layer

slide-8
SLIDE 8

Neural Net

IN OUT Neural Networks

One model to rule them all

slide-9
SLIDE 9

Neural Net

IN OUT Neural Networks

One model to rule them all

slide-10
SLIDE 10

Neural Net

. . .

x1 xd

Artificial Neuron

The good ol’ Perceptron

a1 ad

y = σ(a1x1 + ... + adxd + b)

σ: Activation function

bias

ha, xi

Sigmoid ReLU

slide-11
SLIDE 11

Neural Net

Artificial Neuron

a x1

The good ol’ Perceptron

slide-12
SLIDE 12

Neural Net

Artificial Neuron

ha, x1i a x1

The good ol’ Perceptron

slide-13
SLIDE 13

Neural Net

Artificial Neuron

a x2 ha, x2i

The good ol’ Perceptron

slide-14
SLIDE 14

Neural Net

Artificial Neuron

a x2 x1 b = 0

y = σ(ha, xi + b)

Active Inactive

The good ol’ Perceptron

slide-15
SLIDE 15

Neural Net

Artificial Neuron

a x2 x1

b

y = σ(ha, xi + b)

The good ol’ Perceptron

slide-16
SLIDE 16

Neural Net

Why deep learning ?

We need to go deeper

x1 x2

x2 x1

slide-17
SLIDE 17

Neural Net

x1 x2

x2 x1 y1

1

y1

1

Why deep learning ?

We need to go deeper

slide-18
SLIDE 18

Neural Net

x1 x2

x2 x1 y1

1

y1

1

y1

2

y1

2

Why deep learning ?

We need to go deeper

slide-19
SLIDE 19

Neural Net

x1 x2 y1

1

y1

2

y1

1

y1

2

x2 x1 y1

1

y1

2

Why deep learning ?

We need to go deeper

slide-20
SLIDE 20

Neural Net

x1 x2 y1

1

y1

2

y1

1

y1

2

y2

y2

Why deep learning ?

We need to go deeper

slide-21
SLIDE 21

Neural Net

IN OUT Hidden Layer

And in the darkness bind them

slide-22
SLIDE 22

Neural Net

x1 x2 x3 x4 y2 y3 y4

y1

Hidden Layer

Set of neurons with shared inputs

slide-23
SLIDE 23

Neural Net

y1 = σ(ha1, xi + b1) y2 = σ(ha2, xi + b2) y3 = σ(ha3, xi + b3) y4 = σ(ha4, xi + b4)

x1 x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

slide-24
SLIDE 24

Neural Net

y1 = σ( h a1 , x i + b1 ) y2 = σ( h a2 , x i + b2 ) y3 = σ( h a3 , x i + b3 ) y4 = σ( h a4 , x i + b4 )

A b

x1 x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

slide-25
SLIDE 25

Neural Net

y = σ

  • Ax + b
  • x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

slide-26
SLIDE 26

Neural Net

y = σ

  • Ax + b
  • x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

slide-27
SLIDE 27

Neural Net

y = σ

  • Ax + b
  • x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

> 100x speed-up

slide-28
SLIDE 28

Neural Net

IN OUT Forward Pass

Hmm what’s this picture …

slide-29
SLIDE 29

Neural Net

IN OUT Forward Pass

Wait for it …

slide-30
SLIDE 30

Neural Net

IN OUT Forward Pass

slide-31
SLIDE 31

Neural Net

IN OUT Forward Pass

slide-32
SLIDE 32

Neural Net

IN OUT Forward Pass

slide-33
SLIDE 33

Neural Net

IN OUT Forward Pass

slide-34
SLIDE 34

Neural Net

IN OUT Forward Pass

slide-35
SLIDE 35

Neural Net

IN OUT Forward Pass

slide-36
SLIDE 36

Neural Net

IN Poodle Forward Pass

It’s a Poodle !

slide-37
SLIDE 37

Neural Net

Recap

The story so far

  • Artificial Neuron
  • Neural Network
slide-38
SLIDE 38

Neural Net

Recap

The story so far

  • Artificial Neuron -> Thresholds its input, one is not enough
  • Neural Network
slide-39
SLIDE 39

Neural Net

Recap

The story so far

  • Artificial Neuron -> Thresholds its input, one is not enough
  • Neural Network -> Computes its output with the Forward pass
slide-40
SLIDE 40
  • Learning from data
  • Stochastic Gradient Descent
  • Back Propagation

Learning

”It’s all downhill from here”

slide-41
SLIDE 41

Learning

Learning From Data

Step 1 – Get Data

A dataset is a set of (input, output)

Poodle Shiba Inu Not Dog Shiba Inu

slide-42
SLIDE 42

Learning

Learning From Data

A dataset is a set of (input, output), lots of them

Step 2 – Get More

slide-43
SLIDE 43

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function

slide-44
SLIDE 44

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction

slide-45
SLIDE 45

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label)

slide-46
SLIDE 46

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label) Poodle

slide-47
SLIDE 47

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label) Poodle

The loss defines how much you pay for answering ”Shiba” instead of ”Poodle”

slide-48
SLIDE 48

Loss Function

Some examples

Classification

  • Goal : assign class +1 / -1
  • Loss : based on sign agreement yf(x)

yf(x) > 0 yf(x) < 0 1

Learning

slide-49
SLIDE 49

Loss Function

Some examples

Loss : Application-dependent measure of error

Learning

Regression

  • Goal : learn continuous value y
  • Loss : based on distance (f(x) − y)2

f(x) − y (f(x) − y)2

slide-50
SLIDE 50

Training Loss

Let’s get to minimizing

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Learning

slide-51
SLIDE 51

Training Loss

Let’s get to minimizing

Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Learning

slide-52
SLIDE 52

Training Loss

Let’s get to minimizing

Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

slide-53
SLIDE 53

Training Loss

Let’s get to minimizing

Training example Loss function Example label Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

slide-54
SLIDE 54

Training Loss

Let’s get to minimizing

Training example Loss function Example label Model Prediction Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

slide-55
SLIDE 55

Training Loss

Let’s get to minimizing

Loss function poodle Model Parameters Average over the training set

Learning

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

slide-56
SLIDE 56

Training Loss

Let’s get to minimizing

Find the parameters that minimize the average loss on the training set

Learning

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

slide-57
SLIDE 57

Learning

An iterative process

Step 1 – be random

Your neural net start with random parameters θ0

+

  • + +

+ +

  • θ0
slide-58
SLIDE 58

Learning

Your neural net start with random parameters θ0

+

  • + +

+ +

  • θ0

1 n

n

X

i=1

`(f(xi; ✓), yi)

An iterative process

Step 1 – be random

slide-59
SLIDE 59

Learning

Your neural net start with random parameters θ0

+

  • + +

+ +

  • θ0

Training loss iteration

An iterative process

Step 1 – be random

slide-60
SLIDE 60

Learning

Then you’ll find that performs a bit better

+

  • + +

+ +

  • θ1

θ1

Training loss iteration

An iterative process

Step 2 – be less random

slide-61
SLIDE 61

Learning

Then you’ll find that performs even better

+

  • + +

+ +

  • θ2

θ2

Training loss iteration

An iterative process

Step 3 – be even less random

slide-62
SLIDE 62

Learning

And finally a

+

  • + +

+ +

  • Training loss

iteration

θ3 θ3

An iterative process

Step 4 – profit

slide-63
SLIDE 63

Learning From Data

Avoiding Overfitting

This classifier has zero error

Learning

slide-64
SLIDE 64

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ?

Learning

slide-65
SLIDE 65

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ?

Learning

?

slide-66
SLIDE 66

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ? Solution : Training and Validation Set

  • +

+ + + +

Learning

slide-67
SLIDE 67

Gradient Descent

Back to minimizing

Training example Loss function Example label Model Prediction Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

slide-68
SLIDE 68

Gradient Descent

Back to minimizing

f(x)

Skier’s approach to minimization : Steepest Descent

Learning

slide-69
SLIDE 69

Gradient Descent

Back to minimizing

Skier’s approach to minimization : Steepest Descent

rxf(x) f(x)

”Gradient in x of f(x)”

Learning

slide-70
SLIDE 70

Gradient Descent

Back to minimizing

Learning

slide-71
SLIDE 71

Gradient Descent

Back to minimizing

Learning

slide-72
SLIDE 72

Gradient Descent

Back to minimizing

Learning

slide-73
SLIDE 73

Gradient Descent

Back to minimizing

Learning

slide-74
SLIDE 74

Gradient Descent

Back to minimizing

Learning

slide-75
SLIDE 75

Gradient Descent

Back to minimizing

Learning

slide-76
SLIDE 76

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large Complicated function (neural net)

Learning

slide-77
SLIDE 77

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent Complicated function (neural net) -> BackProp

Learning

slide-78
SLIDE 78

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent

Learning

slide-79
SLIDE 79

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

One function

Learning

slide-80
SLIDE 80

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average

Learning

slide-81
SLIDE 81

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average = The Average of the Gradients

Learning

slide-82
SLIDE 82

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average = The Average of the Gradients

Learning

In expectation, for uniform j

slide-83
SLIDE 83

Stochastic Gradient Descent

Killing n

Pick some random example (xj, yj)

✓n+1 ✓n ⌘rθ`(f(xj; ✓n), yj)

Learning-rate

For some number of iterations :

Gradient Step

Learning

slide-84
SLIDE 84

Back Propagation

Computing the gradient

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ?

Learning

Complicated function (neural net) -> BackProp

slide-85
SLIDE 85

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

fi(x) = σ(Aix + bi)

Hidden Layer i

slide-86
SLIDE 86

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

f = fh(fh−1(fh−2(. . .))) = (fh fh−1 . . . f1)(x) fi(x) = σ(Aix + bi)

Hidden Layer i Complete Neural Network

slide-87
SLIDE 87

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

f = fh(fh−1(fh−2(. . .))) = (fh fh−1 . . . f1)(x) fi(x) = σ(Aix + bi) rθ`(f(xi; ✓), yi) ??

Hidden Layer i Complete Neural Network

slide-88
SLIDE 88

BackProp

Computing the gradient

Learning

Chain-rule :

∂f ∂x = ∂f ∂y ∂y ∂x

slide-89
SLIDE 89

BackProp

Computing the gradient

`(yh) yh(yh−1) yh−1(yh−2) yh−2(yh−3) y1(x) x y2(y1)

θh θh−1 θh−2 θ2 θ1

Learning

slide-90
SLIDE 90

BackProp

Computing the gradient

`(yh)

θh

rθh`(yh)

Learning

slide-91
SLIDE 91

BackProp

Computing the gradient

θh

rθh`(yh)

Learning

slide-92
SLIDE 92

BackProp

Computing the gradient

rθh`(yh)

θh

Learning

Chain-Rule

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

slide-93
SLIDE 93

BackProp

Computing the gradient

rθh`(yh)

Doesn’t depend on current layer Only depends on ` θh

Learning

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

slide-94
SLIDE 94

BackProp

Computing the gradient

rθh`(yh)

Only depends on current layer θh

Learning

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

slide-95
SLIDE 95

BackProp

Computing the gradient

rθh−1`(yh)

θh−1

Learning

= Φh−1(✓h−1, ryh−1`(yh))

slide-96
SLIDE 96

BackProp

Computing the gradient

rθh−1`(yh)

θh−1

Learning

Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

slide-97
SLIDE 97

BackProp

Computing the gradient

Known

rθh−1`(yh)

θh−1

Learning

Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

slide-98
SLIDE 98

BackProp

Computing the gradient

Already computed

rθh−1`(yh)

θh−1

Learning

Known Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

slide-99
SLIDE 99

BackProp

It’s Backwards !

ryh(`(yh))

Learning

slide-100
SLIDE 100

BackProp

It’s Backwards !

ryh(`(yh)) rθh(`(yh))

Learning

slide-101
SLIDE 101

BackProp

It’s Backwards !

rθh−1(`(yh)) rθh(`(yh))

Learning

slide-102
SLIDE 102

rθh−2(`(yh))

BackProp

It’s Backwards !

rθh−1(`(yh))

Learning

slide-103
SLIDE 103

BackProp

It’s Backwards !

Learning

More precisely, let’s look at Next layer gives us We use the chain rule to compute :

  • The gradient of our parameters :
  • The gradient of the inputs :

Return the gradient of the inputs

  • = f(x; θ)

roL(y) rxL(y) = Jx(f)roL(y) rθL(y) = Jθ(f)roL(y) ∂L ∂θj =

m

X

i=1

∂oi ∂θj ∂L ∂oi

slide-104
SLIDE 104

Recap

The Story so far

Learning

  • Training / Testing set
  • Loss
  • (Stochastic) Gradient Descent
  • Back-Propagation
slide-105
SLIDE 105

Recap

The Story so far

Learning

  • Training / Testing set -> Required to avoid overfitting
  • Loss
  • (Stochastic) Gradient Descent
  • Back-Propagation
slide-106
SLIDE 106

Recap

The Story so far

Learning

  • Training / Testing set -> Required to avoid overfitting
  • Loss -> Defines the problem you’re solving
  • (Stochastic) Gradient Descent
  • Back-Propagation
slide-107
SLIDE 107

Recap

The Story so far

Learning

  • Training / Testing set -> Required to avoid overfitting
  • Loss -> Defines the problem you’re solving
  • (Stochastic) Gradient Descent -> Minimize by taking successive steps
  • Back-Propagation
slide-108
SLIDE 108

Recap

The Story so far

Learning

  • Training / Testing set -> Required to avoid overfitting
  • Loss -> Defines the problem you’re solving
  • (Stochastic) Gradient Descent -> Minimize by taking successive steps
  • Back-Propagation -> A trick to compute the gradient
slide-109
SLIDE 109

Recap

The Story so far

Learning

Pick a random example (xj, yj)

slide-110
SLIDE 110

Recap

The Story so far

Learning

Pick a random example poodle

slide-111
SLIDE 111

Neural Net

IN OUT Forward Pass

Hmm what’s this picture …

slide-112
SLIDE 112

Neural Net

IN OUT Forward Pass

Wait for it …

slide-113
SLIDE 113

Neural Net

IN OUT Forward Pass

slide-114
SLIDE 114

Neural Net

IN OUT Forward Pass

slide-115
SLIDE 115

Neural Net

IN OUT Forward Pass

slide-116
SLIDE 116

Neural Net

IN OUT Forward Pass

slide-117
SLIDE 117

Neural Net

IN OUT Forward Pass

slide-118
SLIDE 118

Neural Net

IN OUT Forward Pass

slide-119
SLIDE 119

Neural Net

IN Shiba Forward Pass

It’s a Shiba !

slide-120
SLIDE 120

Neural Net

IN Forward Pass

Wait, was it a shiba ?

`(shiba, poodle)

slide-121
SLIDE 121

BackProp

It was a poodle !

ryh(`(yh))

Learning

`(shiba, poodle)

slide-122
SLIDE 122

BackProp

ryh(`(yh)) rθh(`(yh))

Learning

It was a poodle !

slide-123
SLIDE 123

BackProp

rθh−1(`(yh)) rθh(`(yh))

Learning

It was a poodle !

slide-124
SLIDE 124

rθh−2(`(yh))

BackProp

rθh−1(`(yh))

Learning

It was a poodle !

slide-125
SLIDE 125

Gradient Descent

Take a step !

Learning

slide-126
SLIDE 126
  • PyTorch
  • Picking Hyper-Parameters

In Practice

Import torch

”How I learned to stop caring and love the Backprop”

slide-127
SLIDE 127

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Learning

slide-128
SLIDE 128

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Learning

slide-129
SLIDE 129

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Learning

slide-130
SLIDE 130

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Learning

slide-131
SLIDE 131

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Stochastic Gradient Descent

Learning

slide-132
SLIDE 132

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Set all stored gradients to zero

Learning

slide-133
SLIDE 133

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Forward pass (output from input)

Learning

slide-134
SLIDE 134

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Compute the loss

Learning

slide-135
SLIDE 135

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

BackProp

Learning

slide-136
SLIDE 136

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Gradient Step

Learning

slide-137
SLIDE 137

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

  • ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

  • ptimizer.zero_grad()
  • utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

  • ptimizer.step()

Wait what ? Is that it ?

Learning

slide-138
SLIDE 138

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

Define Model Parameters

slide-139
SLIDE 139

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

Define Forward Pass

slide-140
SLIDE 140

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

For simple forwards, the backward pass will be added automatically !

slide-141
SLIDE 141

Hyper-Parameters

AKA Those that shall not be learned

Learning

  • Number of Layers
  • Number of Hidden Units
  • Activation Function
  • Loss
  • Learning Rate

How do we pick :

slide-142
SLIDE 142

Hyper-Parameters

AKA Those that shall not be learned

Learning

  • Number of Layers
  • Number of Hidden Units
  • Activation Function
  • Loss
  • Learning Rate

How do we pick :

We don’t know !

slide-143
SLIDE 143

Grid Search

AKA Science !

Learning

We use Grid-Searches :

Learning Rate Number of Layers 10 20 30 1e-2 1e-1 1

slide-144
SLIDE 144

Grid Search

AKA Science !

Learning

We use Grid-Searches :

  • Evaluate for set of hyper-parameters
  • Select Based on Validation Error

Learning Rate Number of Layers 10 20 30 1e-2 1e-1 1

slide-145
SLIDE 145

Recap

The Story so far

Learning

  • Pytorch
  • Grid-Search
slide-146
SLIDE 146

Recap

The Story so far

Learning

  • Pytorch -> Allows us not to care too much about Backprop
  • Grid-Search
slide-147
SLIDE 147

Recap

The Story so far

Learning

  • Pytorch -> Allows us not to care too much about Backprop
  • Grid-Search -> How we set Hyper-Parameters we cannot learn