[PPT] - About me : ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd PowerPoint Presentation

SLIDE 1

About me :

ENS -> MVA -> FAIR (engineer) -> FAIR (PhD 3rd year)

About my PhD :

Interested in sign matrices and tensors (graphs / multi-graphs)
Observe a few entries, predict the remaining edges
Factorization methods are simple and efficient for these problems
When is it needed to go beyond factorization methods ?
How ?

Presentation

SLIDE 2

Artificial Neuron
Why go deep ?
Neural Networks

Introduction to Neural Networks

SLIDE 3

Neural Net

Poodle

Examples of Inputs / Outputs

Image In, Class out

SLIDE 4

Neural Net

Examples of Inputs / Outputs

Image In, Text out

a person on skis jumping down part of a hill

SLIDE 5

Neural Net

Examples of Inputs / Outputs

Text In, Text out

a person on skis jumping down part of a hill Un skieur sautant au dessus d’un talus

SLIDE 6

Neural Net

From NVIDIA’s paper at ICLR2018

Examples of Inputs / Outputs

Noise In, Face out ?!

SLIDE 7

Neural Net

IN OUT Neural Networks

One model to rule them all

Hidden Layers Input Layer Output Layer

SLIDE 8

Neural Net

IN OUT Neural Networks

One model to rule them all

SLIDE 9

Neural Net

IN OUT Neural Networks

One model to rule them all

SLIDE 10

Neural Net

. . .

x1 xd

Artificial Neuron

The good ol’ Perceptron

a1 ad

y = σ(a1x1 + ... + adxd + b)

σ: Activation function

bias

ha, xi

Sigmoid ReLU

SLIDE 11

Neural Net

Artificial Neuron

a x1

The good ol’ Perceptron

SLIDE 12

Neural Net

Artificial Neuron

ha, x1i a x1

The good ol’ Perceptron

SLIDE 13

Neural Net

Artificial Neuron

a x2 ha, x2i

The good ol’ Perceptron

SLIDE 14

Neural Net

Artificial Neuron

a x2 x1 b = 0

y = σ(ha, xi + b)

Active Inactive

The good ol’ Perceptron

SLIDE 15

Neural Net

Artificial Neuron

a x2 x1

b

y = σ(ha, xi + b)

The good ol’ Perceptron

SLIDE 16

Neural Net

Why deep learning ?

We need to go deeper

x1 x2

x2 x1

SLIDE 17

Neural Net

x1 x2

x2 x1 y1

1

y1

1

Why deep learning ?

We need to go deeper

SLIDE 18

Neural Net

x1 x2

x2 x1 y1

1

y1

1

y1

2

y1

2

Why deep learning ?

We need to go deeper

SLIDE 19

Neural Net

x1 x2 y1

1

y1

2

y1

1

y1

2

x2 x1 y1

1

y1

2

Why deep learning ?

We need to go deeper

SLIDE 20

Neural Net

x1 x2 y1

1

y1

2

y1

1

y1

2

y2

Why deep learning ?

We need to go deeper

SLIDE 21

Neural Net

IN OUT Hidden Layer

And in the darkness bind them

SLIDE 22

Neural Net

x1 x2 x3 x4 y2 y3 y4

y1

Hidden Layer

Set of neurons with shared inputs

SLIDE 23

Neural Net

y1 = σ(ha1, xi + b1) y2 = σ(ha2, xi + b2) y3 = σ(ha3, xi + b3) y4 = σ(ha4, xi + b4)

x1 x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

SLIDE 24

Neural Net

y1 = σ( h a1 , x i + b1 ) y2 = σ( h a2 , x i + b2 ) y3 = σ( h a3 , x i + b3 ) y4 = σ( h a4 , x i + b4 )

A b

x1 x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

SLIDE 25

Neural Net

y = σ

Ax + b
x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

SLIDE 26

Neural Net

y = σ

Ax + b
x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

SLIDE 27

Neural Net

y = σ

Ax + b
x1

x2 x3 x4

Hidden Layer

Set of neurons with shared inputs

> 100x speed-up

SLIDE 28

Neural Net

IN OUT Forward Pass

Hmm what’s this picture …

SLIDE 29

Neural Net

IN OUT Forward Pass

Wait for it …

SLIDE 30

Neural Net

IN OUT Forward Pass

…

SLIDE 31

Neural Net

IN OUT Forward Pass

…

SLIDE 32

Neural Net

IN OUT Forward Pass

…

SLIDE 33

Neural Net

IN OUT Forward Pass

…

SLIDE 34

Neural Net

IN OUT Forward Pass

…

SLIDE 35

Neural Net

IN OUT Forward Pass

…

SLIDE 36

Neural Net

IN Poodle Forward Pass

It’s a Poodle !

SLIDE 37

Neural Net

Recap

The story so far

Artificial Neuron
Neural Network

SLIDE 38

Neural Net

Recap

The story so far

Artificial Neuron -> Thresholds its input, one is not enough
Neural Network

SLIDE 39

Neural Net

Recap

The story so far

Artificial Neuron -> Thresholds its input, one is not enough
Neural Network -> Computes its output with the Forward pass

SLIDE 40

Learning from data
Stochastic Gradient Descent
Back Propagation

Learning

”It’s all downhill from here”

SLIDE 41

Learning

Learning From Data

Step 1 – Get Data

A dataset is a set of (input, output)

Poodle Shiba Inu Not Dog Shiba Inu

SLIDE 42

Learning

Learning From Data

A dataset is a set of (input, output), lots of them

Step 2 – Get More

SLIDE 43

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function

SLIDE 44

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction

SLIDE 45

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label)

SLIDE 46

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label) Poodle

SLIDE 47

Loss Function

Define your goal

Learning

`(f(x), y)

Loss function Model prediction Ground Truth (or label) Poodle

The loss defines how much you pay for answering ”Shiba” instead of ”Poodle”

SLIDE 48

Loss Function

Some examples

Classification

Goal : assign class +1 / -1
Loss : based on sign agreement yf(x)

yf(x) > 0 yf(x) < 0 1

Learning

SLIDE 49

Loss Function

Some examples

Loss : Application-dependent measure of error

Learning

Regression

Goal : learn continuous value y
Loss : based on distance (f(x) − y)2

f(x) − y (f(x) − y)2

SLIDE 50

Training Loss

Let’s get to minimizing

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Learning

SLIDE 51

Training Loss

Let’s get to minimizing

Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Learning

SLIDE 52

Training Loss

Let’s get to minimizing

Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

SLIDE 53

Training Loss

Let’s get to minimizing

Training example Loss function Example label Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

SLIDE 54

Training Loss

Let’s get to minimizing

Training example Loss function Example label Model Prediction Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

SLIDE 55

Training Loss

Let’s get to minimizing

Loss function poodle Model Parameters Average over the training set

Learning

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

SLIDE 56

Training Loss

Let’s get to minimizing

Find the parameters that minimize the average loss on the training set

Learning

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

SLIDE 57

Learning

An iterative process

Step 1 – be random

Your neural net start with random parameters θ0

+

+ +

+ +

θ0

SLIDE 58

Learning

Your neural net start with random parameters θ0

+

+ +

+ +

θ0

1 n

n

X

i=1

`(f(xi; ✓), yi)

An iterative process

Step 1 – be random

SLIDE 59

Learning

Your neural net start with random parameters θ0

+

+ +

+ +

θ0

Training loss iteration

An iterative process

Step 1 – be random

SLIDE 60

Learning

Then you’ll find that performs a bit better

+

+ +

+ +

θ1

θ1

Training loss iteration

An iterative process

Step 2 – be less random

SLIDE 61

Learning

Then you’ll find that performs even better

+

+ +

+ +

θ2

θ2

Training loss iteration

An iterative process

Step 3 – be even less random

SLIDE 62

Learning

And finally a

+

+ +

+ +

Training loss

iteration

θ3 θ3

An iterative process

Step 4 – profit

SLIDE 63

Learning From Data

Avoiding Overfitting

This classifier has zero error

Learning

SLIDE 64

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ?

Learning

SLIDE 65

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ?

Learning

?

SLIDE 66

Learning From Data

Avoiding Overfitting

This one too. How do we pick one ? Solution : Training and Validation Set

+

+ + + +

Learning

SLIDE 67

Gradient Descent

Back to minimizing

Training example Loss function Example label Model Prediction Model Parameters

min

θ

1 n

n

X

i=1

`(f(xi; ✓), yi)

Average over the training set

Learning

SLIDE 68

Gradient Descent

Back to minimizing

f(x)

Skier’s approach to minimization : Steepest Descent

Learning

SLIDE 69

Gradient Descent

Back to minimizing

Skier’s approach to minimization : Steepest Descent

rxf(x) f(x)

”Gradient in x of f(x)”

Learning

SLIDE 70

Gradient Descent

Back to minimizing

Learning

SLIDE 71

Gradient Descent

Back to minimizing

Learning

SLIDE 72

Gradient Descent

Back to minimizing

Learning

SLIDE 73

Gradient Descent

Back to minimizing

Learning

SLIDE 74

Gradient Descent

Back to minimizing

Learning

SLIDE 75

Gradient Descent

Back to minimizing

Learning

SLIDE 76

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large Complicated function (neural net)

Learning

SLIDE 77

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent Complicated function (neural net) -> BackProp

Learning

SLIDE 78

Gradient Descent

Back to minimizing

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent

Learning

SLIDE 79

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

One function

Learning

SLIDE 80

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average

Learning

SLIDE 81

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average = The Average of the Gradients

Learning

SLIDE 82

Stochastic Gradient Descent

Killing n

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi) = 1 n

n

X

i=1

rθ`(f(xi; ✓), yi) ⇡ rθ`(f(xj; ✓), yj)

The Gradient of the Average = The Average of the Gradients

Learning

In expectation, for uniform j

SLIDE 83

Stochastic Gradient Descent

Killing n

Pick some random example (xj, yj)

✓n+1 ✓n ⌘rθ`(f(xj; ✓n), yj)

Learning-rate

For some number of iterations :

Gradient Step

Learning

SLIDE 84

Back Propagation

Computing the gradient

rθ 1 n

n

X

i=1

`(f(xi; ✓), yi)

Problem : How do we compute the gradient ?

Learning

Complicated function (neural net) -> BackProp

SLIDE 85

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

fi(x) = σ(Aix + bi)

Hidden Layer i

SLIDE 86

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

f = fh(fh−1(fh−2(. . .))) = (fh fh−1 . . . f1)(x) fi(x) = σ(Aix + bi)

Hidden Layer i Complete Neural Network

SLIDE 87

BackProp

Computing the gradient

Problem : How do we compute the gradient ?

Learning

f = fh(fh−1(fh−2(. . .))) = (fh fh−1 . . . f1)(x) fi(x) = σ(Aix + bi) rθ`(f(xi; ✓), yi) ??

Hidden Layer i Complete Neural Network

SLIDE 88

BackProp

Computing the gradient

Learning

Chain-rule :

∂f ∂x = ∂f ∂y ∂y ∂x

SLIDE 89

BackProp

Computing the gradient

`(yh) yh(yh−1) yh−1(yh−2) yh−2(yh−3) y1(x) x y2(y1)

…

θh θh−1 θh−2 θ2 θ1

Learning

SLIDE 90

BackProp

Computing the gradient

`(yh)

…

θh

rθh`(yh)

Learning

SLIDE 91

BackProp

Computing the gradient

θh

rθh`(yh)

Learning

SLIDE 92

BackProp

Computing the gradient

rθh`(yh)

θh

Learning

Chain-Rule

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

SLIDE 93

BackProp

Computing the gradient

rθh`(yh)

Doesn’t depend on current layer Only depends on ` θh

Learning

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

SLIDE 94

BackProp

Computing the gradient

rθh`(yh)

Only depends on current layer θh

Learning

@`(yh) @✓h,i = @`(yh) @yh @yh @✓h,i

SLIDE 95

BackProp

Computing the gradient

rθh−1`(yh)

θh−1

Learning

= Φh−1(✓h−1, ryh−1`(yh))

SLIDE 96

BackProp

Computing the gradient

rθh−1`(yh)

θh−1

Learning

Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

SLIDE 97

BackProp

Computing the gradient

Known

rθh−1`(yh)

θh−1

Learning

Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

SLIDE 98

BackProp

Computing the gradient

Already computed

rθh−1`(yh)

θh−1

Learning

Known Depends on current layer’s structure

= Φh−1(✓h−1, ryh−1`(yh))

SLIDE 99

BackProp

It’s Backwards !

ryh(`(yh))

Learning

SLIDE 100

BackProp

It’s Backwards !

ryh(`(yh)) rθh(`(yh))

Learning

SLIDE 101

BackProp

It’s Backwards !

rθh−1(`(yh)) rθh(`(yh))

Learning

SLIDE 102

rθh−2(`(yh))

BackProp

It’s Backwards !

rθh−1(`(yh))

Learning

SLIDE 103

BackProp

It’s Backwards !

Learning

More precisely, let’s look at Next layer gives us We use the chain rule to compute :

The gradient of our parameters :
The gradient of the inputs :

Return the gradient of the inputs

= f(x; θ)

roL(y) rxL(y) = Jx(f)roL(y) rθL(y) = Jθ(f)roL(y) ∂L ∂θj =

m

X

i=1

∂oi ∂θj ∂L ∂oi

SLIDE 104

Recap

The Story so far

Learning

Training / Testing set
Loss
(Stochastic) Gradient Descent
Back-Propagation

SLIDE 105

Recap

The Story so far

Learning

Training / Testing set -> Required to avoid overfitting
Loss
(Stochastic) Gradient Descent
Back-Propagation

SLIDE 106

Recap

The Story so far

Learning

Training / Testing set -> Required to avoid overfitting
Loss -> Defines the problem you’re solving
(Stochastic) Gradient Descent
Back-Propagation

SLIDE 107

Recap

The Story so far

Learning

Training / Testing set -> Required to avoid overfitting
Loss -> Defines the problem you’re solving
(Stochastic) Gradient Descent -> Minimize by taking successive steps
Back-Propagation

SLIDE 108

Recap

The Story so far

Learning

Training / Testing set -> Required to avoid overfitting
Loss -> Defines the problem you’re solving
(Stochastic) Gradient Descent -> Minimize by taking successive steps
Back-Propagation -> A trick to compute the gradient

SLIDE 109

Recap

The Story so far

Learning

Pick a random example (xj, yj)

SLIDE 110

Recap

The Story so far

Learning

Pick a random example poodle

SLIDE 111

Neural Net

IN OUT Forward Pass

Hmm what’s this picture …

SLIDE 112

Neural Net

IN OUT Forward Pass

Wait for it …

SLIDE 113

Neural Net

IN OUT Forward Pass

…

SLIDE 114

Neural Net

IN OUT Forward Pass

…

SLIDE 115

Neural Net

IN OUT Forward Pass

…

SLIDE 116

Neural Net

IN OUT Forward Pass

…

SLIDE 117

Neural Net

IN OUT Forward Pass

…

SLIDE 118

Neural Net

IN OUT Forward Pass

…

SLIDE 119

Neural Net

IN Shiba Forward Pass

It’s a Shiba !

SLIDE 120

Neural Net

IN Forward Pass

Wait, was it a shiba ?

`(shiba, poodle)

SLIDE 121

BackProp

It was a poodle !

ryh(`(yh))

Learning

`(shiba, poodle)

SLIDE 122

BackProp

ryh(`(yh)) rθh(`(yh))

Learning

It was a poodle !

SLIDE 123

BackProp

rθh−1(`(yh)) rθh(`(yh))

Learning

It was a poodle !

SLIDE 124

rθh−2(`(yh))

BackProp

rθh−1(`(yh))

Learning

It was a poodle !

SLIDE 125

Gradient Descent

Take a step !

Learning

SLIDE 126

PyTorch
Picking Hyper-Parameters

In Practice

Import torch

”How I learned to stop caring and love the Backprop”

SLIDE 127

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Learning

SLIDE 128

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Learning

SLIDE 129

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Learning

SLIDE 130

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Learning

SLIDE 131

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Stochastic Gradient Descent

Learning

SLIDE 132

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Set all stored gradients to zero

Learning

SLIDE 133

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Forward pass (output from input)

Learning

SLIDE 134

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Compute the loss

Learning

SLIDE 135

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

BackProp

Learning

SLIDE 136

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Gradient Step

Learning

SLIDE 137

In PyTorch

It’s so easy …

import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() )

ptimizer = optim.SGD(net.parameters(), lr=1)

for i in range(100):

ptimizer.zero_grad()
utput = net(X[i])

loss = nn.BCELoss(output, Y[i]) loss.backward()

ptimizer.step()

Wait what ? Is that it ?

Learning

SLIDE 138

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

Define Model Parameters

SLIDE 139

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

Define Forward Pass

SLIDE 140

In PyTorch

It’s so easy …

Learning

import torch import torch.nn as nn class Linear(nn.Module): def __init__(self): super(Linear, self).__init__() self.A = nn.Parameter(torch.randn(hidden_size, input_size)) self.b = nn.Parameter(torch.randn(hidden_size, 1)) def forward(self, x): x = torch.mm(self.A, torch.t(x)) + self.b return x

For simple forwards, the backward pass will be added automatically !

SLIDE 141

Hyper-Parameters

AKA Those that shall not be learned

Learning

Number of Layers
Number of Hidden Units
Activation Function
Loss
Learning Rate
…

How do we pick :

SLIDE 142

Hyper-Parameters

AKA Those that shall not be learned

Learning

Number of Layers
Number of Hidden Units
Activation Function
Loss
Learning Rate
…

How do we pick :

We don’t know !

SLIDE 143

Grid Search

AKA Science !

Learning

We use Grid-Searches :

Learning Rate Number of Layers 10 20 30 1e-2 1e-1 1

SLIDE 144

Grid Search

AKA Science !

Learning

We use Grid-Searches :

Evaluate for set of hyper-parameters
Select Based on Validation Error

Learning Rate Number of Layers 10 20 30 1e-2 1e-1 1

SLIDE 145

Recap

The Story so far

Learning

Pytorch
Grid-Search

SLIDE 146

Recap

The Story so far

Learning

Pytorch -> Allows us not to care too much about Backprop
Grid-Search

SLIDE 147

Recap

The Story so far

Learning

Pytorch -> Allows us not to care too much about Backprop
Grid-Search -> How we set Hyper-Parameters we cannot learn