[PPT] - Neural Networks CE417: Introduction to Artificial Intelligence PowerPoint Presentation

SLIDE 1

Neural Networks

CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019

Soleymani

Some slides are based on Anca Dragan’s slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.

SLIDE 2

2

SLIDE 3

McCulloch-Pitts neuron: binary threshold

3

𝑦" 𝑦# 𝑦$ 𝑧 … 𝑧 = '1, 𝑨 ≥ 𝜄 0, 𝑨 < 𝜄 𝑧 𝜄: activation threshold 𝑥" 𝑥# 𝑥$ 𝑦" 𝑦# 𝑦$ 𝑧 … 𝑥" 𝑥# 𝑥$ 𝑐 1 𝑧 bias: 𝑐 = −𝜄 Equivale nt to

SLIDE 4

Neural nets and the brain

4

Neural nets are composed of networks of computational

models of neurons called perceptrons

+

.... .

𝑦" 𝑦# 𝑦3 𝑦$

𝑥" 𝑥# 𝑥3

𝑥$

𝑐

SLIDE 5

The perceptron

5

A

threshold unit

– “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate

A basic unit of Boolean circuits

y = 1 if 7 𝑥8𝑦8 ≥ 𝜄

8

0 else

+

.... .

𝑦" 𝑦# 𝑦3 𝑦$

𝑥" 𝑥# 𝑥3

𝑥$

𝑐 = −𝜄

SLIDE 6

Perceptron

6

𝑦" 𝑦# 1 x1 x2 z = 7 w>x> + b

>

+

....

𝑦" 𝑦# 𝑦3 𝑦$

𝑥" 𝑥# 𝑥3

𝑥$

𝑐

} Lean this function

} A step function across a hyperplane

SLIDE 7

Learning the perceptron

7

Given a number of input output pairs, learn the weights and

bias

– Learn 𝑋 = [𝑥", … , 𝑥$] and b, given several 𝑌, 𝑧 pairs 𝑦" 𝑦#

+

....

𝑦" 𝑦# 𝑦3 𝑦$

𝑥" 𝑥# 𝑥3 𝑥$

𝑐

𝑧 = H1 𝑗𝑔 7 w>x> + b

>

≥ 0 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

SLIDE 8

Restating the perceptron

8

x1 x2 x3 xd

Wd+1

xd+1=1

} Restating

the perceptron equation by adding another dimension to 𝑌

𝑧 = '1 𝑗𝑔 ∑ 𝑥8𝑦8 ≥ 0

$R" 8S"

0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Where 𝑦$R" = 1

} Note that the boundary ∑

𝑥8𝑦8 ≥ 0

$R" 8S"

is now a hyperplane through origin

𝑋

$

SLIDE 9

The Perceptron Problem

9

that perfectly separates the two groups of points

Find the hyperplane

34

7 𝑥8𝑦8 = 0

$R" 8S"

SLIDE 10

Perceptron Algorithm: Summary

10

Cycle through the training instances
Only update 𝒙 on misclassified

instances

If instance misclassified:

– If instance is positive class 𝒙 = 𝒙 + 𝒚

– If instance is negative class

𝒙 = 𝒙 − 𝒚

SLIDE 11

Training of Single Layer

11

𝒙VR" = 𝒙V − 𝜃𝛼𝐹Z 𝒙V

} Weight update for a training pair (𝒚 Z , 𝑧(Z)):

} Perceptron: If sign(𝒙_𝒚(Z)) ≠ 𝑧(Z) then }

𝛼𝐹Z 𝒙V = −𝜃𝒚(Z)𝑧(Z)

} } ADALINE:

𝛼𝐹Z 𝒙V = −𝜃(𝑧(Z) − 𝒙_𝒚(Z))𝒚(Z)

} Widrow-Hoff, LMS, or delta rule

𝐹Z 𝒙 = 𝑧(Z) − 𝒙_𝒚(Z) # if misclassified 𝐹Z 𝒙 = −𝒙_𝒚(Z)𝑧(Z)

SLIDE 12

Perceptron vs. Delta Rule

12

} Perceptron learning rule:

} guaranteed

to succeed if training examples are linearly separable

} Delta rule:

} guaranteed to converge to the hypothesis with the minimum

squared error

} can also be used for regression problems

SLIDE 13

Reminder: Linear Classifiers

§ Inputs are feature values § Each feature has a weight § Sum is the activation § If the activation is:

§ Positive, output +1 § Negative, output -1

S

f1 f2 f3 w1 w2 w3 >0?

13

SLIDE 14

The “soft” perceptron (logistic)

14

+

.....

𝒚𝟐 𝒚𝟑 𝒚𝟒

𝒚𝑶 𝒙𝟐 𝒙𝟑 𝒙𝟒

𝒙𝑶

z = 7 w>x> − θ

>

y = 1 1 + exp(−z)

A “squashing” function instead of a threshold at the output

– The sigmoid “activation” replaces the threshold

Activation: The function that acts on the weighted combination of

inputs (and threshold) 𝑐 = −𝜄

SLIDE 15

Sigmoid neurons

} These give a real-valued output that is a smooth and

bounded function of their total input.

} Typically they use the logistic function

} They have nice derivatives.

15

SLIDE 16

Other “activations”

16

+

....

𝒚𝟐 𝒚𝟑 𝒚𝟒

𝒚𝑶 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝑶

𝒄

Does not always have to be a squashing function

– We will hear more about activations later

We will continue to assume a “threshold” activation in this lecture

tanh

tanh 𝑨 log (1 +𝑓l)

sigmoid

1 1 1 + exp (−𝑨)

SLIDE 17

How to get probabilistic decisions?

} Activation: } If 𝑨 = 𝒙_𝒚 very positive à want probability going to 1 } If 𝑨 = 𝒙_𝒚 very negative à want probability going to 0 } Sigmoid function

φ(z) = 1 1 + e−z

17

SLIDE 18

Best w?

}

Maximum likelihood estimation: with:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

= Logistic Regression

18

𝑄 𝑧 8 = +1 𝑦 8 ; 𝑥 = 1 1 + 𝑓o𝒙p𝒚 𝑄 𝑧 8 = −1 𝑦 8 ; 𝑥 = 1 − 1 1 + 𝑓o𝒙p𝒚

SLIDE 19

Multiclass Logistic Regression

} Multi-class linear classification

}

A weight vector for each class:

}

Score (activation) of a class y:

}

Prediction w/highest score wins:

} How to make the scores into probabilities?

z1, z2, z3 → ez1 ez1 + ez2 + ez3 , ez2 ez1 + ez2 + ez3 , ez3 ez1 + ez2 + ez3

riginal activations

SLIDE 20

Best w?

} Maximum likelihood estimation:

with:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w) P(y(i)|x(i); w) = ewy(i)·f(x(i)) P

y ewy·f(x(i))

= Multi-Class Logistic Regression

20

𝑄 𝑧 8 𝑦 8 ; 𝑥 = 𝑓

𝒙r s

p

𝒚(s)

∑ 𝑓𝒙t

p𝒚(s)

u vS"

SLIDE 21

Batch Gradient Ascent on the Log Likelihood Objective

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

g(w)

} init } for iter = 1, 2, …

w

w w + α ⇤ X

i

r log P(y(i)|x(i); w)

21

SLIDE 22

Logistic regression: multi-class

22

𝒙w

VR" = 𝒙w V − 𝜃𝛼𝑿𝐾(𝑿V)

𝛼

𝒙z𝐾 𝑿 = 7

𝑄 𝑧w|𝒚 8 ; 𝑿 − 𝑧w

8

𝒚 8

Z 8S"

SLIDE 23

Stochastic Gradient Ascent on the Log Likelihood Objective

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

} init } for iter = 1, 2, …

} pick random j

w

w w + α ⇤ r log P(y(j)|x(j); w)

Observation: once gradient on one training example has been computed, might as well incorporate before computing next one

23

SLIDE 24

Mini-Batch Gradient Ascent on the Log Likelihood Objective

max

} We need multiple layers of adaptive, non-linear hidden

units. But how can we train such nets?

25

SLIDE 26

Neural Networks

26

SLIDE 27

The multi-layer perceptron

27

SLIDE 28

Neural Networks Properties

} Theorem (Universal Function Approximators).

A two- layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy.

} Practical considerations

} Can be seen as learning the features } Large number of neurons

} Danger for overfitting } (hence early stopping!)

28

SLIDE 29

Universal Function Approximation Theorem*

} In words: Given any continuous function f(x), if a 2-layer neural

network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x).

Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 29

SLIDE 30

Universal Function Approximation Theorem*

Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”

30

SLIDE 31

Expressiveness of neural networks

31

} All Boolean functions can be represented by a

network with a single hidden layer

} But it might require exponential (in number of inputs)

hidden units

} Continuous functions:

} Any continuous function on a compact domain can be

approximated to an arbitrary accuracy, by network with one hidden layer [Cybenko 1989]

} Any function can be approximated to an arbitrarily accuracy

by a network with two hidden layers [Cybenko 1988]

SLIDE 32

MLP Universal Approximator

32

} A feed-forward network with a single hidden layer and linear

utputs can approximate any continuous function on a

compact domain to an arbitrary accuracy

} under mild assumptions on the activation function

} e.g., sigmoid activation functions (Cybenko,1989)

} when sufficiently large (but finite) number of hidden units is used

} It is of greater theoretical interest than practical

} the construction of such a network requires the nonlinear activation

functions and weight values which are unknown

𝐺

v 𝑦 = 7

𝑥

wv # 𝜚 7

𝑥8w

" 𝑦8 $ 8S~

wS"

SLIDE 33

MLP with single hidden layer

33 } Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝v 𝒚 = 𝜔 7 𝑥

wv [#]𝑨 w

wS~

⇒ 𝑝v 𝒚 = 𝜔 7 𝑥

wv [#]𝜚 7 𝑥8w ["]𝑦8 $ 8S~

wS~

… … … Input Output 𝑦~ = 1 𝑦$ 𝑝" 𝑝u 𝑥

wv [#]

𝑥8w

["]

𝜚 𝜔 𝑨~ = 1 𝑨" 𝑨• 𝑨

w

𝜔 𝜚 𝜚 𝑦" 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿

SLIDE 34

MLPs approximate functions

34

MLP s can compose Boolean functions
MLPs as universal classifiers
MLPs can compose real-valued functions

SLIDE 35

AND & OR networks

35

−0.5

SLIDE 36

The perceptron is not enough

36

X

? ? ?

Cannot compute an XOR

Y

SLIDE 37

XOR example

37

𝑔 = 𝑌𝑃𝑆 𝑦", 𝑦# = 𝑃𝑆 𝐵𝑂𝐸 𝑦", 𝑦̅# , 𝐵𝑂𝐸 𝑦̅", 𝑦#

−0.5

𝑦" 𝑦#

SLIDE 38

General Boolean functions

38

} Every Boolean function can be represented by a network

with a single hidden layer

1.

Consider the truth table of the Boolean function

2.

Write Boolean function as OR of ANDs, with one AND for each positive entry in the truth table.

3.

Construct a 2-layer network that is composed of OR of ANDs (first layer contains ANDs and second layer contains OR)

} It might need an exponential number of hidden units

SLIDE 39

How many layers for a Boolean MLP?

39

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

SLIDE 40

How many layers for a Boolean MLP?

40

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 41

How many layers for a Boolean MLP?

41

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 42

How many layers for a Boolean MLP?

42

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 43

How many layers for a Boolean MLP?

43

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 44

How many layers for a Boolean MLP?

44

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 45

How many layers for a Boolean MLP?

45

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 46

How many layers for a Boolean MLP?

46

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 47

How many layers for a Boolean MLP?

47

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Any truth table can be expressed in this manner!
A one-hidden-layer MLP is a Universal Boolean Function
But what is the largest number of perceptrons required in the

single hidden layer for an N-input-variable function?

y = 𝑌

"𝑌
#X3X‘𝑌
’ +𝑌
"X#𝑌
3X‘X’ +𝑌
"X#X3𝑌
‘𝑌
’+

X"𝑌

#𝑌
3𝑌
‘X’ + X"𝑌
#X3X‘X’ + X"X#𝑌
3𝑌
‘X’

X# X3 X‘ X’ X"

SLIDE 48

MLPs approximate functions

48

MLP s can compose Boolean functions
MLPs as universal classifiers
MLPs can compose real-valued functions

SLIDE 49

The MLP as a classifier

49

784 dimensions (MNIST) 784 dimensions

2

MLP as a function over real inputs
MLP as a function that finds a complex “decision boundary” over a

space of reals

SLIDE 50

Boolean functions with a real perceptron

50

Boolean perceptrons are also linear classifiers

– Purple regions are 1

0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1

SLIDE 51

Composing complicated “decision” boundaries

51

Build a network of units with a single output that fires if the input

is in the coloured area

x1 x2

Can now be composed into “networks” to compute arbitrary classification “boundaries”

SLIDE 52

Booleans over the reals

52

The network must fire if the input is in the coloured area

x1 x2

SLIDE 53

Booleans over the reals

53

The network must fire if the input is in the coloured area

x1 x2

SLIDE 54

Booleans over the reals

54

The network must fire if the input is in the coloured area

x1 x2

SLIDE 55

Booleans over the reals

55

The network must fire if the input is in the coloured area

x1 x2

SLIDE 56

Booleans over the reals

56

The network must fire if the input is in the coloured area

x1 x2

SLIDE 57

Booleans over the reals

57

The network must fire if the input is in the coloured area

A ND

y1 y2 y3 y4 y5

7𝑧8 ≥ 4.5

” 8S"

x1 x2

SLIDE 58

More complex decision boundaries

58

Network to fire if the input is in the yellow area

– “OR” two polygons – A third layer is required

x2 x1

A ND A ND OR

x1 x2

SLIDE 59

Complex decision boundaries

59 A ND OR x1 x2

Can compose arbitrarily complex decision boundaries

– With only one hidden layer! – How?

SLIDE 60

6

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.

SLIDE 61

Exercise: compose this with one hidden layer

61

How would you compose the decision boundary to the left

with only one hidden layer?

x1 x2 x2 x1

SLIDE 62

MLPs approximate functions

62

MLP s can compose Boolean functions
MLPs as universal classifiers
MLPs can compose real-valued functions

SLIDE 63

MLP as a continuous-valued regression

63 +

x

1

𝑈

#

1 T

1

T1

T

2

1

1

x f(x)

A simple 3-unit MLP with a “summing” output unit can

generate a “square pulse” over an input

– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified

T

2

T

1

SLIDE 64

MLP as a continuous-valued regression

64

x

ℎ□ ℎ□ ℎ□

A simple 3-unit MLP can generate a “square pulse” over an input
An MLP with many units can model an arbitrary function over an

input

– To arbitrary precision

Simply make the individual pulses narrower
A one-layer MLP can model an arbitrary function of a single input

SLIDE 65

Summary

} MLPs are universal Boolean function } MLPs are universal classifiers } MLPs are universal function approximators } An MLP with two (or even one) hidden layers can

approximate anything to arbitrary precision

} But could be exponentially or even infinitely wide in its inputs size

65

SLIDE 66

Multi-class Logistic Regression

} = special case of neural network

z1 z2 z3

f1(x) f2(x) f3(x) fK(x)

s

f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3

P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

66

SLIDE 67

Deep Neural Network = Also learn the features!

z1 z2 z3

f1(x) f2(x) f3(x) fK(x)

s

f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3

P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

67

SLIDE 68

Deep Neural Network = Also learn the features!

f1(x) f2(x) f3(x) fK(x)

s

f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

x1 x2 x3 xL

… … … …

z(1)

1

z(1)

2

z(1)

3

z(1)

K(1)

z(2)

K(2)

z(2)

1

z(2)

2

z(2)

3

z(OUT )

1

z(OUT )

2

z(OUT )

3

z(n−1)

3

z(n−1)

2

z(n−1)

1

z(n−1)

K(n−1)

…

z(k)

i

= g( X

j

W (k−1,k)

i,j

z(k−1)

j

)

g = nonlinear activation function

68

SLIDE 69

Deep Neural Network = Also learn the features!

s

f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3

x1 x2 x3 xL

… … … …

z(1)

1

z(1)

2

z(1)

3

z(1)

K(1)

z(2)

K(2)

z(2)

1

z(2)

2

z(2)

3

z(OUT )

1

z(OUT )

2

z(OUT )

3

z(n−1)

3

z(n−1)

2

z(n−1)

1

z(n−1)

K(n−1)

…

z(k)

i

= g( X

j

W (k−1,k)

i,j

z(k−1)

j

)

g = nonlinear activation function …

z(n)

K(n)

z(n)

3

z(n)

2

z(n)

1

69

SLIDE 70

Common Activation Functions

[source: MIT 6.S191 introtodeeplearning.com]

70

SLIDE 71

} We’ll talk about that once we covered neural networks,

which are a generalization of logistic regression

How about computing all the derivatives?

71

SLIDE 72

Learning problem

} Given: the architecture of the network } Training data:A set of input-output pairs

𝒚("), 𝒛(") , 𝒚(#), 𝒛(#) , … , (𝒚(”), 𝒛(”))

} We want to find the function 𝑔 on the input space to get

the output

} We consider a neural network as a parametric function

𝑔(𝒚; 𝑿)

72

SLIDE 73

Problem setup

} Given: the architecture of the network } Training data:A set of input-output pairs

𝒚("), 𝒛(") , 𝒚(#), 𝒛(#) , … , (𝒚(”), 𝒛(”))

} We want to find the function 𝑔

} We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)

} We need a loss function to show how penalizes the obtained

utput 𝑔(𝒚; 𝑿) when the desired output is 𝒛

1 𝑂 7 𝑚𝑝𝑡𝑡 𝑔 𝒚(Z); 𝑿 , 𝒛(Z)

” ZS"

73

SLIDE 74

Choosing cost function: Examples

74

} Regression problem

} SSE

} Classification problem

} Cross-entropy

𝑚𝑝𝑡𝑡Z = 7 −𝑧v

Z log 𝑝v (Z) u vS"

= −log 𝑝q ˜

(Z)

Output is found by a softmax layer 𝑝v =

™št ∑ ™št

› tœ•

𝐹 = 7 𝐹Z

” ZS"

𝐹Z = 7 𝑝v

Z − 𝑧v Z # u vS"

𝑧" 𝑧u 𝑦" 𝑦$

SLIDE 75

How to adjust weights for multi layer networks?

} We need multiple layers of adaptive, non-linear hidden

units. But how can we train such nets?

} We need an efficient way of adapting all the weights, not just

the last layer.

} Learning the weights going into hidden units is equivalent to

learning features.

} This is difficult because nobody is telling us directly what the

hidden units should do.

75

SLIDE 76

Find the weights by optimizing the cost

76 } Start from random weights and then adjust them iteratively to get lower cost. } Update the weights according to the gradient of the cost function

Source: http://3b1b.co

SLIDE 77

How does the network learn?

77 } Which changes to the weights do improve the most? } The magnitude of each element shows how sensitive the cost

is to that weight or bias.

𝛼𝐹

𝛼𝐹 Source: http://3b1b.co

SLIDE 78

Training multi-layer networks

78

} Back-propagation

} Training algorithm that is used to adjust weights in multi-layer

networks (based on the training data)

} The back-propagation algorithm is based on gradient descent } Use chain rule and

dynamic programming to efficiently compute gradients

SLIDE 79

Training Neural Nets through Gradient Descent

79

Total training error:

} Gradient descent algorithm } Initialize all weights and biases 𝑥8w

[v]

} Using the extended notation : the bias is also weight

} Do :

} For every layer 𝑙 for all 𝑗, 𝑘 update:

} 𝑥8,w [v] = 𝑥8,w [v] − 𝜃 $ž $Ÿs,z

[t]

} Until 𝐹 has converged

Assuming the bias is also represented as a weight

𝐹 = 7 𝑚𝑝𝑡𝑡 𝒑(Z), 𝒛(Z)

” ZS"

SLIDE 80

The derivative

80

Computing the derivative

Total derivative: Total training error:

𝐹 = 7 𝑚𝑝𝑡𝑡 𝒑(Z), 𝒛(Z)

” ZS"

𝑒𝐹 𝑒𝑥8,w

[v] = 7 𝑚𝑝𝑡𝑡 𝒑(Z), 𝒛(Z)

𝑒𝑥8,w

[v] ” ZS"

SLIDE 81

Training by gradient descent

} Initialize all weights 𝑥8w

[v]

} Do :

} For all 𝑗 , 𝑘 , 𝑙, initialize

$ž $Ÿs,z

[t] = 0

} For all 𝑜 = 1: 𝑂

} For every layer 𝑙 for all 𝑗, 𝑘:

¨ Compute

$ £¤¥¥ ¤ ˜ ,q ˜ $Ÿs,z

[t]

¨

$ž $Ÿs,z

[t] +=

$ £¤¥¥ ¤ ˜ ,q ˜ $Ÿs,z

[t]

} For every layer 𝑙 for all 𝑗, 𝑘: 𝑥8,w

[v] = 𝑥8,w [v] − ¦ _ $ž $Ÿs,z

[t]

81

SLIDE 82

How about computing all the derivatives?

n But neural net f is never one of those?

n No problem: CHAIN RULE:

If Then à Derivatives can be computed by following well-defined procedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

82

SLIDE 83

Simple chain rule

} 𝑨 = 𝑔 𝑕 𝑦 } 𝑧 = 𝑕(𝑦)

83

SLIDE 84

Multiple paths chain rule

84

SLIDE 85

Returning to our problem

85

How to compute

𝑒 𝑚𝑝𝑡𝑡 𝒑, 𝒛 𝑒𝑥8,w

[v]

SLIDE 86

Backpropagation: Notation

86

} 𝒃[~] ← 𝐽𝑜𝑞𝑣𝑢 } 𝑝𝑣𝑢𝑞𝑣𝑢 ← 𝒃[-]

𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[£o"]

𝒃[£] 𝒜[£]

SLIDE 87

Output as a composite function

𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[-] = 𝑔 𝑨[-] = 𝑔 𝑋[-]𝑏[-o"] = 𝑔 𝑋[-]𝑔(𝑋[-o"]𝑏[-o#] = 𝑔 𝑋[-]𝑔 𝑋[-o"] … 𝑔 𝑋[#]𝑔 𝑋["]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋["] 𝑦 × 𝑔 𝑋[#] × 𝑔 𝑋[-] × 𝑔 𝑨["] 𝑏["] 𝑨[#] 𝑏[-o"] 𝑨[-] 𝑏[-] 𝑏[-] = 𝑝𝑣𝑢𝑞𝑣𝑢

87

SLIDE 88

Backpropagation: Last layer gradient

𝑏8

[£o"]

𝑨

w [-]

𝑏w

[-]

𝑔

𝑏8

[-] = 𝑔 𝑨8 [-]

𝑨

w [-] = 7 𝑥8w [-]𝑏8 [-o"]

8S~

For squared error loss: 𝑚𝑝𝑡𝑡 = 1 2 7 𝑝

w − 𝑧w # w

𝑝

w = 𝑏w

𝑥8w

[-]

88

𝜖𝑚𝑝𝑡𝑡 𝜖𝑏w

Output j

𝜖𝑚𝑝𝑡𝑡 𝜖𝑏w

= (𝑧w − 𝑏w
)

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥8w

[-] = 𝜖𝑚𝑝𝑡𝑡

𝜖𝑏w

𝜖𝑏w
𝜖𝑥8w

[-]

𝜖𝑏[-] 𝜖𝑥8w

[-] = 𝑔³ 𝑨 w [-]

𝜖𝑨

w [-]

𝜖𝑥8w

[-] = 𝑔³ 𝑨 w [-] 𝑏8 [-o"]

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥8w

[-] = 𝜖𝑚𝑝𝑡𝑡

𝜖𝑏w

𝑔³ 𝑨

w [-] 𝑏8 [-o"]

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥8w

[-]

SLIDE 89

Activations and their derivatives

89

2

[*]

Some popular activation functions and their

derivatives

SLIDE 90

[£o"] $[´] wS"

= 7 𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏w

[£] ×𝑔³ 𝑨 w [£] ×𝑥8w [£] $[´] wS"

90

𝑏8

[£o"]

𝑨

w [£]

𝑏w

[£]

𝑔 𝑥8w

[£]

𝑏8

[£o"]

𝑏w

[£]

𝑥8w

[£]

𝑨

w [£]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏w

£

𝑏8

[£] = 𝑔 𝑨8 [£]

𝑨

w [£] = 7 𝑥8w [£]𝑏8 [£o"]

8S~

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥8w

[£]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏$[´]

[£]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏8

[£o

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏"

[£]

SLIDE 91

Backpropagation:

91

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥8w

[£] = 𝜖 𝑚𝑝𝑡𝑡

𝜖𝑏w

[£] ×

𝜖𝑏w

[£]

𝜖𝑥8w

[£]

= 𝜀

w [£]×𝑏8 [£o"]×𝑔³ 𝑨 w [£]

} 𝜀

w [£] = ¶ £¤¥¥ ¶·z

[´] is the sensitivity of the output to 𝑏w

[£]

} Sensitivity vectors can be obtained by running a backward process in the

network architecture (hence the name backpropagation.) 𝑏8

[£o"]

𝑨

w [£]

𝑏w

[£]

𝑔 𝑏8

[£] = 𝑔 𝑨8 [£]

𝑨

w [£] = 7 𝑥8w [£]𝑏8 [£o"]

8S~

𝑥8w

[£]

We will compute 𝜺[£o"] from 𝜺[£]:

𝜀8

[£o"] = 7 𝜀 w [£]×𝑔³ 𝑨 w [£] ×𝑥8w [£] $[´] wS"

SLIDE 92

Backpropagation Algorithm

92

} Initialize all weights to small random numbers. } While not satisfied } For each training example do: 1.

Feed forward the training example to the network and compute the

utputs of all units in forward step (z and a) and the loss

2.

For each unit find its 𝜀 in the backward step

3.

Update each network weight 𝑥8w

[£] as 𝑥8w [£] ← 𝑥8w [£] − 𝜃 ¶ £¤¥¥ ¶Ÿsz

[´]

where

¶ £¤¥¥ ¶Ÿsz

[´] = 𝜀

w [£]×𝑏8 [£o"]×𝑔³ 𝑨 w [£]

SLIDE 93

} Automatic differentiation software

} e.g.Theano,TensorFlow, PyT

rch, Chainer

} Only need to program the function g(x,y,w) } Can automatically compute all derivatives w.r.t. all entries in w } This

is typically done by caching info during forward computation pass of f, and then doing a backward pass = “backpropagation”

} Autodiff / Backpropagation can often be done at computational

cost comparable to the forward pass

Automatic Differentiation

93