Neural Networks CE417: Introduction to Artificial Intelligence - - PowerPoint PPT Presentation

β–Ά
neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks CE417: Introduction to Artificial Intelligence - - PowerPoint PPT Presentation

Neural Networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019 Soleymani Some slides are based on Anca Dragans slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019. 2


slide-1
SLIDE 1

Neural Networks

CE417: Introduction to Artificial Intelligence Sharif University of Technology Fall 2019

Soleymani

Some slides are based on Anca Dragan’s slides, CS188, UC Berkeley and some adapted from Bhiksha Raj, 11-785, CMU 2019.

slide-2
SLIDE 2

2

slide-3
SLIDE 3

McCulloch-Pitts neuron: binary threshold

3

𝑦" 𝑦# 𝑦$ 𝑧 … 𝑧 = '1, 𝑨 β‰₯ πœ„ 0, 𝑨 < πœ„ 𝑧 πœ„: activation threshold π‘₯" π‘₯# π‘₯$ 𝑦" 𝑦# 𝑦$ 𝑧 … π‘₯" π‘₯# π‘₯$ 𝑐 1 𝑧 bias: 𝑐 = βˆ’πœ„ Equivale nt to

slide-4
SLIDE 4

Neural nets and the brain

4

  • Neural nets are composed of networks of computational

models of neurons called perceptrons

+

.... .

𝑦" 𝑦# 𝑦3 𝑦$

π‘₯" π‘₯# π‘₯3

π‘₯$

𝑐

slide-5
SLIDE 5

The perceptron

5

  • A

threshold unit

– β€œFires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate

  • A basic unit of Boolean circuits

y = 1 if 7 π‘₯8𝑦8 β‰₯ πœ„

8

0 else

+

.... .

𝑦" 𝑦# 𝑦3 𝑦$

π‘₯" π‘₯# π‘₯3

π‘₯$

𝑐 = βˆ’πœ„

slide-6
SLIDE 6

Perceptron

6

𝑦" 𝑦# 1 x1 x2 z = 7 w>x> + b

  • >

+

....

𝑦" 𝑦# 𝑦3 𝑦$

π‘₯" π‘₯# π‘₯3

π‘₯$

𝑐

} Lean this function

} A step function across a hyperplane

slide-7
SLIDE 7

Learning the perceptron

7

  • Given a number of input output pairs, learn the weights and

bias

– Learn 𝑋 = [π‘₯", … , π‘₯$] and b, given several π‘Œ, 𝑧 pairs 𝑦" 𝑦#

+

....

𝑦" 𝑦# 𝑦3 𝑦$

π‘₯" π‘₯# π‘₯3 π‘₯$

𝑐

𝑧 = H1 𝑗𝑔 7 w>x> + b

  • >

β‰₯ 0 0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓

slide-8
SLIDE 8

Restating the perceptron

8

x1 x2 x3 xd

Wd+1

xd+1=1

} Restating

the perceptron equation by adding another dimension to π‘Œ

𝑧 = '1 𝑗𝑔 βˆ‘ π‘₯8𝑦8 β‰₯ 0

$R" 8S"

0 π‘π‘’β„Žπ‘“π‘ π‘₯𝑗𝑑𝑓 Where 𝑦$R" = 1

} Note that the boundary βˆ‘

π‘₯8𝑦8 β‰₯ 0

$R" 8S"

is now a hyperplane through origin

𝑋

$

slide-9
SLIDE 9

The Perceptron Problem

9

that perfectly separates the two groups of points

  • Find the hyperplane

34

7 π‘₯8𝑦8 = 0

$R" 8S"

slide-10
SLIDE 10

Perceptron Algorithm: Summary

10

  • Cycle through the training instances
  • Only update 𝒙 on misclassified

instances

  • If instance misclassified:

– If instance is positive class 𝒙 = 𝒙 + π’š

– If instance is negative class

𝒙 = 𝒙 βˆ’ π’š

slide-11
SLIDE 11

Training of Single Layer

11

𝒙VR" = 𝒙V βˆ’ πœƒπ›ΌπΉZ 𝒙V

} Weight update for a training pair (π’š Z , 𝑧(Z)):

} Perceptron: If sign(𝒙_π’š(Z)) β‰  𝑧(Z) then }

𝛼𝐹Z 𝒙V = βˆ’πœƒπ’š(Z)𝑧(Z)

} } ADALINE:

𝛼𝐹Z 𝒙V = βˆ’πœƒ(𝑧(Z) βˆ’ 𝒙_π’š(Z))π’š(Z)

} Widrow-Hoff, LMS, or delta rule

𝐹Z 𝒙 = 𝑧(Z) βˆ’ 𝒙_π’š(Z) # if misclassified 𝐹Z 𝒙 = βˆ’π’™_π’š(Z)𝑧(Z)

slide-12
SLIDE 12

Perceptron vs. Delta Rule

12

} Perceptron learning rule:

} guaranteed

to succeed if training examples are linearly separable

} Delta rule:

} guaranteed to converge to the hypothesis with the minimum

squared error

} can also be used for regression problems

slide-13
SLIDE 13

Reminder: Linear Classifiers

Β§ Inputs are feature values Β§ Each feature has a weight Β§ Sum is the activation Β§ If the activation is:

Β§ Positive, output +1 Β§ Negative, output -1

S

f1 f2 f3 w1 w2 w3 >0?

13

slide-14
SLIDE 14

The β€œsoft” perceptron (logistic)

14

+

.....

π’šπŸ π’šπŸ‘ π’šπŸ’

π’šπ‘Ά π’™πŸ π’™πŸ‘ π’™πŸ’

𝒙𝑢

z = 7 w>x> βˆ’ ΞΈ

  • >

y = 1 1 + exp(βˆ’z)

  • A β€œsquashing” function instead of a threshold at the output

– The sigmoid β€œactivation” replaces the threshold

  • Activation: The function that acts on the weighted combination of

inputs (and threshold) 𝑐 = βˆ’πœ„

slide-15
SLIDE 15

Sigmoid neurons

} These give a real-valued output that is a smooth and

bounded function of their total input.

} Typically they use the logistic function

} They have nice derivatives.

15

slide-16
SLIDE 16

Other β€œactivations”

16

+

....

π’šπŸ π’šπŸ‘ π’šπŸ’

π’šπ‘Ά π’™πŸ π’™πŸ‘ π’™πŸ’ 𝒙𝑢

𝒄

  • Does not always have to be a squashing function

– We will hear more about activations later

  • We will continue to assume a β€œthreshold” activation in this lecture

tanh

tanh 𝑨 log (1 +𝑓l)

sigmoid

1 1 1 + exp (βˆ’π‘¨)

slide-17
SLIDE 17

How to get probabilistic decisions?

} Activation: } If 𝑨 = 𝒙_π’š very positive Γ  want probability going to 1 } If 𝑨 = 𝒙_π’š very negative Γ  want probability going to 0 } Sigmoid function

Ο†(z) = 1 1 + eβˆ’z

17

slide-18
SLIDE 18

Best w?

}

Maximum likelihood estimation: with:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

= Logistic Regression

18

𝑄 𝑧 8 = +1 𝑦 8 ; π‘₯ = 1 1 + 𝑓o𝒙pπ’š 𝑄 𝑧 8 = βˆ’1 𝑦 8 ; π‘₯ = 1 βˆ’ 1 1 + 𝑓o𝒙pπ’š

slide-19
SLIDE 19

Multiclass Logistic Regression

} Multi-class linear classification

}

A weight vector for each class:

}

Score (activation) of a class y:

}

Prediction w/highest score wins:

} How to make the scores into probabilities?

z1, z2, z3 β†’ ez1 ez1 + ez2 + ez3 , ez2 ez1 + ez2 + ez3 , ez3 ez1 + ez2 + ez3

  • riginal activations

softmax activations

19

𝒙"

_π’š

𝒙#

_π’š

𝒙3

_π’š

𝒙q 𝒙q

_π’š

𝒙q

_π’š

slide-20
SLIDE 20

Best w?

} Maximum likelihood estimation:

with:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w) P(y(i)|x(i); w) = ewy(i)Β·f(x(i)) P

y ewyΒ·f(x(i))

= Multi-Class Logistic Regression

20

𝑄 𝑧 8 𝑦 8 ; π‘₯ = 𝑓

𝒙r s

p

π’š(s)

βˆ‘ 𝑓𝒙t

pπ’š(s)

u vS"

slide-21
SLIDE 21

Batch Gradient Ascent on the Log Likelihood Objective

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

g(w)

} init } for iter = 1, 2, …

w

w w + Ξ± ⇀ X

i

r log P(y(i)|x(i); w)

21

slide-22
SLIDE 22

Logistic regression: multi-class

22

𝒙w

VR" = 𝒙w V βˆ’ πœƒπ›Όπ‘ΏπΎ(𝑿V)

𝛼

𝒙z𝐾 𝑿 = 7

𝑄 𝑧w|π’š 8 ; 𝑿 βˆ’ 𝑧w

8

π’š 8

Z 8S"

slide-23
SLIDE 23

Stochastic Gradient Ascent on the Log Likelihood Objective

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

} init } for iter = 1, 2, …

} pick random j

w

w w + Ξ± ⇀ r log P(y(j)|x(j); w)

Observation: once gradient on one training example has been computed, might as well incorporate before computing next one

23

slide-24
SLIDE 24

Mini-Batch Gradient Ascent on the Log Likelihood Objective

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)

} init } for iter = 1, 2, …

} pick random subset of training examples J

w

Observation: gradient over small set of training examples (=mini-batch) can be computed, might as well do that instead of a single one

w w + Ξ± ⇀ X

j∈J

r log P(y(j)|x(j); w)

24

slide-25
SLIDE 25

Networks with hidden units

} Networks without hidden units are very limited in the

input-output mappings they can learn to model.

} A simple function such as XOR cannot be modeled with single

layer network

} More layers of linear units do not help. Its still linear. } Fixed output non-linearities are not enough.

} We need multiple layers of adaptive, non-linear hidden

  • units. But how can we train such nets?

25

slide-26
SLIDE 26

Neural Networks

26

slide-27
SLIDE 27

The multi-layer perceptron

27

slide-28
SLIDE 28

Neural Networks Properties

} Theorem (Universal Function Approximators).

A two- layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy.

} Practical considerations

} Can be seen as learning the features } Large number of neurons

} Danger for overfitting } (hence early stopping!)

28

slide-29
SLIDE 29

Universal Function Approximation Theorem*

} In words: Given any continuous function f(x), if a 2-layer neural

network has enough hidden units, then there is a choice of weights that allow it to closely approximate f(x).

Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” 29

slide-30
SLIDE 30

Universal Function Approximation Theorem*

Cybenko (1989) β€œApproximations by superpositions of sigmoidal functions” Hornik (1991) β€œApproximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”

30

slide-31
SLIDE 31

Expressiveness of neural networks

31

} All Boolean functions can be represented by a

network with a single hidden layer

} But it might require exponential (in number of inputs)

hidden units

} Continuous functions:

} Any continuous function on a compact domain can be

approximated to an arbitrary accuracy, by network with one hidden layer [Cybenko 1989]

} Any function can be approximated to an arbitrarily accuracy

by a network with two hidden layers [Cybenko 1988]

slide-32
SLIDE 32

MLP Universal Approximator

32

} A feed-forward network with a single hidden layer and linear

  • utputs can approximate any continuous function on a

compact domain to an arbitrary accuracy

} under mild assumptions on the activation function

} e.g., sigmoid activation functions (Cybenko,1989)

} when sufficiently large (but finite) number of hidden units is used

} It is of greater theoretical interest than practical

} the construction of such a network requires the nonlinear activation

functions and weight values which are unknown

𝐺

v 𝑦 = 7

π‘₯

wv # 𝜚 7

π‘₯8w

" 𝑦8 $ 8S~

  • wS"
slide-33
SLIDE 33

MLP with single hidden layer

33 } Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝v π’š = πœ” 7 π‘₯

wv [#]𝑨 w

  • wS~

β‡’ 𝑝v π’š = πœ” 7 π‘₯

wv [#]𝜚 7 π‘₯8w ["]𝑦8 $ 8S~

  • wS~

… … … Input Output 𝑦~ = 1 𝑦$ 𝑝" 𝑝u π‘₯

wv [#]

π‘₯8w

["]

𝜚 πœ” 𝑨~ = 1 𝑨" 𝑨‒ 𝑨

w

πœ” 𝜚 𝜚 𝑦" 𝑗 = 0, … , 𝑒 π‘˜ = 1 … 𝑁 π‘˜ = 1 … 𝑁 𝑙 = 1, … , 𝐿

slide-34
SLIDE 34

MLPs approximate functions

34

  • MLP s can compose Boolean functions
  • MLPs as universal classifiers
  • MLPs can compose real-valued functions
slide-35
SLIDE 35

AND & OR networks

35

βˆ’0.5

slide-36
SLIDE 36

The perceptron is not enough

36

X

? ? ?

  • Cannot compute an XOR

Y

slide-37
SLIDE 37

XOR example

37

𝑔 = π‘Œπ‘ƒπ‘† 𝑦", 𝑦# = 𝑃𝑆 𝐡𝑂𝐸 𝑦", 𝑦̅# , 𝐡𝑂𝐸 𝑦̅", 𝑦#

βˆ’0.5

𝑦" 𝑦#

slide-38
SLIDE 38

General Boolean functions

38

} Every Boolean function can be represented by a network

with a single hidden layer

1.

Consider the truth table of the Boolean function

2.

Write Boolean function as OR of ANDs, with one AND for each positive entry in the truth table.

3.

Construct a 2-layer network that is composed of OR of ANDs (first layer contains ANDs and second layer contains OR)

} It might need an exponential number of hidden units

slide-39
SLIDE 39

How many layers for a Boolean MLP?

39

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’
slide-40
SLIDE 40

How many layers for a Boolean MLP?

40

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-41
SLIDE 41

How many layers for a Boolean MLP?

41

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-42
SLIDE 42

How many layers for a Boolean MLP?

42

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-43
SLIDE 43

How many layers for a Boolean MLP?

43

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-44
SLIDE 44

How many layers for a Boolean MLP?

44

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-45
SLIDE 45

How many layers for a Boolean MLP?

45

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-46
SLIDE 46

How many layers for a Boolean MLP?

46

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-47
SLIDE 47

How many layers for a Boolean MLP?

47

X

1

X

2

X

3

X

4 X 5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

  • Any truth table can be expressed in this manner!
  • A one-hidden-layer MLP is a Universal Boolean Function
  • But what is the largest number of perceptrons required in the

single hidden layer for an N-input-variable function?

y = π‘Œ

  • "π‘Œ
  • #X3Xβ€˜π‘Œ
  • ’ +π‘Œ
  • "X#π‘Œ
  • 3Xβ€˜X’ +π‘Œ
  • "X#X3π‘Œ
  • β€˜π‘Œ
  • ’+

X"π‘Œ

  • #π‘Œ
  • 3π‘Œ
  • β€˜X’ + X"π‘Œ
  • #X3Xβ€˜X’ + X"X#π‘Œ
  • 3π‘Œ
  • β€˜X’

X# X3 Xβ€˜ X’ X"

slide-48
SLIDE 48

MLPs approximate functions

48

  • MLP s can compose Boolean functions
  • MLPs as universal classifiers
  • MLPs can compose real-valued functions
slide-49
SLIDE 49

The MLP as a classifier

49

784 dimensions (MNIST) 784 dimensions

2

  • MLP as a function over real inputs
  • MLP as a function that finds a complex β€œdecision boundary” over a

space of reals

slide-50
SLIDE 50

Boolean functions with a real perceptron

50

  • Boolean perceptrons are also linear classifiers

– Purple regions are 1

0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1

slide-51
SLIDE 51

Composing complicated β€œdecision” boundaries

51

  • Build a network of units with a single output that fires if the input

is in the coloured area

x1 x2

Can now be composed into β€œnetworks” to compute arbitrary classification β€œboundaries”

slide-52
SLIDE 52

Booleans over the reals

52

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-53
SLIDE 53

Booleans over the reals

53

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-54
SLIDE 54

Booleans over the reals

54

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-55
SLIDE 55

Booleans over the reals

55

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-56
SLIDE 56

Booleans over the reals

56

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-57
SLIDE 57

Booleans over the reals

57

  • The network must fire if the input is in the coloured area

A ND

y1 y2 y3 y4 y5

7𝑧8 β‰₯ 4.5

” 8S"

x1 x2

slide-58
SLIDE 58

More complex decision boundaries

58

  • Network to fire if the input is in the yellow area

– β€œOR” two polygons – A third layer is required

x2 x1

A ND A ND OR

x1 x2

slide-59
SLIDE 59

Complex decision boundaries

59 A ND OR x1 x2

  • Can compose arbitrarily complex decision boundaries

– With only one hidden layer! – How?

slide-60
SLIDE 60

6

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.

slide-61
SLIDE 61

Exercise: compose this with one hidden layer

61

  • How would you compose the decision boundary to the left

with only one hidden layer?

x1 x2 x2 x1

slide-62
SLIDE 62

MLPs approximate functions

62

  • MLP s can compose Boolean functions
  • MLPs as universal classifiers
  • MLPs can compose real-valued functions
slide-63
SLIDE 63

MLP as a continuous-valued regression

63 +

x

1

  • π‘ˆ

#

1 T

1

  • T1

T

2

1

  • 1

x f(x)

  • A simple 3-unit MLP with a β€œsumming” output unit can

generate a β€œsquare pulse” over an input

– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified

T

2

T

1

slide-64
SLIDE 64

MLP as a continuous-valued regression

64

x

β„Žβ–‘ β„Žβ–‘ β„Žβ–‘

  • A simple 3-unit MLP can generate a β€œsquare pulse” over an input
  • An MLP with many units can model an arbitrary function over an

input

– To arbitrary precision

  • Simply make the individual pulses narrower
  • A one-layer MLP can model an arbitrary function of a single input
slide-65
SLIDE 65

Summary

} MLPs are universal Boolean function } MLPs are universal classifiers } MLPs are universal function approximators } An MLP with two (or even one) hidden layers can

approximate anything to arbitrary precision

} But could be exponentially or even infinitely wide in its inputs size

65

slide-66
SLIDE 66

Multi-class Logistic Regression

} = special case of neural network

z1 z2 z3

f1(x) f2(x) f3(x) fK(x)

s

  • f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3

P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

66

slide-67
SLIDE 67

Deep Neural Network = Also learn the features!

z1 z2 z3

f1(x) f2(x) f3(x) fK(x)

s

  • f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3

P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

67

slide-68
SLIDE 68

Deep Neural Network = Also learn the features!

f1(x) f2(x) f3(x) fK(x)

s

  • f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3

…

x1 x2 x3 xL

… … … …

z(1)

1

z(1)

2

z(1)

3

z(1)

K(1)

z(2)

K(2)

z(2)

1

z(2)

2

z(2)

3

z(OUT )

1

z(OUT )

2

z(OUT )

3

z(nβˆ’1)

3

z(nβˆ’1)

2

z(nβˆ’1)

1

z(nβˆ’1)

K(nβˆ’1)

…

z(k)

i

= g( X

j

W (kβˆ’1,k)

i,j

z(kβˆ’1)

j

)

g = nonlinear activation function

68

slide-69
SLIDE 69

Deep Neural Network = Also learn the features!

s

  • f

t m a x

P(y1|x; w) = ez1 ez1 + ez2 + ez3 P(y2|x; w) = ez2 ez1 + ez2 + ez3 P(y3|x; w) = ez3 ez1 + ez2 + ez3

x1 x2 x3 xL

… … … …

z(1)

1

z(1)

2

z(1)

3

z(1)

K(1)

z(2)

K(2)

z(2)

1

z(2)

2

z(2)

3

z(OUT )

1

z(OUT )

2

z(OUT )

3

z(nβˆ’1)

3

z(nβˆ’1)

2

z(nβˆ’1)

1

z(nβˆ’1)

K(nβˆ’1)

…

z(k)

i

= g( X

j

W (kβˆ’1,k)

i,j

z(kβˆ’1)

j

)

g = nonlinear activation function …

z(n)

K(n)

z(n)

3

z(n)

2

z(n)

1

69

slide-70
SLIDE 70

Common Activation Functions

[source: MIT 6.S191 introtodeeplearning.com]

70

slide-71
SLIDE 71

} We’ll talk about that once we covered neural networks,

which are a generalization of logistic regression

How about computing all the derivatives?

71

slide-72
SLIDE 72

Learning problem

} Given: the architecture of the network } Training data:A set of input-output pairs

π’š("), 𝒛(") , π’š(#), 𝒛(#) , … , (π’š(”), 𝒛(”))

} We want to find the function 𝑔 on the input space to get

the output

} We consider a neural network as a parametric function

𝑔(π’š; 𝑿)

72

slide-73
SLIDE 73

Problem setup

} Given: the architecture of the network } Training data:A set of input-output pairs

π’š("), 𝒛(") , π’š(#), 𝒛(#) , … , (π’š(”), 𝒛(”))

} We want to find the function 𝑔

} We consider a neural network as a parametric function 𝑔(π’š; 𝑿)

} We need a loss function to show how penalizes the obtained

  • utput 𝑔(π’š; 𝑿) when the desired output is 𝒛

1 𝑂 7 π‘šπ‘π‘‘π‘‘ 𝑔 π’š(Z); 𝑿 , 𝒛(Z)

” ZS"

73

slide-74
SLIDE 74

Choosing cost function: Examples

74

} Regression problem

} SSE

} Classification problem

} Cross-entropy

π‘šπ‘π‘‘π‘‘Z = 7 βˆ’π‘§v

Z log 𝑝v (Z) u vS"

= βˆ’log 𝑝q ˜

(Z)

Output is found by a softmax layer 𝑝v =

β„’Ε‘t βˆ‘ β„’Ε‘t

β€Ί tΕ“β€’

𝐹 = 7 𝐹Z

” ZS"

𝐹Z = 7 𝑝v

Z βˆ’ 𝑧v Z # u vS"

𝑧" 𝑧u 𝑦" 𝑦$

slide-75
SLIDE 75

How to adjust weights for multi layer networks?

} We need multiple layers of adaptive, non-linear hidden

  • units. But how can we train such nets?

} We need an efficient way of adapting all the weights, not just

the last layer.

} Learning the weights going into hidden units is equivalent to

learning features.

} This is difficult because nobody is telling us directly what the

hidden units should do.

75

slide-76
SLIDE 76

Find the weights by optimizing the cost

76 } Start from random weights and then adjust them iteratively to get lower cost. } Update the weights according to the gradient of the cost function

Source: http://3b1b.co

slide-77
SLIDE 77

How does the network learn?

77 } Which changes to the weights do improve the most? } The magnitude of each element shows how sensitive the cost

is to that weight or bias.

𝛼𝐹

𝛼𝐹 Source: http://3b1b.co

slide-78
SLIDE 78

Training multi-layer networks

78

} Back-propagation

} Training algorithm that is used to adjust weights in multi-layer

networks (based on the training data)

} The back-propagation algorithm is based on gradient descent } Use chain rule and

dynamic programming to efficiently compute gradients

slide-79
SLIDE 79

Training Neural Nets through Gradient Descent

79

Total training error:

} Gradient descent algorithm } Initialize all weights and biases π‘₯8w

[v]

} Using the extended notation : the bias is also weight

} Do :

} For every layer 𝑙 for all 𝑗, π‘˜ update:

} π‘₯8,w [v] = π‘₯8,w [v] βˆ’ πœƒ $ΕΎ $ΕΈs,z

[t]

} Until 𝐹 has converged

Assuming the bias is also represented as a weight

𝐹 = 7 π‘šπ‘π‘‘π‘‘ 𝒑(Z), 𝒛(Z)

” ZS"

slide-80
SLIDE 80

The derivative

80

  • Computing the derivative

Total derivative: Total training error:

𝐹 = 7 π‘šπ‘π‘‘π‘‘ 𝒑(Z), 𝒛(Z)

” ZS"

𝑒𝐹 𝑒π‘₯8,w

[v] = 7 π‘šπ‘π‘‘π‘‘ 𝒑(Z), 𝒛(Z)

𝑒π‘₯8,w

[v] ” ZS"

slide-81
SLIDE 81

Training by gradient descent

} Initialize all weights π‘₯8w

[v]

} Do :

} For all 𝑗 , π‘˜ , 𝑙, initialize

$ΕΎ $ΕΈs,z

[t] = 0

} For all π‘œ = 1: 𝑂

} For every layer 𝑙 for all 𝑗, π‘˜:

Β¨ Compute

$ £€Β₯Β₯ Β€ ˜ ,q ˜ $ΕΈs,z

[t]

Β¨

$ΕΎ $ΕΈs,z

[t] +=

$ £€Β₯Β₯ Β€ ˜ ,q ˜ $ΕΈs,z

[t]

} For every layer 𝑙 for all 𝑗, π‘˜: π‘₯8,w

[v] = π‘₯8,w [v] βˆ’ Β¦ _ $ΕΎ $ΕΈs,z

[t]

81

slide-82
SLIDE 82

How about computing all the derivatives?

n But neural net f is never one of those?

n No problem: CHAIN RULE:

If Then Γ  Derivatives can be computed by following well-defined procedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

82

slide-83
SLIDE 83

Simple chain rule

} 𝑨 = 𝑔 𝑕 𝑦 } 𝑧 = 𝑕(𝑦)

83

slide-84
SLIDE 84

Multiple paths chain rule

84

slide-85
SLIDE 85

Returning to our problem

85

  • How to compute

𝑒 π‘šπ‘π‘‘π‘‘ 𝒑, 𝒛 𝑒π‘₯8,w

[v]

slide-86
SLIDE 86

Backpropagation: Notation

86

} 𝒃[~] ← π½π‘œπ‘žπ‘£π‘’ } π‘π‘£π‘’π‘žπ‘£π‘’ ← 𝒃[-]

𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[Β£o"]

𝒃[Β£] π’œ[Β£]

slide-87
SLIDE 87

Output as a composite function

π‘ƒπ‘£π‘’π‘žπ‘£π‘’ = 𝑏[-] = 𝑔 𝑨[-] = 𝑔 𝑋[-]𝑏[-o"] = 𝑔 𝑋[-]𝑔(𝑋[-o"]𝑏[-o#] = 𝑔 𝑋[-]𝑔 𝑋[-o"] … 𝑔 𝑋[#]𝑔 𝑋["]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋["] 𝑦 Γ— 𝑔 𝑋[#] Γ— 𝑔 𝑋[-] Γ— 𝑔 𝑨["] 𝑏["] 𝑨[#] 𝑏[-o"] 𝑨[-] 𝑏[-] 𝑏[-] = π‘π‘£π‘’π‘žπ‘£π‘’

87

slide-88
SLIDE 88

Backpropagation: Last layer gradient

𝑏8

[Β£o"]

𝑨

w [-]

𝑏w

[-]

𝑔

𝑏8

[-] = 𝑔 𝑨8 [-]

𝑨

w [-] = 7 π‘₯8w [-]𝑏8 [-o"]

  • 8S~

For squared error loss: π‘šπ‘π‘‘π‘‘ = 1 2 7 𝑝

w βˆ’ 𝑧w # w

𝑝

w = 𝑏w

  • π‘₯8w

[-]

88

πœ–π‘šπ‘π‘‘π‘‘ πœ–π‘w

  • Output j

πœ–π‘šπ‘π‘‘π‘‘ πœ–π‘w

  • = (𝑧w βˆ’ 𝑏w
  • )

πœ–π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[-] = πœ–π‘šπ‘π‘‘π‘‘

πœ–π‘w

  • πœ–π‘w
  • πœ–π‘₯8w

[-]

πœ–π‘[-] πœ–π‘₯8w

[-] = 𝑔³ 𝑨 w [-]

πœ–π‘¨

w [-]

πœ–π‘₯8w

[-] = 𝑔³ 𝑨 w [-] 𝑏8 [-o"]

πœ–π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[-] = πœ–π‘šπ‘π‘‘π‘‘

πœ–π‘w

  • 𝑔³ 𝑨

w [-] 𝑏8 [-o"]

πœ–π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[-]

slide-89
SLIDE 89

Activations and their derivatives

89

2

[*]

  • Some popular activation functions and their

derivatives

slide-90
SLIDE 90

Previous layers gradients

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[Β£] = πœ– π‘šπ‘π‘‘π‘‘

πœ–π‘w

Β£

πœ–π‘w

Β£

πœ–π‘₯8w

[Β£]

πœ–π‘[Β£] πœ–π‘₯8w

[Β£] =

πœ–π‘w

[Β£]

πœ–π‘¨

w [Β£] Γ—

πœ–π‘¨

w [Β£]

πœ–π‘₯8w

[Β£] = 𝑔³ 𝑨 w [Β£] 𝑏8 [Β£o"]

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘8

[Β£o"] = 7 πœ– π‘šπ‘π‘‘π‘‘

πœ–π‘w

[Β£] Γ—

πœ–π‘w

[Β£]

πœ–π‘¨

w [Β£] Γ—

πœ–π‘¨

w [Β£]

πœ–π‘8

[Β£o"] $[Β΄] wS"

= 7 πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘w

[Β£] ×𝑔³ 𝑨 w [Β£] Γ—π‘₯8w [Β£] $[Β΄] wS"

90

𝑏8

[Β£o"]

𝑨

w [Β£]

𝑏w

[Β£]

𝑔 π‘₯8w

[Β£]

𝑏8

[Β£o"]

𝑏w

[Β£]

π‘₯8w

[Β£]

𝑨

w [Β£]

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘w

Β£

𝑏8

[Β£] = 𝑔 𝑨8 [Β£]

𝑨

w [Β£] = 7 π‘₯8w [Β£]𝑏8 [Β£o"]

  • 8S~

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[Β£]

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘$[Β΄]

[Β£]

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘8

[Β£o

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘"

[Β£]

slide-91
SLIDE 91

Backpropagation:

91

πœ– π‘šπ‘π‘‘π‘‘ πœ–π‘₯8w

[Β£] = πœ– π‘šπ‘π‘‘π‘‘

πœ–π‘w

[Β£] Γ—

πœ–π‘w

[Β£]

πœ–π‘₯8w

[Β£]

= πœ€

w [Β£]×𝑏8 [Β£o"]×𝑔³ 𝑨 w [Β£]

} πœ€

w [Β£] = ΒΆ £€Β₯Β₯ ΒΆΒ·z

[Β΄] is the sensitivity of the output to 𝑏w

[Β£]

} Sensitivity vectors can be obtained by running a backward process in the

network architecture (hence the name backpropagation.) 𝑏8

[Β£o"]

𝑨

w [Β£]

𝑏w

[Β£]

𝑔 𝑏8

[Β£] = 𝑔 𝑨8 [Β£]

𝑨

w [Β£] = 7 π‘₯8w [Β£]𝑏8 [Β£o"]

  • 8S~

π‘₯8w

[Β£]

We will compute 𝜺[£o"] from 𝜺[£]:

πœ€8

[Β£o"] = 7 πœ€ w [Β£]×𝑔³ 𝑨 w [Β£] Γ—π‘₯8w [Β£] $[Β΄] wS"

slide-92
SLIDE 92

Backpropagation Algorithm

92

} Initialize all weights to small random numbers. } While not satisfied } For each training example do: 1.

Feed forward the training example to the network and compute the

  • utputs of all units in forward step (z and a) and the loss

2.

For each unit find its πœ€ in the backward step

3.

Update each network weight π‘₯8w

[Β£] as π‘₯8w [Β£] ← π‘₯8w [Β£] βˆ’ πœƒ ΒΆ £€Β₯Β₯ ΒΆΕΈsz

[Β΄]

where

ΒΆ £€Β₯Β₯ ΒΆΕΈsz

[Β΄] = πœ€

w [Β£]×𝑏8 [Β£o"]×𝑔³ 𝑨 w [Β£]

slide-93
SLIDE 93

} Automatic differentiation software

} e.g.Theano,TensorFlow, PyT

  • rch, Chainer

} Only need to program the function g(x,y,w) } Can automatically compute all derivatives w.r.t. all entries in w } This

is typically done by caching info during forward computation pass of f, and then doing a backward pass = β€œbackpropagation”

} Autodiff / Backpropagation can often be done at computational

cost comparable to the forward pass

Automatic Differentiation

93