[PPT] - Multi-Layer Networks M. Soleymani Deep Learning Sharif University PowerPoint Presentation

SLIDE 1

Multi-Layer Networks

M. Soleymani

Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from: Bhiksha Raj, 11-785, CMU 2019 and Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton, “NN for Machine Learning”, coursera, 2015.

1

SLIDE 2

Reasons to study neural computation

Neuroscience: To understand how the brain actually works.

– Its very big and very complicated and made of stuff that dies when you poke it around. So we need to use computer simulations.

AI: To solve practical problems by using novel learning algorithms

inspired by the brain

– Learning algorithms can be very useful even if they are not how the brain actually works.

2

SLIDE 3

3

SLIDE 4

A typical cortical neuron

Gross physical structure:

– There is one axon that branches – There is a dendritic tree that collects input from other neurons.

Axons typically contact dendritic trees at synapses

– A spike of activity in the axon causes charge to be injected into the post-synaptic neuron.

Spike generation:

– There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane.

4

SLIDE 5

Binary threshold neurons

McCulloch-Pitts (1943): influenced Von Neumann.

– First compute a weighted sum of the inputs. – send out a spike of activity if the weighted sum exceeds a threshold. – McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

𝑗𝑜𝑞𝑣𝑢& 𝑗𝑜𝑞𝑣𝑢' 𝑗𝑜𝑞𝑣𝑢( 𝑔 * 𝑥,𝑦,

,

𝑔: Activation function … 𝑔 Σ 𝑥& 𝑥' 𝑥(

5

SLIDE 6

A better figure

6

A threshold unit

– “Fires” if the weighted sum of inputs and the “bias” T is positive

+

.....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

z = * w4x4 − 𝜄

4

z = 81 if z ≥ 0 0 else

𝜄

SLIDE 7

McCulloch-Pitts neuron: binary threshold

7

𝑦& 𝑦' 𝑦( 𝑧 … 𝑧 = 81, 𝑨 ≥ 𝜄 0, 𝑨 < 𝜄 𝑧 𝜄: activation threshold 𝑥& 𝑥' 𝑥( 𝑦& 𝑦' 𝑦( 𝑧 … 𝑥& 𝑥' 𝑥( 𝑐 1 𝑧

bias: 𝑐 = −𝜄

Equivalent to binary McCulloch-Pitts neuron

SLIDE 8

Neural nets and the brain

8

Neural nets are composed of networks of computational models of

neurons called perceptrons

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

SLIDE 9

The perceptron

9

A threshold unit

– “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate

A basic unit of Booleancircuits

y = 1 if * 𝑥,𝑦, ≥ 𝜄

,

0 else

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

SLIDE 10

The “soft” perceptron (logistic)

10

+

.....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒

𝒙𝑶

z = * w4x4 − θ

4

y = 1 1 + exp(−z)

A “squashing” function instead of a threshold at the output

– The sigmoid “activation” replaces the threshold

Activation: The function that acts on the weighted combination of

inputs (and threshold)

−𝜄

SLIDE 11

Sigmoid neurons

These give a real-valued output that is a smooth and bounded

function of their total input.

Typically they use the logistic function

– They have nice derivatives.

11

SLIDE 12

Other “activations”

12

+

....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝑶

𝒄

Does not always have to be a squashing function

– We will hear more about activations later

We will continue to assume a “threshold” activation in this lecture

tanh

tanh 𝑨 log (1 + 𝑓[)

sigmoid

1 1 1 + exp (−𝑨)

SLIDE 13

Adjusting weights

Types of single layer networks:

–ADALINE (Widrow and Hoff, 1960) –Perceptron (Rosenblatt, 1962)

14

SLIDE 14

A little bit History : Widrow

15

First known attempt at an analytical solution to training a single layer

network

Now famous as theLMS algorithm

– Used everywhere – Also known as the “deltarule”

SLIDE 15

History: A D A LI N E

16

Adaptive linear element

(Hopf and Widrow,1960)

Weighted sum on inputs

and bias passed through a thresholdingfunction

ADALINE differs in the learningrule

Using 1-extended vector notation to account for bias

SLIDE 16

History: Learning in ADALINE

17

During learning, minimize the squared

error assuming to be realoutput

The desired output is stillbinary!

𝐹𝑠𝑠 𝒚 = 1 2 𝑧 − 𝑨 '

SLIDE 17

History: Learning in ADALINE

18

If we just have asingle training input,

the gradient descent update rule is

𝒙_`& = 𝒙_ − 𝜃𝛼𝐹𝑠𝑠 𝒚(c) 𝐹𝑠𝑠 𝒚(c) = 1 2 𝑧(c) − 𝑨(c) ' = 1 2 𝑧(c) − 𝒙d𝒚(c) ' 𝐹𝑠𝑠 𝒚 = 1 2 𝑧 − 𝑨 ' 𝒙_`& = 𝒙_ + 𝜃(𝑧(c) − 𝑨(c))𝒚(c)

𝜀

SLIDE 18

The ADALINE learning rule

19

Online learning rule
After each input , that has target

(binary) output 𝑧, compute and update:

This is the famous deltarule

– Also called the LMS update rule

𝜀 = 𝑧 − 𝑨 𝒙_`& = 𝒙_ + 𝜃𝜀𝒚(c)

SLIDE 19

Perceptron

21

𝑦& 𝑦' 1

x1 x2

z = * w4x4 + b

4

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥&

𝑥' 𝑥0

𝑥(

𝑐

Lean this function

– A step function across a hyperplane

SLIDE 20

Learning the perceptron

22

Given a number of input output pairs, learn the weights and

bias

– Learn 𝑋 = [𝑥&, … , 𝑥(] and b, given several 𝑌, 𝑧 pairs

𝑦& 𝑦'

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

𝑐

𝑧 = l1 𝑗𝑔 * w4x4 + b

4

≥ 0 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

SLIDE 21

Restating the perceptron

23

x1 x2 x3 xd

Wd+1

xd+1=1

Restating the perceptron equation by adding another dimension

to 𝑌

𝑧 = 81 𝑗𝑔 ∑ 𝑥,𝑦, ≥ 0

(`& ,q&

0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Where 𝑦(`& = 1

Note that the boundary ∑

𝑥,𝑦, ≥ 0

(`& ,q&

is now a hyperplane through origin

𝑋

(

SLIDE 22

The Perceptron Problem

24

that perfectly separates thetwo

Find the hyperplane

groups of points

34

* 𝑥,𝑦, = 0

(`& ,q&

SLIDE 23

The Perceptron Problem

Find the hyperplane ∑

𝑥,𝑦, = 0

(`& ,q&

that perfecly separates the two groups of points

– Note: 𝒙 = 𝑥&, 𝑥', … , 𝑥(`& is a vector that is orthogonal to the hyperplane

In fact the equation for the hyperplane itself means “the set of all Xs that are
rthogonal to W” (∑

𝑥,𝑦, = 𝒙d𝒚 = 0

(`& ,q&

)

25

SLIDE 24

Perceptron Learning Algorithm

Given 𝑂 training instances 𝒚(&), 𝑧(&) , 𝒚('), 𝑧(') , … , 𝒚(u), 𝑧(u)

– 𝑧(c) = +1 or −1

26

Initialize 𝒙
Cycle through the training instance
While more classification errors

– For 𝑗 = 1 … 𝑂_vw,c

𝑃 𝒚(,) = 𝑡𝑗𝑕𝑜(𝒙d𝒚(,))

If 𝑃 𝒚(,) ≠ 𝑧(,)

𝒙 = 𝒙 + 𝑧(,)𝒚(,)

SLIDE 25

Perceptron Algorithm:Summary

27

Cycle through the traininginstances
Only update 𝒙 on misclassifiedinstances
If instance misclassified:

– If instance is positive class 𝒙 = 𝒙 + 𝒚

– If instance is negative class

𝒙 = 𝒙 − 𝒚

SLIDE 26

The perceptron convergence procedure

Perceptron trains binary output neurons as classifiers
Pick training cases (until convergence):

– If the output unit is correct, leave its weights alone. – If the output unit incorrectly outputs a zero, add the input vector to it. – If the output unit incorrectly outputs a 1, subtract the input vector from it.

This is guaranteed to find a set of weights that gets the right answer

for all the training cases if any such set exists.

28

SLIDE 27

A Simple Method: The Perceptron Algorithm

29

Initialize: Randomly initialize the hyperplane

– i.e. randomly initialize the normalvector 𝑥

Classification rule 𝑡𝑗𝑕𝑜(𝒙d𝒚)

– Vectors on the same side of the hyperplane as 𝑋 will be assigned +1 class, and those on the other side will be assigned -1

The random initial plane will make mistakes

+1 (blue)

1 (red)

𝒙

SLIDE 28

Perceptron Algorithm

30

+1 (bleu)

1 (Red)

Initialization

𝒙

SLIDE 29

Perceptron Algorithm

31

+1 (bleu)

1 (Red)

Misclassified positive instance

𝑿

SLIDE 30

Perceptron Algorithm

32

Updated weight vector: Misclassified positive instance, add it to W

𝒑𝒎𝒆 𝒙 𝒐𝒇𝒙 𝒙

SLIDE 31

Perceptron Algorithm

33

𝒙

+1 (bleu)

1 (Red)

SLIDE 32

Perceptron Algorithm

34

Updated weight vector: Misclassified negative instance, subtract it from w

𝒑𝒎𝒆 𝒙 𝒐𝒇𝒙 𝒙

SLIDE 33

Perceptron Algorithm

35

𝒙

Perfect classification, no more updates

SLIDE 34

Convergence of Perceptron Algorithm

36

Guaranteed to converge if classes are linearly

separable

– After no more than misclassifications

Specifically when 𝒙is initialized to 0

– is length of longest trainingpoint – is the best case closest distance of a training point from the classifier

Same as the margin in an SVM

– Intuitively takes many increments of size to undo an error resulting from a step of size

SLIDE 35

Perceptron Algorithm

37

𝛿 is the best-case margin R is the length of the longest vector

𝜹 𝜹 𝑺

SLIDE 36

Adjusting weights

38

Weight update for a training pair (𝒚 c , 𝑧(c)):

– Perceptron: If 𝑡𝑗𝑕𝑜(𝒙d𝒚(c)) ≠ 𝑧(c) then ∆𝒙 = 𝒚(c)𝑧(c) else ∆𝒙 = 𝟏 – ADALINE: ∆𝒙 = 𝜃(𝑧(c) − 𝒙d𝒚(c))𝒚(c)

Widrow-Hoff, LMS, or delta rule

𝒙_`& = 𝒙_ − 𝜃𝛼𝐹c 𝒙_

𝐹c 𝒙 = 𝑧(c) − 𝒙d𝒚(c) '

SLIDE 37

How to learn the weights: multi class example

40

SLIDE 38

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

41

SLIDE 39

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

42

SLIDE 40

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

43

SLIDE 41

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

44

SLIDE 42

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

45

SLIDE 43

How to learn the weights: multi class example

If correct: no change
If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

46

SLIDE 44

Single layer networks as template matching

Weights for each class as a template (or sometimes also called a

prototype) for that class.

– The winner is the most similar template.

The ways in which hand-written digits vary are much too complicated

to be captured by simple template matches of whole shapes.

To capture all the allowable variations of a digit we need to learn the

features that it is composed of.

47

SLIDE 45

The history of perceptrons

They were popularised by Frank Rosenblatt in the early 1960’s.

– They appeared to have a very powerful learning algorithm. – Lots of grand claims were made for what they could learn to do.

In 1969, Minsky and Papert published a book called “Perceptrons”

that analyzed what they could do and showed their limitations.

– Many people thought these limitations applied to all neural network models.

48

SLIDE 46

What binary threshold neurons cannot do

A binary threshold output unit cannot even tell if two single bit

features are the same!

A geometric view of what binary threshold neurons cannot do
The positive and negative cases cannot be separated by a plane

49

SLIDE 47

What binary threshold neurons cannot do

Positive cases (same): (1,1)->1; (0,0)->1
Negative cases (different): (1,0)->0; (0,1)->0
The four input-output pairs give four inequalities that are impossible

to satisfy:

– w1 +w2 ≥θ – 0 ≥θ – w1 <θ – w2 <θ

50

SLIDE 48

Discriminating simple patterns under translation with wrap-around

Suppose we just use pixels as the

features.

binary

decision unit cannot discriminate patterns with the same number of on pixels

– if the patterns can translate with wrap- around!

51

SLIDE 49

Sketch of a proof

For pattern A, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern A. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

For pattern B, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern B. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

But to discriminate correctly, every single case of pattern A must provide

more input to the decision unit than every single case of pattern B.

This is impossible if the sums over cases are the same.

52

SLIDE 50

Networks with hidden units

Networks without hidden units are very limited in the input-output

mappings they can learn to model.

– More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough.

We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

53

SLIDE 51

The multi-layer perceptron

54

A network of perceptrons

– Generally “layered”

SLIDE 52

Feed-forward neural networks

Also called Multi-Layer Perceptron (MLP)

55

SLIDE 53

MLP with single hidden layer

56

Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝† 𝒚 = 𝜔 * 𝑥

ˆ† [']𝑨 ˆ ‰ ˆqŠ

⇒ 𝑝† 𝒚 = 𝜔 * 𝑥

ˆ† [']𝜚 * 𝑥,ˆ [&]𝑦, ( ,qŠ ‰ ˆqŠ

… … … Input Output 𝑦Š = 1 𝑦( 𝑝& 𝑝• 𝑥

ˆ† [']

𝑥,ˆ

[&]

𝜚 𝜔 𝑨Š = 1 𝑨& 𝑨‰ 𝑨

ˆ

𝜔 𝜚 𝜚 𝑦& 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿

SLIDE 54

Beyond linear models

57

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚

SLIDE 55

Beyond linear models

58

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚 𝒈 = 𝑿0𝜚 𝑿'𝜚 𝑿𝟐𝒚

SLIDE 56

Defining “depth”

60

What is a “deep” network

SLIDE 57

Deep Structures

61

In any directed network of computational elements with input source

nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink

“Deep” [ Depth > 2
Left: Depth =2.

Right: Depth =3

SLIDE 58

The multi-layer perceptron

63

N.Net

Inputs are real or Boolean stimuli
Outputs are real or Boolean values

– Can have multiple outputs for a single input

What can this network compute?

– What kinds of input/output relationships can it model?

SLIDE 59

MLPs approximate functions

64

x

ℎ2 ℎn

X Y Z

1

1 A 2 1 1 2

1

1 2 1 1 1 1 2 1 1

1

1 1 1 1

MLP s can compose Boolean functions
MLPs can compose real-valued functions
What are the limitations?

SLIDE 60

Multi-layer Perceptrons as universal Boolean functions

65

SLIDE 61

The perceptron as a Boolean gate

67

X Y

1 1 2

X

1
A perceptron can model any simple binary Boolean gate

X

1 1 1

Y

SLIDE 62

Perceptron as aBoolean gate

68

1 1 L 1

1
1
1

Will fire only if X1 .. XL are all 1 and XL+1 .. XN are all 0

The universal AND gate

– AND any number of inputs

Any subset of who may be negated

SLIDE 63

Perceptron as aBoolean gate

69

1 1 L-N+1 1

1
1
1

Will fire only if any of X1 .. XL are 1 or any of XL+1 .. XN are 0

The universal OR gate

– OR any number of inputs

Any subset of who may be negated

SLIDE 64

Perceptron as aBoolean gate

70

1 1 K 1 1 1 1

Will fire only if at least K inputs are 1

Generalized majority gate

– Fire if at least K inputs are of the desired polarity

SLIDE 65

Perceptron as aBoolean gate

71

Will fire only if the total number of of X1 .. XL that are 1 or XL+1 .. XN that are 0 is at least K

Generalized majority gate

– Fire if at least K inputs are of the desired polarity

1 1 L

N+K

1

1
1
1

SLIDE 66

The perceptron is not enough

72

X

? ? ?

Y

Cannot compute an XOR

SLIDE 67

Multi-layer perceptron XOR

73

1 1 1 1

1

X

1

1

2

Y

Hidden Layer

An XOR takes three perceptrons
1

SLIDE 68

Multi-layer perceptron XOR

74

With 2 neurons

– 5 weights and two thresholds

X

2

1 1 1.5 0.5 1 1

Y

SLIDE 69

Multi-layer perceptron

75

2 1 X Y Z 1 2

1

1 2 1 1 1 1

1

1 A 1 1 1

1

1 1 1 1

MLPs can compute more complex Boolean functions
MLPs can compute any Boolean function

– Since they can emulate individual gates

MLPs are universal Boolean functions

SLIDE 70

MLP as Boolean Functions

76

2 1 X Y Z 1 2

1

1 2 1 1 1 1

1

1 A 1 1 1

1

1 1 1 1

MLPs are universal Boolean functions

– Any function over any number of inputs and any number of outputs

But how many “layers” will they need?

SLIDE 71

How many layers for aBoolean MLP?

77

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

A Boolean function is just a truth table

SLIDE 72

How many layers for aBoolean MLP?

78

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

SLIDE 73

How many layers for aBoolean MLP?

79

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 74

How many layers for aBoolean MLP?

80

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 75

How many layers for aBoolean MLP?

81

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 76

How many layers for aBoolean MLP?

82

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 77

How many layers for aBoolean MLP?

83

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 78

How many layers for aBoolean MLP?

84

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 79

How many layers for aBoolean MLP?

85

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 80

How many layers for aBoolean MLP?

86

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Any truth table can be expressed in this manner!
A one-hidden-layer MLP is a Universal Boolean Function
But what is the largest number of perceptrons required in the

single hidden layer for an N-input-variable function?

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

SLIDE 81

Worst case

87

Which truth tables cannot be reduced further simply?
Largest width needed for a single-layer Boolean network on N inputs

– Worst case: 2u˜&

Example: Parity function

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11 𝑌 ⊕ 𝑍 ⊕ 𝑎 ⊕ 𝑋

SLIDE 82

Boolean functions

88

Input: N Boolean variable
How many neurons in a one hidden layer MLP is required?
More compact representation of a Boolean function

– “Karnaugh Map”

representing the truth table as a grid
Grouping adjacent boxes to reduce the complexity of the Disjunctive Normal Form (DNF)

formula

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11

SLIDE 83

How many neurons in the hidden layer?

89

𝑌

”𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍𝑋 œ 𝑎̅ ∨ 𝑌𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌𝑍𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍 ”𝑋𝑎 ∨ 𝑌𝑍 ”𝑋𝑎̅ ∨ 𝑌𝑍𝑋𝑎̅ ∨ 𝑌𝑍 ”𝑋𝑎

𝑋

œ 𝑎̅ ∨ 𝑍 ”𝑋𝑎 ∨ 𝑌𝑋𝑎̅

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11

SLIDE 84

Width of a deepMLP

92 00 01 11 10 Y Z WX

00 01 11 10 00 01 11 10 Y Z WX 10 00 11 01 Y Z UV

00 01 11 10

SLIDE 85

Using deep network: Parity function on N inputs

93

Simple MLP with one hidden layer:

2u˜& Hidden units 𝑂 + 2 2u˜& + 1 Weights and biases

SLIDE 86

Using deep network: Parity function on N inputs

94

Simple MLP with one hidden layer:
𝑔 = 𝑌& ⊕ 𝑌' ⊕ ⋯ ⊕ 𝑌u

𝑌& 𝑌' 𝑌0 𝑌–

3(𝑂 − 1) Nodes 9(𝑂 − 1) Weights and biases 2u˜& Hidden units 𝑂 + 2 2u˜& + 1 Weights and biases

The actual number of parameters in a network is the number that really matters in software or hardware implementations

SLIDE 87

A better architecture

95

Only requires 2log𝑂 layers
𝑔 =

𝑌& ⊕ 𝑌' ⊕ 𝑌0 ⊕ 𝑌– ⊕ 𝑌– ⊕ 𝑌— ⊕ 𝑌¢ ⊕ 𝑌£

𝑌& 𝑌' 𝑌0 𝑌– 𝑌— 𝑌¢ 𝑌£ 𝑌¤

SLIDE 88

The challenge of depth

96

𝑎

&

… …

𝑦& 𝑦u

𝑎

‰

Using only K hidden layers will require 𝑃 2¥u neurons in the kth

layer, where 𝐷 = 2˜†/'

– Because the output can be shown to be the XOR of all the outputs of k-1th hidden layer – i.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function

SLIDE 89

Caveat 1: Not all Booleanfunctions..

99

Not all Boolean circuits have such clear depth-vs-size tradeoff
Shannon’s theorem: For 𝑂 > 2 , there is Boolean function of 𝑂

variables that requires at least 2u/𝑂 gates

– More correctly, for large N, almost all N-input Boolean function need more than 2u/𝑂 gates

Regardless of depth
Note: if all Boolean functions over 𝑂 inputs could be computed using

a circuit of size that is polynomial in 𝑂, P=NP !

SLIDE 90

Caveat 2

100

Used a simple “Boolean circuit” analogy for explanation
We actually have threshold circuit (TC) not, just a Boolean circuit (AC)

– Specifically composed of threshold gates

More versatile than Boolean gates (can compute majority function)
E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC
For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (𝑡𝑢𝑠𝑗𝑑𝑢 𝑡𝑣𝑐𝑡𝑓𝑢)

– A depth-2 TC parity circuit can be composed with 𝑃(𝑜') weights

But a network of depth log

(𝑜) requires only 𝑃(𝑜) weights

Other formal analyses typically view neural networks as arithmetic

circuits

– Circuits which compute polynomials over any field

So lets consider functions over the field of reals

SLIDE 91

Summary: Wide vs. deep network

101

MLP with a single hidden layer is a universal Boolean function
However, a single-layer network might need an exponential number of

hidden units w.r.t. the number of inputs

Deeper networks may require far fewer neurons than shallower

networks to express the same function

– Could be exponentially smaller

Optimal width and depth depend on the number of variables and the

complexity of the Boolean function

– Complexity: minimal number of terms in DNF formula to represent it

SLIDE 92

MLPs as universal classifiers

102

SLIDE 93

The MLPas a classifier

103

784 dimensions (MNIST) 784 dimensions

2

MLP as a function over real inputs
MLP as a function that finds a complex “decision boundary” over a

space of reals

SLIDE 94

A Perceptron onReals

104

A perceptron operates on

real-valued vectors

– This is a linear classifier

* 𝑥,𝑦, ≥ 𝑈

,

x1 x2 x3 xN

x1 x2

1

𝑥&𝑦& + 𝑥'𝑦' = 𝑈

x1 x2

SLIDE 95

Boolean functions with areal perceptron

105

Boolean perceptrons are also linear classifiers

– Purple regions are 1

Y X 0,0 0,1 1,0 1,1 Y X 0,0 0,1 1,0 1,1 X Y 0,0 0,1 1,0 1,1

SLIDE 96

Composing complicated “decision” boundaries

106

Build a network of units with a single output that fires if the input is

in the coloured area

x1 x2

Can now be composed into “networks” to compute arbitrary classification “boundaries”

SLIDE 97

Booleans over the reals

107

The network must fire if the input is in the coloured area

x1 x2

SLIDE 98

Booleans over the reals

108

The network must fire if the input is in the coloured area

x1 x2

SLIDE 99

Booleans over the reals

109

The network must fire if the input is in the coloured area

x1 x2

SLIDE 100

Booleans over the reals

110

The network must fire if the input is in the coloured area

x1 x2

SLIDE 101

Booleans over the reals

111

The network must fire if the input is in the coloured area

x1 x2

SLIDE 102

Booleans over the reals

112

The network must fire if the input is in the coloured area

AND

y1 y2 y3 y4 y5

* 𝑧, ≥ 5

u ,q&

x1 x2

SLIDE 103

More complex decision boundaries

113

Network to fire if the input is in the yellow area

– “OR” two polygons – A third layer is required

x2 x1

AND AND OR

x1 x2

SLIDE 104

Complex decision boundaries

114

AND OR x1 x2

Can compose arbitrarily complex decision boundaries

– With only one hidden layer! – How?

SLIDE 105

115

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.

SLIDE 106

Exercise: compose this withone hidden layer

116

How would you compose the decision boundary to the left with
nly one hidden layer?

x1 x2 x2 x1

SLIDE 107

Composing a squaredecision boundary

117

4 2 2 2 2

The polygon net

x1 x2

1

y y2 y3 y4

* y4 ≥ 4 ?

– 4q&

SLIDE 108

Composing a squaredecision boundary

118

The polygon net

* y4 ≥ 5 ?

— 4q& x1 x2 y1 y2 y3 y4 y5

5 4 4 4 4 4 2 2 2 2 2 3 3 3 3 3

SLIDE 109

Composing a pentagon

119

The polygon net

* y4 ≥ 6 ?

¢ 4q&

6 5 5 5 5 5 5 3 3 3 3 3 3 4 4 4 4 4

x1 x2 y5 y1 y2 y3 y4 y6

SLIDE 110

16 sides

120

What are the sums in the different regions?

SLIDE 111

64 sides

121

What are the sums in the different regions?

SLIDE 112

1000 sides

122

What are the sums in the different regions?

SLIDE 113

Polygon net

123

Increasing the number of sides reduces the area outside the

polygon that have N/2 < Sum < N

SLIDE 114

In the limit

124

For small radius, it’s a near perfect cylinder

– N in the cylinder, N/2 outside * 𝑧,

,

= 𝑂 1 − 1 𝜌 arccos min (1, 𝑠𝑏𝑒𝑗𝑣𝑡 |𝑦 − 𝑑𝑓𝑜𝑢|)

SLIDE 115

Composing a circle

125

N N/2

The circle net

– Very large number of neurons – Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location

* y4 ≥ N ?

· 4q&

SLIDE 116

Composing a circle

126

The circle net

– Very large number of neurons – Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

−𝑂/2

1 N/2

SLIDE 117

Adding circles

127

The “sum” of two circles sub nets is exactly N/2 inside either circle, and 0

almost everywhere outside

𝑶 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

SLIDE 118

Composing an arbitraryfigure

128 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

Just fit in an arbitrary number of circles

– More accurate approximation with greater number of smaller circles – Can achieve arbitrary precision

SLIDE 119

MLP: Universal classifier

129 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

MLPs can capture any classification boundary
A one-layer MLP can model any classification boundary
MLPs are universal classifiers

SLIDE 120

Depth and theuniversal classifier

130

x2 x1 x1 x2

Deeper networks can require far fewer neurons

SLIDE 121

Optimal depth in generic nets

We look at a different pattern:

– “worst case” decision boundaries

For threshold-activation networks

– Generalizes to other nets

134

SLIDE 122

Optimal depth

135

A naïve one-hidden-layer neural network will required infinite

hidden neurons

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

SLIDE 123

Optimal depth

136

Two hidden-layer network: 56 hidden neurons

SLIDE 124

Optimal depth

137

Two layer network: 56 hidden neurons

– 16 neurons in hidden layer 1

𝒁𝟐 𝒁𝟑 𝒁𝟐𝟕 𝒁𝟐 𝒁𝟑 𝒁𝟒 𝒁𝟓 𝒁𝟔 𝒁𝟕 𝒁𝟖 𝒁𝟗 𝒁𝟘 𝒁𝟐𝟏 𝒁𝟐𝟐 𝒁𝟐𝟑 𝒁𝟐𝟒 𝒁𝟐𝟓 𝒁𝟐𝟔 𝒁𝟐𝟕

SLIDE 125

Optimal depth

138

Two-layer network: 56 hidden neurons

– 16 in hidden layer 1 – 40 in hidden layer 2 – 57 total neurons, including output neuron

SLIDE 126

Optimal depth

139

But this isjust

𝒁𝟐 𝒁𝟑 𝒁𝟐𝟕 𝒁𝟐 𝒁𝟑 𝒁𝟒 𝒁𝟓 𝒁𝟔 𝒁𝟕 𝒁𝟖 𝒁𝟗 𝒁𝟘 𝒁𝟐𝟏 𝒁𝟐𝟐 𝒁𝟐𝟑 𝒁𝟐𝟒 𝒁𝟐𝟓 𝒁𝟐𝟔 𝒁𝟐𝟕

SLIDE 127

Optimal depth

140

But this is just
The XOR net will require 16+15×3 = 61 neurons
46 neurons if we use a two-gate XOR

SLIDE 128

Optimal depth

141

A naïve one-hidden-layer neural network will required infinite

hidden neurons

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

SLIDE 129

Actual linear units

142

𝒁𝟐 𝒁𝟑 𝒁𝟕𝟓

64 basic linear feature detectors

SLIDE 130

Optimal depth

143

Two hidden layers: 608 hidden neurons

– 64 in layer 1 – 544 in layer 2

609 total neurons (including output neuron)

SLIDE 131

Optimal depth

144

… . … . … .

XOR network (12 hidden layers): 253 neurons

– 190 neurons with 2-gate XOR

The difference in size between the deeper optimal (XOR) net and

shallower nets increases with increasing pattern complexity and input dimension

SLIDE 132

Depth: Summary

The number of neurons required in a shallow network is potentially

exponential in the dimensionality of the input

– (this is the worst case) – Alternately, exponential in the number of statistically independent features

146

SLIDE 133

Summary

Multi-layer perceptrons are Universal Boolean Machines

– Even a network with a single hidden layer is a universal Boolean machine

Multi-layer perceptrons are Universal Classification Functions

– Even a network with a single hidden layer is a universal classifier

But a single-layer network may require an exponentially large

number of perceptrons than a deep one

Deeper networks may require far fewer neurons than shallower

networks to express the same function

– Could be exponentially smaller – Deeper networks are more expressive

147

SLIDE 134

MLPs as universal approximators

148

SLIDE 135

MLP as a continuous-valued regression

149

+

x

1 T2 1 T

1

T1 T

2

1

1

T T

1 2 x

f(x)

A simple 3-unit MLP with a “summing” output unit can

generate a “square pulse” over an input

– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified

SLIDE 136

MLP as a continuous-valued regression

150

x

ℎ

□

ℎ□ ℎ□

+

x

1 T

2

1 T

1

T1 T

2

1

1

T T

1 2 x

f(x)

A simple 3-unit MLP can generate a “square pulse” over an input
An MLP with many units can model an arbitrary function over an

input

– To arbitrary precision

Simply make the individual pulses narrower
A one-layer MLP can model an arbitrary function of a single input

SLIDE 137

For higher dimensions

151

N/2

+

1

N/2
An MLP can compose a cylinder

– u

' in the circle, 0 outside

SLIDE 138

MLP as a continuous-valued function

152

1

+

n 2 1 2 n

MLPs can actually compose arbitrary functions in any number of

dimensions!

– Even with only one layer

As sums of scaled and shifted cylinders

– To arbitrary precision

By making the cylinders thinner

– The MLP is a universal approximator!

SLIDE 139

Caution: MLPs with additive output units are universal approximators

153

1

+

n 2 1 2 n

MLPs can actually compose arbitrary functions in any number of

dimensions!

But explanation so far only holds if the output unit only performs

summation

– i.e. does not have an additional “activation”

= ∑

ℎ,𝑧,

u ,q&

SLIDE 140

“Proper” networks: Outputswith activations

154

x1 x2 x3 xN sigmoid tanh

Output neuron may have actual “activation”

– Threshold, sigmoid, tanh, softplus, rectifier, etc.

What is the property of such networks?

SLIDE 141

155

f: 0,1 · → 0,1 Boolean f: 𝑆· → 0,1 Threshold f: 𝑆· → (0,1) Sigmoid f: 𝑆· → (−1,1) Tanh f: 𝑆· → (0, +∞) Softrectifier, Rectifier

Output unit with activation function
Threshold or Sigmoid, or any other
The network is actually a universal map from the entire domain of input values to the

entire range of the output activation

All values the activation function of the output neuron
The MLP is a Universal Approximator for the entire class of functions (maps) it represents!

SLIDE 142

A discussion of optimal depth and width

156

SLIDE 143

The issue ofdepth

Previous discussion showed that a single-layer MLP is a universal

function approximator

– Can approximate any function to arbitrary precision – But may require infinite neurons in the layer

More generally, deeper networks will require far fewer neurons for

the same approximation error

– The network is a generic map

The same principles that apply for Boolean networks apply here

– Can be exponentially fewer than the 1-layer network

157

SLIDE 144

Sufficiency of architecture

158

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

SLIDE 145

Sufficiency of architecture

159

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

With caveats..

SLIDE 146

Sufficiency of architecture

160

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

With caveats..

We will revisit this idea shortly

SLIDE 147

Sufficiency of architecture

161

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

With caveats..

SLIDE 148

Sufficiency of architecture

162

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

With caveats..

SLIDE 149

Sufficiency of architecture

163

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

A neural network can represent any function provided
it has sufficient capacity

– I.e. sufficiently broad and deep to represent the function

Not all architectures can represent any function

A net work wi t h l es s t han 16 ne urons i n t he f i rst l ayer cannot repres ent t hi s pattern exact l y

Wi t h cave at s ..

A 2-layer network with 16 neurons in the first layer cannot represent the pattern with less than 41 neurons in the second layer

SLIDE 150

Sufficiency of architecture

164

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly

With caveats..

SLIDE 151

Sufficiency of architecture

165

This effect is because we use the

threshold activation

It gates information in the input

from later layers

The pattern of outputs within any

colored region is identical

Subsequent layers do not obtain

enough information to partition them

SLIDE 152

Sufficiency of architecture

166

This effect is because we use the

threshold activation

It gates information in the input

from later layers

Continuous activation functions result in graded output at the layer
The gradation provides information to subsequent layers, to

capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).

SLIDE 153

Sufficiency of architecture

167

This effect is because we use the

threshold activation

It gates information in the input

from later layers

Continuous activation functions result in graded output at the layer
The gradation provides information to subsequent layers, to

capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).

Activations with more gradation (e.g. RELU) pass more information

SLIDE 154

Width vs. Activations vs.Depth

Narrow layers can still pass information to subsequent layers if the

activation function is sufficiently graded

But will require greater depth, to permit later layers to capture

patterns

168

SLIDE 155

Sufficiency of architecture

169

The capacity of a network has various definitions

– Information or Storage capacity: how many patterns can it remember – VC dimension

bounded by the square of the number of weights in the network

– From our perspective: largest number of disconnected convex regions it can represent

A network with insufficient capacity cannot exactly

model a function that requires a greater minimal number of convex hulls than the capacity of the network

– But can approximate it with error

SLIDE 156

Summary

MLPs are universal Boolean function
MLPs are universal classifiers
MLPs are universal function approximators
A single-layer MLP can approximate anything to arbitrary precision

– But could be exponentially or even infinitely wide in its inputs size

Deeper MLPs can achieve the same precision with far fewer

neurons

– Deeper networks are more expressive

171