Multi-Layer Networks M. Soleymani Deep Learning Sharif University - PowerPoint PPT Presentation
Multi-Layer Networks M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from: Bhiksha Raj, 11-785, CMU 2019 and Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton, NN for
Perceptron Algorithm 𝜹 𝜹 𝑺 𝛿 is the best-case margin R is the length of the longest vector 37
Adjusting weights 𝒙 _`& = 𝒙 _ − 𝜃𝛼𝐹 c 𝒙 _ • Weight update for a training pair (𝒚 c , 𝑧 (c) ) : – Perceptron : If 𝑡𝑗𝑜(𝒙 d 𝒚 (c) ) ≠ 𝑧 (c) then ∆𝒙 = 𝒚 (c) 𝑧 (c) else ∆𝒙 = 𝟏 – ADALINE : ∆𝒙 = 𝜃(𝑧 (c) − 𝒙 d 𝒚 (c) )𝒚 (c) 𝐹 c 𝒙 = 𝑧 (c) − 𝒙 d 𝒚 (c) ' • Widrow-Hoff, LMS, or delta rule 38
How to learn the weights: multi class example 40
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 41
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 42
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 43
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 44
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 45
How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 46
Single layer networks as template matching • Weights for each class as a template (or sometimes also called a prototype) for that class. – The winner is the most similar template. • The ways in which hand-written digits vary are much too complicated to be captured by simple template matches of whole shapes. • To capture all the allowable variations of a digit we need to learn the features that it is composed of. 47
The history of perceptrons • They were popularised by Frank Rosenblatt in the early 1960’s. – They appeared to have a very powerful learning algorithm. – Lots of grand claims were made for what they could learn to do. • In 1969, Minsky and Papert published a book called “Perceptrons” that analyzed what they could do and showed their limitations. – Many people thought these limitations applied to all neural network models. 48
What binary threshold neurons cannot do • A binary threshold output unit cannot even tell if two single bit features are the same! • A geometric view of what binary threshold neurons cannot do • The positive and negative cases cannot be separated by a plane 49
What binary threshold neurons cannot do • Positive cases (same): (1,1)->1; (0,0)->1 • Negative cases (different): (1,0)->0; (0,1)->0 • The four input-output pairs give four inequalities that are impossible to satisfy: – w 1 + w 2 ≥θ – 0 ≥θ – w 1 <θ – w 2 <θ 50
Discriminating simple patterns under translation with wrap-around • Suppose we just use pixels as the features. • binary decision unit cannot discriminate patterns with the same number of on pixels – if the patterns can translate with wrap- around! 51
Sketch of a proof • For pattern A, use training cases in all possible translations. – Each pixel will be activated by 4 different translations of pattern A. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. • For pattern B, use training cases in all possible translations. – Each pixel will be activated by 4 different translations of pattern B. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. • But to discriminate correctly, every single case of pattern A must provide more input to the decision unit than every single case of pattern B. • This is impossible if the sums over cases are the same. 52
Networks with hidden units • Networks without hidden units are very limited in the input-output mappings they can learn to model. – More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough. • We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 53
The multi-layer perceptron • A network of perceptrons – Generally “layered ” 54
Feed-forward neural networks • Also called Multi-Layer Perceptron (MLP) 55
MLP with single hidden layer • Two-layer MLP (Number of layers of adaptive weights is counted) ‰ ‰ ( ['] 𝑨 ['] 𝜚 * 𝑥 ,ˆ [&] 𝑦 , 𝑝 † 𝒚 = 𝜔 * 𝑥 ⇒ 𝑝 † 𝒚 = 𝜔 * 𝑥 ˆ ˆ† ˆ† ˆqŠ ˆqŠ ,qŠ 𝑨 Š = 1 𝑨 ˆ ['] [&] 𝑥 𝑥 ,ˆ ˆ† 𝑨 & 𝑦 Š = 1 𝜚 𝜔 𝑝 & 𝑦 & … … 𝜚 … 𝜔 𝑝 • 𝑦 ( 𝜚 𝑨 ‰ Output Input 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿 56
Beyond linear models 𝒈 = 𝑿𝒚 𝒈 = 𝑿 ' 𝜚 𝑿 𝟐 𝒚 57
Beyond linear models 𝒈 = 𝑿𝒚 𝒈 = 𝑿 ' 𝜚 𝑿 𝟐 𝒚 𝒈 = 𝑿 0 𝜚 𝑿 ' 𝜚 𝑿 𝟐 𝒚 58
Defining “depth” • What is a “deep” network 60
Deep Structures • In any directed network of computational elements with input source nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink • Left: Depth =2. Right: Depth =3 • “ Deep ” [ Depth > 2 61
The multi-layer perceptron N.Net • Inputs are real or Boolean stimuli • Outputs are real or Boolean value s – Can have multiple outputs for a single input • What can this network compute? – What kinds of input/output relationships can it model? 63
MLPs approximate functions 2 ℎ 2 1 1 0 1 ℎ n 1 -1 1 1 x 2 2 1 2 1 1 1 -1 -1 1 1 1 1 X Y Z A • MLP s can compose Boolean functions • MLPs can compose real-valued functions • What are the limitations? 64
Multi-layer Perceptrons as universal Boolean functions 65
The perceptron as a Boolean gate X 1 -1 2 X 0 1 X 1 Y 1 1 Y • A perceptron can model any simple binary Boolean gate 67
Perceptron as aBoolean gate 1 1 1 L -1 -1 -1 Will fire only if X 1 .. X L are all 1 and X L+1 .. X N are all 0 • The universal AND gate – AND any number of inputs • Any subset of who may be negated 68
Perceptron as aBoolean gate 1 1 1 L-N+1 -1 -1 -1 Will fire only if any of X 1 .. X L are 1 or any of X L+1 .. X N are 0 • The universal OR gate – OR any number of inputs • Any subset of who may be negated 69
Perceptron as aBoolean gate 1 1 1 Will fire only if at least K inputs are 1 K 1 1 1 • Generalized majority gate – Fire if at least K inputs are of the desired polarity 70
Perceptron as aBoolean gate 1 1 Will fire only if the total number of of X 1 .. 1 X L that are 1 or X L+1 .. X N that are 0 is at L -N+K -1 least K -1 -1 • Generalized majority gate – Fire if at least K inputs are of the desired polarity 71
The perceptron is not enough X ? ? ? Y • Cannot compute an XOR 72
Multi-layer perceptron XOR X 1 1 1 1 2 1 -1 -1 -1 Y Hidden Layer • An XOR takes three perceptrons 73
Multi-layer perceptron XOR X 1 1 -2 1.5 0.5 1 1 Y • With 2 neurons – 5 weights and two thresholds 74
Multi-layer perceptron 2 1 1 0 1 1 -1 1 1 2 2 1 1 1 1 1 -1 -1 1 1 1 1 X Y Z A • MLPs can compute more complex Boolean functions • MLPs can compute any Boolean function – Since they can emulate individual gates • MLPs are universal Boolean functions 75
MLP as Boolean Functions 2 1 1 0 1 1 -1 1 1 2 1 1 2 -1 -1 1 1 1 1 1 1 1 X Y Z A • MLPs are universal Boolean functions – Any function over any number of inputs and any number of outputs • But how many “layers” will they need? 76
How many layers for aBoolean MLP? Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 Truth table shows all input 0 1 1 0 0 1 combinations for which output is 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 • A Boolean function is just a truth table 77
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 • Expressed in disjunctive normal form 78
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 79
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 80
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 81
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 82
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 83
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 84
How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 0 1 1 0 1 0 1 0 1 1 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 85
How many layers for aBoolean MLP? ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 Truth Table X X X X X Y ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Any truth table can be expressed in this manner! • A one-hidden-layer MLP is a Universal Boolean Function • But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function? 86
Worst case • Which truth tables cannot be reduced further simply? • Largest width needed for a single-layer Boolean network on N inputs – Worst case: 2 u˜& • Example: Parity function 𝑌, 𝑍 00 01 11 10 𝑋, 𝑎 1 0 1 0 00 0 1 0 1 01 1 0 1 0 11 0 1 0 1 10 𝑌 ⊕ 𝑍 ⊕ 𝑎 ⊕ 𝑋 87
Boolean functions • Input: N Boolean variable • How many neurons in a one hidden layer MLP is required? • More compact representation of a Boolean function – “Karnaugh Map” • representing the truth table as a grid • Grouping adjacent boxes to reduce the complexity of the Disjunctive Normal Form (DNF) formula 𝑌, 𝑍 00 01 10 11 𝑋, 𝑎 1 1 1 1 00 01 1 1 10 11 1 1 88
How many neurons in the hidden layer? ”𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍𝑋 œ 𝑎̅ ∨ 𝑌𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌𝑍𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍 ”𝑋𝑎 ∨ 𝑌𝑍 ”𝑋𝑎̅ ∨ 𝑌𝑍𝑋𝑎̅ ∨ • 𝑌 ”𝑋𝑎 𝑌𝑍 𝑌, 𝑍 00 01 11 10 𝑋, 𝑎 1 1 1 1 00 01 11 1 1 1 1 10 œ 𝑎̅ ∨ 𝑍 ”𝑋𝑎 ∨ 𝑌𝑋𝑎̅ • 𝑋 89
Width of a deepMLP Y Z WX 00 01 11 10 Y Z WX 00 00 01 01 11 11 10 11 10 01 10 00 Y Z 00 01 11 10 UV 92
Using deep network: Parity function on N inputs • Simple MLP with one hidden layer: 2 u˜& Hidden units 𝑂 + 2 2 u˜& + 1 Weights and biases 93
Using deep network: Parity function on N inputs • Simple MLP with one hidden layer: 2 u˜& Hidden units 𝑂 + 2 2 u˜& + 1 Weights and biases 𝑌 – • 𝑔 = 𝑌 & ⊕ 𝑌 ' ⊕ ⋯ ⊕ 𝑌 u 3(𝑂 − 1) Nodes 𝑌 0 9(𝑂 − 1) Weights and biases The actual number of parameters in a network is the number that really matters in software or hardware implementations 𝑌 & 𝑌 ' 94
A better architecture • Only requires 2log𝑂 layers • 𝑔 = 𝑌 & ⊕ 𝑌 ' ⊕ 𝑌 0 ⊕ 𝑌 – ⊕ 𝑌 – ⊕ 𝑌 — ⊕ 𝑌 ¢ ⊕ 𝑌 £ 𝑌 & 𝑌 ' 𝑌 — 𝑌 ¢ 𝑌 0 𝑌 – 𝑌 £ 𝑌 ¤ 95
The challenge of depth … … 𝑎 𝑎 & ‰ 𝑦 & 𝑦 u • Using only K hidden layers will require 𝑃 2 ¥u neurons in the kth layer, where 𝐷 = 2 ˜†/' – Because the output can be shown to be the XOR of all the outputs of k-1th hidden layer – i.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function 96
Caveat 1: Not all Booleanfunctions.. • Not all Boolean circuits have such clear depth-vs-size tradeoff • Shannon’s theorem: For 𝑂 > 2 , there is Boolean function of 𝑂 variables that requires at least 2 u /𝑂 gates – More correctly, for large N, almost all N-input Boolean function need more than 2 u /𝑂 gates • Regardless of depth • Note: if all Boolean functions over 𝑂 inputs could be computed using a circuit of size that is polynomial in 𝑂 , P=NP ! 99
Caveat 2 • Used a simple “Boolean circuit” analogy for explanation • We actually have threshold circuit (TC) not, just a Boolean circuit (AC) – Specifically composed of threshold gates • More versatile than Boolean gates (can compute majority function) • E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC • For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (𝑡𝑢𝑠𝑗𝑑𝑢 𝑡𝑣𝑐𝑡𝑓𝑢) – A depth-2 TC parity circuit can be composed with 𝑃(𝑜 ' ) weights • But a network of depth log (𝑜) requires only 𝑃(𝑜) weights • Other formal analyses typically view neural networks as arithmetic circuits – Circuits which compute polynomials over any field • So lets consider functions over the field of reals 100
Summary: Wide vs. deep network • MLP with a single hidden layer is a universal Boolean function • However, a single-layer network might need an exponential number of hidden units w.r.t. the number of inputs • Deeper networks may require far fewer neurons than shallower networks to express the same function – Could be exponentially smaller • Optimal width and depth depend on the number of variables and the complexity of the Boolean function – Complexity: minimal number of terms in DNF formula to represent it 101
MLPs as universal classifiers 102
The MLPas a classifier 2 784 dimensions (MNIST) 784 dimensions • MLP as a function over real inputs • MLP as a function that finds a complex “decision boundary” over a space of reals 103
A Perceptron onReals x 1 x 2 x 3 x N 1 * 𝑥 , 𝑦 , ≥ 𝑈 , 𝑥 & 𝑦 & + 𝑥 ' 𝑦 ' = 𝑈 x 2 0 x 2 x 1 x 1 • A perceptron operates on real-valued vectors – This is a linear classifier 104
Boolean functions with areal perceptron 1,1 1,1 1,1 0,1 0,1 0,1 X X Y Y Y X 0,0 1,0 0,0 1,0 0,0 1,0 • Boolean perceptrons are also linear classifiers – Purple regions are 1 105
Composing complicated “decision” boundaries Can now be composed into “networks” to x 2 compute arbitrary classification “boundaries” x 1 • Build a network of units with a single output that fires if the input is in the coloured area 106
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 107
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 108
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 109
Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 110
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.