multi layer networks

Multi-Layer Networks M. Soleymani Deep Learning Sharif University - PowerPoint PPT Presentation

Multi-Layer Networks M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from: Bhiksha Raj, 11-785, CMU 2019 and Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton, NN for


  1. Perceptron Algorithm 𝜹 𝜹 𝑺 𝛿 is the best-case margin R is the length of the longest vector 37

  2. Adjusting weights 𝒙 _`& = 𝒙 _ − 𝜃𝛼𝐹 c 𝒙 _ • Weight update for a training pair (𝒚 c , 𝑧 (c) ) : – Perceptron : If 𝑡𝑗𝑕𝑜(𝒙 d 𝒚 (c) ) ≠ 𝑧 (c) then ∆𝒙 = 𝒚 (c) 𝑧 (c) else ∆𝒙 = 𝟏 – ADALINE : ∆𝒙 = 𝜃(𝑧 (c) − 𝒙 d 𝒚 (c) )𝒚 (c) 𝐹 c 𝒙 = 𝑧 (c) − 𝒙 d 𝒚 (c) ' • Widrow-Hoff, LMS, or delta rule 38

  3. How to learn the weights: multi class example 40

  4. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 41

  5. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 42

  6. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 43

  7. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 44

  8. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 45

  9. How to learn the weights: multi class example • If correct: no change • If wrong: – lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class) 46

  10. Single layer networks as template matching • Weights for each class as a template (or sometimes also called a prototype) for that class. – The winner is the most similar template. • The ways in which hand-written digits vary are much too complicated to be captured by simple template matches of whole shapes. • To capture all the allowable variations of a digit we need to learn the features that it is composed of. 47

  11. The history of perceptrons • They were popularised by Frank Rosenblatt in the early 1960’s. – They appeared to have a very powerful learning algorithm. – Lots of grand claims were made for what they could learn to do. • In 1969, Minsky and Papert published a book called “Perceptrons” that analyzed what they could do and showed their limitations. – Many people thought these limitations applied to all neural network models. 48

  12. What binary threshold neurons cannot do • A binary threshold output unit cannot even tell if two single bit features are the same! • A geometric view of what binary threshold neurons cannot do • The positive and negative cases cannot be separated by a plane 49

  13. What binary threshold neurons cannot do • Positive cases (same): (1,1)->1; (0,0)->1 • Negative cases (different): (1,0)->0; (0,1)->0 • The four input-output pairs give four inequalities that are impossible to satisfy: – w 1 + w 2 ≥θ – 0 ≥θ – w 1 <θ – w 2 <θ 50

  14. Discriminating simple patterns under translation with wrap-around • Suppose we just use pixels as the features. • binary decision unit cannot discriminate patterns with the same number of on pixels – if the patterns can translate with wrap- around! 51

  15. Sketch of a proof • For pattern A, use training cases in all possible translations. – Each pixel will be activated by 4 different translations of pattern A. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. • For pattern B, use training cases in all possible translations. – Each pixel will be activated by 4 different translations of pattern B. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights. • But to discriminate correctly, every single case of pattern A must provide more input to the decision unit than every single case of pattern B. • This is impossible if the sums over cases are the same. 52

  16. Networks with hidden units • Networks without hidden units are very limited in the input-output mappings they can learn to model. – More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough. • We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets? 53

  17. The multi-layer perceptron • A network of perceptrons – Generally “layered ” 54

  18. Feed-forward neural networks • Also called Multi-Layer Perceptron (MLP) 55

  19. MLP with single hidden layer • Two-layer MLP (Number of layers of adaptive weights is counted) ‰ ‰ ( ['] 𝑨 ['] 𝜚 * 𝑥 ,ˆ [&] 𝑦 , 𝑝 † 𝒚 = 𝜔 * 𝑥 ⇒ 𝑝 † 𝒚 = 𝜔 * 𝑥 ˆ ˆ† ˆ† ˆqŠ ˆqŠ ,qŠ 𝑨 Š = 1 𝑨 ˆ ['] [&] 𝑥 𝑥 ,ˆ ˆ† 𝑨 & 𝑦 Š = 1 𝜚 𝜔 𝑝 & 𝑦 & … … 𝜚 … 𝜔 𝑝 • 𝑦 ( 𝜚 𝑨 ‰ Output Input 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿 56

  20. Beyond linear models 𝒈 = 𝑿𝒚 𝒈 = 𝑿 ' 𝜚 𝑿 𝟐 𝒚 57

  21. Beyond linear models 𝒈 = 𝑿𝒚 𝒈 = 𝑿 ' 𝜚 𝑿 𝟐 𝒚 𝒈 = 𝑿 0 𝜚 𝑿 ' 𝜚 𝑿 𝟐 𝒚 58

  22. Defining “depth” • What is a “deep” network 60

  23. Deep Structures • In any directed network of computational elements with input source nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink • Left: Depth =2. Right: Depth =3 • “ Deep ” [ Depth > 2 61

  24. The multi-layer perceptron N.Net • Inputs are real or Boolean stimuli • Outputs are real or Boolean value s – Can have multiple outputs for a single input • What can this network compute? – What kinds of input/output relationships can it model? 63

  25. MLPs approximate functions 2 ℎ 2 1 1 0 1 ℎ n 1 -1 1 1 x 2 2 1 2 1 1 1 -1 -1 1 1 1 1 X Y Z A • MLP s can compose Boolean functions • MLPs can compose real-valued functions • What are the limitations? 64

  26. Multi-layer Perceptrons as universal Boolean functions 65

  27. The perceptron as a Boolean gate X 1 -1 2 X 0 1 X 1 Y 1 1 Y • A perceptron can model any simple binary Boolean gate 67

  28. Perceptron as aBoolean gate 1 1 1 L -1 -1 -1 Will fire only if X 1 .. X L are all 1 and X L+1 .. X N are all 0 • The universal AND gate – AND any number of inputs • Any subset of who may be negated 68

  29. Perceptron as aBoolean gate 1 1 1 L-N+1 -1 -1 -1 Will fire only if any of X 1 .. X L are 1 or any of X L+1 .. X N are 0 • The universal OR gate – OR any number of inputs • Any subset of who may be negated 69

  30. Perceptron as aBoolean gate 1 1 1 Will fire only if at least K inputs are 1 K 1 1 1 • Generalized majority gate – Fire if at least K inputs are of the desired polarity 70

  31. Perceptron as aBoolean gate 1 1 Will fire only if the total number of of X 1 .. 1 X L that are 1 or X L+1 .. X N that are 0 is at L -N+K -1 least K -1 -1 • Generalized majority gate – Fire if at least K inputs are of the desired polarity 71

  32. The perceptron is not enough X ? ? ? Y • Cannot compute an XOR 72

  33. Multi-layer perceptron XOR X 1 1 1 1 2 1 -1 -1 -1 Y Hidden Layer • An XOR takes three perceptrons 73

  34. Multi-layer perceptron XOR X 1 1 -2 1.5 0.5 1 1 Y • With 2 neurons – 5 weights and two thresholds 74

  35. Multi-layer perceptron 2 1 1 0 1 1 -1 1 1 2 2 1 1 1 1 1 -1 -1 1 1 1 1 X Y Z A • MLPs can compute more complex Boolean functions • MLPs can compute any Boolean function – Since they can emulate individual gates • MLPs are universal Boolean functions 75

  36. MLP as Boolean Functions 2 1 1 0 1 1 -1 1 1 2 1 1 2 -1 -1 1 1 1 1 1 1 1 X Y Z A • MLPs are universal Boolean functions – Any function over any number of inputs and any number of outputs • But how many “layers” will they need? 76

  37. How many layers for aBoolean MLP? Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 Truth table shows all input 0 1 1 0 0 1 combinations for which output is 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 • A Boolean function is just a truth table 77

  38. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 • Expressed in disjunctive normal form 78

  39. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 79

  40. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 80

  41. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 81

  42. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 82

  43. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 83

  44. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 1 1 0 0 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 84

  45. How many layers for aBoolean MLP? Truth table shows all input combinations for which output is 1 Truth Table X X X X X Y 1 2 3 4 5 ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 0 0 1 1 0 1 0 1 0 1 1 1 ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Expressed in disjunctive normal form 85

  46. How many layers for aBoolean MLP? ” & 𝑌 ” ' X 0 X – 𝑌 ” — + 𝑌 ” & X ' 𝑌 ” 0 X – X — + 𝑌 ” & X ' X 0 𝑌 ” – 𝑌 ” — + y = 𝑌 Truth Table X X X X X Y ” ' 𝑌 ” 0 𝑌 ” – X — + X & 𝑌 ” ' X 0 X – X — + X & X ' 𝑌 ” 0 𝑌 ” – X — X & 𝑌 1 2 3 4 5 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 X & X ' X 0 X – X — • Any truth table can be expressed in this manner! • A one-hidden-layer MLP is a Universal Boolean Function • But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function? 86

  47. Worst case • Which truth tables cannot be reduced further simply? • Largest width needed for a single-layer Boolean network on N inputs – Worst case: 2 u˜& • Example: Parity function 𝑌, 𝑍 00 01 11 10 𝑋, 𝑎 1 0 1 0 00 0 1 0 1 01 1 0 1 0 11 0 1 0 1 10 𝑌 ⊕ 𝑍 ⊕ 𝑎 ⊕ 𝑋 87

  48. Boolean functions • Input: N Boolean variable • How many neurons in a one hidden layer MLP is required? • More compact representation of a Boolean function – “Karnaugh Map” • representing the truth table as a grid • Grouping adjacent boxes to reduce the complexity of the Disjunctive Normal Form (DNF) formula 𝑌, 𝑍 00 01 10 11 𝑋, 𝑎 1 1 1 1 00 01 1 1 10 11 1 1 88

  49. How many neurons in the hidden layer? ”𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍𝑋 œ 𝑎̅ ∨ 𝑌𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌𝑍𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍 ”𝑋𝑎 ∨ 𝑌𝑍 ”𝑋𝑎̅ ∨ 𝑌𝑍𝑋𝑎̅ ∨ • 𝑌 ”𝑋𝑎 𝑌𝑍 𝑌, 𝑍 00 01 11 10 𝑋, 𝑎 1 1 1 1 00 01 11 1 1 1 1 10 œ 𝑎̅ ∨ 𝑍 ”𝑋𝑎 ∨ 𝑌𝑋𝑎̅ • 𝑋 89

  50. Width of a deepMLP Y Z WX 00 01 11 10 Y Z WX 00 00 01 01 11 11 10 11 10 01 10 00 Y Z 00 01 11 10 UV 92

  51. Using deep network: Parity function on N inputs • Simple MLP with one hidden layer: 2 u˜& Hidden units 𝑂 + 2 2 u˜& + 1 Weights and biases 93

  52. Using deep network: Parity function on N inputs • Simple MLP with one hidden layer: 2 u˜& Hidden units 𝑂 + 2 2 u˜& + 1 Weights and biases 𝑌 – • 𝑔 = 𝑌 & ⊕ 𝑌 ' ⊕ ⋯ ⊕ 𝑌 u 3(𝑂 − 1) Nodes 𝑌 0 9(𝑂 − 1) Weights and biases The actual number of parameters in a network is the number that really matters in software or hardware implementations 𝑌 & 𝑌 ' 94

  53. A better architecture • Only requires 2log𝑂 layers • 𝑔 = 𝑌 & ⊕ 𝑌 ' ⊕ 𝑌 0 ⊕ 𝑌 – ⊕ 𝑌 – ⊕ 𝑌 — ⊕ 𝑌 ¢ ⊕ 𝑌 £ 𝑌 & 𝑌 ' 𝑌 — 𝑌 ¢ 𝑌 0 𝑌 – 𝑌 £ 𝑌 ¤ 95

  54. The challenge of depth … … 𝑎 𝑎 & ‰ 𝑦 & 𝑦 u • Using only K hidden layers will require 𝑃 2 ¥u neurons in the kth layer, where 𝐷 = 2 ˜†/' – Because the output can be shown to be the XOR of all the outputs of k-1th hidden layer – i.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function 96

  55. Caveat 1: Not all Booleanfunctions.. • Not all Boolean circuits have such clear depth-vs-size tradeoff • Shannon’s theorem: For 𝑂 > 2 , there is Boolean function of 𝑂 variables that requires at least 2 u /𝑂 gates – More correctly, for large N, almost all N-input Boolean function need more than 2 u /𝑂 gates • Regardless of depth • Note: if all Boolean functions over 𝑂 inputs could be computed using a circuit of size that is polynomial in 𝑂 , P=NP ! 99

  56. Caveat 2 • Used a simple “Boolean circuit” analogy for explanation • We actually have threshold circuit (TC) not, just a Boolean circuit (AC) – Specifically composed of threshold gates • More versatile than Boolean gates (can compute majority function) • E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC • For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (𝑡𝑢𝑠𝑗𝑑𝑢 𝑡𝑣𝑐𝑡𝑓𝑢) – A depth-2 TC parity circuit can be composed with 𝑃(𝑜 ' ) weights • But a network of depth log (𝑜) requires only 𝑃(𝑜) weights • Other formal analyses typically view neural networks as arithmetic circuits – Circuits which compute polynomials over any field • So lets consider functions over the field of reals 100

  57. Summary: Wide vs. deep network • MLP with a single hidden layer is a universal Boolean function • However, a single-layer network might need an exponential number of hidden units w.r.t. the number of inputs • Deeper networks may require far fewer neurons than shallower networks to express the same function – Could be exponentially smaller • Optimal width and depth depend on the number of variables and the complexity of the Boolean function – Complexity: minimal number of terms in DNF formula to represent it 101

  58. MLPs as universal classifiers 102

  59. The MLPas a classifier 2 784 dimensions (MNIST) 784 dimensions • MLP as a function over real inputs • MLP as a function that finds a complex “decision boundary” over a space of reals 103

  60. A Perceptron onReals x 1 x 2 x 3 x N 1 * 𝑥 , 𝑦 , ≥ 𝑈 , 𝑥 & 𝑦 & + 𝑥 ' 𝑦 ' = 𝑈 x 2 0 x 2 x 1 x 1 • A perceptron operates on real-valued vectors – This is a linear classifier 104

  61. Boolean functions with areal perceptron 1,1 1,1 1,1 0,1 0,1 0,1 X X Y Y Y X 0,0 1,0 0,0 1,0 0,0 1,0 • Boolean perceptrons are also linear classifiers – Purple regions are 1 105

  62. Composing complicated “decision” boundaries Can now be composed into “networks” to x 2 compute arbitrary classification “boundaries” x 1 • Build a network of units with a single output that fires if the input is in the coloured area 106

  63. Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 107

  64. Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 108

  65. Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 109

  66. Booleans over the reals x 2 x 1 x 2 x 1 • The network must fire if the input is in the coloured area 110

Recommend


More recommend


Explore More Topics

Stay informed with curated content and fresh updates.