neural networks

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy


  1. Four non- orthogonal 6-bit patterns β€’ Patterns are perfectly stationary and stable for K > 0.14N β€’ Fewer spurious minima than for the orthogonal 2-pattern case – Most fake-looking memories are in fact ghosts.. 33

  2. Six non- orthogonal 6-bit patterns β€’ Breakdown largely due to interference from β€œghosts” β€’ But patterns are stationary, and often stable – For K >> 0.14N 34

  3. More visualization.. β€’ Lets inspect a few 8-bit patterns – Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract 35

  4. One 8-bit pattern β€’ Its actually cleanly stored, but there are a few spurious minima 36

  5. Two orthogonal 8-bit patterns β€’ Both have regions of attraction β€’ Some spurious minima 37

  6. Two non-orthogonal 8-bit patterns β€’ Actually have fewer spurious minima – Not obvious from visualization.. 38

  7. Four orthogonal 8-bit patterns β€’ Successfully stored 39

  8. Four non-orthogonal 8-bit patterns β€’ Stored with interference from ghosts.. 40

  9. Eight orthogonal 8-bit patterns β€’ Wipeout 41

  10. Eight non-orthogonal 8-bit patterns β€’ Nothing stored – Neither stationary nor stable 42

  11. Making sense of the behavior β€’ Seems possible to store K > 0.14N patterns – i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary – Possible to make more than 0.14N patterns at-least 1-bit stable β€’ So what was Hopfield talking about? β€’ Patterns that are non-orthogonal easier to remember – I.e. patterns that are closer are easier to remember than patterns that are farther!! β€’ Can we attempt to get greater control on the process than Hebbian learning gives us? 43

  12. Bold Claim β€’ I can always store (upto) N orthogonal patterns such that they are stationary! – Although not necessarily stable β€’ Why? 44

  13. β€œTraining” the network β€’ How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization β€’ Secondary question – How many patterns can we store? 45

  14. A minor adjustment β€’ Note behavior of 𝐅 𝐳 = 𝐳 π‘ˆ 𝐗𝐳 with 𝐗 = 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 Energy landscape only differs by an additive constant β€’ Is identical to behavior with Gradients and location 𝐗 = 𝐙𝐙 π‘ˆ of minima remain same β€’ Since 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 𝐳 = 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ 𝐳 βˆ’ 𝑂𝑂 π‘ž β€’ But 𝐗 = 𝐙𝐙 π‘ˆ is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 π‘ˆ 46

  15. A minor adjustment β€’ Note behavior of 𝐅 𝐳 = 𝐳 π‘ˆ 𝐗𝐳 with 𝐗 = 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 Energy landscape only differs by Both have the an additive constant β€’ Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 π‘ˆ of minima remain same β€’ Since 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 𝐳 = 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ 𝐳 βˆ’ 𝑂𝑂 π‘ž β€’ But 𝐗 = 𝐙𝐙 π‘ˆ is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 π‘ˆ 47

  16. A minor adjustment β€’ Note behavior of 𝐅 𝐳 = 𝐳 π‘ˆ 𝐗𝐳 with 𝐗 = 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 Energy landscape only differs by Both have the an additive constant β€’ Is identical to behavior with same Eigen vectors Gradients and location 𝐗 = 𝐙𝐙 π‘ˆ of minima remain same NOTE: This β€’ Since is a positive semidefinite matrix 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ βˆ’ 𝑂 π‘ž 𝐉 𝐳 = 𝐳 π‘ˆ 𝐙𝐙 π‘ˆ 𝐳 βˆ’ 𝑂𝑂 π‘ž β€’ But 𝐗 = 𝐙𝐙 π‘ˆ is easier to analyze. Hence in the following slides we will use 𝐗 = 𝐙𝐙 π‘ˆ 48

  17. Consider the energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 49

  18. Consider the energy function This is a quadratic! For Hebbian learning W is positive semidefinite E is convex 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ Reinstating the bias term for completeness sake – Remember that we don’t actually use it in a Hopfield net 50

  19. The energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ 𝐹 is a convex quadratic 51

  20. The energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) β€’ But components of 𝑧 can only take values Β±1 – I.e 𝑧 lies on the corners of the unit hypercube 52

  21. The energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ 𝐹 is a convex quadratic – Shown from above (assuming 0 bias) β€’ But components of 𝑧 can only take values Β±1 – I.e 𝑧 lies on the corners of the unit hypercube 53

  22. The energy function Stored patterns 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ The stored values of 𝐳 are the ones where all adjacent corners are higher on the quadratic – Hebbian learning attempts to make the quadratic steep in the vicinity of stored patterns 54

  23. Patterns you can store Ghosts (negations) Stored patterns β€’ Ideally must be maximally separated on the hypercube – The number of patterns we can store depends on the actual distance between the patterns 55

  24. Storing patterns β€’ A pattern 𝐳 𝑄 is stored if: – π‘‘π‘—π‘•π‘œ 𝐗𝐳 π‘ž = 𝐳 π‘ž for all target patterns β€’ Note: for binary vectors π‘‘π‘—π‘•π‘œ 𝐳 is a projection – Projects 𝐳 onto the nearest corner of the hypercube – It β€œquantizes” the space into orthants 56

  25. Storing patterns β€’ A pattern 𝐳 𝑄 is stored if: – π‘‘π‘—π‘•π‘œ 𝐗𝐳 π‘ž = 𝐳 π‘ž for all target patterns β€’ Training: Design 𝐗 such that this holds β€’ Simple solution: 𝐳 π‘ž is an Eigenvector of 𝐗 – And the corresponding Eigenvalue is positive 𝐗𝐳 π‘ž = πœ‡π³ π‘ž – More generally orthant( 𝐗𝐳 π‘ž ) = orthant( 𝐳 π‘ž ) β€’ How many such 𝐳 π‘ž can we have? 57

  26. Only N patterns? (1,1) (1,-1) β€’ Patterns that differ in 𝑂/2 bits are orthogonal β€’ You can have no more than 𝑂 orthogonal vectors in an 𝑂 -dimensional space 59

  27. Another random fact that should interest you β€’ The Eigenvectors of any symmetric matrix 𝐗 are orthogonal β€’ The Eigen values may be positive or negative 60

  28. Storing more than one pattern β€’ Requirement: Given 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – Design 𝐗 such that β€’ π‘‘π‘—π‘•π‘œ 𝐗𝐳 π‘ž = 𝐳 π‘ž for all target patterns β€’ There are no other binary vectors for which this holds β€’ What is the largest number of patterns that can be stored? 61

  29. Storing 𝑳 orthogonal patterns β€’ Simple solution: Design 𝐗 such that 𝐳 1 , 𝐳 2 , … , 𝐳 𝐿 are the Eigen vectors of 𝐗 – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝑋 = 𝑍Λ𝑍 π‘ˆ – πœ‡ 1 , … , πœ‡ 𝐿 are positive – For πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 this is exactly the Hebbian rule β€’ The patterns are provably stationary 62

  30. Hebbian rule β€’ In reality – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are also stationary – Although not stable 63

  31. Storing 𝑢 orthogonal patterns β€’ When we have 𝑂 orthogonal (or near orthogonal) patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 – 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝑂 = 1 β€’ The Eigen vectors of 𝑋 span the space β€’ Also, for any 𝐳 𝑙 𝐗𝐳 𝑙 = 𝐳 𝑙 64

  32. Storing 𝑢 orthogonal patterns β€’ The 𝑂 orthogonal patterns 𝐳 1 , 𝐳 2 , … , 𝐳 𝑂 span the space β€’ Any pattern 𝐳 can be written as 𝐳 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 𝐗𝐳 = 𝑏 1 𝐗𝐳 1 + 𝑏 2 𝐗𝐳 2 + β‹― + 𝑏 𝑂 𝐗𝐳 𝑂 = 𝑏 1 𝐳 1 + 𝑏 2 𝐳 2 + β‹― + 𝑏 𝑂 𝐳 𝑂 = 𝐳 β€’ All patterns are stable – Remembers everything – Completely useless network 65

  33. Storing K orthogonal patterns β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 – πœ‡ 𝐿+1 , … , πœ‡ 𝑂 = 0 β€’ All patterns orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 are stationary β€’ Any pattern that is entirely in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 is also stable (same logic as earlier) β€’ Only patterns that are partially in the subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 are unstable – Get projected onto subspace spanned by 𝐳 1 𝐳 2 … 𝐳 𝐿 66

  34. Problem with Hebbian Rule β€’ Even if we store fewer than 𝑂 patterns – Let 𝑍 = 𝐳 1 𝐳 2 … 𝐳 𝐿 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 𝑋 = 𝑍Λ𝑍 π‘ˆ – 𝐬 𝑳+1 𝐬 𝑳+2 … 𝐬 𝑂 are orthogonal to 𝐳 1 𝐳 2 … 𝐳 𝐿 – πœ‡ 1 = πœ‡ 2 = πœ‡ 𝐿 = 1 β€’ Problems arise because Eigen values are all 1.0 – Ensures stationarity of vectors in the subspace – What if we get rid of this requirement? 67

  35. Hebbian rule and general (non- orthogonal) vectors π‘ž 𝑧 π‘˜ π‘ž π‘₯ π‘˜π‘— = ෍ 𝑧 𝑗 π‘žβˆˆ{π‘ž} β€’ What happens when the patterns are not orthogonal β€’ What happens when the patterns are presented more than once – Different patterns presented different numbers of times – Equivalent to having unequal Eigen values.. β€’ Can we predict the evolution of any vector 𝐳 – Hint: Lanczos iterations π‘ˆ β€’ Can write 𝐙 𝑄 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂 , οƒ  𝐗 = 𝐙 π‘π‘ π‘’β„Žπ‘ 𝐂Λ𝐂 π‘ˆ 𝐙 π‘π‘ π‘’β„Žπ‘ 68

  36. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ – E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many β€œparasitic” memories 69

  37. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 – Mostafa and St. Jacques 85’ β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many β€œparasitic” memories 70

  38. The bottom line β€’ With an network of 𝑂 units (i.e. 𝑂 -bit patterns) β€’ The maximum number of stable patterns is actually exponential in 𝑂 – McElice and Posner, 84’ How do we find this – E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β€’ For a specific set of 𝐿 patterns, we can always build a network for which all 𝐿 patterns are stable provided 𝐿 ≀ 𝑂 Can we do something – Mostafa and St. Jacques 85’ about this? β€’ For large N, the upper bound on K is actually N/4logN – McElice et. Al. 8 7’ – But this may come with many β€œparasitic” memories 71

  39. A different tack β€’ How do we make the network store a specific pattern or set of patterns? – Hebbian learning – Geometric approach – Optimization β€’ Secondary question – How many patterns can we store? 72

  40. Consider the energy function 𝐹 = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ This must be maximally low for target patterns β€’ Must be maximally high for all other patterns – So that they are unstable and evolve into one of the target patterns 73

  41. Alternate Approach to Estimating the Network 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 β€’ Estimate 𝐗 (and 𝐜 ) such that – 𝐹 is minimized for 𝐳 1 , 𝐳 2 , … , 𝐳 𝑄 – 𝐹 is maximized for all other 𝐳 β€’ Caveat: Unrealistic to expect to store more than 𝑂 patterns, but can we make those 𝑂 patterns memorable 74

  42. Optimizing W (and b) 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 The bias can be captured by another fixed-value component β€’ Minimize total energy of target patterns – Problem with this? 75

  43. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Minimize total energy of target patterns β€’ Maximize the total energy of all non-target patterns 76

  44. Optimizing W 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 ΰ·‘ 𝐗 = argmin ෍ 𝐹(𝐳) βˆ’ ෍ 𝐹(𝐳) 𝐗 π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Simple gradient descent: 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 77

  45. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis 78

  46. Optimizing W 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Can β€œemphasize” the importance of a pattern by repeating – More repetitions οƒ  greater emphasis β€’ How many of these? – Do we need to include all of them? – Are all equally important? 79

  47. The training again.. 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ Note the energy contour of a Hopfield network for any weight 𝐗 Bowls will all actually be quadratic Energy 80 state

  48. The training again 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The first term tries to minimize the energy at target patterns – Make them local minima – Emphasize more β€œimportant” memories by repeating them more frequently Target patterns Energy 81 state

  49. The negative class 𝐳𝐳 π‘ˆ βˆ’ ෍ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 β€’ The second term tries to β€œraise” all non -target patterns – Do we need to raise everything ? Energy 82 state

  50. Option 1: Focus on the valleys 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Focus on raising the valleys – If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish Energy 83 state

  51. Identifying the valleys.. 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Problem: How do you identify the valleys for the current 𝐗 ? Energy 84 state

  52. Identifying the valleys.. β€’ Initialize the network randomly and let it evolve – It will settle in a valley Energy 85 state

  53. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Randomly initialize the network several times and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 86

  54. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 87

  55. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Randomly initialize the network and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 88

  56. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? Energy 89 state

  57. Which valleys? β€’ Should we randomly sample valleys? – Are all valleys equally important? β€’ Major requirement: memories must be stable – They must be broad valleys β€’ Spurious valleys in the neighborhood of memories are more important to eliminate Energy 90 state

  58. Identifying the valleys.. β€’ Initialize the network at valid memories and let it evolve – It will settle in a valley. If this is not the target pattern, raise it Energy 91 state

  59. Training the Hopfield network 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Compute the total outer product of all target patterns – More important patterns presented more frequently β€’ Initialize the network with each target pattern and let it evolve – And settle at a valley β€’ Compute the total outer product of valley patterns β€’ Update weights 92

  60. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve β€’ And settle at a valley 𝐳 𝑀 – Update weights π‘ˆ βˆ’ 𝐳 𝑀 𝐳 𝑀 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 93

  61. A possible problem β€’ What if there’s another target pattern downvalley – Raising it will destroy a better-represented or stored pattern! Energy 94 state

  62. A related issue β€’ Really no need to raise the entire surface, or even every valley Energy 95 state

  63. A related issue β€’ Really no need to raise the entire surface, or even every valley β€’ Raise the neighborhood of each target memory – Sufficient to make the memory a valley – The broader the neighborhood considered, the broader the valley Energy 96 state

  64. Raising the neighborhood β€’ Starting from a target pattern, let the network evolve only a few steps – Try to raise the resultant location β€’ Will raise the neighborhood of targets β€’ Will avoid problem of down-valley targets Energy 97 state

  65. Training the Hopfield network: SGD version 𝐳𝐳 π‘ˆ βˆ’ 𝐳𝐳 π‘ˆ 𝐗 = 𝐗 + πœƒ ෍ ෍ π³βˆˆπ™ 𝑄 π³βˆ‰π™ 𝑄 &𝐳=π‘€π‘π‘šπ‘šπ‘“π‘§ β€’ Initialize 𝐗 β€’ Do until convergence, satisfaction, or death from boredom: – Sample a target pattern 𝐳 π‘ž β€’ Sampling frequency of pattern must reflect importance of pattern – Initialize the network at 𝐳 π‘ž and let it evolve a few steps (2- 4) β€’ And arrive at a down-valley position 𝐳 𝑒 – Update weights π‘ˆ βˆ’ 𝐳 𝑒 𝐳 𝑒 π‘ˆ β€’ 𝐗 = 𝐗 + πœƒ 𝐳 π‘ž 𝐳 π‘ž 98

  66. A probabilistic interpretation 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 β€’ For continuous 𝐳 , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density β€’ For binary y it is the analog of the negative log likelihood of a Boltzmann distribution – Minimizing energy maximizes log likelihood 𝑄(𝐳) = π·π‘“π‘¦π‘ž 1 𝐹(𝐳) = βˆ’ 1 2 𝐳 π‘ˆ 𝐗𝐳 2 𝐳 π‘ˆ 𝐗𝐳 99

  67. The Boltzmann Distribution 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ 𝑙 is the Boltzmann constant β€’ π‘ˆ is the temperature of the system β€’ The energy terms are like the loglikelihood of a Boltzmann distribution at π‘ˆ = 1 – Derivation of this probability is in fact quite trivial.. 100

  68. Continuing the Boltzmann analogy 𝐹 𝐳 = βˆ’ 1 𝑄(𝐳) = π·π‘“π‘¦π‘ž βˆ’πΉ(𝐳) 2 𝐳 π‘ˆ 𝐗𝐳 βˆ’ 𝐜 π‘ˆ 𝐳 π‘™π‘ˆ 1 𝐷 = Οƒ 𝐳 𝑄(𝐳) β€’ The system probabilistically selects states with lower energy – With infinitesimally slow cooling, at π‘ˆ = 0, it arrives at the global minimal state 101

Recommend


More recommend


Explore More Topics

Stay informed with curated content and fresh updates.