Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 - PowerPoint PPT Presentation
Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network = +1 > 0 = + 1 0 Symmetric loopy
Four non- orthogonal 6-bit patterns β’ Patterns are perfectly stationary and stable for K > 0.14N β’ Fewer spurious minima than for the orthogonal 2-pattern case β Most fake-looking memories are in fact ghosts.. 33
Six non- orthogonal 6-bit patterns β’ Breakdown largely due to interference from βghostsβ β’ But patterns are stationary, and often stable β For K >> 0.14N 34
More visualization.. β’ Lets inspect a few 8-bit patterns β Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract 35
One 8-bit pattern β’ Its actually cleanly stored, but there are a few spurious minima 36
Two orthogonal 8-bit patterns β’ Both have regions of attraction β’ Some spurious minima 37
Two non-orthogonal 8-bit patterns β’ Actually have fewer spurious minima β Not obvious from visualization.. 38
Four orthogonal 8-bit patterns β’ Successfully stored 39
Four non-orthogonal 8-bit patterns β’ Stored with interference from ghosts.. 40
Eight orthogonal 8-bit patterns β’ Wipeout 41
Eight non-orthogonal 8-bit patterns β’ Nothing stored β Neither stationary nor stable 42
Making sense of the behavior β’ Seems possible to store K > 0.14N patterns β i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary β Possible to make more than 0.14N patterns at-least 1-bit stable β’ So what was Hopfield talking about? β’ Patterns that are non-orthogonal easier to remember β I.e. patterns that are closer are easier to remember than patterns that are farther!! β’ Can we attempt to get greater control on the process than Hebbian learning gives us? 43
Bold Claim β’ I can always store (upto) N orthogonal patterns such that they are stationary! β Although not necessarily stable β’ Why? 44
βTrainingβ the network β’ How do we make the network store a specific pattern or set of patterns? β Hebbian learning β Geometric approach β Optimization β’ Secondary question β How many patterns can we store? 45
A minor adjustment β’ Note behavior of π π³ = π³ π ππ³ with π = ππ π β π π π Energy landscape only differs by an additive constant β’ Is identical to behavior with Gradients and location π = ππ π of minima remain same β’ Since π³ π ππ π β π π π π³ = π³ π ππ π π³ β ππ π β’ But π = ππ π is easier to analyze. Hence in the following slides we will use π = ππ π 46
A minor adjustment β’ Note behavior of π π³ = π³ π ππ³ with π = ππ π β π π π Energy landscape only differs by Both have the an additive constant β’ Is identical to behavior with same Eigen vectors Gradients and location π = ππ π of minima remain same β’ Since π³ π ππ π β π π π π³ = π³ π ππ π π³ β ππ π β’ But π = ππ π is easier to analyze. Hence in the following slides we will use π = ππ π 47
A minor adjustment β’ Note behavior of π π³ = π³ π ππ³ with π = ππ π β π π π Energy landscape only differs by Both have the an additive constant β’ Is identical to behavior with same Eigen vectors Gradients and location π = ππ π of minima remain same NOTE: This β’ Since is a positive semidefinite matrix π³ π ππ π β π π π π³ = π³ π ππ π π³ β ππ π β’ But π = ππ π is easier to analyze. Hence in the following slides we will use π = ππ π 48
Consider the energy function πΉ = β 1 2 π³ π ππ³ β π π π³ β’ Reinstating the bias term for completeness sake β Remember that we donβt actually use it in a Hopfield net 49
Consider the energy function This is a quadratic! For Hebbian learning W is positive semidefinite E is convex πΉ = β 1 2 π³ π ππ³ β π π π³ β’ Reinstating the bias term for completeness sake β Remember that we donβt actually use it in a Hopfield net 50
The energy function πΉ = β 1 2 π³ π ππ³ β π π π³ β’ πΉ is a convex quadratic 51
The energy function πΉ = β 1 2 π³ π ππ³ β π π π³ β’ πΉ is a convex quadratic β Shown from above (assuming 0 bias) β’ But components of π§ can only take values Β±1 β I.e π§ lies on the corners of the unit hypercube 52
The energy function πΉ = β 1 2 π³ π ππ³ β π π π³ β’ πΉ is a convex quadratic β Shown from above (assuming 0 bias) β’ But components of π§ can only take values Β±1 β I.e π§ lies on the corners of the unit hypercube 53
The energy function Stored patterns πΉ = β 1 2 π³ π ππ³ β π π π³ β’ The stored values of π³ are the ones where all adjacent corners are higher on the quadratic β Hebbian learning attempts to make the quadratic steep in the vicinity of stored patterns 54
Patterns you can store Ghosts (negations) Stored patterns β’ Ideally must be maximally separated on the hypercube β The number of patterns we can store depends on the actual distance between the patterns 55
Storing patterns β’ A pattern π³ π is stored if: β π‘πππ ππ³ π = π³ π for all target patterns β’ Note: for binary vectors π‘πππ π³ is a projection β Projects π³ onto the nearest corner of the hypercube β It βquantizesβ the space into orthants 56
Storing patterns β’ A pattern π³ π is stored if: β π‘πππ ππ³ π = π³ π for all target patterns β’ Training: Design π such that this holds β’ Simple solution: π³ π is an Eigenvector of π β And the corresponding Eigenvalue is positive ππ³ π = ππ³ π β More generally orthant( ππ³ π ) = orthant( π³ π ) β’ How many such π³ π can we have? 57
Only N patterns? (1,1) (1,-1) β’ Patterns that differ in π/2 bits are orthogonal β’ You can have no more than π orthogonal vectors in an π -dimensional space 59
Another random fact that should interest you β’ The Eigenvectors of any symmetric matrix π are orthogonal β’ The Eigen values may be positive or negative 60
Storing more than one pattern β’ Requirement: Given π³ 1 , π³ 2 , β¦ , π³ π β Design π such that β’ π‘πππ ππ³ π = π³ π for all target patterns β’ There are no other binary vectors for which this holds β’ What is the largest number of patterns that can be stored? 61
Storing π³ orthogonal patterns β’ Simple solution: Design π such that π³ 1 , π³ 2 , β¦ , π³ πΏ are the Eigen vectors of π β Let π = π³ 1 π³ 2 β¦ π³ πΏ π = πΞπ π β π 1 , β¦ , π πΏ are positive β For π 1 = π 2 = π πΏ = 1 this is exactly the Hebbian rule β’ The patterns are provably stationary 62
Hebbian rule β’ In reality β Let π = π³ 1 π³ 2 β¦ π³ πΏ π¬ π³+1 π¬ π³+2 β¦ π¬ π π = πΞπ π β π¬ π³+1 π¬ π³+2 β¦ π¬ π are orthogonal to π³ 1 π³ 2 β¦ π³ πΏ β π 1 = π 2 = π πΏ = 1 β π πΏ+1 , β¦ , π π = 0 β’ All patterns orthogonal to π³ 1 π³ 2 β¦ π³ πΏ are also stationary β Although not stable 63
Storing πΆ orthogonal patterns β’ When we have π orthogonal (or near orthogonal) patterns π³ 1 , π³ 2 , β¦ , π³ π β π = π³ 1 π³ 2 β¦ π³ π π = πΞπ π β π 1 = π 2 = π π = 1 β’ The Eigen vectors of π span the space β’ Also, for any π³ π ππ³ π = π³ π 64
Storing πΆ orthogonal patterns β’ The π orthogonal patterns π³ 1 , π³ 2 , β¦ , π³ π span the space β’ Any pattern π³ can be written as π³ = π 1 π³ 1 + π 2 π³ 2 + β― + π π π³ π ππ³ = π 1 ππ³ 1 + π 2 ππ³ 2 + β― + π π ππ³ π = π 1 π³ 1 + π 2 π³ 2 + β― + π π π³ π = π³ β’ All patterns are stable β Remembers everything β Completely useless network 65
Storing K orthogonal patterns β’ Even if we store fewer than π patterns β Let π = π³ 1 π³ 2 β¦ π³ πΏ π¬ π³+1 π¬ π³+2 β¦ π¬ π π = πΞπ π β π¬ π³+1 π¬ π³+2 β¦ π¬ π are orthogonal to π³ 1 π³ 2 β¦ π³ πΏ β π 1 = π 2 = π πΏ = 1 β π πΏ+1 , β¦ , π π = 0 β’ All patterns orthogonal to π³ 1 π³ 2 β¦ π³ πΏ are stationary β’ Any pattern that is entirely in the subspace spanned by π³ 1 π³ 2 β¦ π³ πΏ is also stable (same logic as earlier) β’ Only patterns that are partially in the subspace spanned by π³ 1 π³ 2 β¦ π³ πΏ are unstable β Get projected onto subspace spanned by π³ 1 π³ 2 β¦ π³ πΏ 66
Problem with Hebbian Rule β’ Even if we store fewer than π patterns β Let π = π³ 1 π³ 2 β¦ π³ πΏ π¬ π³+1 π¬ π³+2 β¦ π¬ π π = πΞπ π β π¬ π³+1 π¬ π³+2 β¦ π¬ π are orthogonal to π³ 1 π³ 2 β¦ π³ πΏ β π 1 = π 2 = π πΏ = 1 β’ Problems arise because Eigen values are all 1.0 β Ensures stationarity of vectors in the subspace β What if we get rid of this requirement? 67
Hebbian rule and general (non- orthogonal) vectors π π§ π π π₯ ππ = ΰ· π§ π πβ{π} β’ What happens when the patterns are not orthogonal β’ What happens when the patterns are presented more than once β Different patterns presented different numbers of times β Equivalent to having unequal Eigen values.. β’ Can we predict the evolution of any vector π³ β Hint: Lanczos iterations π β’ Can write π π = π ππ π’βπ π , ο π = π ππ π’βπ πΞπ π π ππ π’βπ 68
The bottom line β’ With an network of π units (i.e. π -bit patterns) β’ The maximum number of stable patterns is actually exponential in π β McElice and Posner, 84β β E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable β’ For a specific set of πΏ patterns, we can always build a network for which all πΏ patterns are stable provided πΏ β€ π β Mostafa and St. Jacques 85β β’ For large N, the upper bound on K is actually N/4logN β McElice et. Al. 8 7β β But this may come with many βparasiticβ memories 69
The bottom line β’ With an network of π units (i.e. π -bit patterns) β’ The maximum number of stable patterns is actually exponential in π β McElice and Posner, 84β How do we find this β E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β’ For a specific set of πΏ patterns, we can always build a network for which all πΏ patterns are stable provided πΏ β€ π β Mostafa and St. Jacques 85β β’ For large N, the upper bound on K is actually N/4logN β McElice et. Al. 8 7β β But this may come with many βparasiticβ memories 70
The bottom line β’ With an network of π units (i.e. π -bit patterns) β’ The maximum number of stable patterns is actually exponential in π β McElice and Posner, 84β How do we find this β E.g. when we had the Hebbian net with N orthogonal base network? patterns, all patterns are stable β’ For a specific set of πΏ patterns, we can always build a network for which all πΏ patterns are stable provided πΏ β€ π Can we do something β Mostafa and St. Jacques 85β about this? β’ For large N, the upper bound on K is actually N/4logN β McElice et. Al. 8 7β β But this may come with many βparasiticβ memories 71
A different tack β’ How do we make the network store a specific pattern or set of patterns? β Hebbian learning β Geometric approach β Optimization β’ Secondary question β How many patterns can we store? 72
Consider the energy function πΉ = β 1 2 π³ π ππ³ β π π π³ β’ This must be maximally low for target patterns β’ Must be maximally high for all other patterns β So that they are unstable and evolve into one of the target patterns 73
Alternate Approach to Estimating the Network πΉ(π³) = β 1 2 π³ π ππ³ β π π π³ β’ Estimate π (and π ) such that β πΉ is minimized for π³ 1 , π³ 2 , β¦ , π³ π β πΉ is maximized for all other π³ β’ Caveat: Unrealistic to expect to store more than π patterns, but can we make those π patterns memorable 74
Optimizing W (and b) πΉ(π³) = β 1 2 π³ π ππ³ ΰ·‘ π = argmin ΰ· πΉ(π³) π π³βπ π The bias can be captured by another fixed-value component β’ Minimize total energy of target patterns β Problem with this? 75
Optimizing W πΉ(π³) = β 1 2 π³ π ππ³ ΰ·‘ π = argmin ΰ· πΉ(π³) β ΰ· πΉ(π³) π π³βπ π π³βπ π β’ Minimize total energy of target patterns β’ Maximize the total energy of all non-target patterns 76
Optimizing W πΉ(π³) = β 1 2 π³ π ππ³ ΰ·‘ π = argmin ΰ· πΉ(π³) β ΰ· πΉ(π³) π π³βπ π π³βπ π β’ Simple gradient descent: π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π 77
Optimizing W π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π β’ Can βemphasizeβ the importance of a pattern by repeating β More repetitions ο greater emphasis 78
Optimizing W π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π β’ Can βemphasizeβ the importance of a pattern by repeating β More repetitions ο greater emphasis β’ How many of these? β Do we need to include all of them? β Are all equally important? 79
The training again.. π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π β’ Note the energy contour of a Hopfield network for any weight π Bowls will all actually be quadratic Energy 80 state
The training again π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π β’ The first term tries to minimize the energy at target patterns β Make them local minima β Emphasize more βimportantβ memories by repeating them more frequently Target patterns Energy 81 state
The negative class π³π³ π β ΰ· π³π³ π π = π + π ΰ· π³βπ π π³βπ π β’ The second term tries to βraiseβ all non -target patterns β Do we need to raise everything ? Energy 82 state
Option 1: Focus on the valleys π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Focus on raising the valleys β If you raise every valley, eventually theyβll all move up above the target patterns, and many will even vanish Energy 83 state
Identifying the valleys.. π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Problem: How do you identify the valleys for the current π ? Energy 84 state
Identifying the valleys.. β’ Initialize the network randomly and let it evolve β It will settle in a valley Energy 85 state
Training the Hopfield network π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Compute the total outer product of all target patterns β More important patterns presented more frequently β’ Randomly initialize the network several times and let it evolve β And settle at a valley β’ Compute the total outer product of valley patterns β’ Update weights 86
Training the Hopfield network: SGD version π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Do until convergence, satisfaction, or death from boredom: β Sample a target pattern π³ π β’ Sampling frequency of pattern must reflect importance of pattern β Randomly initialize the network and let it evolve β’ And settle at a valley π³ π€ β Update weights π β π³ π€ π³ π€ π β’ π = π + π π³ π π³ π 87
Training the Hopfield network π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Do until convergence, satisfaction, or death from boredom: β Sample a target pattern π³ π β’ Sampling frequency of pattern must reflect importance of pattern β Randomly initialize the network and let it evolve β’ And settle at a valley π³ π€ β Update weights π β π³ π€ π³ π€ π β’ π = π + π π³ π π³ π 88
Which valleys? β’ Should we randomly sample valleys? β Are all valleys equally important? Energy 89 state
Which valleys? β’ Should we randomly sample valleys? β Are all valleys equally important? β’ Major requirement: memories must be stable β They must be broad valleys β’ Spurious valleys in the neighborhood of memories are more important to eliminate Energy 90 state
Identifying the valleys.. β’ Initialize the network at valid memories and let it evolve β It will settle in a valley. If this is not the target pattern, raise it Energy 91 state
Training the Hopfield network π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Compute the total outer product of all target patterns β More important patterns presented more frequently β’ Initialize the network with each target pattern and let it evolve β And settle at a valley β’ Compute the total outer product of valley patterns β’ Update weights 92
Training the Hopfield network: SGD version π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Do until convergence, satisfaction, or death from boredom: β Sample a target pattern π³ π β’ Sampling frequency of pattern must reflect importance of pattern β Initialize the network at π³ π and let it evolve β’ And settle at a valley π³ π€ β Update weights π β π³ π€ π³ π€ π β’ π = π + π π³ π π³ π 93
A possible problem β’ What if thereβs another target pattern downvalley β Raising it will destroy a better-represented or stored pattern! Energy 94 state
A related issue β’ Really no need to raise the entire surface, or even every valley Energy 95 state
A related issue β’ Really no need to raise the entire surface, or even every valley β’ Raise the neighborhood of each target memory β Sufficient to make the memory a valley β The broader the neighborhood considered, the broader the valley Energy 96 state
Raising the neighborhood β’ Starting from a target pattern, let the network evolve only a few steps β Try to raise the resultant location β’ Will raise the neighborhood of targets β’ Will avoid problem of down-valley targets Energy 97 state
Training the Hopfield network: SGD version π³π³ π β π³π³ π π = π + π ΰ· ΰ· π³βπ π π³βπ π &π³=π€πππππ§ β’ Initialize π β’ Do until convergence, satisfaction, or death from boredom: β Sample a target pattern π³ π β’ Sampling frequency of pattern must reflect importance of pattern β Initialize the network at π³ π and let it evolve a few steps (2- 4) β’ And arrive at a down-valley position π³ π β Update weights π β π³ π π³ π π β’ π = π + π π³ π π³ π 98
A probabilistic interpretation π(π³) = π·ππ¦π 1 πΉ(π³) = β 1 2 π³ π ππ³ 2 π³ π ππ³ β’ For continuous π³ , the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density β’ For binary y it is the analog of the negative log likelihood of a Boltzmann distribution β Minimizing energy maximizes log likelihood π(π³) = π·ππ¦π 1 πΉ(π³) = β 1 2 π³ π ππ³ 2 π³ π ππ³ 99
The Boltzmann Distribution πΉ π³ = β 1 π(π³) = π·ππ¦π βπΉ(π³) 2 π³ π ππ³ β π π π³ ππ 1 π· = Ο π³ π(π³) β’ π is the Boltzmann constant β’ π is the temperature of the system β’ The energy terms are like the loglikelihood of a Boltzmann distribution at π = 1 β Derivation of this probability is in fact quite trivial.. 100
Continuing the Boltzmann analogy πΉ π³ = β 1 π(π³) = π·ππ¦π βπΉ(π³) 2 π³ π ππ³ β π π π³ ππ 1 π· = Ο π³ π(π³) β’ The system probabilistically selects states with lower energy β With infinitesimally slow cooling, at π = 0, it arrives at the global minimal state 101
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.