Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013
Entropic Affinities: Properties and Efficient Numerical Computation - - PowerPoint PPT Presentation
Entropic Affinities: Properties and Efficient Numerical Computation - - PowerPoint PPT Presentation
Entropic Affinities: Properties and Efficient Numerical Computation Max Vladymyrov and Miguel Carreira-Perpin Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013 Summary
- The entropic affinities define affinities so that each point has an
effective number of neighbors equal to K.
- First introduced in:
- Not in a widespread use, even though they work well in a range of
problems.
- We study some properties of entropic affinities and give fast
algorithms to compute them.
Summary
2
- G. E. Hinton & S. Roweis: "Stochastic Neighbor Embedding", NIPS 2002.
Affinity matrix
Used in:
- Dimensionality reduction:
- Stochastic Neighbor Embedding, t-SNE,
Elastic Embedding, Laplacian Eigenmaps.
- Clustering:
- Mean-Shift, Spectral clustering.
- Semi-supervised learning.
- and others
The performance of the algorithms depends crucially of the affinity construction, govern by the bandwidth .
σ
3
Defines a measure of similarity between points in the dataset. Common practice to set :
- constant,
- rule-of-thumb (e.g. distance to the 7th nearset neighbor, Zelnik & Perona, 05).
σ
Data set Affinity matrix
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 14
Affinity matrices:
Entropic affinities Rule-of-thumb: Constant sigma
1 2 3 4 5 6 7 x 10
−5
2 4 6 8 10 12 14 16 x 10
−6
0.5 1 1.5 2 2.5 3 3.5 x 10
−4
Motivation: choice of σ
- Dist. to the 7th nn (Zelnik & Perona, 05)
COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128
5
Dimensionality Reduction with Elastic Embedding algorithm:
Motivation: choice of
Entropic affinities Constant sigma
COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128
Rule-of-thumb:
- Dist. to the 7th nn (Zelnik & Perona, 05)
σ
Search for good
- Set separately for every data point.
- Take into account the whole distribution of distances.
Good should be:
6
σn σn xn xn x1 x1 x2 x2
σ
σ
Entropic affinities
In the entropic affinities, the is set individually for each point such that it has a distribution over neighbors with fixed perplexity (Hinton & Rowies, 2003).
- The entropy of the distribution is defined as
7
pn(x; σ) = K
- ||(x − xn)/σ||2
PN
k=1 K
- ||(x − xk)/σ||2
- Consider a distribution of the neighbors for :
x1, . . . , xN ∈ RD x ∈ RD H(x, σ) = − PN
n=1 pn(x, σ) log(pn(x, σ))
- Consider the bandwidth (or precision ) given the perplexity :
σ H(x, β) = log K
- Perplexity of in a distribution over neighbors provides the same surprise
as if we were to choose among equiprobable neighbors. K p N K
- We define entropic affinities as probabilities for with respect
to . Thos affinities define a random walk matrix.
posterior distribution of Kernel Density Estimate.
K β =
1 2σ2
x σ x x1 x2 xN β p = (p1, . . . , pN) K
Entropic affinities: example
8
Entropic affinities: properties
- This is a root-finding problem or an
inversion problem .
- Should be solved for
- We can prove that:
- The root-finding problem is well
defined for a Gaussian kernel for any , and has a unique root for any .
- The inverse is a uniquely defined
continuously differentiable function for all and .
9
−2 −1 1 2 3 2 4 6 log() log(K) H(x,) K=30
H(xn, βn) ≡ − PN
n=1 pn(xn, βn) log(pn(xn, βn)) = log K
β∗
1D βn = H−1
xn (log K)
βn > 0 K ∈ (0, N) K ∈ (0, N) xn ∈ RN xn ∈ x1, . . . , xN
−2 −1 1 2 3 2 4 6 log() log(K)
Entropic affinities: bounds
The bounds for every and : The bounds are computed in for each point.
10
K ∈ (0, N) [βL, βU]
where , , and is a unique solution of the equation
β∗ βU βL xn ∈ RN
2(1 − p1) log
N 2(1−p1) = log
- min(
√ 2N, K)
- βU = 1
∆2
2
log ✓ p1 1 − p1 (N − 1) ◆ , βL = max @ N log N
K
(N − 1)∆2
N
, s log N
K
d4
N − d4 1
1 A ,
∆2
N = d2 N − d2 1
∆2
2 = d2 2 − d2 1
p1
O(1)
d1 d2 dN
11
Entropic affinities: computation
For every
- 1. Initialize as close to the root as
possible.
- 2. Compute the root .
βn xn ∈ x1, . . . , xN βn x1 x2 xN
H(xn, βn) = log K
12
Meth Methods Convergence
- rder
Derivatives order Number of . evaluations Derivative- Bisection linear 1 Derivative- free Brent linear 1 free Ridder quadratic 2 Derivative- Newton quadratic 1 2 Derivative- based Halley cubic 2 3 based Euler cubic 2 3
- Derivative-free methods above generally converge globally. They work by
iteratively shrinking an interval bracketing the root.
- Derivative-based methods have higher convergence order, but may diverge.
- The cost of the objective function evaluation and each of derivative is .
O(N) O(N)
- 1. Computation of ; the root-finding
βn
Robustified root-finding algorithm
13 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()
1
H() log(K) iterations
K d2
1, . . . , d2 N
B k = 1 β β / ∈ B B β B
, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while
β
- We embed the derivative-based algorithm
into bisection loop for global convergence.
- We run the following algorithm for each
xn ∈ x1, . . . , xN
14 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()
1
Bisection: step is outside the brackets H() log(K) Newton iterations
Robustified root-finding algorithm
K d2
1, . . . , d2 N
B k = 1 β β / ∈ B B β B
, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while
β
- We embed the derivative-based algorithm
into bisection loop for global convergence
- We run the following algorithm for each
xn ∈ x1, . . . , xN
15 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()
1 2
Normal step H() log(K) Newton iterations
−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6
Robustified root-finding algorithm
K d2
1, . . . , d2 N
B k = 1 β β / ∈ B B β B
, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while
β
- We embed the derivative-based algorithm
into bisection loop for global convergence
- We run the following algorithm for each
xn ∈ x1, . . . , xN
16 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()
1 2 3
Normal step H() log(K) Newton iterations
−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6
Robustified root-finding algorithm
K d2
1, . . . , d2 N
B k = 1 β β / ∈ B B β B
, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while
β
- We embed the derivative-based algorithm
into bisection loop for global convergence
- We run the following algorithm for each
xn ∈ x1, . . . , xN
17 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()
1 2 3 4
Normal step H() log(K) Newton iterations
−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6
Robustified root-finding algorithm
K d2
1, . . . , d2 N
B k = 1 β β / ∈ B B β B
, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while
β
- We embed the derivative-based algorithm
into bisection loop for global convergence
- We run the following algorithm for each
xn ∈ x1, . . . , xN
18
- 1. Simple initialization:
- midpoint of the bounds,
- distance to th nearest neighbor.
Typically far from root and require more iterations.
- 2. Each new is initialized from the
solution to its predecessor:
- sequential order;
- tree order.
k β
- 2. Initialization of βn
βn We need to find orders that are correlated with the behavior of .
19
- 2. Initialization of βn
- 1. Simple initialization:
- middle of the bounds,
- distance to th nearest neighbor.
Typically far from root and require more iterations.
- 2. Each new is initialized from the
solution to its predecessor:
- sequential order;
- tree order.
k We need to find orders that are correlated with the behavior of . β βn
4 5 6 7 8 9 10
Sequential or tree order
- , density strategy: for the fixed entropy value, is larger in dense regions and
smaller in sparser ones.
- Use nearest neighbor density estimate.
- is proportional to the distance to th nearest neighbor of .
- MST, local strategy: nearby points have similar values.
- Build a MST around the data.
- Process the points in level-order, so parents are solved for before children.
Dk β β
MST True β DK
k βn xn
50 100 150 200 250 300
Experimental evaluation: setup
Initializations:
- “oracle”: processes the points in the order of their true values,
- MST: local-based order,
- : density-based order,
- bounds: initialize from the midpoint of the bounds,
- random: initialize from one of chosen at random.
We set the perplexity to and the tolerance to . Root-finding methods:
- Derivative-free: Bisection, Brent, Ridder.
- Derivative-based: Newton, Euler, Halley.
21
β
K = 30 10−10 DK
xn
Bisection: min. Our method: min. Computing just the affinities given s: s. β > 10 1 20
Experimental evaluation: Lena
22
10 10
1
10
2
0.5 1 1.5 2x 10
5
iteration number of iterations
best MST epsilonK raster bounds random 10
1
10
2
10
3
runtime best MST epsilonK raster bounds random 10 10
1
10
2
numberIterations
Experimental evaluation: image
23
Lena image. Each data point is a pixel represented by spatial and range features :
- is the pixel location;
- the pixel value.
512 × 512 (i, j, L, u, v) ∈ R5 (i, j) (L, u, v)
Number of iterations Runtime
Oracle MST DK Raster Bounds Random Oracle MST DK Raster Bounds Random
Number of points converged after iterations iteration i
Euler Newton Halley Ridder Brent Bisection
N = 262 144 points, D = 5dimensions
i
best MST epsilonK bounds random 10 10
1
10
2
numberIterations best MST epsilonK bounds random 10
1
10
2
runtime
Experimental evaluation: digits
handwritten digits from the MNIST dataset. Each datapoint is a grayscale image.
24
60 000 28 × 28
Oracle MST DK Bounds Random Oracle MST DK Bounds Random
10 10
1
10
2
1 2 3 4x 10
4
iteration number of iterations
iteration Number of iterations Runtime Number of points converged after iterations i
Euler Newton Halley Ridder Brent Bisection
N = 60 000 D = 784 points, dimensions
i
best MST epsilonK bounds random 10 10
1
10
2
numberIterations best MST epsilonK bounds random 10 10
1
10
2
runtime
Experimental evaluation: text
Articles from Grolier’s encyclopedia. Each point is a word count of the most popular words from articles.
10 10
1
10
2
0.5 1 1.5 2x 10
4
iteration number of iterations
25
30 991
Runtime Number of iterations
Oracle MST DK Bounds Random Oracle MST DK Bounds Random
iteration Number of points converged after iterations i
Euler Newton Halley Ridder Brent Bisection
15 275 D = 15 275 points, dimensions N = 30 991
i
Conclusions
- We studied the behavior of entropic affinities and their properties.
- Search for the affinities involves finding the root of non-linear equation.
- We can find the root almost to machine precision in just over one
iteration per point on average using:
- bounds for the root,
- root-finding methods with high-order convergence,
- warm-start initialization based on local or density orders.
- In applications such as spectral clustering and embeddings, semi-
supervised learning using entropic affinities should give better results than fixing the bandwidth to a single value or using a rule-of-thumb.
- The only user parameter is the global perplexity value .
- MATLAB code online at http://eecs.ucmerced.edu. Run it simply
like .
26
[W,s] = ea(X,K)