[PPT] - Entropic Affinities: Properties and Efficient Numerical Computation PowerPoint Presentation

SLIDE 1

Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013

Entropic Affinities: Properties and Efficient Numerical Computation

Max Vladymyrov and Miguel Carreira-Perpiñán

SLIDE 2

The entropic affinities define affinities so that each point has an

effective number of neighbors equal to K.

First introduced in:
Not in a widespread use, even though they work well in a range of

problems.

We study some properties of entropic affinities and give fast

algorithms to compute them.

Summary

2

G. E. Hinton & S. Roweis: "Stochastic Neighbor Embedding", NIPS 2002.

SLIDE 3

Affinity matrix

Used in:

Dimensionality reduction:
Stochastic Neighbor Embedding, t-SNE,

Elastic Embedding, Laplacian Eigenmaps.

Clustering:
Mean-Shift, Spectral clustering.
Semi-supervised learning.
and others

The performance of the algorithms depends crucially of the affinity construction, govern by the bandwidth .

σ

3

Defines a measure of similarity between points in the dataset. Common practice to set :

constant,
rule-of-thumb (e.g. distance to the 7th nearset neighbor, Zelnik & Perona, 05).

σ

Data set Affinity matrix

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

SLIDE 4

4

Affinity matrices:

Entropic affinities Rule-of-thumb: Constant sigma

1 2 3 4 5 6 7 x 10

−5

2 4 6 8 10 12 14 16 x 10

−6

0.5 1 1.5 2 2.5 3 3.5 x 10

−4

Motivation: choice of σ

Dist. to the 7th nn (Zelnik & Perona, 05)

COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128

SLIDE 5

5

Dimensionality Reduction with Elastic Embedding algorithm:

Motivation: choice of

Entropic affinities Constant sigma

COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128

Rule-of-thumb:

Dist. to the 7th nn (Zelnik & Perona, 05)

σ

SLIDE 6

Search for good

Set separately for every data point.
Take into account the whole distribution of distances.

Good should be:

6

σn σn xn xn x1 x1 x2 x2

σ

SLIDE 7

Entropic affinities

In the entropic affinities, the is set individually for each point such that it has a distribution over neighbors with fixed perplexity (Hinton & Rowies, 2003).

The entropy of the distribution is defined as

7

pn(x; σ) = K

||(x − xn)/σ||2

PN

k=1 K

||(x − xk)/σ||2
Consider a distribution of the neighbors for :

x1, . . . , xN ∈ RD x ∈ RD H(x, σ) = − PN

n=1 pn(x, σ) log(pn(x, σ))

Consider the bandwidth (or precision ) given the perplexity :

σ H(x, β) = log K

Perplexity of in a distribution over neighbors provides the same surprise

as if we were to choose among equiprobable neighbors. K p N K

We define entropic affinities as probabilities for with respect

to . Thos affinities define a random walk matrix.

posterior distribution of Kernel Density Estimate.

K β =

1 2σ2

x σ x x1 x2 xN β p = (p1, . . . , pN) K

SLIDE 8

Entropic affinities: example

8

SLIDE 9

Entropic affinities: properties

This is a root-finding problem or an

inversion problem .

Should be solved for
We can prove that:
The root-finding problem is well

defined for a Gaussian kernel for any , and has a unique root for any .

The inverse is a uniquely defined

continuously differentiable function for all and .

9

−2 −1 1 2 3 2 4 6 log() log(K) H(x,) K=30

H(xn, βn) ≡ − PN

n=1 pn(xn, βn) log(pn(xn, βn)) = log K

β∗

1D βn = H−1

xn (log K)

βn > 0 K ∈ (0, N) K ∈ (0, N) xn ∈ RN xn ∈ x1, . . . , xN

SLIDE 10

−2 −1 1 2 3 2 4 6 log() log(K)

Entropic affinities: bounds

The bounds for every and : The bounds are computed in for each point.

10

K ∈ (0, N) [βL, βU]

where , , and is a unique solution of the equation

β∗ βU βL xn ∈ RN

2(1 − p1) log

N 2(1−p1) = log

min(

√ 2N, K)

βU = 1

∆2

2

log ✓ p1 1 − p1 (N − 1) ◆ , βL = max @ N log N

K

(N − 1)∆2

N

, s log N

K

d4

N − d4 1

1 A ,

∆2

N = d2 N − d2 1

∆2

2 = d2 2 − d2 1

p1

O(1)

d1 d2 dN

SLIDE 11

11

Entropic affinities: computation

For every

1. Initialize as close to the root as

possible.

2. Compute the root .

βn xn ∈ x1, . . . , xN βn x1 x2 xN

H(xn, βn) = log K

SLIDE 12

12

Meth Methods Convergence

rder

Derivatives order Number of . evaluations Derivative- Bisection linear 1 Derivative- free Brent linear 1 free Ridder quadratic 2 Derivative- Newton quadratic 1 2 Derivative- based Halley cubic 2 3 based Euler cubic 2 3

Derivative-free methods above generally converge globally. They work by

iteratively shrinking an interval bracketing the root.

Derivative-based methods have higher convergence order, but may diverge.
The cost of the objective function evaluation and each of derivative is .

O(N) O(N)

1. Computation of ; the root-finding

βn

SLIDE 13

Robustified root-finding algorithm

13 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1

H() log(K) iterations

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

We embed the derivative-based algorithm

into bisection loop for global convergence.

We run the following algorithm for each

xn ∈ x1, . . . , xN

SLIDE 14

14 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1

Bisection: step is outside the brackets H() log(K) Newton iterations

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

We embed the derivative-based algorithm

into bisection loop for global convergence

We run the following algorithm for each

xn ∈ x1, . . . , xN

SLIDE 15

15 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

We embed the derivative-based algorithm

into bisection loop for global convergence

We run the following algorithm for each

xn ∈ x1, . . . , xN

SLIDE 16

16 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2 3

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

We embed the derivative-based algorithm

into bisection loop for global convergence

We run the following algorithm for each

xn ∈ x1, . . . , xN

SLIDE 17

17 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2 3 4

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

We embed the derivative-based algorithm

into bisection loop for global convergence

We run the following algorithm for each

xn ∈ x1, . . . , xN

SLIDE 18

18

1. Simple initialization:
midpoint of the bounds,
distance to th nearest neighbor.

Typically far from root and require more iterations.

2. Each new is initialized from the

solution to its predecessor:

sequential order;
tree order.

k β

2. Initialization of βn

βn We need to find orders that are correlated with the behavior of .

SLIDE 19

19

2. Initialization of βn
1. Simple initialization:
middle of the bounds,
distance to th nearest neighbor.

Typically far from root and require more iterations.

2. Each new is initialized from the

solution to its predecessor:

sequential order;
tree order.

k We need to find orders that are correlated with the behavior of . β βn

SLIDE 20

4 5 6 7 8 9 10

Sequential or tree order

, density strategy: for the fixed entropy value, is larger in dense regions and

smaller in sparser ones.

Use nearest neighbor density estimate.
is proportional to the distance to th nearest neighbor of .
MST, local strategy: nearby points have similar values.
Build a MST around the data.
Process the points in level-order, so parents are solved for before children.

Dk β β

MST True β DK

k βn xn

50 100 150 200 250 300

SLIDE 21

Experimental evaluation: setup

Initializations:

“oracle”: processes the points in the order of their true values,
MST: local-based order,
: density-based order,
bounds: initialize from the midpoint of the bounds,
random: initialize from one of chosen at random.

We set the perplexity to and the tolerance to . Root-finding methods:

Derivative-free: Bisection, Brent, Ridder.
Derivative-based: Newton, Euler, Halley.

21

β

K = 30 10−10 DK

xn

SLIDE 22

Bisection: min. Our method: min. Computing just the affinities given s: s. β > 10 1 20

Experimental evaluation: Lena

22

SLIDE 23

10 10

1

10

2

0.5 1 1.5 2x 10

5

iteration number of iterations

best MST epsilonK raster bounds random 10

1

10

2

10

3

runtime best MST epsilonK raster bounds random 10 10

1

10

2

numberIterations

Experimental evaluation: image

23

Lena image. Each data point is a pixel represented by spatial and range features :

is the pixel location;
the pixel value.

512 × 512 (i, j, L, u, v) ∈ R5 (i, j) (L, u, v)

Number of iterations Runtime

Oracle MST DK Raster Bounds Random Oracle MST DK Raster Bounds Random

Number of points converged after iterations iteration i

Euler Newton Halley Ridder Brent Bisection

N = 262 144 points, D = 5dimensions

i

SLIDE 24

best MST epsilonK bounds random 10 10

1

10

2

numberIterations best MST epsilonK bounds random 10

1

10

2

runtime

Experimental evaluation: digits

handwritten digits from the MNIST dataset. Each datapoint is a grayscale image.

24

60 000 28 × 28

Oracle MST DK Bounds Random Oracle MST DK Bounds Random

10 10

1

10

2

1 2 3 4x 10

4

iteration number of iterations

iteration Number of iterations Runtime Number of points converged after iterations i

Euler Newton Halley Ridder Brent Bisection

N = 60 000 D = 784 points, dimensions

i

SLIDE 25

best MST epsilonK bounds random 10 10

1

10

2

numberIterations best MST epsilonK bounds random 10 10

1

10

2

runtime

Experimental evaluation: text

Articles from Grolier’s encyclopedia. Each point is a word count of the most popular words from articles.

10 10

1

10

2

0.5 1 1.5 2x 10

4

iteration number of iterations

25

30 991

Runtime Number of iterations

Oracle MST DK Bounds Random Oracle MST DK Bounds Random

iteration Number of points converged after iterations i

Euler Newton Halley Ridder Brent Bisection

15 275 D = 15 275 points, dimensions N = 30 991

i

SLIDE 26

Conclusions

We studied the behavior of entropic affinities and their properties.
Search for the affinities involves finding the root of non-linear equation.
We can find the root almost to machine precision in just over one

iteration per point on average using:

bounds for the root,
root-finding methods with high-order convergence,
warm-start initialization based on local or density orders.
In applications such as spectral clustering and embeddings, semi-

supervised learning using entropic affinities should give better results than fixing the bandwidth to a single value or using a rule-of-thumb.

The only user parameter is the global perplexity value .
MATLAB code online at http://eecs.ucmerced.edu. Run it simply

like .

26

[W,s] = ea(X,K)