Entropic Affinities: Properties and Efficient Numerical Computation - - PowerPoint PPT Presentation

entropic affinities properties and efficient numerical
SMART_READER_LITE
LIVE PREVIEW

Entropic Affinities: Properties and Efficient Numerical Computation - - PowerPoint PPT Presentation

Entropic Affinities: Properties and Efficient Numerical Computation Max Vladymyrov and Miguel Carreira-Perpin Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013 Summary


slide-1
SLIDE 1

Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu June 18, 2013

Entropic Affinities: Properties and Efficient Numerical Computation

Max Vladymyrov and Miguel Carreira-Perpiñán

slide-2
SLIDE 2
  • The entropic affinities define affinities so that each point has an

effective number of neighbors equal to K.

  • First introduced in:
  • Not in a widespread use, even though they work well in a range of

problems.

  • We study some properties of entropic affinities and give fast

algorithms to compute them.

Summary

2

  • G. E. Hinton & S. Roweis: "Stochastic Neighbor Embedding", NIPS 2002.
slide-3
SLIDE 3

Affinity matrix

Used in:

  • Dimensionality reduction:
  • Stochastic Neighbor Embedding, t-SNE,

Elastic Embedding, Laplacian Eigenmaps.

  • Clustering:
  • Mean-Shift, Spectral clustering.
  • Semi-supervised learning.
  • and others

The performance of the algorithms depends crucially of the affinity construction, govern by the bandwidth .

σ

3

Defines a measure of similarity between points in the dataset. Common practice to set :

  • constant,
  • rule-of-thumb (e.g. distance to the 7th nearset neighbor, Zelnik & Perona, 05).

σ

Data set Affinity matrix

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
slide-4
SLIDE 4

4

Affinity matrices:

Entropic affinities Rule-of-thumb: Constant sigma

1 2 3 4 5 6 7 x 10

−5

2 4 6 8 10 12 14 16 x 10

−6

0.5 1 1.5 2 2.5 3 3.5 x 10

−4

Motivation: choice of σ

  • Dist. to the 7th nn (Zelnik & Perona, 05)

COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128

slide-5
SLIDE 5

5

Dimensionality Reduction with Elastic Embedding algorithm:

Motivation: choice of

Entropic affinities Constant sigma

COIL-20: Rotations of objects every 5º; input are greyscale images of . 128 × 128

Rule-of-thumb:

  • Dist. to the 7th nn (Zelnik & Perona, 05)

σ

slide-6
SLIDE 6

Search for good

  • Set separately for every data point.
  • Take into account the whole distribution of distances.

Good should be:

6

σn σn xn xn x1 x1 x2 x2

σ

σ

slide-7
SLIDE 7

Entropic affinities

In the entropic affinities, the is set individually for each point such that it has a distribution over neighbors with fixed perplexity (Hinton & Rowies, 2003).

  • The entropy of the distribution is defined as

7

pn(x; σ) = K

  • ||(x − xn)/σ||2

PN

k=1 K

  • ||(x − xk)/σ||2
  • Consider a distribution of the neighbors for :

x1, . . . , xN ∈ RD x ∈ RD H(x, σ) = − PN

n=1 pn(x, σ) log(pn(x, σ))

  • Consider the bandwidth (or precision ) given the perplexity :

σ H(x, β) = log K

  • Perplexity of in a distribution over neighbors provides the same surprise

as if we were to choose among equiprobable neighbors. K p N K

  • We define entropic affinities as probabilities for with respect

to . Thos affinities define a random walk matrix.

posterior distribution of Kernel Density Estimate.

K β =

1 2σ2

x σ x x1 x2 xN β p = (p1, . . . , pN) K

slide-8
SLIDE 8

Entropic affinities: example

8

slide-9
SLIDE 9

Entropic affinities: properties

  • This is a root-finding problem or an

inversion problem .

  • Should be solved for
  • We can prove that:
  • The root-finding problem is well

defined for a Gaussian kernel for any , and has a unique root for any .

  • The inverse is a uniquely defined

continuously differentiable function for all and .

9

−2 −1 1 2 3 2 4 6 log() log(K) H(x,) K=30

H(xn, βn) ≡ − PN

n=1 pn(xn, βn) log(pn(xn, βn)) = log K

β∗

1D βn = H−1

xn (log K)

βn > 0 K ∈ (0, N) K ∈ (0, N) xn ∈ RN xn ∈ x1, . . . , xN

slide-10
SLIDE 10

−2 −1 1 2 3 2 4 6 log() log(K)

Entropic affinities: bounds

The bounds for every and : The bounds are computed in for each point.

10

K ∈ (0, N) [βL, βU]

where , , and is a unique solution of the equation

β∗ βU βL xn ∈ RN

2(1 − p1) log

N 2(1−p1) = log

  • min(

√ 2N, K)

  • βU = 1

∆2

2

log ✓ p1 1 − p1 (N − 1) ◆ , βL = max @ N log N

K

(N − 1)∆2

N

, s log N

K

d4

N − d4 1

1 A ,

∆2

N = d2 N − d2 1

∆2

2 = d2 2 − d2 1

p1

O(1)

d1 d2 dN

slide-11
SLIDE 11

11

Entropic affinities: computation

For every

  • 1. Initialize as close to the root as

possible.

  • 2. Compute the root .

βn xn ∈ x1, . . . , xN βn x1 x2 xN

H(xn, βn) = log K

slide-12
SLIDE 12

12

Meth Methods Convergence

  • rder

Derivatives order Number of . evaluations Derivative- Bisection linear 1 Derivative- free Brent linear 1 free Ridder quadratic 2 Derivative- Newton quadratic 1 2 Derivative- based Halley cubic 2 3 based Euler cubic 2 3

  • Derivative-free methods above generally converge globally. They work by

iteratively shrinking an interval bracketing the root.

  • Derivative-based methods have higher convergence order, but may diverge.
  • The cost of the objective function evaluation and each of derivative is .

O(N) O(N)

  • 1. Computation of ; the root-finding

βn

slide-13
SLIDE 13

Robustified root-finding algorithm

13 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1

H() log(K) iterations

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

  • We embed the derivative-based algorithm

into bisection loop for global convergence.

  • We run the following algorithm for each

xn ∈ x1, . . . , xN

slide-14
SLIDE 14

14 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1

Bisection: step is outside the brackets H() log(K) Newton iterations

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

  • We embed the derivative-based algorithm

into bisection loop for global convergence

  • We run the following algorithm for each

xn ∈ x1, . . . , xN

slide-15
SLIDE 15

15 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

  • We embed the derivative-based algorithm

into bisection loop for global convergence

  • We run the following algorithm for each

xn ∈ x1, . . . , xN

slide-16
SLIDE 16

16 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2 3

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

  • We embed the derivative-based algorithm

into bisection loop for global convergence

  • We run the following algorithm for each

xn ∈ x1, . . . , xN

slide-17
SLIDE 17

17 −10 −5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 log(K) log()

1 2 3 4

Normal step H() log(K) Newton iterations

−3.7 −3.6 −3.5 −3.4 −3.3 −3.2 −3.1 −3 3.2 3.3 3.4 3.5 3.6

Robustified root-finding algorithm

K d2

1, . . . , d2 N

B k = 1 β β / ∈ B B β B

, bounds . Input: initial , perplexity , distances while true do for to maxit do compute using a derivative- based method if tolerance achieved return if exit for loop update end for compute using bisection update end while

β

  • We embed the derivative-based algorithm

into bisection loop for global convergence

  • We run the following algorithm for each

xn ∈ x1, . . . , xN

slide-18
SLIDE 18

18

  • 1. Simple initialization:
  • midpoint of the bounds,
  • distance to th nearest neighbor.

Typically far from root and require more iterations.

  • 2. Each new is initialized from the

solution to its predecessor:

  • sequential order;
  • tree order.

k β

  • 2. Initialization of βn

βn We need to find orders that are correlated with the behavior of .

slide-19
SLIDE 19

19

  • 2. Initialization of βn
  • 1. Simple initialization:
  • middle of the bounds,
  • distance to th nearest neighbor.

Typically far from root and require more iterations.

  • 2. Each new is initialized from the

solution to its predecessor:

  • sequential order;
  • tree order.

k We need to find orders that are correlated with the behavior of . β βn

slide-20
SLIDE 20

4 5 6 7 8 9 10

Sequential or tree order

  • , density strategy: for the fixed entropy value, is larger in dense regions and

smaller in sparser ones.

  • Use nearest neighbor density estimate.
  • is proportional to the distance to th nearest neighbor of .
  • MST, local strategy: nearby points have similar values.
  • Build a MST around the data.
  • Process the points in level-order, so parents are solved for before children.

Dk β β

MST True β DK

k βn xn

50 100 150 200 250 300

slide-21
SLIDE 21

Experimental evaluation: setup

Initializations:

  • “oracle”: processes the points in the order of their true values,
  • MST: local-based order,
  • : density-based order,
  • bounds: initialize from the midpoint of the bounds,
  • random: initialize from one of chosen at random.

We set the perplexity to and the tolerance to . Root-finding methods:

  • Derivative-free: Bisection, Brent, Ridder.
  • Derivative-based: Newton, Euler, Halley.

21

β

K = 30 10−10 DK

xn

slide-22
SLIDE 22

Bisection: min. Our method: min. Computing just the affinities given s: s. β > 10 1 20

Experimental evaluation: Lena

22

slide-23
SLIDE 23

10 10

1

10

2

0.5 1 1.5 2x 10

5

iteration number of iterations

best MST epsilonK raster bounds random 10

1

10

2

10

3

runtime best MST epsilonK raster bounds random 10 10

1

10

2

numberIterations

Experimental evaluation: image

23

Lena image. Each data point is a pixel represented by spatial and range features :

  • is the pixel location;
  • the pixel value.

512 × 512 (i, j, L, u, v) ∈ R5 (i, j) (L, u, v)

Number of iterations Runtime

Oracle MST DK Raster Bounds Random Oracle MST DK Raster Bounds Random

Number of points converged after iterations iteration i

Euler Newton Halley Ridder Brent Bisection

N = 262 144 points, D = 5dimensions

i

slide-24
SLIDE 24

best MST epsilonK bounds random 10 10

1

10

2

numberIterations best MST epsilonK bounds random 10

1

10

2

runtime

Experimental evaluation: digits

handwritten digits from the MNIST dataset. Each datapoint is a grayscale image.

24

60 000 28 × 28

Oracle MST DK Bounds Random Oracle MST DK Bounds Random

10 10

1

10

2

1 2 3 4x 10

4

iteration number of iterations

iteration Number of iterations Runtime Number of points converged after iterations i

Euler Newton Halley Ridder Brent Bisection

N = 60 000 D = 784 points, dimensions

i

slide-25
SLIDE 25

best MST epsilonK bounds random 10 10

1

10

2

numberIterations best MST epsilonK bounds random 10 10

1

10

2

runtime

Experimental evaluation: text

Articles from Grolier’s encyclopedia. Each point is a word count of the most popular words from articles.

10 10

1

10

2

0.5 1 1.5 2x 10

4

iteration number of iterations

25

30 991

Runtime Number of iterations

Oracle MST DK Bounds Random Oracle MST DK Bounds Random

iteration Number of points converged after iterations i

Euler Newton Halley Ridder Brent Bisection

15 275 D = 15 275 points, dimensions N = 30 991

i

slide-26
SLIDE 26

Conclusions

  • We studied the behavior of entropic affinities and their properties.
  • Search for the affinities involves finding the root of non-linear equation.
  • We can find the root almost to machine precision in just over one

iteration per point on average using:

  • bounds for the root,
  • root-finding methods with high-order convergence,
  • warm-start initialization based on local or density orders.
  • In applications such as spectral clustering and embeddings, semi-

supervised learning using entropic affinities should give better results than fixing the bandwidth to a single value or using a rule-of-thumb.

  • The only user parameter is the global perplexity value .
  • MATLAB code online at http://eecs.ucmerced.edu. Run it simply

like .

26

[W,s] = ea(X,K)

K