When do birds of a feather flock together? k -means, proximity, and - - PowerPoint PPT Presentation

when do birds of a feather flock together k means
SMART_READER_LITE
LIVE PREVIEW

When do birds of a feather flock together? k -means, proximity, and - - PowerPoint PPT Presentation

When do birds of a feather flock together? k -means, proximity, and conic programming Shuyang Ling Courant Institute of Mathematical Sciences, NYU May 14, 2018 Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 1 / 26


slide-1
SLIDE 1

When do birds of a feather flock together? k-means, proximity, and conic programming

Shuyang Ling

Courant Institute of Mathematical Sciences, NYU

May 14, 2018

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 1 / 26

slide-2
SLIDE 2

Acknowledgement

Research in collaboration with:

  • Prof. Xiaodong Li (Statistics, UC Davis)
  • Prof. Thomas Strohmer, Yang Li (Mathematics, UC Davis)
  • Prof. Ke Wei (School of Data Sciences, Fudan University, Shanghai)

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 2 / 26

slide-3
SLIDE 3

k-means

Question: Given a set of N data points in Rm, how to partition them into k clusters? Criterion: minimize the k-means objective function: min

{Γl}k

l=1

k

  • l=1
  • i∈Γl

xi − cl2

  • within-cluster sum of squares

, {Γl} is a partition of {1, · · · , N} cl is the sample mean of data points in Γl

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 3 / 26

slide-4
SLIDE 4

k-means

Question: Given a set of N data points in Rm, how to partition them into k clusters? Criterion: minimize the k-means objective function: min

{Γl}k

l=1

k

  • l=1
  • i∈Γl

xi − cl2

  • within-cluster sum of squares

, {Γl} is a partition of {1, · · · , N} cl is the sample mean of data points in Γl

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 3 / 26

slide-5
SLIDE 5

Difficulty of k-means

Importance and Difficulties

Widely used in vector quantization, unsupervised learning, Voronoi tessellation, etc. It is an NP-hard problem, even if m = 2. [Mahajan, etc 09] Heuristic method: Lloyd’s algorithm [Lloyd 82] works well in practice. But convergence is not always guaranteed: it may take exponentially (in N) many steps to converge to stationary points (not even a local minimum).

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 4 / 26

slide-6
SLIDE 6

Convex relaxation of k-means

Focus of talk

We are interested in the convex relaxation for k-means [Peng, Wei 07].

k-means

To minimize k-means objective, it suffices to optimize over all possible choices of partition {Γl}: f ({Γl}) :=

k

  • l=1
  • i∈Γl

xi − cl2

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 5 / 26

slide-7
SLIDE 7

Convex relaxation of k-means

Focus of talk

We are interested in the convex relaxation for k-means [Peng, Wei 07].

k-means

To minimize k-means objective, it suffices to optimize over all possible choices of partition {Γl}: f ({Γl}) :=

k

  • l=1
  • i∈Γl

xi − cl2

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 5 / 26

slide-8
SLIDE 8

Convex relaxation of k-means

Focus of talk

We are interested in the convex relaxation for k-means [Peng, Wei 07].

An equivalent form:

It suffices to minimize it over all choices of partition {Γl}k

l=1:

f ({Γl}k

l=1) := k

  • l=1
  • i∈Γl

xi − cl2 =

k

  • l=1

1 |Γl|

  • i∈Γl,j∈Γl

xi − xj2 which is the sum of the squared pairwise deviations of points in the same cluster.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 6 / 26

slide-9
SLIDE 9

A bit more calculation

f ({Γl}k

l=1) is the inner product between two matrices

f ({Γl}) =

N

  • i=1

N

  • j=1

xi − xj2

  • Dij

· 1 |Γl|1{i∈Γl,j∈Γl}

  • Xij

= D, X where D = (xi − xj2)1≤i,j≤N is the distance matrix and X = 1 |Γl| · 1{i∈Γl,j∈Γl}

  • 1≤i,j≤N

We simply call X the partition matrix. What properties does X have for any given partition {Γl}k

l=1?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 7 / 26

slide-10
SLIDE 10

A bit more calculation

f ({Γl}k

l=1) is the inner product between two matrices

f ({Γl}) =

N

  • i=1

N

  • j=1

xi − xj2

  • Dij

· 1 |Γl|1{i∈Γl,j∈Γl}

  • Xij

= D, X where D = (xi − xj2)1≤i,j≤N is the distance matrix and X = 1 |Γl| · 1{i∈Γl,j∈Γl}

  • 1≤i,j≤N

We simply call X the partition matrix. What properties does X have for any given partition {Γl}k

l=1?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 7 / 26

slide-11
SLIDE 11

Relaxation

Up to certain permutation, the matrix X is a block-diagonal matrix: X =   

1 |Γ1|1|Γ1|1⊤ |Γ1|

· · · . . . ... . . . · · ·

1 |Γk|1|Γk|1⊤ |Γk|

   We want to find a larger and convex search space containing all X as a proper subset. What constraints does X satisfy?

Four constraints

Nonnegativity: X ≥ 0. Positive semidefinite: X 0. Tr(X) = k (note that rank(X) = k is nonconvex) Leading eigenvalues are 1 with multiplicities k: X1N = 1N.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 8 / 26

slide-12
SLIDE 12

Relaxation

Up to certain permutation, the matrix X is a block-diagonal matrix: X =   

1 |Γ1|1|Γ1|1⊤ |Γ1|

· · · . . . ... . . . · · ·

1 |Γk|1|Γk|1⊤ |Γk|

   We want to find a larger and convex search space containing all X as a proper subset. What constraints does X satisfy?

Four constraints

Nonnegativity: X ≥ 0. Positive semidefinite: X 0. Tr(X) = k (note that rank(X) = k is nonconvex) Leading eigenvalues are 1 with multiplicities k: X1N = 1N.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 8 / 26

slide-13
SLIDE 13

Relaxation

Up to certain permutation, the matrix X is a block-diagonal matrix: X =   

1 |Γ1|1|Γ1|1⊤ |Γ1|

· · · . . . ... . . . · · ·

1 |Γk|1|Γk|1⊤ |Γk|

   We want to find a larger and convex search space containing all X as a proper subset. What constraints does X satisfy?

Four constraints

Nonnegativity: X ≥ 0. Positive semidefinite: X 0. Tr(X) = k (note that rank(X) = k is nonconvex) Leading eigenvalues are 1 with multiplicities k: X1N = 1N.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 8 / 26

slide-14
SLIDE 14

Relaxation

Up to certain permutation, the matrix X is a block-diagonal matrix: X =   

1 |Γ1|1|Γ1|1⊤ |Γ1|

· · · . . . ... . . . · · ·

1 |Γk|1|Γk|1⊤ |Γk|

   We want to find a larger and convex search space containing all X as a proper subset. What constraints does X satisfy?

Four constraints

Nonnegativity: X ≥ 0. Positive semidefinite: X 0. Tr(X) = k (note that rank(X) = k is nonconvex) Leading eigenvalues are 1 with multiplicities k: X1N = 1N.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 8 / 26

slide-15
SLIDE 15

Relaxation

Up to certain permutation, the matrix X is a block-diagonal matrix: X =   

1 |Γ1|1|Γ1|1⊤ |Γ1|

· · · . . . ... . . . · · ·

1 |Γk|1|Γk|1⊤ |Γk|

   We want to find a larger and convex search space containing all X as a proper subset. What constraints does X satisfy?

Four constraints

Nonnegativity: X ≥ 0. Positive semidefinite: X 0. Tr(X) = k (note that rank(X) = k is nonconvex) Leading eigenvalues are 1 with multiplicities k: X1N = 1N.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 8 / 26

slide-16
SLIDE 16

Convex relaxation

Semidefinite programming relaxation [Peng, Wei, 07]

The convex relaxation of k-means is min D, Z s.t. Z ≥ 0, Z 0, Tr(Z) = k, Z1N = 1N.

Key question

Suppose we assume {Γl}k

l=1 is the ground truth partition,

when does SDP relaxation recover X = k

l=1 1 |Γl|1Γl1⊤ Γl?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 9 / 26

slide-17
SLIDE 17

Convex relaxation

Semidefinite programming relaxation [Peng, Wei, 07]

The convex relaxation of k-means is min D, Z s.t. Z ≥ 0, Z 0, Tr(Z) = k, Z1N = 1N.

Key question

Suppose we assume {Γl}k

l=1 is the ground truth partition,

when does SDP relaxation recover X = k

l=1 1 |Γl|1Γl1⊤ Γl?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 9 / 26

slide-18
SLIDE 18

A short literature review

Many excellent works for learning mixtures of distributions and SDP relaxation of k-means: SDP-relaxation of k-means: [Peng, Wei, 07], [Bandeira, Villar, Ward, etc, 17], [Mixon, Villar, etc, 15], etc. Spectral-projection based approaches: [Dasgupta, 99], [Vempala, Wang, 04], [Achlipotas, McSherry, 05], etc. Almost all works have one thing in common: data are assumed to be sampled from a generative model, i.e., stochastic ball model, Gaussian mixture models, etc.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 10 / 26

slide-19
SLIDE 19

A short literature review

Many excellent works for learning mixtures of distributions and SDP relaxation of k-means: SDP-relaxation of k-means: [Peng, Wei, 07], [Bandeira, Villar, Ward, etc, 17], [Mixon, Villar, etc, 15], etc. Spectral-projection based approaches: [Dasgupta, 99], [Vempala, Wang, 04], [Achlipotas, McSherry, 05], etc. Almost all works have one thing in common: data are assumed to be sampled from a generative model, i.e., stochastic ball model, Gaussian mixture models, etc.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 10 / 26

slide-20
SLIDE 20

A model-free framework?

Question: Can we establish a model-free framework to learn mixture of distributions? Model-free: No assumption on data generative model. One model-free idea: different clusters are mutually well-separated. How large the separation is needed and in what sense? This is made possible by proximity condition [Kumar, Kannan, 10], [Awashi, Sheffet, 12].

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 11 / 26

slide-21
SLIDE 21

A model-free framework?

Question: Can we establish a model-free framework to learn mixture of distributions? Model-free: No assumption on data generative model. One model-free idea: different clusters are mutually well-separated. How large the separation is needed and in what sense? This is made possible by proximity condition [Kumar, Kannan, 10], [Awashi, Sheffet, 12].

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 11 / 26

slide-22
SLIDE 22

What is proximity condition?

𝑑" + 𝑑$ 2

𝑑$ 𝑑"

𝜐",$

ℎ",$ 2 −𝜐",$

ha,b: the distance between two centers τa,b: the largest distance between data and their corresponding centers when projected on the line linking ca with cb da,b := ha,b

2 − τa,b is the smallest distance between the middle point

and projected data onto the line, which is a measure of separability

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 12 / 26

slide-23
SLIDE 23

Proximity condition

Proximity condition

The partition Γ = ⊔k

l=1Γl satisfies proximity condition if

da,b = ha,b 2 − τa,b > 1 √ 2 · √ k ·

  • max Σl
  • standard deviation

holds for any a = b where Σl is the sample covariance matrix of data in Γl. Proximity condition quantifies how far each data point is away from the other clusters.

𝑑" + 𝑑$ 2

𝑑$ 𝑑"

𝜐",$

ℎ",$ 2 −𝜐",$

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 13 / 26

slide-24
SLIDE 24

Main theorem

Theorem

Suppose the partition {Γl}k

l=1 obeys the proximity condition, i.e.,

da,b ≥ 1 √ 2 · √ k

  • tight?

·

  • max Σl.

The minimizer of the SDP relaxation is unique and given by the ground truth partition X. A purely deterministic and model-free condition. Conveniently apply to other data-generative models (shown in the next few slides). If all Γl are of the same size, the right hand side is replaced by √ k ·

  • max{Σa, Σb} which only depends on the covariance

matrix of group Γa and Γb. The dependence of ∆ on √ k is not tight.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 14 / 26

slide-25
SLIDE 25

Main theorem

Theorem

Suppose the partition {Γl}k

l=1 obeys the proximity condition, i.e.,

da,b ≥ 1 √ 2 · √ k

  • tight?

·

  • max Σl.

The minimizer of the SDP relaxation is unique and given by the ground truth partition X. A purely deterministic and model-free condition. Conveniently apply to other data-generative models (shown in the next few slides). If all Γl are of the same size, the right hand side is replaced by √ k ·

  • max{Σa, Σb} which only depends on the covariance

matrix of group Γa and Γb. The dependence of ∆ on √ k is not tight.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 14 / 26

slide-26
SLIDE 26

Data generative model - Stochastic ball model

Stochastic ball model

The data is generated from xa,i = µa + r a,i, 1 ≤ i ≤ n, 1 ≤ a ≤ k where µa ∈ Rm is the population center and r a,i is uniform in B(Rm). Obviously, ∆ = mina=b µa − µb > 2 guarantees two balls are not

  • verlapped and is necessary for exact recovery.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 15 / 26

slide-27
SLIDE 27

Data generative model - Stochastic ball model

Our bound is slightly larger than 2 where the difference depends on the number of clusters k and dimension m.

Corollary

The proximity condition holds with high probability if ∆ ≥ 2 +

  • 2k max Σl = 2 +
  • 2k

m + 2 + o(1) where ∆ is the minimal separation ∆ = mina=b µa − µb and m is the dimension. State-of-the-art [Awashi, Bandeira, Villar, Ward, Mixon, etc, 2015, 2017]: ∆ > min

  • 2

√ 2

  • 1 +

1 √m

  • , 2 + k2

m

  • .

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 16 / 26

slide-28
SLIDE 28

Data generative model - Stochastic ball model

Our bound is slightly larger than 2 where the difference depends on the number of clusters k and dimension m.

Corollary

The proximity condition holds with high probability if ∆ ≥ 2 +

  • 2k max Σl = 2 +
  • 2k

m + 2 + o(1) where ∆ is the minimal separation ∆ = mina=b µa − µb and m is the dimension. State-of-the-art [Awashi, Bandeira, Villar, Ward, Mixon, etc, 2015, 2017]: ∆ > min

  • 2

√ 2

  • 1 +

1 √m

  • , 2 + k2

m

  • .

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 16 / 26

slide-29
SLIDE 29

Data generative model - Gaussian mixture model

Gaussian mixture model

Consider xa,i ∼ N(µa, Σa), 1 ≤ i ≤ n, 1 ≤ a ≤ k where Σa is the covariance matrix.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 17 / 26

slide-30
SLIDE 30

Data generative model - Gaussian mixture model

Corollary

Assume Σa = I m for all 1 ≤ a ≤ k, the proximity condition holds with high probability if ∆ ≥ 2 √ k + 4 √ 2 log1/2(kN2) + o(1), if N ≫ m2k3 log(k). Gaussian mixture model: we achieve state-of-the-art result ∆ ≥ O( √ k + log1/2(kN)) for minimal separation by e.g. [Awasthi, Sheffet, 12] and [Mixon, Villar, Ward, 17], etc.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 18 / 26

slide-31
SLIDE 31

An impossibility theorem

Question: How tight is our bound? The minimal separation ∆ cannot be arbitrarily small, i.e., there is a lower bound for the separation for SDP to work. Here is one specific example:

Theorem

For stochastic ball model, the Peng-Wei relaxation fails to achieve exact recovery if N is large enough and ∆ < 1 +

  • 1 +

2 m + 2 ≈ 2 + Σ where Σ =

1 m+2.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 19 / 26

slide-32
SLIDE 32

Numerics: How does ∆ depend on k?

Our bound: ∆ ≥ 2 +

  • 2k

m+2;

State-of-the-art bound: ∆ ≥ min

  • 2

√ 2

  • 1 +

1 √m

  • , 2 + k2

m

  • The bound does not depend on k much.

2 2.5 3 3.5 4 4.5 5 5.5 6

Number of clusters

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

Minimal separation between centers

Stochastic ball model in dimension two

Circle shaped Line shaped Hive shaped Our bound State-of-art bound

Figure: Numerical experiment on the stochastic ball model with dimension 2 and number of clusters k varies from 2 to 6.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 20 / 26

slide-33
SLIDE 33

Numerics: How does ∆ depend on m?

Here k = 2 and change m from 2 to 7. Conjectured bound: ∆ ≥ 2 +

2 m+2

Necessary lower bound: ∆ > 1 +

  • 1 +

2 m+2

Sufficient lower bound: ∆ > 2 +

2 √m+2

State-of-the-art: ∆ > min

  • 2

√ 2

  • 1 +

1 √m

  • , 2 + k2

m

  • 2

3 4 5 6 7 Dimension 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 Separation between centers

Stochastic ball model with two clusters

Necessary lower bound Conjectured bound Sufficient lower bound State-of-the-art bound Empircal lower bound

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 21 / 26

slide-34
SLIDE 34

Is k-means always a good choice? - toy example 1

Example 1: data are on two circles with the same centers but different radius. k-means does not work at all since it usually works for convex clusters.

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 22 / 26

slide-35
SLIDE 35

Is k-means always a good choice? - toy example 2

Example 2: data are lying uniformly on two unit intervals with separation about ∆ ≈ 0.65. Let’s guess where the centers are?

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

= 0.65

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 23 / 26

slide-36
SLIDE 36

Is k-means always a good choice? - toy example 2

  • 0.5

0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

= 0.67

  • 0.5

0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

= 0.63

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 24 / 26

slide-37
SLIDE 37

Advertisement for an upcoming paper: kernel k-means?

Observation: k-means does not work if the geometry of data is complicated. Solution: spectral clustering which consists of Laplacian eigenmap and k-means. However, many theoretic questions are not well understood. Question: Can we extend this convex relaxation framework to spectral clustering or kernel k-means? Yes, we will propose a convex relaxation of spectral clustering. It is also model-free and provably solves the previous two cases where

  • rdinary k-means fails. The paper will be released soon!

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 25 / 26

slide-38
SLIDE 38

Advertisement for an upcoming paper: kernel k-means?

Observation: k-means does not work if the geometry of data is complicated. Solution: spectral clustering which consists of Laplacian eigenmap and k-means. However, many theoretic questions are not well understood. Question: Can we extend this convex relaxation framework to spectral clustering or kernel k-means? Yes, we will propose a convex relaxation of spectral clustering. It is also model-free and provably solves the previous two cases where

  • rdinary k-means fails. The paper will be released soon!

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 25 / 26

slide-39
SLIDE 39

Advertisement for an upcoming paper: kernel k-means?

Observation: k-means does not work if the geometry of data is complicated. Solution: spectral clustering which consists of Laplacian eigenmap and k-means. However, many theoretic questions are not well understood. Question: Can we extend this convex relaxation framework to spectral clustering or kernel k-means? Yes, we will propose a convex relaxation of spectral clustering. It is also model-free and provably solves the previous two cases where

  • rdinary k-means fails. The paper will be released soon!

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 25 / 26

slide-40
SLIDE 40

Advertisement for an upcoming paper: kernel k-means?

Observation: k-means does not work if the geometry of data is complicated. Solution: spectral clustering which consists of Laplacian eigenmap and k-means. However, many theoretic questions are not well understood. Question: Can we extend this convex relaxation framework to spectral clustering or kernel k-means? Yes, we will propose a convex relaxation of spectral clustering. It is also model-free and provably solves the previous two cases where

  • rdinary k-means fails. The paper will be released soon!

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 25 / 26

slide-41
SLIDE 41

Open problem and conclusions

Conclusions

A model-free framework to certify the exactness of SDP relaxation applied to k-means. More details can be found arXiv:1710.06008.

Open problems

For a mixture generated by the generalized stochastic ball model, is it possible to show ∆ ≥ 2 + O 1 m

  • ,

suffices provided that the total number of points N is large enough. How to analyze misclassification rate via convex optimization approach? Understand the convergence of Lloyd’s algorithm?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 26 / 26

slide-42
SLIDE 42

Open problem and conclusions

Conclusions

A model-free framework to certify the exactness of SDP relaxation applied to k-means. More details can be found arXiv:1710.06008.

Open problems

For a mixture generated by the generalized stochastic ball model, is it possible to show ∆ ≥ 2 + O 1 m

  • ,

suffices provided that the total number of points N is large enough. How to analyze misclassification rate via convex optimization approach? Understand the convergence of Lloyd’s algorithm?

Shuyang Ling (New York University) ICCHA7 2018, Nashville, TN May 14, 2018 26 / 26