K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - - PowerPoint PPT Presentation

k means the advantages of careful seeding
SMART_READER_LITE
LIVE PREVIEW

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - - PowerPoint PPT Presentation

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university) Clustering R d Given points in split them into similar groups. k n Clustering R d Given points in split them into


slide-1
SLIDE 1

K-means++: The Advantages of Careful Seeding

Sergei Vassilvitskii David Arthur (Stanford university)

slide-2
SLIDE 2

Given points in split them into similar groups.

Clustering

n Rd k

slide-3
SLIDE 3

Given points in split them into similar groups.

Clustering

n Rd k This talk: k-means clustering: Find centers, that minimize k C

  • x∈X

min

c∈C x − c2 2

slide-4
SLIDE 4

Objective: Find centers, that minimize

Why Means?

k C

  • x∈X

min

c∈C x − c2 2

For one cluster: Find that minimizes y

  • x∈X

x − y2

2

Easy! y = 1 |X|

  • x∈X

x

slide-5
SLIDE 5

Lloyd’s Method: k-means

Initialize with random clusters

slide-6
SLIDE 6

Lloyd’s Method: k-means

Assign each point to nearest center

slide-7
SLIDE 7

Lloyd’s Method: k-means

Recompute optimum centers (means)

slide-8
SLIDE 8

Lloyd’s Method: k-means

Repeat: Assign points to nearest center

slide-9
SLIDE 9

Lloyd’s Method: k-means

Repeat: Recompute centers

slide-10
SLIDE 10

Lloyd’s Method: k-means

Repeat...

slide-11
SLIDE 11

Lloyd’s Method: k-means

Repeat...Until clustering does not change

slide-12
SLIDE 12

Analysis

How good is this algorithm? Finds a local optimum That is potentially arbitrarily worse than optimal solution

slide-13
SLIDE 13

Approximating k-means

  • Mount et al.: approximation in time
  • Har Peled et al.: in time
  • Kumar et al.: in time

9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd

slide-14
SLIDE 14

Approximating k-means

  • Mount et al.: approximation in time
  • Har Peled et al.: in time
  • Kumar et al.: in time

Lloyd’s method:

  • Worst-case time complexity:
  • Smoothed complexity:

9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd 2Ω(√n) nO(k)

slide-15
SLIDE 15

Approximating k-means

  • Mount et al.: approximation in time
  • Har Peled et al.: in time
  • Kumar et al.: in time

Lloyd’s method: 9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd For example, Digit Recognition dataset (UCI): n = 60, 000 d = 600 Convergence to a local optimum in 60 iterations.

slide-16
SLIDE 16

Challenge

Develop an approximation algorithm for k-means clustering that is competitive with the k-means method in speed and solution quality. Easiest line of attack: focus on the initial center positions. Classical k-means: pick points at random. k

slide-17
SLIDE 17

k-means on Gaussians

slide-18
SLIDE 18

k-means on Gaussians

slide-19
SLIDE 19

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-20
SLIDE 20

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-21
SLIDE 21

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-22
SLIDE 22

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-23
SLIDE 23

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-24
SLIDE 24

Sensitive to Outliers

slide-25
SLIDE 25

Sensitive to Outliers

slide-26
SLIDE 26

Sensitive to Outliers

slide-27
SLIDE 27

Interpolate between the two methods: Let be the distance between and the nearest cluster center. Sample proportionally to (D(x))α = Dα(x)

k-means++

D(x) x Original Lloyd’s: α = 0 Contribution of to the overall error x α = ∞ Furthest Point: α = 2 k-means++:

slide-28
SLIDE 28

k-Means++

slide-29
SLIDE 29

k-Means++

Theorem: k-means++ is approximate in expectation. Θ(log k) Ostrovsky et al. [06]: Similar method is approximate under some data distribution assumptions. O(1)

slide-30
SLIDE 30

Proof - 1st cluster

Fix an optimal clustering . C∗ Pick first center uniformly at random Bound the total error of that cluster.

slide-31
SLIDE 31

Proof - 1st cluster

Let be the cluster. Each point equally likely to be the chosen center. A a0 ∈ A = 2

  • a∈A

a − ¯ A2 = 2φ∗(A) E[φ(A)] =

  • a0∈A

1 |A|

  • a∈A

a − a02 Expected Error:

slide-32
SLIDE 32

Proof - Other Clusters

Suppose next center came from a new cluster in OPT. Bound the total error of that cluster.

slide-33
SLIDE 33

Other CLusters

Let be this cluster, and the point selected. Then: B b0 E[φ(B)] =

  • b0∈B

D2(b0)

  • b∈B D2(b) ·
  • b∈B

min(D(b), b − b0)2 Key step: D(b0) ≤ D(b) + b − b0

slide-34
SLIDE 34

Cont.

For any b: D2(b0) ≤ 2D2(b) + 2b − b02 Same for all b0 Cost in uniform sampling D2(b0) ≤ 2 |B|

  • b∈B

D2(b) + 2 |B|

  • b∈B

b − b02

  • Avg. over all b:
slide-35
SLIDE 35

D2(b0) ≤ 2 |B|

  • b∈B

D2(b) + 2 |B|

  • b∈B

b − b02

  • Avg. over all b:

Cont.

For any b: D2(b0) ≤ 2D2(b) + 2b − b02 Recall: E[φ(B)] =

  • b0∈B

D2(b0)

  • b∈B D2(b) ·
  • b∈B

min(D(b), b − b0)2 ≤ 4 |B|

  • b0∈B
  • b∈B

b − b02 = 8φ∗(B)

slide-36
SLIDE 36

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. 8

slide-37
SLIDE 37

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. 8

slide-38
SLIDE 38

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. Formally, an inductive proof shows this method is competitive. 8 Θ(log k)

slide-39
SLIDE 39

Experiments

Tested on several datasets: Synthetic

  • 10k points, 3 dimensions

Cloud Cover (UCI Repository]

  • 10k points, 54 dimensions

Color Quantization

  • 16k points, 16 dimensions

Intrusion Detection (KDD Cup)

  • 500k points, 35 dimensions
slide-40
SLIDE 40

Typical Run

KM++ v. KM v. KM-Hybrid

600 700 800 900 1000 1100 1200 1300 50 100 150 200 250 300 350 400 450 500 Stage Error LLOYD HYBRID KM++

slide-41
SLIDE 41

Experiments

Total Error Time: k-means++ 1% slower due to initialization. k-means km-Hybrid k-means++ Synthetic Cloud Cover Color Intrusion 6.02 × 105 5.95 × 105 6.06 × 105 670 741 712 0.016 0.015 0.014 32.9 × 103 3.4 × 103 −

slide-42
SLIDE 42

Final Message

Friends don’t let friends use k-means.

slide-43
SLIDE 43

Thank You

Any Questions?