K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - - PowerPoint PPT Presentation

▶

k means the advantages of careful seeding

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii - - PowerPoint PPT Presentation

Oct 10, 2023 460 likes •898 views

K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university) Clustering R d Given points in split them into similar groups. k n Clustering R d Given points in split them into

slide-1

SLIDE 1

K-means++: The Advantages of Careful Seeding

Sergei Vassilvitskii David Arthur (Stanford university)

slide-2

SLIDE 2

Given points in split them into similar groups.

Clustering

n Rd k

slide-3

SLIDE 3

Given points in split them into similar groups.

Clustering

n Rd k This talk: k-means clustering: Find centers, that minimize k C

x∈X

min

c∈C x − c2 2

slide-4

SLIDE 4

Objective: Find centers, that minimize

Why Means?

k C

x∈X

min

c∈C x − c2 2

For one cluster: Find that minimizes y

x∈X

x − y2

2

Easy! y = 1 |X|

x∈X

x

slide-5

SLIDE 5

Lloyd’s Method: k-means

Initialize with random clusters

slide-6

SLIDE 6

Lloyd’s Method: k-means

Assign each point to nearest center

slide-7

SLIDE 7

Lloyd’s Method: k-means

Recompute optimum centers (means)

slide-8

SLIDE 8

Lloyd’s Method: k-means

Repeat: Assign points to nearest center

slide-9

SLIDE 9

Lloyd’s Method: k-means

Repeat: Recompute centers

slide-10

SLIDE 10

Lloyd’s Method: k-means

Repeat...

slide-11

SLIDE 11

Lloyd’s Method: k-means

Repeat...Until clustering does not change

slide-12

SLIDE 12

Analysis

How good is this algorithm? Finds a local optimum That is potentially arbitrarily worse than optimal solution

slide-13

SLIDE 13

Approximating k-means

Mount et al.: approximation in time
Har Peled et al.: in time
Kumar et al.: in time

9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd

slide-14

SLIDE 14

Approximating k-means

Mount et al.: approximation in time
Har Peled et al.: in time
Kumar et al.: in time

Lloyd’s method:

Worst-case time complexity:
Smoothed complexity:

9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd 2Ω(√n) nO(k)

slide-15

SLIDE 15

Approximating k-means

Mount et al.: approximation in time
Har Peled et al.: in time
Kumar et al.: in time

Lloyd’s method: 9 + O(n3/d) 1 + O(n + kk+2−2dk logk(n/)) 1 + 2(k/)O(1)nd For example, Digit Recognition dataset (UCI): n = 60, 000 d = 600 Convergence to a local optimum in 60 iterations.

slide-16

SLIDE 16

Challenge

Develop an approximation algorithm for k-means clustering that is competitive with the k-means method in speed and solution quality. Easiest line of attack: focus on the initial center positions. Classical k-means: pick points at random. k

slide-17

SLIDE 17

k-means on Gaussians

slide-18

SLIDE 18

k-means on Gaussians

slide-19

SLIDE 19

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-20

SLIDE 20

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-21

SLIDE 21

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-22

SLIDE 22

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-23

SLIDE 23

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

slide-24

SLIDE 24

Sensitive to Outliers

slide-25

SLIDE 25

Sensitive to Outliers

slide-26

SLIDE 26

Sensitive to Outliers

slide-27

SLIDE 27

Interpolate between the two methods: Let be the distance between and the nearest cluster center. Sample proportionally to (D(x))α = Dα(x)

k-means++

D(x) x Original Lloyd’s: α = 0 Contribution of to the overall error x α = ∞ Furthest Point: α = 2 k-means++:

slide-28

SLIDE 28

k-Means++

slide-29

SLIDE 29

k-Means++

Theorem: k-means++ is approximate in expectation. Θ(log k) Ostrovsky et al. [06]: Similar method is approximate under some data distribution assumptions. O(1)

slide-30

SLIDE 30

Proof - 1st cluster

Fix an optimal clustering . C∗ Pick first center uniformly at random Bound the total error of that cluster.

slide-31

SLIDE 31

Proof - 1st cluster

Let be the cluster. Each point equally likely to be the chosen center. A a0 ∈ A = 2

a∈A

a − ¯ A2 = 2φ∗(A) E[φ(A)] =

a0∈A

1 |A|

a∈A

a − a02 Expected Error:

slide-32

SLIDE 32

Proof - Other Clusters

Suppose next center came from a new cluster in OPT. Bound the total error of that cluster.

slide-33

SLIDE 33

Other CLusters

Let be this cluster, and the point selected. Then: B b0 E[φ(B)] =

b0∈B

D2(b0)

b∈B D2(b) ·
b∈B

min(D(b), b − b0)2 Key step: D(b0) ≤ D(b) + b − b0

slide-34

SLIDE 34

Cont.

For any b: D2(b0) ≤ 2D2(b) + 2b − b02 Same for all b0 Cost in uniform sampling D2(b0) ≤ 2 |B|

b∈B

D2(b) + 2 |B|

b∈B

b − b02

Avg. over all b:

slide-35

SLIDE 35

D2(b0) ≤ 2 |B|

b∈B

D2(b) + 2 |B|

b∈B

b − b02

Avg. over all b:

Cont.

For any b: D2(b0) ≤ 2D2(b) + 2b − b02 Recall: E[φ(B)] =

b0∈B

D2(b0)

b∈B D2(b) ·
b∈B

min(D(b), b − b0)2 ≤ 4 |B|

b0∈B
b∈B

b − b02 = 8φ∗(B)

slide-36

SLIDE 36

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. 8

slide-37

SLIDE 37

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. 8

slide-38

SLIDE 38

Wrap Up

If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. Formally, an inductive proof shows this method is competitive. 8 Θ(log k)

slide-39

SLIDE 39

Experiments

Tested on several datasets: Synthetic

10k points, 3 dimensions

Cloud Cover (UCI Repository]

10k points, 54 dimensions

Color Quantization

16k points, 16 dimensions

Intrusion Detection (KDD Cup)

500k points, 35 dimensions

slide-40

SLIDE 40

Typical Run

KM++ v. KM v. KM-Hybrid

600 700 800 900 1000 1100 1200 1300 50 100 150 200 250 300 350 400 450 500 Stage Error LLOYD HYBRID KM++

slide-41

SLIDE 41

Experiments

Total Error Time: k-means++ 1% slower due to initialization. k-means km-Hybrid k-means++ Synthetic Cloud Cover Color Intrusion 6.02 × 105 5.95 × 105 6.06 × 105 670 741 712 0.016 0.015 0.014 32.9 × 103 3.4 × 103 −

slide-42

SLIDE 42

Final Message

Friends don’t let friends use k-means.

slide-43

SLIDE 43

Thank You

Any Questions?