Coresets for k-Means and k-Median Clustering and their Applications - - PowerPoint PPT Presentation

▶

Aug 13, 2023 677 likes •954 views

Coresets for k-Means and k-Median Clustering and their Applications Sariel Har-Peled and Soham Mazumdar March 8, 2006 Problem Introduction We are given a point set P in R d of size n Find a set of k points C such that the cost function

SLIDE 1

March 8, 2006

Coresets for k-Means and k-Median Clustering and their Applications

Sariel Har-Peled and Soham Mazumdar

SLIDE 2

Problem Introduction

We are given a point set P in Rd of size n
Find a set of k points C such that the cost

function is minimized

Cost functions

– Median: – Discrete median: – Mean:

Streaming

SLIDE 3

Costs

k-medians Discrete k-medians k-means

SLIDE 4

Results

Builds on the algorithms we saw last week

– Kolliopoulos and Rao [KR99] – Matoušek [Mat00]

Results

– k-median – Discrete k-median – k-mean

SLIDE 5

Overview

Similar for k-medians and k-means
Construct a series of sets
Algorithm Components

– P: Point set – S: Coreset – A: Constant factor approximation – D: Centroid set – C: k centers

SLIDE 6

Coresets for k-median

Definition: S is an (k,ε)-coreset if
Construction
Begin with P and A where
Estimate average radius
Exponential grid around x 2 A with M levels

SLIDE 7

Exponential Grid

For each point

in A

Level j has

side length ε R2j /(10cd)

Pick a point in

each non- empty cell

Assign weight

by number of points in cell

SLIDE 8

Cost of Constructing S

Size
In each level, constant

number of cells

– log n levels

Cost of construction

– Constant factor approximation to cost νA(P) – Nearest Neighbor queries NN Queries Naïve: O(mn) [AMN+98]: O(log m) after O(m log m) Here: O(n+mn1/4 log n) Total Cost If m = else

SLIDE 9

Fuzzy Nearest Neighbor Search in O(1)

ε-approximate nearest neighbors to a set X
If distance q < δ

– Any point in X which is closer than δ is valid

If q > ∆

– Any point in X is valid δ ∆

SLIDE 10

Proof of Correctness

p 2 P and its image in S ! p’
For any k points (Y) the error is

SLIDE 11

Coresets for k-means

Similar to k-medians
Lower bound estimate for average mean radius
A is a constant factor approximation
Using R and A, we construct S with the exponential

grid

Size:
Running time:

SLIDE 12

Proof of Correctness

Idea: Partition P into 3 sets

– Points that are close to A and B – small error – Points closer to B than to A – ε fraction error – Points closer to A than to B – “better” than optimal

Bound each error
Result:

SLIDE 13

Errors

SLIDE 14

Fast Constant Factor Approximation

In both cases need constant approximation

– i.e. set A

Use more than k centers – O(k log3 n)
Good for both k-means and k-medians
2-approximate clustering (min-max clustering)

– k = O(n1/4) ! O(n) [Har01a] – k = Ω(n1/4) ! O(n log k) [FG88]

SLIDE 15

Picking Sets

Distance between points in V at least L
L is an estimate of cost
Y is a random sample of P

– size ρ = γ k log2 n

Desired set of centers ! X = Y U V
We want a large “good” subset for X
“Good” defined in terms of bad points

SLIDE 16

Bad Points

Definition

A point is bad with respect to a set X if the cost is much larger than the optimal center would pay. More precisely

There are few bad points in X
There contribution to the clustering cost is small

SLIDE 17

Few Bad Points

Copt are optimal center k-means
Place ball bi around each point ci
Each ball contains η = n/(20k log n)

points

Choose γ so at least one xi in bi
Any p outside bi is not a bad point
Number of bad points

SLIDE 18

Clustering Cost of Bad Points

Hard to determine set of bad point
For every point in P, compute approximate nearest

neighbor in X

– Cost is same as in construction of S

Partition P
Good set P’

– Pα is the last class more than 2 β points – P’ = U Pi for i =1…α – |P’| ¸ n/2 and

SLIDE 19

Proof

Size of P’:
Cost is roughly the same for all p’
Constant factor k-median clustering

– Run O(log n) iterations – In each iteration we get |X| = O (k log2 n) – So total number of centers O(k log3 n) – Approximation bounded by

SLIDE 20

(1+ε) k-Median Approximation

Make A of size O(k log3 n)
Get coreset S of size O(k log4 n)
Compute O(n) approximation using k-center (min-

max) algorithm [Gon85]

– Result is C0

Use local search to get down to exactly k centers

[AGK+01]

– Swap a point in the set of centers with one outside – Keep it if it shows considerable improvement

Use these with exponential grid once more to get the

final coreset S

Time: O(|S|2 k3 log9 n)
Size: O((k/εd) log n)

SLIDE 21

Centroid Sets

To apply [KR99] directly but only works in discrete case
Create a centroid set

– Make a (k,ε/12)-coreset S – Compute exponential grid around each point in S with R = νB (P)/n – Centroid set D size of O(k2 ε-2d log2 n)

Proof
Now run [KR99], using only centers from D

SLIDE 22

Summary of Construction

Compute 2-approximate k-center clustering of P
Compute set of good points P’ and X
Repeat log n times to get A
Compute S from A and P using exp. grid
Compute O(n) approximation of S
Apply local search alg. to find k centers
Compute coreset from k centers and P using exp. grid
Compute D from coreset and k centers using exp. grid
Apply [KR99] using only centers from D

SLIDE 23

Discrete k-medians

Compute ε/4 centroid
Find representative set

– Points P snapped to D – Discrete centroid set

Result

SLIDE 24

k-Means

Everything is the same up to local search algorithm
Algorithm due to Kanungo et al. [KMN+02]
Use Maousek [Mat00] to compute k-means on the

coreset

Result

SLIDE 25

Streaming

Partition P into sets

– Pi is empty – |Pi| = 2i M where M=O(k/εd)

Store coreset for each Pj ! Qj
Qj is a (k,δj)-coreset for Pj
U Qj is a (k,ε/2)-coreset for P
When new point enters

– Add new p to P0 – If Q1 exists, merge the two, calculate new coreset and continue until Qr does not exist – Can merge coresets efficiently

SLIDE 26

Coresets for k-Means and k-Median Clustering and their Applications

Sariel Har-Peled and Soham Mazumdar

Problem Introduction

function is minimized

– Median: – Discrete median: – Mean:

Costs

k-medians Discrete k-medians k-means

Results

– Kolliopoulos and Rao [KR99] – Matoušek [Mat00]

– k-median – Discrete k-median – k-mean

Overview

– P: Point set – S: Coreset – A: Constant factor approximation – D: Centroid set – C: k centers

Coresets for k-median

Exponential Grid

in A

side length ε R2j /(10cd)

each non- empty cell

by number of points in cell

Cost of Constructing S

number of cells

– log n levels

– Constant factor approximation to cost νA(P) – Nearest Neighbor queries NN Queries Naïve: O(mn) [AMN+98]: O(log m) after O(m log m) Here: O(n+mn1/4 log n) Total Cost If m = else

Fuzzy Nearest Neighbor Search in O(1)

– Any point in X which is closer than δ is valid

– Any point in X is valid δ ∆

Proof of Correctness

Coresets for k-means

grid

Proof of Correctness

– Points that are close to A and B – small error – Points closer to B than to A – ε fraction error – Points closer to A than to B – “better” than optimal

Errors

Fast Constant Factor Approximation

– i.e. set A

– k = O(n1/4) ! O(n) [Har01a] – k = Ω(n1/4) ! O(n log k) [FG88]

Picking Sets

– size ρ = γ k log2 n

Bad Points

A point is bad with respect to a set X if the cost is much larger than the optimal center would pay. More precisely

Few Bad Points

points

Clustering Cost of Bad Points

neighbor in X

– Cost is same as in construction of S

– Pα is the last class more than 2 β points – P’ = U Pi for i =1…α – |P’| ¸ n/2 and

Proof

– Run O(log n) iterations – In each iteration we get |X| = O (k log2 n) – So total number of centers O(k log3 n) – Approximation bounded by

(1+ε) k-Median Approximation

max) algorithm [Gon85]

– Result is C0

[AGK+01]

– Swap a point in the set of centers with one outside – Keep it if it shows considerable improvement

final coreset S

Centroid Sets

– Make a (k,ε/12)-coreset S – Compute exponential grid around each point in S with R = νB (P)/n – Centroid set D size of O(k2 ε-2d log2 n)

Summary of Construction

Discrete k-medians

– Points P snapped to D – Discrete centroid set

k-Means

coreset

Streaming

– Pi is empty – |Pi| = 2i M where M=O(k/εd)

– Add new p to P0 – If Q1 exists, merge the two, calculate new coreset and continue until Qr does not exist – Can merge coresets efficiently

End