[PPT] - Clustering with k-means Introduction to Machine Learning PowerPoint Presentation

SLIDE 1

INTRODUCTION TO MACHINE LEARNING

Clustering with   k-means

SLIDE 2

Introduction to Machine Learning

Clustering, what?

Similar within cluster
Dissimilar between clusters
Clustering: grouping objects in clusters
No labels: unsupervised classification
Plenty possible clusterings
Cluster: collection of objects

SLIDE 3

Introduction to Machine Learning

Clustering, why?

Paern Analysis
Visualise Data
pre-Processing Step
Outlier Detection
…
Targeted Marketing Programs
Student Segmentations
Data Mining
…

SLIDE 4

Introduction to Machine Learning

Clustering, how?

Measure of Similarity: d( …, …)
Numerical variables Metrics: Euclidean, Manhattan, …
Categorical variables Construct your own distance
Clustering Methods
k-means
Hierarchical
…

Many variations

SLIDE 5

Introduction to Machine Learning

Cluster Centroid Object Cluster #Clusters

Compactness and Separation

Within Cluster Sums of Squares (WSS):

Measure of compactness

Between Cluster Sums of Squares (BSS):

Measure of separation

Minimise WSS Maximise BSS

Cluster Centroid #Clusters #Objects in Cluster Sample Mean

SLIDE 6

Introduction to Machine Learning

Goal: Partition data in k disjoint subsets

k-Means Algorithm

5 10 −5 5 x y

Let’s take k = 3

SLIDE 7

Introduction to Machine Learning

1. Randomly assign k centroids

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

SLIDE 8

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

SLIDE 9

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid
3. Moves centroids to average location

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

SLIDE 10

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid
3. Moves centroids to average location
4. Repeat step 2 and 3

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

SLIDE 11

Introduction to Machine Learning

1. Randomly assign k centroids

2. Assign data to closest centroid
3. Moves centroids to average location
4. Repeat step 2 and 3

5 10 −5 5 x y

k-Means Algorithm

The algorithm has converged!

k = 3

Goal: Partition data in k disjoint subsets

SLIDE 12

Introduction to Machine Learning

Problem: WSS keeps decreasing as k increases!
Solution: WSS starts decreasing slowly

Choosing k

Goal: Find k that minimizes WSS

WSS / TSS < 0.2 Fix k

}

SLIDE 13

Introduction to Machine Learning

Choosing k: Scree Plot

Scree Plot: Visualizing the ratio WSS / TSS as function of k

1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1.0 k WSS / TSS

Look for the elbow in the plot Choose k = 3

SLIDE 14

Introduction to Machine Learning

k-Means in R

centers: Starting centroid or #clusters
nstart: #times R restarts with different centroids

> my_km <- kmeans(data, centers, nstart)

Distance: Euclidean metric

> my_km$tot.withinss > my_km$betweenss

WSS BSS

SLIDE 15

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

SLIDE 16

INTRODUCTION TO MACHINE LEARNING

Performance and Scaling

SLIDE 17

Introduction to Machine Learning

Cluster Evaluation

Not trivial! There is no truth

No true labels
No true response

Evaluation methods? Depends on the goal

Goal: Compact and Separated

Measurable!

SLIDE 18

Introduction to Machine Learning

Cluster Measures

WSS and BSS: Underlying idea:

Variance within clusters
Separation between clusters }

Compare

Alternative:

Diameter
Intercluster Distance

Good indication

SLIDE 19

Introduction to Machine Learning

Diameter

5 10 15 −5 5 x y

: Objects : Cluster : Distance (objects)

Measure of Compactness

SLIDE 20

Introduction to Machine Learning

5 10 15 −5 5 x y

Intercluster Distance

: Objects : Clusters : Distance (objects)

Measure of Separation

SLIDE 21

Introduction to Machine Learning

5 10 15 −5 5 x y

Dunn’s Index

SLIDE 22

Introduction to Machine Learning

Notes:

High computational cost
Worst case indicator

Dunn’s Index

Higher Dunn Beer separated / more compact

SLIDE 23

Introduction to Machine Learning

Alternative measures

Internal Validation: based on intrinsic knowledge
BIC Index
Silhouee’s Index
External Validation: based on previous knowledge
Hulbert’s Correlation
Jaccard’s Coefficient

SLIDE 24

Introduction to Machine Learning

Evaluating in R

Libraries: cluster and clValid

> dunn(clusters = my_km, Data = ...)

Dunn’s Index:

clusters: cluster partitioning vector
Data: original dataset

SLIDE 25

Introduction to Machine Learning

Scale Issues

Metrics are oen scale dependent!

Which pair is most similar? ( Age, Income, IQ )

X1 = (28, 72000, 120)
X2 = (56, 73000, 80)
X3 = (29, 74500, 118)
Intuition: (X1, X3)
Euclidean: (X1, X2)

Solution: Rescale income / 1000$

SLIDE 26

Introduction to Machine Learning

Standardizing

Problem: Multiple variables on different scales Solution: Standardize your data

1. Subtract the mean
2. Divide by the standard deviation

> scale(data)

Note: Standardizing Different interpretation

SLIDE 27

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

SLIDE 28

INTRODUCTION TO MACHINE LEARNING

Hierarchical Clustering

SLIDE 29

Introduction to Machine Learning

Hierarchical Clustering

Hierarchy:

Which objects cluster first?
Which cluster pairs merge? When?

Boom-up:

Starts from the objects
Builds a hierarchy of clusters

SLIDE 30

Introduction to Machine Learning

Boom-Up: Algorithm

Pre: Calculate distances between objects

Objects

SLIDE 31

Introduction to Machine Learning

Boom-Up: Algorithm

Pre: Calculate distances between objects

Distance Objects

SLIDE 32

Introduction to Machine Learning

Boom-Up: Algorithm

1. Put every object in its own cluster

SLIDE 33

Introduction to Machine Learning

Boom-Up: Algorithm

2. Find the closest pair of clusters Merge them

SLIDE 34

Introduction to Machine Learning

Boom-Up: Algorithm

3. Compute distances between new cluster and old ones

SLIDE 35

Introduction to Machine Learning

Boom-Up: Algorithm

4. Repeat steps two and three

SLIDE 36

Introduction to Machine Learning

Boom-Up: Algorithm

4. Repeat steps two and three

SLIDE 37

Introduction to Machine Learning

Boom-Up: Algorithm

4. Repeat steps two and three

SLIDE 38

Introduction to Machine Learning

Boom-Up: Algorithm

4. Repeat steps two and three One cluster

SLIDE 39

Introduction to Machine Learning

Linkage-Methods

Simple-Linkage: minimal distance between clusters
Complete-Linkage: maximal distance between clusters
Average-Linkage: average distance between clusters

Different Clusterings

SLIDE 40

Introduction to Machine Learning

Simple-Linkage

Minimal distance between objects in each clusters

SLIDE 41

Introduction to Machine Learning

Complete-Linkage

Maximal distance between objects in each cluster

SLIDE 42

Introduction to Machine Learning

Single-Linkage: Chaining

Oen undesired

SLIDE 43

Introduction to Machine Learning

Single-Linkage: Chaining

Oen undesired

SLIDE 44

Introduction to Machine Learning

Single-Linkage: Chaining

Oen undesired

SLIDE 45

Introduction to Machine Learning

Single-Linkage: Chaining

Can be great outlier detector
Oen undesired

SLIDE 46

Introduction to Machine Learning

Dendrogram

Height Leaves / Objects Cut Merge Merge

SLIDE 47

Introduction to Machine Learning

Hierarchical Clustering in R

> dist(x, method)

Library: stats

> hclust(d, method)

x: dataset
method: distance
d: distance matrix
method: linkage

SLIDE 48

Introduction to Machine Learning

Hierarchical: Pro and Cons

Pros
In-depth analysis
Linkage-methods
Cons
High computational cost
Can never undo merges

Different paern

SLIDE 49

Introduction to Machine Learning

k-Means: Pro and Cons

Pros
Can undo merges
Fast computations
Cons
Fixed #Clusters
Dependent on starting centroids

SLIDE 50

INTRODUCTION TO MACHINE LEARNING