Clustering with k-means Introduction to Machine Learning - - PowerPoint PPT Presentation

clustering with k means
SMART_READER_LITE
LIVE PREVIEW

Clustering with k-means Introduction to Machine Learning - - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects


slide-1
SLIDE 1

INTRODUCTION TO MACHINE LEARNING

Clustering with 
 k-means

slide-2
SLIDE 2

Introduction to Machine Learning

Clustering, what?

  • Similar within cluster
  • Dissimilar between clusters
  • Clustering: grouping objects in clusters
  • No labels: unsupervised classification
  • Plenty possible clusterings
  • Cluster: collection of objects
slide-3
SLIDE 3

Introduction to Machine Learning

Clustering, why?

  • Paern Analysis
  • Visualise Data
  • pre-Processing Step
  • Outlier Detection
  • Targeted Marketing Programs
  • Student Segmentations
  • Data Mining
slide-4
SLIDE 4

Introduction to Machine Learning

Clustering, how?

  • Measure of Similarity: d( …, …)
  • Numerical variables Metrics: Euclidean, Manhattan, …
  • Categorical variables Construct your own distance
  • Clustering Methods
  • k-means
  • Hierarchical

Many variations

slide-5
SLIDE 5

Introduction to Machine Learning

Cluster Centroid Object Cluster #Clusters

Compactness and Separation

  • Within Cluster Sums of Squares (WSS):

Measure of compactness

  • Between Cluster Sums of Squares (BSS):

Measure of separation

Minimise WSS Maximise BSS

Cluster Centroid #Clusters #Objects in Cluster Sample Mean

slide-6
SLIDE 6

Introduction to Machine Learning

Goal: Partition data in k disjoint subsets

k-Means Algorithm

5 10 −5 5 x y

Let’s take k = 3

slide-7
SLIDE 7

Introduction to Machine Learning

1. Randomly assign k centroids

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

slide-8
SLIDE 8

Introduction to Machine Learning

1. Randomly assign k centroids

  • 2. Assign data to closest centroid

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

slide-9
SLIDE 9

Introduction to Machine Learning

1. Randomly assign k centroids

  • 2. Assign data to closest centroid
  • 3. Moves centroids to average location

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

slide-10
SLIDE 10

Introduction to Machine Learning

1. Randomly assign k centroids

  • 2. Assign data to closest centroid
  • 3. Moves centroids to average location
  • 4. Repeat step 2 and 3

5 10 −5 5 x y

k-Means Algorithm

k = 3

Goal: Partition data in k disjoint subsets

slide-11
SLIDE 11

Introduction to Machine Learning

1. Randomly assign k centroids

  • 2. Assign data to closest centroid
  • 3. Moves centroids to average location
  • 4. Repeat step 2 and 3

5 10 −5 5 x y

k-Means Algorithm

The algorithm has converged!

k = 3

Goal: Partition data in k disjoint subsets

slide-12
SLIDE 12

Introduction to Machine Learning

  • Problem: WSS keeps decreasing as k increases!
  • Solution: WSS starts decreasing slowly

Choosing k

  • Goal: Find k that minimizes WSS

WSS / TSS < 0.2 Fix k

}

slide-13
SLIDE 13

Introduction to Machine Learning

Choosing k: Scree Plot

Scree Plot: Visualizing the ratio WSS / TSS as function of k

1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1.0 k WSS / TSS

Look for the elbow in the plot Choose k = 3

slide-14
SLIDE 14

Introduction to Machine Learning

k-Means in R

  • centers: Starting centroid or #clusters
  • nstart: #times R restarts with different centroids

> my_km <- kmeans(data, centers, nstart)

Distance: Euclidean metric

> my_km$tot.withinss > my_km$betweenss

WSS BSS

slide-15
SLIDE 15

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

slide-16
SLIDE 16

INTRODUCTION TO MACHINE LEARNING

Performance and Scaling

slide-17
SLIDE 17

Introduction to Machine Learning

Cluster Evaluation

Not trivial! There is no truth

  • No true labels
  • No true response

Evaluation methods? Depends on the goal

Goal: Compact and Separated

Measurable!

slide-18
SLIDE 18

Introduction to Machine Learning

Cluster Measures

WSS and BSS: Underlying idea:

  • Variance within clusters
  • Separation between clusters }

Compare

Alternative:

  • Diameter
  • Intercluster Distance

Good indication

slide-19
SLIDE 19

Introduction to Machine Learning

Diameter

5 10 15 −5 5 x y

: Objects : Cluster : Distance (objects)

Measure of Compactness

slide-20
SLIDE 20

Introduction to Machine Learning

5 10 15 −5 5 x y

Intercluster Distance

: Objects : Clusters : Distance (objects)

Measure of Separation

slide-21
SLIDE 21

Introduction to Machine Learning

5 10 15 −5 5 x y

Dunn’s Index

slide-22
SLIDE 22

Introduction to Machine Learning

Notes:

  • High computational cost
  • Worst case indicator

Dunn’s Index

Higher Dunn Beer separated / more compact

slide-23
SLIDE 23

Introduction to Machine Learning

Alternative measures

  • Internal Validation: based on intrinsic knowledge
  • BIC Index
  • Silhouee’s Index
  • External Validation: based on previous knowledge
  • Hulbert’s Correlation
  • Jaccard’s Coefficient
slide-24
SLIDE 24

Introduction to Machine Learning

Evaluating in R

Libraries: cluster and clValid

> dunn(clusters = my_km, Data = ...)

Dunn’s Index:

  • clusters: cluster partitioning vector
  • Data: original dataset
slide-25
SLIDE 25

Introduction to Machine Learning

Scale Issues

Metrics are oen scale dependent!

Which pair is most similar? ( Age, Income, IQ )

  • X1 = (28, 72000, 120)
  • X2 = (56, 73000, 80)
  • X3 = (29, 74500, 118)
  • Intuition: (X1, X3)
  • Euclidean: (X1, X2)

Solution: Rescale income / 1000$

slide-26
SLIDE 26

Introduction to Machine Learning

Standardizing

Problem: Multiple variables on different scales Solution: Standardize your data

  • 1. Subtract the mean
  • 2. Divide by the standard deviation

> scale(data)

Note: Standardizing Different interpretation

slide-27
SLIDE 27

INTRODUCTION TO MACHINE LEARNING

Let’s practice!

slide-28
SLIDE 28

INTRODUCTION TO MACHINE LEARNING

Hierarchical Clustering

slide-29
SLIDE 29

Introduction to Machine Learning

Hierarchical Clustering

Hierarchy:

  • Which objects cluster first?
  • Which cluster pairs merge? When?

Boom-up:

  • Starts from the objects
  • Builds a hierarchy of clusters
slide-30
SLIDE 30

Introduction to Machine Learning

Boom-Up: Algorithm

Pre: Calculate distances between objects

Objects

slide-31
SLIDE 31

Introduction to Machine Learning

Boom-Up: Algorithm

Pre: Calculate distances between objects

Distance Objects

slide-32
SLIDE 32

Introduction to Machine Learning

Boom-Up: Algorithm

1. Put every object in its own cluster

slide-33
SLIDE 33

Introduction to Machine Learning

Boom-Up: Algorithm

  • 2. Find the closest pair of clusters Merge them
slide-34
SLIDE 34

Introduction to Machine Learning

Boom-Up: Algorithm

  • 3. Compute distances between new cluster and old ones
slide-35
SLIDE 35

Introduction to Machine Learning

Boom-Up: Algorithm

  • 4. Repeat steps two and three
slide-36
SLIDE 36

Introduction to Machine Learning

Boom-Up: Algorithm

  • 4. Repeat steps two and three
slide-37
SLIDE 37

Introduction to Machine Learning

Boom-Up: Algorithm

  • 4. Repeat steps two and three
slide-38
SLIDE 38

Introduction to Machine Learning

Boom-Up: Algorithm

  • 4. Repeat steps two and three One cluster
slide-39
SLIDE 39

Introduction to Machine Learning

Linkage-Methods

  • Simple-Linkage: minimal distance between clusters
  • Complete-Linkage: maximal distance between clusters
  • Average-Linkage: average distance between clusters

Different Clusterings

slide-40
SLIDE 40

Introduction to Machine Learning

Simple-Linkage

Minimal distance between objects in each clusters

slide-41
SLIDE 41

Introduction to Machine Learning

Complete-Linkage

Maximal distance between objects in each cluster

slide-42
SLIDE 42

Introduction to Machine Learning

Single-Linkage: Chaining

  • Oen undesired
slide-43
SLIDE 43

Introduction to Machine Learning

Single-Linkage: Chaining

  • Oen undesired
slide-44
SLIDE 44

Introduction to Machine Learning

Single-Linkage: Chaining

  • Oen undesired
slide-45
SLIDE 45

Introduction to Machine Learning

Single-Linkage: Chaining

  • Can be great outlier detector
  • Oen undesired
slide-46
SLIDE 46

Introduction to Machine Learning

Dendrogram

Height Leaves / Objects Cut Merge Merge

slide-47
SLIDE 47

Introduction to Machine Learning

Hierarchical Clustering in R

> dist(x, method)

Library: stats

> hclust(d, method)

  • x: dataset
  • method: distance
  • d: distance matrix
  • method: linkage
slide-48
SLIDE 48

Introduction to Machine Learning

Hierarchical: Pro and Cons

  • Pros
  • In-depth analysis
  • Linkage-methods
  • Cons
  • High computational cost
  • Can never undo merges

Different paern

slide-49
SLIDE 49

Introduction to Machine Learning

k-Means: Pro and Cons

  • Pros
  • Can undo merges
  • Fast computations
  • Cons
  • Fixed #Clusters
  • Dependent on starting centroids
slide-50
SLIDE 50

INTRODUCTION TO MACHINE LEARNING

Let’s practice!