INTRODUCTION TO MACHINE LEARNING
Clustering with k-means Introduction to Machine Learning - - PowerPoint PPT Presentation
Clustering with k-means Introduction to Machine Learning - - PowerPoint PPT Presentation
INTRODUCTION TO MACHINE LEARNING Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster : collection of objects Similar within cluster Dissimilar between clusters Clustering : grouping objects
Introduction to Machine Learning
Clustering, what?
- Similar within cluster
- Dissimilar between clusters
- Clustering: grouping objects in clusters
- No labels: unsupervised classification
- Plenty possible clusterings
- Cluster: collection of objects
Introduction to Machine Learning
Clustering, why?
- Paern Analysis
- Visualise Data
- pre-Processing Step
- Outlier Detection
- …
- Targeted Marketing Programs
- Student Segmentations
- Data Mining
- …
Introduction to Machine Learning
Clustering, how?
- Measure of Similarity: d( …, …)
- Numerical variables Metrics: Euclidean, Manhattan, …
- Categorical variables Construct your own distance
- Clustering Methods
- k-means
- Hierarchical
- …
Many variations
Introduction to Machine Learning
Cluster Centroid Object Cluster #Clusters
Compactness and Separation
- Within Cluster Sums of Squares (WSS):
Measure of compactness
- Between Cluster Sums of Squares (BSS):
Measure of separation
Minimise WSS Maximise BSS
Cluster Centroid #Clusters #Objects in Cluster Sample Mean
Introduction to Machine Learning
Goal: Partition data in k disjoint subsets
k-Means Algorithm
5 10 −5 5 x y
Let’s take k = 3
Introduction to Machine Learning
1. Randomly assign k centroids
5 10 −5 5 x y
k-Means Algorithm
k = 3
Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
- 2. Assign data to closest centroid
5 10 −5 5 x y
k-Means Algorithm
k = 3
Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
- 2. Assign data to closest centroid
- 3. Moves centroids to average location
5 10 −5 5 x y
k-Means Algorithm
k = 3
Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
- 2. Assign data to closest centroid
- 3. Moves centroids to average location
- 4. Repeat step 2 and 3
5 10 −5 5 x y
k-Means Algorithm
k = 3
Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
- 2. Assign data to closest centroid
- 3. Moves centroids to average location
- 4. Repeat step 2 and 3
5 10 −5 5 x y
k-Means Algorithm
The algorithm has converged!
k = 3
Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
- Problem: WSS keeps decreasing as k increases!
- Solution: WSS starts decreasing slowly
Choosing k
- Goal: Find k that minimizes WSS
WSS / TSS < 0.2 Fix k
}
Introduction to Machine Learning
Choosing k: Scree Plot
Scree Plot: Visualizing the ratio WSS / TSS as function of k
1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1.0 k WSS / TSS
Look for the elbow in the plot Choose k = 3
Introduction to Machine Learning
k-Means in R
- centers: Starting centroid or #clusters
- nstart: #times R restarts with different centroids
> my_km <- kmeans(data, centers, nstart)
Distance: Euclidean metric
> my_km$tot.withinss > my_km$betweenss
WSS BSS
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Performance and Scaling
Introduction to Machine Learning
Cluster Evaluation
Not trivial! There is no truth
- No true labels
- No true response
Evaluation methods? Depends on the goal
Goal: Compact and Separated
Measurable!
Introduction to Machine Learning
Cluster Measures
WSS and BSS: Underlying idea:
- Variance within clusters
- Separation between clusters }
Compare
Alternative:
- Diameter
- Intercluster Distance
Good indication
Introduction to Machine Learning
Diameter
5 10 15 −5 5 x y
: Objects : Cluster : Distance (objects)
Measure of Compactness
Introduction to Machine Learning
5 10 15 −5 5 x y
Intercluster Distance
: Objects : Clusters : Distance (objects)
Measure of Separation
Introduction to Machine Learning
5 10 15 −5 5 x y
Dunn’s Index
Introduction to Machine Learning
Notes:
- High computational cost
- Worst case indicator
Dunn’s Index
Higher Dunn Beer separated / more compact
Introduction to Machine Learning
Alternative measures
- Internal Validation: based on intrinsic knowledge
- BIC Index
- Silhouee’s Index
- External Validation: based on previous knowledge
- Hulbert’s Correlation
- Jaccard’s Coefficient
Introduction to Machine Learning
Evaluating in R
Libraries: cluster and clValid
> dunn(clusters = my_km, Data = ...)
Dunn’s Index:
- clusters: cluster partitioning vector
- Data: original dataset
Introduction to Machine Learning
Scale Issues
Metrics are oen scale dependent!
Which pair is most similar? ( Age, Income, IQ )
- X1 = (28, 72000, 120)
- X2 = (56, 73000, 80)
- X3 = (29, 74500, 118)
- Intuition: (X1, X3)
- Euclidean: (X1, X2)
Solution: Rescale income / 1000$
Introduction to Machine Learning
Standardizing
Problem: Multiple variables on different scales Solution: Standardize your data
- 1. Subtract the mean
- 2. Divide by the standard deviation
> scale(data)
Note: Standardizing Different interpretation
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Hierarchical Clustering
Introduction to Machine Learning
Hierarchical Clustering
Hierarchy:
- Which objects cluster first?
- Which cluster pairs merge? When?
Boom-up:
- Starts from the objects
- Builds a hierarchy of clusters
Introduction to Machine Learning
Boom-Up: Algorithm
Pre: Calculate distances between objects
Objects
Introduction to Machine Learning
Boom-Up: Algorithm
Pre: Calculate distances between objects
Distance Objects
Introduction to Machine Learning
Boom-Up: Algorithm
1. Put every object in its own cluster
Introduction to Machine Learning
Boom-Up: Algorithm
- 2. Find the closest pair of clusters Merge them
Introduction to Machine Learning
Boom-Up: Algorithm
- 3. Compute distances between new cluster and old ones
Introduction to Machine Learning
Boom-Up: Algorithm
- 4. Repeat steps two and three
Introduction to Machine Learning
Boom-Up: Algorithm
- 4. Repeat steps two and three
Introduction to Machine Learning
Boom-Up: Algorithm
- 4. Repeat steps two and three
Introduction to Machine Learning
Boom-Up: Algorithm
- 4. Repeat steps two and three One cluster
Introduction to Machine Learning
Linkage-Methods
- Simple-Linkage: minimal distance between clusters
- Complete-Linkage: maximal distance between clusters
- Average-Linkage: average distance between clusters
Different Clusterings
Introduction to Machine Learning
Simple-Linkage
Minimal distance between objects in each clusters
Introduction to Machine Learning
Complete-Linkage
Maximal distance between objects in each cluster
Introduction to Machine Learning
Single-Linkage: Chaining
- Oen undesired
Introduction to Machine Learning
Single-Linkage: Chaining
- Oen undesired
Introduction to Machine Learning
Single-Linkage: Chaining
- Oen undesired
Introduction to Machine Learning
Single-Linkage: Chaining
- Can be great outlier detector
- Oen undesired
Introduction to Machine Learning
Dendrogram
Height Leaves / Objects Cut Merge Merge
Introduction to Machine Learning
Hierarchical Clustering in R
> dist(x, method)
Library: stats
> hclust(d, method)
- x: dataset
- method: distance
- d: distance matrix
- method: linkage
Introduction to Machine Learning
Hierarchical: Pro and Cons
- Pros
- In-depth analysis
- Linkage-methods
- Cons
- High computational cost
- Can never undo merges
Different paern
Introduction to Machine Learning
k-Means: Pro and Cons
- Pros
- Can undo merges
- Fast computations
- Cons
- Fixed #Clusters
- Dependent on starting centroids
INTRODUCTION TO MACHINE LEARNING