[PPT] - Evaluation of Hierarchical Clustering Algorithms for Document PowerPoint Presentation

SLIDE 1

Evaluation of Hierarchical Clustering Algorithms for Document Datasets

Paper by Ying Zhao and George Karypis University of Minnesota (2002) CS 6501 Paper Presentation - April 6, 2016 Matthew Hawthorn, Nikhil Mascarenhas, Shannon Mitchell

SLIDE 2

Motivation

Hierarchical clustering of documents

○ Intuitive, clustering of different levels of granularity.

Two major approaches

○ Partitional ○ Agglomerative

General view was that partitional algorithms are inferior
Authors ran an experiment to compare these approaches.
Defined a new algorithm, a hybrid “constrained agglomerative algorithm”

SLIDE 3

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves

SLIDE 4

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 5

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 6

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 7

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 8

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 9

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 10

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 11

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 12

Hierarchical Clustering: Partitional Algorithms

Top-down

Start with one cluster with all

documents

Start at root, divide down to leaves
Split the cluster which most

improves the criterion function

Complexity: O (n log n)

SLIDE 13

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 14

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 15

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 16

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 17

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 18

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 19

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 20

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 21

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 22

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

Each document starts as its own

cluster

Start at leaves, merge to root
Complexity:

O (n2 log n)

When caching of intermediate values of the

bjective is possible

O(n3)

Otherwise

SLIDE 23

Criterion Functions

Global criterion functions drive the clustering process.

Internal Functions External Functions Graph Based Functions Hybrid Functions Considers only documents within a cluster Considers how various clusters are different from each

ther.

Constructs a graph which represents the relationships between documents. Simultaneously consider internal and external criterion functions

SLIDE 24

Internal Criterion Functions

m Number of terms n Number of documents k Number of clusters S1, S2 ,... Sk Each one of k clusters n1, n2 ,…. nk Size of each cluster d1, d2, …. dn Tf idf vector for a document DA Sum of all vectors in cluster A CA Centroid vector of cluster A

SLIDE 25

External Criterion Functions

m Number of terms n Number of documents k Number of clusters S1, S2 ,... Sk Each one of k clusters n1, n2 ,…. nk Size of each cluster d1, d2, …. dn Tf idf vector for a document DA Sum of all vectors in cluster A CA Centroid vector of cluster A

SLIDE 26

Single-linkage

minimum distance ‘slink’

Complete-linkage

maximum distance

‘clink’ Group average

average of distances ‘UPGMA’

Traditional Agglomerative Clustering Criteria

Authors’ abbreviation:

SLIDE 27

Hierarchical Clustering: Constrained Agglomerative

Hybrid technique
Constrains agglomerative clustering by initializing with intermediate

hierarchical partitional clustering

More likely to avoid early merge mistakes of agglomerative techniques
But takes advantage of the ease with which agglomerative techniques find

small and cohesive clusters

SLIDE 28

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

SLIDE 29

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

SLIDE 30

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

SLIDE 31

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

SLIDE 32

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

SLIDE 33

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 34

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 35

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 36

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 37

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 38

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

SLIDE 39

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

SLIDE 40

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

SLIDE 41

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

SLIDE 42

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

SLIDE 43

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

SLIDE 44

Computational Complexity

Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level

SLIDE 45

Computational Complexity

Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level n log(n)

SLIDE 46

Computational Complexity

Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level n log(n) Truncate at k clusters

SLIDE 47

Computational Complexity

Agglomerative clustering of docs in the k clusters:

O(k (n/k)2 log(n/k))

k clusters, size ≈ n/k Cost to cluster one cluster: O(size2 log(size)) ≈ (n/k)2 log(n/k)

SLIDE 48

Computational Complexity

Agglomerative clustering of docs in the k clusters:

O(k (n/k)2 log(n/k))

k clusters, size ≈ n/k Cost to cluster one cluster: O(size2 log(size)) ≈ (n/k)2 log(n/k)

k (n/k)2 log(n/k)

SLIDE 49

Computational Complexity

Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

SLIDE 50

Computational Complexity

Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

SLIDE 51

Computational Complexity

Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

SLIDE 52

Computational Complexity

Putting it all together:

O(n log(n)) + O(k (n/k)2 log(n/k)) + O(k2log(k))

Initial partitional clustering Agglomerative clustering within initial clusters Agglomerative clustering between initial clusters

SLIDE 53

Computational Complexity

Putting it all together:

O(n log(n)) + O(k (n/k)2 log(n/k)) + O(k2log(k))

Dominant term for reasonable choices of k

SLIDE 54

Computational Complexity

Putting it all together:

O(k (n/k)2 log(n/k))

If we let k ≈ √n, this reduces to:

O(n3/2 log(n))

SLIDE 55

Computational Complexity

Putting it all together:

O(k (n/k)2 log(n/k))

If we let k ≈ √n, this reduces to:

O(n3/2 log(n))

Or in general, if k ≈ nα with 0<α<1, complexity is:

O(nα+2(1-α) log(n))

Better than Agglomerative: O(n2 log(n))

Worse than Partitional: O(n log(n))

But a slightly better performer than either on average

SLIDE 56

Evaluation: Experimental Design

12 document collections were analyzed with each of the hierarchical methods

Partitional Agglomerative Constrained Agglomerative Criterion Functions Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Single Link (slink) Complete Link (clink) Group Average (UPGMA) Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Number of initial clusters 10 20 n/40 n/20

SLIDE 57

Document Collections (12)

SLIDE 58

Vector Space Model

Model design:

TF-IDF term weighting
Normalized by document length
Cosine similarity

SLIDE 59

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class?

SLIDE 60

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class? Define F for class Lr as maximum over all clusters Si in the clustering tree:

SLIDE 61

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class? Define F for class Lr as maximum over all clusters Si in the clustering tree: FScore for entire clustering: F(Lr) summed across classes,weighted by class size

SLIDE 62

Results: Agglomerative vs. Partitional

Hierarchical Method

SLIDE 63

Results: Constrained Agglomerative

Criterion Functions Best Partitional Best Agglomerative Constrained Agglomerative (number of initial partitions)

F-score Constrained vs. Partitional vs. Agglomerative

SLIDE 64

Conclusion

Zhao and Karypis did a thorough comparison of hierarchical clustering methods on large document collections Partitional algorithms consistently outperformed agglomerative methods Constrained agglomerative methods outperformed the partitional methods in many cases

SLIDE 65

Thank You

SLIDE 66

Clustering Method Comparisons

Partitional Agglomerative Constrained Agglomerative Complexity:

O(n log n)

Complexity:

O (n2 log n)

With caching in a binary heap

O (n3)

If the similarity function is not cacheable

Complexity

O (k((n/k)2 log (n/k)) + k2 log k)
=O (n3/2 log n) when number of

partitional clusters ≈ √n Limited studies show agglomerative methods

utperform k-means with small

datasets Initial merging may contain errors, which can multiply during agglomeration Partitional cluster constraint prevents initial merging errors (merging across cluster boundaries) Suited for large datasets due to low computational requirements Easy to group documents in small, cohesive clusters Common belief (2001) that k- means methods are inferior than agglomerative