Data Mining Clustering Hamid Beigy Sharif University of Technology - - PowerPoint PPT Presentation

▶

Mar 17, 2023 242 likes •717 views

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41 Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity

SLIDE 1

Data Mining

Clustering Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41

SLIDE 2

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 41

SLIDE 3

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 41

SLIDE 4

Introduction

Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in

ther clusters.

Dissimilarities and similarities are assessed based on the attribute values describing the

bjects and often involve distance measures.

Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 41

SLIDE 5

Requirements for cluster analysis

Clustering is a challenging research field and the following are its typical requirements.

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint-based clustering Interpretability and usability

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 41

SLIDE 6

Comparing clustering methods

The clustering methods can be compared using the following aspects:

The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements

unreliable. Consequently, clusters found in the full space are often meaningless. Its often

better to instead search for clusters within different subspaces of the same data set.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 41

SLIDE 7

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 41

SLIDE 8

Data matrix and dissimilarity matrix

Suppose that we have n objects described by p attributes. The objects are x1 = (x11, x12, . . . , x1p), x2 = (x21, x22, . . . , x2p), and so on, where xij is the value for

bject xi of the jth attribute. For brevity, we hereafter refer to object xi as object i.

The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors. Main memory-based clustering and nearest-neighbor algorithms typically operate on either

f the following two data structures:

Data matrix This structure stores the n objects in the form of a table or n × p matrix.         x11 . . . x1f . . . x1p . . . . . . . . . . . . . . . xi1 . . . xif . . . xip . . . . . . . . . . . . . . . xn1 . . . xnf . . . xnp         Dissimilarity matrix : This structure stores a collection of proximities that are available for all pairs of objects. It is often represented by an n × n matrix or table:      d(1, 2) d(1, 3) . . . d(1, n) d(2, 1) d(2, 3) . . . d(2, n) . . . . . . . . . ... . . . d(n, 1) d(n, 2) d(n, 3) . . .     

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 41

SLIDE 9

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 41

SLIDE 10

Proximity Measures

Proximity measures for nominal attributes : Let the number of states of a nominal attribute be M. The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d(i, j) = p − m p where m is the number of matches and p is the total number of attributes describing the

bjects.

Proximity measures for binary attributes : Binary attributes are either symmetric or asymmetric.

Object j

1 sum 1 q r q + r

Object i

s t s + t sum q + s r + t p

For symmetric binary attributes, similarity is calculated as d(i, j) = r + s q + r + s + t For asymmetric binary attributes when the number of negative matches, t, is unimportant and the number of positive matches, q, is important , similarity is calculated as d(i, j) = r + s q + r + s Coefficient 1 − d(i, j) is called the Jaccard coefficient.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 41

SLIDE 11

Proximity Measures (cont.)

Dissimilarity of numeric attributes :

The most popular distance measure is Euclidean distance d(i, j) = √ (xi1 − xj2)2 + (xi2 − xj1)2 + . . . + (xip − xjp)2 Another well-known measure is Manhattan distance d(i, j) = |xi1 − xj2| + |xi2 − xj1| + . . . + |xip − xjp| Minkowski distance is generalization of Euclidean and Manhattan distances d(i, j) =

√ |xi1 − xj2|h + |xi2 − xj1|h + . . . + |xip − xjp|h

Dissimilarity of ordinal attributes : We first replace each xif by its corresponding rank rif ∈ {1, . . . , Mf } and then normalize it using zif = rif − 1 Mf − 1 Then dissimilarity can be computed using distance measures for numeric attributes using zif .

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 41

SLIDE 12

Proximity Measures (cont.)

Dissimilarity for attributes of mixed types : A more preferable approach is to process all attribute types together, performing a single analysis. d(i, j) = ∑p

f =1 δ(f ) ij d(f ) ij

∑p

f =1 δ(f ) ij

where the indicator δ(f )

= 0 if either

xif or xjf is missing xif = xjf = 0 and attribute f is asymmetric binary

and otherwise δ(f )

= 1. The distance d(f )

is computed based on the type of attribute f .

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 41

SLIDE 13

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 41

SLIDE 14

Clustering methods

There are many clustering algorithms in the literature. It is difficult to provide a crisp categorization of clustering methods because these categories may overlap so that a method may have features from several categories. In general, the major fundamental clustering methods can be classified into the following categories.

Method General Characteristics Partitioning methods – Find mutually exclusive clusters of spherical shape – Distance-based – May use mean or medoid (etc.) to represent cluster center – Effective for small- to medium-size data sets Hierarchical methods – Clustering is a hierarchical decomposition (i.e., multiple levels) – Cannot correct erroneous merges or splits – May incorporate other techniques like microclustering or consider object “linkages” Density-based methods – Can find arbitrarily shaped clusters – Clusters are dense regions of objects in space that are separated by low-density regions – Cluster density: Each point must have a minimum number of points within its “neighborhood” – May filter out outliers Grid-based methods – Use a multiresolution grid data structure – Fast processing time (typically independent of the number of data objects, yet dependent on grid size)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 41

SLIDE 15

Partitioning methods

The simplest and most fundamental version of cluster analysis is partitioning, which

rganizes the objects of a set into several exclusive groups or clusters.

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are similar to

ne another and dissimilar to objects in other clusters in terms of the data set attributes.

(a)

+ + +

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 41

SLIDE 16

k-Means clustering algorithm

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C1, . . . , Ck, that is, Ci ⊂ D and Ci ∩ Cj = ϕ for (1 ≤ i, j ≤ k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intracluster similarity and low intercluster similarity. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. The difference between an object p ∈ Ci and µi, the representative of the cluster, is measured by ||p − µi||. The quality of cluster Ci can be measured by the within-cluster variation, which is the sum of squared error between all objects in Ci and the centroid ci, defined as E =

∑

i=1

∑

p∈Ci

||p − µi||2

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 41

SLIDE 17

k-Means clustering algorithm (cont.)

2 3 4 10 11 12 20 25 30 (a) Initial dataset µ1 = 2 2 3 µ2 = 4 4 10 11 12 20 25 30 (b) Iteration: t = 1 µ1 = 2.5 2 3 4 µ2 = 16 10 11 12 20 25 30 (c) Iteration: t = 2 µ1 = 3 2 3 4 10 µ2 = 18 11 12 20 25 30 (d) Iteration: t = 3 µ1 = 4.75 2 3 4 10 11 12 µ2 = 19.60 20 25 30 (e) Iteration: t = 4 µ1 = 7 2 3 4 10 11 12 µ2 = 25 20 25 30 (f) Iteration: t = 5 (converged)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 41

SLIDE 18

k-Means clustering algorithm (cont.)

The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. To obtain good results in practice, it is common to run the k-means algorithm multiple times with different initial cluster centers. The time complexity of the k-means algorithm is O(nkt), where n is the total number of

bjects, k is the number of clusters, and t is the number of iterations.

Normally, k ≪ n and t ≪ n. Therefore, the method is relatively scalable and efficient in processing large data sets. There are several variants of the k-means method. These can differ in the selection of the initial k-means, the calculation of dissimilarity, and the strategies for calculating cluster means.

The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes. The partitioning around medoid (PAM) is a realization of k-medoids method (to reduce sensitivity to outliers).

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 41

SLIDE 19

Hierarchical methods

A hierarchical clustering method works by grouping data objects into a hierarchy or tree

f clusters.

a ab b c d e de cde abcde Step 0 Step 1 Step 2 Step 3 Step 4 Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Agglomerative (AGNES)

Hierarchical clustering methods

Agglomerative hierarchical clustering Divisive hierarchical clustering

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 41

SLIDE 20

Distance measures in hierarchical methods

Whether using an agglomerative method or a divisive method, a core need is to measure the distance between two clusters, where each cluster is generally a set of objects. Four widely used measures for distance between clusters are as follows, where |p − q| is the distance between two objects or points, p and q; µi is the mean for cluster, Ci; and ni is the number of objects in Ci. They are also known as linkage measures.

Minimum distance dmin(Ci, Cj) = min

p∈Ci,q∈Cj{|p − q|}

Maximum distance dmax(Ci, Cj) = max

p∈Ci,q∈Cj{|p − q|}

Mean distance dmean(Ci, Cj) = |µi − µj| Average distance dmin(Ci, Cj) = 1 ninj ∑

p∈Ci,q∈Cj

|p − q|

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 41

SLIDE 21

Hierarchical methods

Level l =0 l =1 l =2 l =3 l =4 a b c d e 1.0 0.8 0.6 0.4 0.2 0.0 Similarity scale

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 41

SLIDE 22

Model-based clustering

k-means is closely related to a probabilistic model known as the Gaussian mixture model. p(x) =

∑

k=1

πkN(x|µk, Σk) πk, µk, Σk are parameters. πk are called mixing proportions and each Gaussian is called a mixture component. The model is simply a weighted sum of Gaussians. But it is much more powerful than a single Gaussian, because it can model multi-modal distributions. Note that for p(x) to be a probability distribution, we require that ∑

k πk = 1 and that

for all k we have πk > 0. Thus, we may interpret the πk as probabilities themselves. Set of parameters θ = {{πk}, {µk}, {Σk}}

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 41

SLIDE 23

Model-based clustering (cont.)

Let use a K-dimensional binary random variable z in which a particular element zk equals to 1 and other elements are 0. The values of zk therefore satisfy zk ∈ {0, 1} and ∑

k zk = 1

We define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x|z). The marginal distribution over z is specified in terms of πk, such that p(zk = 1) = πk We can write this distribution in the form of p(zk = 1) =

∏

k=1

πzk

The conditional distribution of x given a particular value for z is a Gaussian p(x|zk = 1) = N(x|µk, Σk) This can also be written in the form of p(x|zk = 1) =

∏

k=1

N(x|µk, Σk)zk

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 41

SLIDE 24

Model-based clustering (cont.)

The marginal distribution of x equals to p(x) = ∑

p(z)p(x|z) =

∑

k=1

πkN(x|µk, Σk) We can write p(zk = 1|x) as γ(zk) = p(zk = 1|x) = p(zk = 1)p(x|zk = 1) p(x) = p(zk = 1)p(x|zk = 1) ∑K

j=1 p(zj = 1)p(x|zj = 1)

= πkN(x|µk, Σk) ∑K

j=1 πjN(x|µj, Σj)

We shall view πk as the prior probability of zk = 1, and the quantity γ(zk) as the corresponding posterior probability once we have observed x.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 41

SLIDE 25

Gaussian mixture model (example)

0.5 0.3 0.2 (a) 0.5 1 0.5 1 (b) 0.5 1 0.5 1 (a) 0.5 1 0.5 1 (b) 0.5 1 0.5 1 (c) 0.5 1 0.5 1

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 41

SLIDE 26

Model-based clustering (cont.)

Let X = {x1, . . . , xN} be drawn i.i.d. from mixture of Gaussian. The log-likelihood of the

bservations equals to

ln p(x|µ, π, Σ) =

∑

n=1

ln [ K ∑

k=1

πkN(xn|µk, Σk) ] Setting the derivatives of ln p(x|µ, π, Σ) with respect to µk and setting it equal to zero, we obtain 0 = −

∑

n=1

πkN(xn|µk, Σk) ∑K

j=1 πjN(xn|µj, Σj)

γ(znk)

Σk(xn − µk) Multiplying by Σ−1

and then simplifying, we obtain µk = 1 Nk

∑

n=1

γ(znk)xn Nk =

∑

n=1

γ(znk)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 41

SLIDE 27

Model-based clustering (cont.)

Setting the derivatives of ln p(x|µ, π, Σ) with respect to Σk and setting it equal to zero, we obtain Σk = 1 Nk

∑

n=1

γ(znk)(xn − µk)(xn − µk)T We maximize ln p(x|µ, π, Σ) with respect to πk with constraint ∑K

k=1 πk = 1. This can

be achieved using a Lagrange multiplier and maximizing the following quantity ln p(x|µ, π, Σ) + λ ( K ∑

k=1

πk − 1 ) . which gives

∑

n=1

πkN(xn|µk, Σk) ∑K

j=1 πjN(xn|µj, Σj)

+ λ If we now multiply both sides by πk and sum over k making use of the constraint ∑K

k=1 πk = 1, we find λ = N. Using this to eliminate λ and rearranging we obtain

πk = Nk N

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 41

SLIDE 28

EM for Gassian mixture models

Initialize µk, Σk, and πk, and evaluate the initial value of the log likelihood.

E step Evaluate γ(znk) using the current parameter values γ(znk) = πkN(xn|µk, Σk) ∑K

j=1 πjN(xn|µj, Σj)

M step Re-estimate the parameters using the current value of γ(znk) µk = 1 Nk

∑

n=1

γ(znk)xn Σk = 1 Nk

∑

n=1

γ(znk)(xn − µk)(xn − µk)T πk = Nk N where Nk = ∑N

n=1 γ(znk).

Evaluate the log likelihood ln p(x|µ, π, Σ) = ∑N

n=1 ln

[∑K

k=1 πkN(xn|µk, Σk)

] and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 24 / 41

SLIDE 29

Model-based clustering (example)

(a) −2 2 −2 2 (b) −2 2 −2 2 (c) L = 1 −2 2 −2 2 (d) L = 2 −2 2 −2 2 (e) L = 5 −2 2 −2 2 (f) L = 20 −2 2 −2 2

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 25 / 41

SLIDE 30

Density based clustering

The general idea of these methods is to continue growing a given cluster as long as the density in the neighborhood exceeds some threshold. How can we find dense regions in density-based clustering? The density of an object x can be measured by the number of objects close to x. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core

bjects, that is, objects that have dense neighborhoods.

It connects core objects and their neighborhoods to form dense regions as clusters.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 26 / 41

SLIDE 31

Density based clustering (cont.)

How does DBSCAN quantify the neighborhood of an object? User-specified parameter ϵ > 0 is used to specify the radius of a neighborhood we consider for every object. Definition (ϵ-neighborhood) The ϵ-neighborhood of an object x is the space within a radius ϵ centered at x. Due to the fixed neighborhood size parameterized by ϵ, the density of a neighborhood can be measured simply by the number of objects in the neighborhood. Definition (ϵ-neighborhood) An object is a core object if the ϵ-neighborhood of the object contains at least MinPts objects.

(a)

x y z

(b)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 27 / 41

SLIDE 32

Density based clustering (cont.)

Given a set, D, of objects, we can identify all core objects with respect to the given parameters, ϵ and MinPts. The clustering task is reduced to using core objects and their neighborhoods to form dense regions, where the dense regions are clusters. Definition (Directly density-reachable) For a core object q and an object p, we say that p is directly density-reachable from q (with respect to ϵ and MinPts) if p is within the ϵ−neighborhood of q. An object p is directly density-reachable from another object q if and only if q is a core

bject and p is in the ϵ−neighborhood of q.

q m p s

MinPts = 3

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 28 / 41

SLIDE 33

Density based clustering (cont.)

How can we assemble a large dense region using small dense regions centered by core

bjects?

Definition (Density-reachable) An object p is density-reachable from q (with respect to ϵ and MinPts in D) if there is a chain

f objects p1, . . . , pn, such that p1 = q, pn = p, and pi+1 is directly density-reachable from pi

with respect to ϵ and MinPts, for 1 ≤ i ≤ n, pi ∈ D.

q m p s

MinPts = 3

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 29 / 41

SLIDE 34

Density based clustering (cont.)

To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notion of density-connectedness. Definition (Density-connected) Two objects p1, p2 ∈ D are density-connected with respect to ϵ and MinPts if there is an

bject q ∈ D such that both p1 and p2 are density-reachable from q with respect to ϵ and

MinPts.

q m p s

MinPts = 3

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 30 / 41

SLIDE 35

Density based clustering (cont.)

How does DBSCAN find clusters?

Initially, all objects in data set D are marked as unvisited.

It randomly selects an unvisited object p, marks p as visited, and checks whether p is core point or not.

If p is not core point, then p is marked as a noise point. Otherwise, a new cluster C is created for p, and all the objects in the ϵ− neighborhood of p are added to a candidate set N.

DBSCAN iteratively adds to C those objects in N that do not belong to any cluster.

In this process, for an object p′ ∈ N that carries the label unvisited, DBSCAN marks it as visited and checks its ϵ−neighborhood.

If p′ is a core point, then those objects in its ϵ−neighborhood are added to N.

DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is

empty. At this time, cluster C is completed, and thus is output.

To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones.

The clustering process continues until all objects are visited.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 31 / 41

SLIDE 36

Density based clustering (example)

20 95 170 245 320 395 100 200 300 400 500 600 X1 X2

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 32 / 41

SLIDE 37

Grid-based clustering

The grid-based clustering approach uses a multiresolution grid data structure.

(i–1)st layer First layer ith layer

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 33 / 41

SLIDE 38

Introduction

Data matrix and dissimilarity matrix

Proximity Measures

Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering

Cluster validation and assessment

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 34 / 41

SLIDE 39

Cluster validation and assessment

Cluster evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method. The major tasks of clustering evaluation include the following:

Assessing clustering tendency : In this task, for a given data set, we assess whether a nonrandom structure exists in the data. Clustering analysis on a data set is meaningful only when there is a nonrandom structure in the data.

Determining the number of clusters in a data set : Algorithms such as k-means, require the number of clusters in a data set as the parameter. Moreover, the number of clusters can be regarded as an interesting and important summary statistic of a data set. Therefore, it is desirable to estimate this number even before a clustering algorithm is used to derive detailed clusters. A simple method is to set the number of clusters to about √ n/2 for a data set of n points.

Measuring clustering quality : After applying a clustering method on a data set, we want to assess how good the resulting clusters are. There are also measures that score clusterings and thus can compare two sets of clustering results on the same data set.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 34 / 41

SLIDE 40

Assessing clustering tendency

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 35 / 41

SLIDE 41

Cluster validation and assessment

How good is the clustering generated by a method? How can we compare the clusterings generated by different methods? Clustering is an unupervised learning technique and it is hard to evaluate the quality of the output of any given method. If we use probabilistic models, we can always evaluate the likelihood of a test set, but this has two drawbacks:

It does not directly assess any clustering that is discovered by the model.

It does not apply to non-probabilistic methods.

We discuss some performance measures not based on likelihood. The goal of clustering is to assign points that are similar to the same cluster, and to ensure that points that are dissimilar are in different clusters. There are several ways of measuring these quantities

Internal criterion : Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity and low inter-cluster similarity. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest.

External criterion : Suppose we have labels for each object. Then we can compare the clustering with the labels using various metrics. We will use some of these metrics later, when we compare clustering methods.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 36 / 41

SLIDE 42

Purity

Purity is a simple and transparent evaluation measure. Consider the following clustering.

Let Nij be the number of objects in cluster i that belongs to class j and Ni = ∑C

j=1 Nij be

the total number of objects in cluster i. We define purity of cluster i as pi ≜ max

( Nij

) , and the overall purity of a clustering as purity ≜ ∑

Ni N pi. For the above figure, the purity is 6 17 5 6 + 6 17 4 6 + 5 17 3 5 = 5 + 4 + 3 17 = 0.71 Bad clusterings have purity values close to 0, a perfect clustering has a purity of 1. High purity is easy to achieve when the number of clusters is large. In particular, purity is 1 if each point gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 37 / 41

SLIDE 43

Rand index

Let U = {u1, . . . , uR} and V = {v1, . . . , vC} be two different clustering of N data points. For example, U might be the estimated clustering and V is reference clustering derived from the class labels. Define a 2 × 2 contingency table, containing the following numbers:

TP is the number of pairs that are in the same cluster in both U and V (true positives);

TN is the number of pairs that are in different clusters in both U and V (true negatives);

FN is the number of pairs that are in different clusters in U but the same cluster in V (false negatives);

FP is the number of pairs that are in the same cluster in U but different clusters in V (false positives).

Rand index is defined as RI ≜ TP + TN TP + FP + FN + TN Rand index can be interpreted as the fraction of clustering decisions that are correct. Clearly RI ∈ [0, 1].

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 38 / 41

SLIDE 44

Rand index (example)

Consider the following clustering

The three clusters contain 6, 6 and 5 points, so we have

TP + FP = (6 2 ) + (6 2 ) + (5 2 ) = 40. The number of true positives TP = (5 2 ) + (4 2 ) + (3 2 ) + (2 2 ) = 20. Then FP = 40 − 20 = 20. Similarly, FN = 24 and TN = 72. Hence Rand index RI = 20 + 72 20 + 20 + 24 + 72 = 0.68.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 39 / 41

SLIDE 45

Rand index (example)

Consider the following clustering

The three clusters contain 6, 6 and 5 points, so we have

TP + FP = (6 2 ) + (6 2 ) + (5 2 ) = 40. The number of true positives TP = (5 2 ) + (4 2 ) + (3 2 ) + (2 2 ) = 20. Then FP = 40 − 20 = 20. Similarly, FN = 24 and TN = 72. Hence Rand index RI = 20 + 72 20 + 20 + 24 + 72 = 0.68. Rand index only achieves its lower bound of 0 if TP = TN = 0, which is a rare event. We can define an adjusted Rand index ARI ≜ index − E[index] max index − E[index].

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 40 / 41

SLIDE 46

Mutual information

We can measure cluster quality is computing mutual information between U and V . Let PUV (i, j) = |ui∩vj|

be the probability that a randomly chosen object belongs to cluster ui in U and vj in V . Let PU(i) = |ui|

N be the be the probability that a randomly chosen object belongs to

cluster ui in U. Let PV (j) = |vj|

N be the be the probability that a randomly chosen object belongs to

cluster vj in V . Then mutual information is defined I(U, V ) ≜

∑

i=1 C

∑

j=1

PUV (i, j) log PUV (i, j) PU(i)PV (j). This lies between 0 and min{H(U), H(V )}. The maximum value can be achieved by using a lots of small clusters, which have low entropy. To compensate this, we can use normalized mutual information (NMI) NMI(U, V ) ≜ I(U, V )

1 2[H(U) + H(V )].

This lies between 0 and 1. Please read section 25.1 of Murphy.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 41 / 41