[PPT] - Clustering with Mixed Type Variables and Determination of Cluster PowerPoint Presentation

SLIDE 1

COMPSTAT 2010 1

Clustering with Mixed Type Variables and Determination of Cluster Numbers

Hana Řezanková, Dušan Húsek Tomáš Löster

University of Economics, Prague ICS, Academy of Sciences of the Czech Republic

SLIDE 2

COMPSTAT 2010 2

Outline

 Motivation  Methods for clustering with mixed type variables  Implementation in software packages  Proposal of new criteria for cluster evaluation  Application  Conclusion

SLIDE 3

COMPSTAT 2010 3

Motivation

  Task: We are looking for groups of similar

Task: We are looking for groups of similar

bjects (e.g. respondents)
bjects (e.g. respondents),

, i.e. we will i.e. we will concentrate on concentrate on the the problem of object clustering problem of object clustering

 The objects are characterized by both

quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)

  The number of clusters is unknown in advance

The number of clusters is unknown in advance – – i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment) clusters determination (assignment)

SLIDE 4

COMPSTAT 2010 4

Methods for clustering with mixed type variables

  Using a specialized dissimilarity measure

Using a specialized dissimilarity measure (Gower (Gower’ ’s coefficient, cluster variability based) s coefficient, cluster variability based) and application of agglomerative hierarchical and application of agglomerative hierarchical cluster analysis cluster analysis (AHCA) (AHCA)

 Clustering objects separately with quantitative

and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)

 Latent class models

SLIDE 5

COMPSTAT 2010 5

Implementation in software packages

  Specialized dissimilarity measures

Specialized dissimilarity measures

are not implemented

are not implemented for for AHCA AHCA

 Clustering objects with qualitative variables

is implemented only rarely (disagreement coef.)

 Cluster-based similarity partitioning algorithm

is not implemented

not implemented but it could be realized

 LC Cluster models (Latent GOLD)   Log

Log-

likelihood distance measure

likelihood distance measure between clusters

implemented in two-step cluster analysis (SPSS)

SLIDE 6

COMPSTAT 2010 6

Implementation in software packages

  Log

Log-

likelihood distance measure

likelihood distance measure between clusters between clusters

implemented in two-step cluster analysis (SPSS)

) (

, h h h h h h

D

  

     

          

 

 

) 1 ( ) 2 (

1 1 2 2

) ln( 2 1

m l m l gl gl l g g

H s s n 

g glu K u g glu gl

n n n n H

l

ln

1





 

… … entropy entropy

SLIDE 7

COMPSTAT 2010 7

Implementation in software packages

  Log

Log-

likelihood distance measure

likelihood distance measure between objects between objects

implemented in two-step cluster analysis (SPSS)

) (

, h h h h h h

D

  

     

          

 

 

) 1 ( ) 2 (

1 1 2 2

) ln( 2 1

m l m l gl gl l g g

H s s n 

j i

D

x x

,

) , (  

SLIDE 8

COMPSTAT 2010 8

Evaluation criteria implemented in software packages

  BIC (

BIC (Bayesian Information Criterion) Bayesian Information Criterion) AIC AIC (Akaike Information Criterion)

implemented in two-step cluster analysis (SPSS)





 

k g k g

n w I

1 BIC

) ln( 2 

          





) 2 (

1 ) 1 (

) 1 ( 2

m l l k

K m k w





 

k g k g

w I

1 AIC

2 2 

… minimum

nly for initial estimation
f number of clusters

SLIDE 9

COMPSTAT 2010 9

Proposed evaluation criteria

 Within-cluster variability for k clusters:  Variability of the whole data set:

   

   

           

k g m l m l gl gl l g k g g

H s s n k

1 1 1 2 2 1

) 1 ( ) 2 (

) ln( 2 1 ) (  

 

 

 

) 1 ( ) 2 (

1 1 2)

2 ln( 2 1 ) 1 (

m l m l l l

H s n 

SLIDE 10

COMPSTAT 2010 10

Proposed evaluation criteria

   

   

           

k g m l m l gl gl l g k g g

H s s n k

1 1 1 2 2 1

) 1 ( ) 2 (

) ln( 2 1 ) (  

 Within-cluster variability for k clusters:

) ( ) 1 ( ) ( k k k diff     

difference it should be maximal for the suitable number of clusters

SLIDE 11

COMPSTAT 2010 11

Evaluation criteria modified for qualitative variables

1. Uncertainty index (R-square (RSQ) index)
2. Semipartial uncertainty index

(optimal number of clusters - minimum)

) ( ) 1 ( ) (

U U SPU

k I k I k I   

) 1 ( ) ( ) 1 ( ) (

T W T T B U

   k V V V V V k I     

SLIDE 12

COMPSTAT 2010 12

Evaluation criteria modified for qualitative variables

3. Pseudo (Calinski and Habarasz) F index

– PSF (SAS), CHF ( SYSTAT)

4. Pseudo T-squared statistic – PST2 (SAS)

PTS (SYSTAT)

) ( ) 1 ( )) ( ) 1 ( ( ) ( 1 ) (

W B CHFU

k k k k n k n V k V k I            

2 ) ( ) (

, PTSU

     

    h h h h h h h h

n n k I     

SLIDE 13

COMPSTAT 2010 13

Evaluation criteria modified for qualitative variables

SYSTAT

SLIDE 14

COMPSTAT 2010 14

Evaluation criteria modified for qualitative variables

5. Modified Davies and Bouldin (DB) index

k D s s k I

k h h h h D h D h h h



     

       

1 , , , DB

max ) (

k k I

k h h h h h h h h h h



      

             

1 , , DBU

) ( max ) (     

SLIDE 15

COMPSTAT 2010 15

Evaluation criteria modified for qualitative variables

6. Dunn’s index

          

        g k g h h k h k h

diam D k I

1 1 1 D

max min min ) ( ) , ( min

, j i C C h h

D D

h j h i

x x



   

) , ( max

, j i C g

D diam

g j i

x x

x x 



SLIDE 16

COMPSTAT 2010 16

Modified evaluation criteria





k g g

G k G

1

) (

  C

Cluster luster variability variability based on the variance and Gini Gini’ ’s s coefficient of mutability coefficient of mutability

          

 

 

) 1 ( ) 2 (

1 1 2 2

) ln( 2 1

m l m l gl gl l g g

G s s n G





         

l

K u g glu gl

n n G

1 2

1

Gini Gini’ ’s s coefficient of coefficient of mutability mutability





 

k g k g

n w G I

1 BGC

) ln( 2

SLIDE 17

COMPSTAT 2010 17

Evaluation criteria modified for qualitative variables

1. Tau index (RSQ index)
2. Semipartial tau index

(optimal number of clusters - minimum)

) ( ) 1 ( ) (

SP

k I k I k I

  

  

) 1 ( ) ( ) 1 ( ) (

T W T T B

G k G G V V V V V k I     



SLIDE 18

COMPSTAT 2010 18

Application to a real data file

  Data from a questionnaire survey

Data from a questionnaire survey ( (for the participants of the chemistry seminar for the participants of the chemistry seminar) )

 7 qualitative and 1 quantitative (count) variables  Two-step cluster analysis for clustering of

respondents (experiments for the numbers of clusters from 2 to 4)

 LC Cluster model (experiments for the numbers

f clusters from 2 to 6) – the quantitative variable

was recoded to 5 categories

SLIDE 19

COMPSTAT 2010 19

Application to a real data file

Number of clusters Measure 1 2 3 4 Within-cluster variability 273.92 241.17 206.39 186.51 Variability difference

32.75

34.78 19.88 IU 0.12 0.25 0.32 ISPU 0.12 0.13 0.07

ICHFU

6.52 7.69 7.19 IBIC 590.85 568.41 541.88 545.15

Criteria based on the entropy (TSCA in SPSS)

SLIDE 20

COMPSTAT 2010 20

Application to a real data file

Criteria based on the Gini’s coefficient (TSCA in SPSS)

Number of clusters Measure 1 2 3 4 Within-cluster variability 185.41 162.57 137.83 127.86 Variability difference

22.84

24.74 9.97 I 0.12 0.26 0.31 ISP 0.12 0.13 0.05

ICHF

6.74 8.11 6.90 IBGC 413.85 411.20 404.75 427.84

SLIDE 21

COMPSTAT 2010 21

Application to a real data file

Comparison of BIC

Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1019.18 1036.90

SLIDE 22

COMPSTAT 2010 22

Conclusion

 If the distance between objects, distance between

clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.

 One possibility is an application of log-likelihood

distance measure based on the entropy

 Another possibility is to use the analogous

measure with using of Gini’s coefficient

SLIDE 23

COMPSTAT 2010 23