COMPSTAT 2010 1
Clustering with Mixed Type Variables and Determination of Cluster - - PowerPoint PPT Presentation
Clustering with Mixed Type Variables and Determination of Cluster - - PowerPoint PPT Presentation
Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana ezankov, Duan Hsek Tom Lster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1 Outline Motivation
COMPSTAT 2010 2
Outline
Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion
COMPSTAT 2010 3
Motivation
Task: We are looking for groups of similar
Task: We are looking for groups of similar
- bjects (e.g. respondents)
- bjects (e.g. respondents),
, i.e. we will i.e. we will concentrate on concentrate on the the problem of object clustering problem of object clustering
The objects are characterized by both
quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)
The number of clusters is unknown in advance
The number of clusters is unknown in advance – – i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment) clusters determination (assignment)
COMPSTAT 2010 4
Methods for clustering with mixed type variables
Using a specialized dissimilarity measure
Using a specialized dissimilarity measure (Gower (Gower’ ’s coefficient, cluster variability based) s coefficient, cluster variability based) and application of agglomerative hierarchical and application of agglomerative hierarchical cluster analysis cluster analysis (AHCA) (AHCA)
Clustering objects separately with quantitative
and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)
Latent class models
COMPSTAT 2010 5
Implementation in software packages
Specialized dissimilarity measures
Specialized dissimilarity measures
- are not implemented
are not implemented for for AHCA AHCA
Clustering objects with qualitative variables
- is implemented only rarely (disagreement coef.)
Cluster-based similarity partitioning algorithm
- is not implemented
not implemented but it could be realized
LC Cluster models (Latent GOLD) Log
Log-
- likelihood distance measure
likelihood distance measure between clusters
- implemented in two-step cluster analysis (SPSS)
COMPSTAT 2010 6
Implementation in software packages
Log
Log-
- likelihood distance measure
likelihood distance measure between clusters between clusters
- implemented in two-step cluster analysis (SPSS)
) (
, h h h h h h
D
) 1 ( ) 2 (
1 1 2 2
) ln( 2 1
m l m l gl gl l g g
H s s n
g glu K u g glu gl
n n n n H
l
ln
1
… … entropy entropy
COMPSTAT 2010 7
Implementation in software packages
Log
Log-
- likelihood distance measure
likelihood distance measure between objects between objects
- implemented in two-step cluster analysis (SPSS)
) (
, h h h h h h
D
) 1 ( ) 2 (
1 1 2 2
) ln( 2 1
m l m l gl gl l g g
H s s n
j i
j i
D
x x
x x
,
) , (
COMPSTAT 2010 8
Evaluation criteria implemented in software packages
BIC (
BIC (Bayesian Information Criterion) Bayesian Information Criterion) AIC AIC (Akaike Information Criterion)
- implemented in two-step cluster analysis (SPSS)
k g k g
n w I
1 BIC
) ln( 2
) 2 (
1 ) 1 (
) 1 ( 2
m l l k
K m k w
k g k g
w I
1 AIC
2 2
… minimum
- nly for initial estimation
- f number of clusters
COMPSTAT 2010 9
Proposed evaluation criteria
Within-cluster variability for k clusters: Variability of the whole data set:
k g m l m l gl gl l g k g g
H s s n k
1 1 1 2 2 1
) 1 ( ) 2 (
) ln( 2 1 ) (
) 1 ( ) 2 (
1 1 2)
2 ln( 2 1 ) 1 (
m l m l l l
H s n
COMPSTAT 2010 10
Proposed evaluation criteria
k g m l m l gl gl l g k g g
H s s n k
1 1 1 2 2 1
) 1 ( ) 2 (
) ln( 2 1 ) (
Within-cluster variability for k clusters:
) ( ) 1 ( ) ( k k k diff
difference it should be maximal for the suitable number of clusters
COMPSTAT 2010 11
Evaluation criteria modified for qualitative variables
- 1. Uncertainty index (R-square (RSQ) index)
- 2. Semipartial uncertainty index
(optimal number of clusters - minimum)
) ( ) 1 ( ) (
U U SPU
k I k I k I
) 1 ( ) ( ) 1 ( ) (
T W T T B U
k V V V V V k I
COMPSTAT 2010 12
Evaluation criteria modified for qualitative variables
- 3. Pseudo (Calinski and Habarasz) F index
– PSF (SAS), CHF ( SYSTAT)
- 4. Pseudo T-squared statistic – PST2 (SAS)
PTS (SYSTAT)
) ( ) 1 ( )) ( ) 1 ( ( ) ( 1 ) (
W B CHFU
k k k k n k n V k V k I
2 ) ( ) (
, PTSU
h h h h h h h h
n n k I
COMPSTAT 2010 13
Evaluation criteria modified for qualitative variables
SYSTAT
COMPSTAT 2010 14
Evaluation criteria modified for qualitative variables
- 5. Modified Davies and Bouldin (DB) index
k D s s k I
k h h h h D h D h h h
1 , , , DB
max ) (
k k I
k h h h h h h h h h h
1 , , DBU
) ( max ) (
COMPSTAT 2010 15
Evaluation criteria modified for qualitative variables
- 6. Dunn’s index
g k g h h k h k h
diam D k I
1 1 1 D
max min min ) ( ) , ( min
, j i C C h h
D D
h j h i
x x
x x
) , ( max
, j i C g
D diam
g j i
x x
x x
COMPSTAT 2010 16
Modified evaluation criteria
k g g
G k G
1
) (
C
Cluster luster variability variability based on the variance and Gini Gini’ ’s s coefficient of mutability coefficient of mutability
) 1 ( ) 2 (
1 1 2 2
) ln( 2 1
m l m l gl gl l g g
G s s n G
l
K u g glu gl
n n G
1 2
1
Gini Gini’ ’s s coefficient of coefficient of mutability mutability
k g k g
n w G I
1 BGC
) ln( 2
COMPSTAT 2010 17
Evaluation criteria modified for qualitative variables
- 1. Tau index (RSQ index)
- 2. Semipartial tau index
(optimal number of clusters - minimum)
) ( ) 1 ( ) (
SP
k I k I k I
) 1 ( ) ( ) 1 ( ) (
T W T T B
G k G G V V V V V k I
COMPSTAT 2010 18
Application to a real data file
Data from a questionnaire survey
Data from a questionnaire survey ( (for the participants of the chemistry seminar for the participants of the chemistry seminar) )
7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of
respondents (experiments for the numbers of clusters from 2 to 4)
LC Cluster model (experiments for the numbers
- f clusters from 2 to 6) – the quantitative variable
was recoded to 5 categories
COMPSTAT 2010 19
Application to a real data file
Number of clusters Measure 1 2 3 4 Within-cluster variability 273.92 241.17 206.39 186.51 Variability difference
- 32.75
34.78 19.88 IU 0.12 0.25 0.32 ISPU 0.12 0.13 0.07
- ICHFU
6.52 7.69 7.19 IBIC 590.85 568.41 541.88 545.15
Criteria based on the entropy (TSCA in SPSS)
COMPSTAT 2010 20
Application to a real data file
Criteria based on the Gini’s coefficient (TSCA in SPSS)
Number of clusters Measure 1 2 3 4 Within-cluster variability 185.41 162.57 137.83 127.86 Variability difference
- 22.84
24.74 9.97 I 0.12 0.26 0.31 ISP 0.12 0.13 0.05
- ICHF
6.74 8.11 6.90 IBGC 413.85 411.20 404.75 427.84
COMPSTAT 2010 21
Application to a real data file
Comparison of BIC
Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1019.18 1036.90
COMPSTAT 2010 22
Conclusion
If the distance between objects, distance between
clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.
One possibility is an application of log-likelihood
distance measure based on the entropy
Another possibility is to use the analogous
measure with using of Gini’s coefficient
COMPSTAT 2010 23