[PPT] - Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical PowerPoint Presentation

SLIDE 1

GfKl 2003 13 March 2003

Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical Clustering: theoretical Clustering: theoretical improvements and tests improvements and tests

Sergiu Chelcea Sergiu Chelcea1, Patrice Bertrand , Patrice Bertrand1,2

1,2, Brigitte Trousse

, Brigitte Trousse1

1. Action
1. Action AxIS

AxIS, INRIA Sophia-Antipolis, France , INRIA Sophia-Antipolis, France

2. ENST
2. ENST Bretagne

Bretagne, France , France LastName LastName.FirstName@inria.fr .FirstName@inria.fr

SLIDE 2

GfKl 2003

1

13 March 2003

The classical case of AHC

The classical case of AHC

2-3 Hierarchies

2-3 Hierarchies

Definitions

Definitions

Properties

Properties

Algorithm

Algorithm of 2-3AHC

f 2-3AHC
Analysis

Analysis of

f complexity

complexity

Application on

Application on simulated simulated data data

Experimental

Experimental Validation of Validation of Complexity Complexity

Ongoing

Ongoing and and Future Future Work Work

Outline Outline

SLIDE 3

GfKl 2003

2

13 March 2003

Hierarchies Hierarchies 2-3 Hierarchies 2-3 Hierarchies Pyramids Pyramids Weak Hierarchies Weak Hierarchies Diday Diday 1984-86, 1984-86, Fichet Fichet 1987 1987 Bertrand Bertrand 2002 2002 Bandelt Bandelt, Dress 1989 , Dress 1989 Diatta Diatta, Fichet Fichet 1994 1994

Context Context

SLIDE 4

GfKl 2003

3

13 March 2003

We We recall recall some some definitions definitions related related to to the the hierarchical hierarchical case case that that w ill w ill be be extended extended to to the the 2-3 2-3 hierarchies hierarchies: :

A1 A2 B

Hierarchy

Hierarchy: : -

each

each cluster cluster is is nonempty nonempty

each

each pair of clusters (A,B) pair of clusters (A,B) is is hierarchical hierarchical: A∩B∈{∅,A,B ,A,B}

Remark

Remark : : -

admits

admits at at most most n-1 non trivial clusters n-1 non trivial clusters

E

E and and the the singletons are clusters singletons are clusters

Indexed hierarchy Indexed hierarchy: :

each

each cluster cluster is associated to is associated to a positive a positive real number real number f f , ,

) ( ) ( , , B f A f B A S B A < ⇒ ⊂ ∈ ∀

w here w here

Hierarchies Hierarchies (1/3) (1/3)

SLIDE 5

GfKl 2003

4

13 March 2003

Vocabulary Vocabulary: :

E b a a a a b b a ∈ ∀ = > = , , ) , ( ) , ( ) , ( δ δ δ

data input: dissimilarity
data input: dissimilarity
candidate clusters (unmarked) = maximal clusters
candidate clusters (unmarked) = maximal clusters

) , [ E E : ∞ → × δ

aggregation

aggregation index ( index (link link betw een betw een clusters), clusters), µ

µ :

:

single linkage
single linkage
complete

complete linkage linkage

average

average linkage linkage

set inclusion
set inclusion order
rder on
n the

the set of clusters: set of clusters:

predecessor

predecessor/successor successor

comparable clusters
comparable clusters
usually

usually f(X f(X∪Y) = Y) = µ

µ(X,Y)

(X,Y)

Agglomerative Hierarchical Classification Agglomerative Hierarchical Classification (2/3) (2/3)

SLIDE 6

GfKl 2003

5

13 March 2003

1.

1. Initialisation

Initialisation: : iter

iter ← 0; Clusters are the singletons of set E. 0; Clusters are the singletons of set E.

2. 2. iter

iter ← iter iter + 1; + 1;

Merge Merge X and Y w hich are - in the sense of

X and Y w hich are - in the sense of µ

µ - the tw o nearest

the tw o nearest

clusters; compute f(X clusters; compute f(X∪Y)

3.

3. Reduction

Reduction: : Eliminate the successors found on the same

Eliminate the successors found on the same level level f w ith their predecessor, if there are any w ith their predecessor, if there are any

4.

4. Update

Update µ µ , predecessor

predecessor links, links, successor successor links links

5.

5. Stopping

Stopping rule rule: : Repeat step 2-4, until the set E becomes a

Repeat step 2-4, until the set E becomes a cluster cluster f f ← 0; 0;

Algorithm Algorithm AHC (3/3) AHC (3/3)

SLIDE 7

GfKl 2003

6

13 March 2003

2-3

2-3 Hierarchy Hierarchy [Bertrand 2002]: [Bertrand 2002]:

A B

Proper

Proper intersection intersection: :

A
A properly

properly intersects intersects B, if A B, if A∩B∉{∅,A,B ,A,B}

each

each cluster cluster is is nonempty nonempty

the

the proper proper intersection of intersection of tw o tw o clusters clusters is is also also a cluster a cluster

each

each cluster cluster properly properly intersects intersects no more no more than than one

ne other
ther

cluster cluster

Concept: Concept: -

in a 2-3

in a 2-3 hierarchy hierarchy, for , for any three any three clusters clusters at least tw o at least tw o pairs of pairs of them them are are hierarchical hierarchical

E

E and singletons are clusters and singletons are clusters

2-3 2-3 Hierarchies Hierarchies: : Definitions Definitions

SLIDE 8

GfKl 2003

7

13 March 2003

The

The number number of

f elements

elements of a 2-3

f a 2-3 hierarchy

hierarchy that that are are not not reduced reduced to to singletons, singletons, is is at at most most

      − ) 1 n ( 2 3

Each

Each 2-3 2-3 hierarchical hierarchical set set system system on E

n E is

is a a collection of collection of intervals intervals of

f some

some linear linear

rder
rder

defined defined on E.

n E.

2-3 Hierarchy 2-3 Hierarchy Pyramid Pyramid

2-3 2-3 Hierarchies Hierarchies: : Properties Properties

[Bertrand 2002] [Bertrand 2002]

SLIDE 9

GfKl 2003

8

13 March 2003

1.

1. Initialisation

Initialisation: :

iter

iter ← 0; Clusters are the singletons of set E. 0; Clusters are the singletons of set E.

2. 2. iter

iter ← iter iter + 1; + 1;

Merge Merge X and Y w hich are - in the sense of

X and Y w hich are - in the sense of µ

µ - the tw o

the tw o

nearest nearest non-comparable non-comparable clusters, such that at least clusters, such that at least

ne of them is maximal; compute f(X
ne of them is maximal; compute f(X ∪Y)

3.

3. Merge

Merge X

X∪Y and the other predecessor of X or Y, if it and the other predecessor of X or Y, if it exists. exists. compute f(X compute f(X∪Y)

4.

4. Reduction

Reduction: : Eliminate the successors found on the same

Eliminate the successors found on the same level level f f w ith their predecessor, if there are any w ith their predecessor, if there are any

5.

5. Update

Update

µ

µ ,

, predecessor predecessor links, links, successor successor links links

6.

6. Stopping

Stopping rule rule: : Repeat step 2-5, until the set E becomes a

Repeat step 2-5, until the set E becomes a cluster cluster f f ← 0; 0;

Algorithm Algorithm of 2-3AHC

f 2-3AHC

SLIDE 10

GfKl 2003

9

13 March 2003

Generalizes

Generalizes the the AHC: AHC:

a cluster
a cluster can

can be be merged merged w ith w ith tw o tw o different different clusters clusters

Double single

Double single linkage linkage [ [Jullien Jullien, Bertrand 2002]: , Bertrand 2002]:

} : ) , ( ), , ( { ) ( cluster candidate Z Z Y X Y X Min Y X f ∪ = ∪ µ µ

Complexity

Complexity: : O(n

O(n 2 log log n) n)

Algorithm Algorithm of 2-3AHC

f 2-3AHC

SLIDE 11

GfKl 2003

10

13 March 2003

We use an ordered dissimilarity matrix on three levels: We use an ordered dissimilarity matrix on three levels: Step 1. Step 1. Initialisation Initialisation: : Compute and order the dissimilarity

Compute and order the dissimilarity matrix, O(n matrix, O(n 2 log log n) n)

Step 2. Step 2. Merge Merge X and Y … : Retrieve (X,Y) from the data structure,

X and Y … : Retrieve (X,Y) from the data structure, and create X and create X∪Y, O(1) Y, O(1)

Step 3. Step 3. Merge Merge X

X∪Y and … : Intermediate merging w ith O(n) and … : Intermediate merging w ith O(n) complexity complexity

dissimilarity values
dissimilarity values
cardinality of the tw o clusters
cardinality of the tw o clusters
lexicographical order
lexicographical order

Analysis of Complexity (1/3) Analysis of Complexity (1/3)

SLIDE 12

GfKl 2003

11

13 March 2003

Step 4. Step 4. Reduction Reduction: : We have five possible cases of reduction

We have five possible cases of reduction w hen merging a cluster: w hen merging a cluster:

eliminate the successors found on the same level
eliminate the successors found on the same level

w ith their predecessor w ith their predecessor

complexity O(n)
complexity O(n)

Analysis Analysis of

f Complexity

Complexity (2/3) (2/3)

α. α β1 β2 β2 β2 X’ Y’ Z

SLIDE 13

GfKl 2003

12

13 March 2003

Step 5. Step 5. Update Update µ µ :

compute new dissimilarities and store them in
compute new dissimilarities and store them in

the matrix, O(n the matrix, O(n log log n) n)

eliminate dissimilarities containing non candidates
eliminate dissimilarities containing non candidates

clusters, O(n clusters, O(n log log n) n)

Total complexity of the algorithm Total complexity of the algorithm:

O(n O(n2 log log n) + n n) + n×O(n O(n log log n) n) → O(n O(n2 log log n) n) step 1.

step 1. steps 2. - 5. steps 2. - 5.

Analysis Analysis of

f Complexity

Complexity (3/3) (3/3)

SLIDE 14

GfKl 2003

13

13 March 2003

Simulated Simulated data data: :

complexity
complexity ⇒ O(n

O(n2 log log n) n) Execution times Execution times Complexity Complexity

Application Application

n n t(ms) t(ms) t/(n t/(n2 log n) log n)

SLIDE 15

GfKl 2003

14

13 March 2003

Ongoing and Future w ork Ongoing and Future w ork :

:

study of the quality of the 2-3 AHC compared w ith
study of the quality of the 2-3 AHC compared w ith

AHC and other classification methods AHC and other classification methods

study of the applicability of 2-3 AHC in the context
study of the applicability of 2-3 AHC in the context
f Web Usage Mining
f Web Usage Mining

Contributions: Contributions:

a new formulation of the 2-3 AHC algorithm
a new formulation of the 2-3 AHC algorithm
a reduction of complexity
a reduction of complexity
a first implementation of the 2-3 AHC algorithm (Java) and
a first implementation of the 2-3 AHC algorithm (Java) and

its integration in CBR* Tools, a Case Based Reasoning its integration in CBR* Tools, a Case Based Reasoning framew ork framew ork

an experimental validation of the complexity on
an experimental validation of the complexity on

simulated data simulated data