Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical - - PowerPoint PPT Presentation

agglomerative 2 3 hierarchical agglomerative 2 3
SMART_READER_LITE
LIVE PREVIEW

Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical - - PowerPoint PPT Presentation

Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical Clustering: theoretical Clustering: theoretical improvements and tests improvements and tests Sergiu Chelcea 1 , Patrice Bertrand , Patrice Bertrand 1,2 1,2 , Brigitte Trousse ,


slide-1
SLIDE 1

GfKl 2003 13 March 2003

Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical Clustering: theoretical Clustering: theoretical improvements and tests improvements and tests

Sergiu Chelcea Sergiu Chelcea1, Patrice Bertrand , Patrice Bertrand1,2

1,2, Brigitte Trousse

, Brigitte Trousse1

  • 1. Action
  • 1. Action AxIS

AxIS, INRIA Sophia-Antipolis, France , INRIA Sophia-Antipolis, France

  • 2. ENST
  • 2. ENST Bretagne

Bretagne, France , France LastName LastName.FirstName@inria.fr .FirstName@inria.fr

slide-2
SLIDE 2

GfKl 2003

1

13 March 2003

  • The classical case of AHC

The classical case of AHC

  • 2-3 Hierarchies

2-3 Hierarchies

  • Definitions

Definitions

  • Properties

Properties

  • Algorithm

Algorithm of 2-3AHC

  • f 2-3AHC
  • Analysis

Analysis of

  • f complexity

complexity

  • Application on

Application on simulated simulated data data

  • Experimental

Experimental Validation of Validation of Complexity Complexity

  • Ongoing

Ongoing and and Future Future Work Work

Outline Outline

slide-3
SLIDE 3

GfKl 2003

2

13 March 2003

Hierarchies Hierarchies 2-3 Hierarchies 2-3 Hierarchies Pyramids Pyramids Weak Hierarchies Weak Hierarchies Diday Diday 1984-86, 1984-86, Fichet Fichet 1987 1987 Bertrand Bertrand 2002 2002 Bandelt Bandelt, Dress 1989 , Dress 1989 Diatta Diatta, Fichet Fichet 1994 1994

Context Context

slide-4
SLIDE 4

GfKl 2003

3

13 March 2003

We We recall recall some some definitions definitions related related to to the the hierarchical hierarchical case case that that w ill w ill be be extended extended to to the the 2-3 2-3 hierarchies hierarchies: :

A1 A2 B

  • Hierarchy

Hierarchy: : -

  • each

each cluster cluster is is nonempty nonempty

  • each

each pair of clusters (A,B) pair of clusters (A,B) is is hierarchical hierarchical: A∩B∈{∅,A,B ,A,B}

Remark

Remark : : -

  • admits

admits at at most most n-1 non trivial clusters n-1 non trivial clusters

  • E

E and and the the singletons are clusters singletons are clusters

Indexed hierarchy Indexed hierarchy: :

  • each

each cluster cluster is associated to is associated to a positive a positive real number real number f f , ,

) ( ) ( , , B f A f B A S B A < ⇒ ⊂ ∈ ∀

w here w here

Hierarchies Hierarchies (1/3) (1/3)

slide-5
SLIDE 5

GfKl 2003

4

13 March 2003

Vocabulary Vocabulary: :

E b a a a a b b a ∈ ∀ = > = , , ) , ( ) , ( ) , ( δ δ δ

  • data input: dissimilarity
  • data input: dissimilarity
  • candidate clusters (unmarked) = maximal clusters
  • candidate clusters (unmarked) = maximal clusters

) , [ E E : ∞ → × δ

  • aggregation

aggregation index ( index (link link betw een betw een clusters), clusters), µ

µ :

:

  • single linkage
  • single linkage
  • complete

complete linkage linkage

  • average

average linkage linkage

  • set inclusion
  • set inclusion order
  • rder on
  • n the

the set of clusters: set of clusters:

  • predecessor

predecessor/successor successor

  • comparable clusters
  • comparable clusters
  • usually

usually f(X f(X∪Y) = Y) = µ

µ(X,Y)

(X,Y)

Agglomerative Hierarchical Classification Agglomerative Hierarchical Classification (2/3) (2/3)

slide-6
SLIDE 6

GfKl 2003

5

13 March 2003

1.

  • 1. Initialisation

Initialisation: : iter

iter ← 0; Clusters are the singletons of set E. 0; Clusters are the singletons of set E.

2. 2. iter

iter ← iter iter + 1; + 1;

Merge Merge X and Y w hich are - in the sense of

X and Y w hich are - in the sense of µ

µ - the tw o nearest

  • the tw o nearest

clusters; compute f(X clusters; compute f(X∪Y)

3.

  • 3. Reduction

Reduction: : Eliminate the successors found on the same

Eliminate the successors found on the same level level f w ith their predecessor, if there are any w ith their predecessor, if there are any

4.

  • 4. Update

Update µ µ , predecessor

predecessor links, links, successor successor links links

5.

  • 5. Stopping

Stopping rule rule: : Repeat step 2-4, until the set E becomes a

Repeat step 2-4, until the set E becomes a cluster cluster f f ← 0; 0;

Algorithm Algorithm AHC (3/3) AHC (3/3)

slide-7
SLIDE 7

GfKl 2003

6

13 March 2003

  • 2-3

2-3 Hierarchy Hierarchy [Bertrand 2002]: [Bertrand 2002]:

A B

  • Proper

Proper intersection intersection: :

  • A
  • A properly

properly intersects intersects B, if A B, if A∩B∉{∅,A,B ,A,B}

  • each

each cluster cluster is is nonempty nonempty

  • the

the proper proper intersection of intersection of tw o tw o clusters clusters is is also also a cluster a cluster

  • each

each cluster cluster properly properly intersects intersects no more no more than than one

  • ne other
  • ther

cluster cluster

Concept: Concept: -

  • in a 2-3

in a 2-3 hierarchy hierarchy, for , for any three any three clusters clusters at least tw o at least tw o pairs of pairs of them them are are hierarchical hierarchical

  • E

E and singletons are clusters and singletons are clusters

2-3 2-3 Hierarchies Hierarchies: : Definitions Definitions

slide-8
SLIDE 8

GfKl 2003

7

13 March 2003

  • The

The number number of

  • f elements

elements of a 2-3

  • f a 2-3 hierarchy

hierarchy that that are are not not reduced reduced to to singletons, singletons, is is at at most most

      − ) 1 n ( 2 3

  • Each

Each 2-3 2-3 hierarchical hierarchical set set system system on E

  • n E is

is a a collection of collection of intervals intervals of

  • f some

some linear linear

  • rder
  • rder

defined defined on E.

  • n E.

2-3 Hierarchy 2-3 Hierarchy Pyramid Pyramid

2-3 2-3 Hierarchies Hierarchies: : Properties Properties

[Bertrand 2002] [Bertrand 2002]

slide-9
SLIDE 9

GfKl 2003

8

13 March 2003

1.

  • 1. Initialisation

Initialisation: :

iter

iter ← 0; Clusters are the singletons of set E. 0; Clusters are the singletons of set E.

2. 2. iter

iter ← iter iter + 1; + 1;

Merge Merge X and Y w hich are - in the sense of

X and Y w hich are - in the sense of µ

µ - the tw o

  • the tw o

nearest nearest non-comparable non-comparable clusters, such that at least clusters, such that at least

  • ne of them is maximal; compute f(X
  • ne of them is maximal; compute f(X ∪Y)

3.

  • 3. Merge

Merge X

X∪Y and the other predecessor of X or Y, if it and the other predecessor of X or Y, if it exists. exists. compute f(X compute f(X∪Y)

4.

  • 4. Reduction

Reduction: : Eliminate the successors found on the same

Eliminate the successors found on the same level level f f w ith their predecessor, if there are any w ith their predecessor, if there are any

5.

  • 5. Update

Update

µ

µ ,

, predecessor predecessor links, links, successor successor links links

6.

  • 6. Stopping

Stopping rule rule: : Repeat step 2-5, until the set E becomes a

Repeat step 2-5, until the set E becomes a cluster cluster f f ← 0; 0;

Algorithm Algorithm of 2-3AHC

  • f 2-3AHC
slide-10
SLIDE 10

GfKl 2003

9

13 March 2003

  • Generalizes

Generalizes the the AHC: AHC:

  • a cluster
  • a cluster can

can be be merged merged w ith w ith tw o tw o different different clusters clusters

  • Double single

Double single linkage linkage [ [Jullien Jullien, Bertrand 2002]: , Bertrand 2002]:

} : ) , ( ), , ( { ) ( cluster candidate Z Z Y X Y X Min Y X f ∪ = ∪ µ µ

  • Complexity

Complexity: : O(n

O(n 2 log log n) n)

Algorithm Algorithm of 2-3AHC

  • f 2-3AHC
slide-11
SLIDE 11

GfKl 2003

10

13 March 2003

We use an ordered dissimilarity matrix on three levels: We use an ordered dissimilarity matrix on three levels: Step 1. Step 1. Initialisation Initialisation: : Compute and order the dissimilarity

Compute and order the dissimilarity matrix, O(n matrix, O(n 2 log log n) n)

Step 2. Step 2. Merge Merge X and Y … : Retrieve (X,Y) from the data structure,

X and Y … : Retrieve (X,Y) from the data structure, and create X and create X∪Y, O(1) Y, O(1)

Step 3. Step 3. Merge Merge X

X∪Y and … : Intermediate merging w ith O(n) and … : Intermediate merging w ith O(n) complexity complexity

  • dissimilarity values
  • dissimilarity values
  • cardinality of the tw o clusters
  • cardinality of the tw o clusters
  • lexicographical order
  • lexicographical order

Analysis of Complexity (1/3) Analysis of Complexity (1/3)

slide-12
SLIDE 12

GfKl 2003

11

13 March 2003

Step 4. Step 4. Reduction Reduction: : We have five possible cases of reduction

We have five possible cases of reduction w hen merging a cluster: w hen merging a cluster:

  • eliminate the successors found on the same level
  • eliminate the successors found on the same level

w ith their predecessor w ith their predecessor

  • complexity O(n)
  • complexity O(n)

Analysis Analysis of

  • f Complexity

Complexity (2/3) (2/3)

α. α β1 β2 β2 β2 X’ Y’ Z

slide-13
SLIDE 13

GfKl 2003

12

13 March 2003

Step 5. Step 5. Update Update µ µ :

  • compute new dissimilarities and store them in
  • compute new dissimilarities and store them in

the matrix, O(n the matrix, O(n log log n) n)

  • eliminate dissimilarities containing non candidates
  • eliminate dissimilarities containing non candidates

clusters, O(n clusters, O(n log log n) n)

Total complexity of the algorithm Total complexity of the algorithm:

O(n O(n2 log log n) + n n) + n×O(n O(n log log n) n) → O(n O(n2 log log n) n) step 1.

step 1. steps 2. - 5. steps 2. - 5.

Analysis Analysis of

  • f Complexity

Complexity (3/3) (3/3)

slide-14
SLIDE 14

GfKl 2003

13

13 March 2003

Simulated Simulated data data: :

  • complexity
  • complexity ⇒ O(n

O(n2 log log n) n) Execution times Execution times Complexity Complexity

Application Application

n n t(ms) t(ms) t/(n t/(n2 log n) log n)

slide-15
SLIDE 15

GfKl 2003

14

13 March 2003

Ongoing and Future w ork Ongoing and Future w ork :

:

  • study of the quality of the 2-3 AHC compared w ith
  • study of the quality of the 2-3 AHC compared w ith

AHC and other classification methods AHC and other classification methods

  • study of the applicability of 2-3 AHC in the context
  • study of the applicability of 2-3 AHC in the context
  • f Web Usage Mining
  • f Web Usage Mining

Contributions: Contributions:

  • a new formulation of the 2-3 AHC algorithm
  • a new formulation of the 2-3 AHC algorithm
  • a reduction of complexity
  • a reduction of complexity
  • a first implementation of the 2-3 AHC algorithm (Java) and
  • a first implementation of the 2-3 AHC algorithm (Java) and

its integration in CBR* Tools, a Case Based Reasoning its integration in CBR* Tools, a Case Based Reasoning framew ork framew ork

  • an experimental validation of the complexity on
  • an experimental validation of the complexity on

simulated data simulated data

Conclusions Conclusions