[PPT] - Scalable Clustering of Categorical Data and Applications PowerPoint Presentation

SLIDE 1

Scalable Clustering of Categorical Data and Applications

Periklis Andritsos periklis@dit.unitn.it

University of Trento

SLIDE 2

January 9, 2006 Periklis Andritsos

2

Problem Definition

Clustering is a procedure that groups members of a

population into similar categories, or clusters

Why is clustering important ?

 Get insight in the way data is distributed  Preprocess an initial data set

SLIDE 3

January 9, 2006 Periklis Andritsos

3

Exercise from the real world

SLIDE 4

January 9, 2006 Periklis Andritsos

4

In September 2004 The New York Times reported the

launch of Clusty (http://www.clusty.com/)



Given a query, the meta-search engine places relevant web documents into groups



Clusty uses clustering technology on ten different types of web content including material from the web, image, news and shopping databases

Example:

SLIDE 5

January 9, 2006 Periklis Andritsos

5

IDC (http://www.idc.com/) employs clustering for

customer segmentation purposes

Example: Digital camera companies investigated ways to

increase sales during Christmas holidays

IDC surveyed 1,000 U.S. consumers at 50 malls



“Likely buyers will be more motivated to buy a digital camera knowing that digital images can be displayed on a TV, printed using a PC-less photo quality printer, or printed at traditional film developer outlets“



“Likely buyers capture more images per month on their film cameras than unlikely buyers”

Companies were able to target the proper market

segment

… of Digital Cameras

Source:http://www.imaging-resource.com/NEWS/1037573998.html

SLIDE 6

January 9, 2006 Periklis Andritsos

6

Understanding software systems

boxParserClass

error

stackOfAnyWithTop boxScannerClass

Fonts

Globals Mathlib EdgeClass ColorTable stackOfAny

Lexer

hashedBoxes

Event

GenerateProlog

NodeOfAny hashGlobals

main

boxClass

Main Event error boxParserClass Globals Generate Prolog

stackOfAnyWithTop

boxClass colorTable Fonts edgaClass MathLib

stackOfAny NodeOfAny hashedBoxes hashGlobals

boxScanner Class Lexer

[MMBRCG’98]

SLIDE 7

January 9, 2006 Periklis Andritsos

7

What if we have ….

[Andritsos, Miller: IEEE Int’l Workshop on Program Comprehension, 2001]

Some information cannot be depicted

SLIDE 8

January 9, 2006 Periklis Andritsos

8

Integrated Information

We deal with data that:



are stored in heterogeneous sources



exist under different formats



are available online (with schemas)

 Schema: A type specification of a collection of data

We often need to integrate data, which introduces errors

Customer Order Scheduled Delivery Product Salesperson

<title>... <author>... <year>...

select all

cust emp dept dno dna

XML Repository Information System Relational Database

O-O Database

Integrated Information

SLIDE 9

January 9, 2006 Periklis Andritsos

9

Cluster Analysis Stages

Data Collection

Initial Screening Represen- tation

Clustering Strategy

Validation Interpre- tation Focus of my work

Intention was not to build yet another clustering algorithm,

but one that adheres to real-world constraints

SLIDE 10

January 9, 2006 Periklis Andritsos

10

Requirements

Perform good quality clustering on different data types



The majority of existing commercial algorithms perform clustering

f objects expressed over numerical values
Scalability



The optimal solution to clustering is hard to find, and existing heuristic techniques do not necessarily perform well with large inputs.

Parameter setting



Many algorithms expect the user to give a set of (sometimes) unintuitive parameters

Inclusion of descriptive information in software clustering



Software clustering techniques use structural information exclusively

SLIDE 11

January 9, 2006 Periklis Andritsos

11

Calculating Distance

Categorical data



no single ordering of values

movie director actor genre Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

J. Stewart
C. Grant
C. Grant
J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

Numerical Data



Lp metrics defined



Euclidean, Manhattan

$60,000 $2,500 $6,000 $5,000 salary 32 30 26 25 age Jenny Peter Mary John employee

SLIDE 12

January 9, 2006 Periklis Andritsos

12

Agglomerative Clustering

Agglomerative, or Hierarchical clustering in Agglomerative, or Hierarchical clustering in Euclidean space on 6 points. Euclidean space on 6 points. Need to compute distance between objects as well as between objects and sub-clusters A B E F C D A B E F C D

SLIDE 13

January 9, 2006 Periklis Andritsos

13

Contributions

Developed LIMBO, an algorithm that



is hierarchical



clusters categorical data using a small number of parameters



is scalable as the size of the input increases International Conference on Extending Data Base Technology, (EDBT’04)

Studied software systems using both structural and non-structural

information



The algorithm incorporates information such as the Developer, Lines Of Code or Directory structure International Working Conference on Reverse Engineering, (WCRE’03) IEEE Transactions on Software Engineering, (TSE’05)

Proposed a set if Information-Theoretic tools to discover structure in large

data sets

ACM International Conference on the Management of Data, (SIGMOD’04)

International Workshop on Information Integration on the Web, (WEB’02) IEEE Data Engineering Bulletin 2002, 2003

SLIDE 14

January 9, 2006 Periklis Andritsos

14

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

SLIDE 15

January 9, 2006 Periklis Andritsos

15

Clustering Categorical Data

Cluster rows (objects) in order to preserve as much

information as possible about the attribute values movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

J. Stewart
C. Grant
C. Grant
J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

Preserves Information for actor, genre Two choices for director Two choices for director, actor, and genre

SLIDE 16

January 9, 2006 Periklis Andritsos

16

Clustering Categorical Data

Cluster rows (objects) in order to preserve as much

information as possible about the attribute values

Three choices for director, actor, and two for genre Preserves Information for director, genre Two choices for actor

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

J. Stewart
C. Grant
C. Grant
J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

SLIDE 17

January 9, 2006 Periklis Andritsos

17

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

SLIDE 18

January 9, 2006 Periklis Andritsos

18

Information Theory Basics

Entropy:

 Measures the Uncertainty in a random variable

Conditional Entropy:

 Measures the Uncertainty of one variable knowing

the values of another.

Mutual Information:

 Measures the Dependence of two random

variables

( )

∑

− = ) ( log ) ( x p x p X H

( )

Y X H |

( )

) | ( ) ( ; Y X H X H Y X I − =

SLIDE 19

January 9, 2006 Periklis Andritsos

19

Information Theoretic Clustering

T : a random variable that ranges over the rows
V : a random variable that ranges over the attribute values
I(T;V) : mutual information of T and V
Information Bottleneck Method [TPB’99]
Compress T into a clustering Ck so that the information

preserved about V is maximum (k=number of clusters).

Optimization criterion:
Minimize{I(T;V) - I(Ck;V)}
i.e., minimization of Information Loss

SLIDE 20

January 9, 2006 Periklis Andritsos

20

Computing Information Loss

Representation: Every cluster, ci , is represented by



Its probability p(ci)=n(ci)/n



Conditional probability of the values in V given the cluster, p(V|ci)

This information is sufficient to compute the Information Loss

SLIDE 21

January 9, 2006 Periklis Andritsos

21

Agglomerative IB [ST99]

Computes an (nxn) Distance Matrix using Information

Loss as distance

Merge sub-clusters with the minimum Information Loss

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

J. Stewart
C. Grant
C. Grant
J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

SLIDE 22

January 9, 2006 Periklis Andritsos

22

Agglomerative approach (AIB) has quadratic complexity

since we need to compute an (nxn) distance matrix.

LIMBO algorithm

 produce a summary of the data  apply agglomerative clustering on the summary

Summary=Distributional Cluster Features

DCF(c) = (n(c) , p(V|c) )

DCFs can be computed incrementally

        + + + + = ) | ( ) ( ) ( ) ( ) | ( ) ( ) ( ) ( ), ( ) ( *) (

2 2 1 2 1 2 1 1 2 1

c V p c n c n c n c V p c n c n c n c n c n c DCF

scaLable InforMation BOttleneck

SLIDE 23

January 9, 2006 Periklis Andritsos

23

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

J. Stewart
C. Grant
C. Grant
J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

LIMBO Algorithm

Read one row at a time and convert it to DCF(t).
Find the closest DCF(c) summary.
Check if DCF(t) can be merged to the closest DCF(c) summary.

n V T I ) ; ( ⋅ = ϕ τ

DCF(t1) DCF(t2) DCF(c1)

)

DCF(t3) DCF(t4) DCF(c2)

)

Apply AIB on DCFs

DCF(c3)

)

SLIDE 24

January 9, 2006 Periklis Andritsos

24

LIMBO vs. Requirements (1)

Scalability



Proposes a new (incremental) computation of summaries



Requires a single pass over the data to compute cluster summaries

Parameter Setting



Number of clusters : the hierarchical nature of LIMBO permits the production of a range of cluster cardinalities



Parameter φ : it should be set, but…



Version of LIMBO with parameter S (in place of φ) : bounds the amount of memory allocated to the summaries

SLIDE 25

January 9, 2006 Periklis Andritsos

25

Experimental Evaluation

Data Sets



Standard testing data (UCI Machine Learning Repository)



Web Data



Legacy Data (DBLP)



Software Engineering (VIM source code)



Synthetic

 Up to 10 Million rows, 20 columns with 10-20 values each

Quality Criteria



Information Loss



Precision, Recall, Classification Error



Category Utility

SLIDE 26

January 9, 2006 Periklis Andritsos

26

Qualitative Results

φ controls the information lost

2.56 0.002 0.998 0.998 77.56 5000 LIMBO (_=0.0) 0.4 0.28 0.724 0.839 85.00

ROCK

[Guha et al ‘00]

2.54 0.05 0.995 0.995 78.02 125 COOLCAT

[Barbara et al’00]

2.56 0.002 0.998 0.998 77.56 66 LIMBO (_=1.0) Category Utility Clas/tion Error Recall Precision Information Loss (%) Size Algorithm

DS10 (n=5,000 , 10 attributes , 5 clusters)

With a 98% reduction of the size of the model, LIMBO remains

superior

SLIDE 27

January 9, 2006 Periklis Andritsos

27

Scalability Results

Number of Rows (x 1M)

SLIDE 28

January 9, 2006 Periklis Andritsos

28

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Strucuture  Conclusions & Future Work

SLIDE 29

January 9, 2006 Periklis Andritsos

29

Clustering Software Data

Input:



set of software artifacts (files, classes) [nodes]



structural information, i.e. interdependencies between the artifacts (invocations, inheritance) [edges]



non-structural information (timestamps, ownership) [node properties]

Goal:



Partition the artifacts into “meaningful” groups in order to understand and maintain the software system

Data analysis tools produce valuable information about source code

SLIDE 30

January 9, 2006 Periklis Andritsos

30

Example snapshot

Program files Utility files Used by the same program files Have almost the same dependencies

What if : “f1” and “f2” are developed by Bob

“f3”, “u1” are developed by Alice “u2” is developed by Mary

SLIDE 31

January 9, 2006 Periklis Andritsos

31

Solution

Structural Information



Artifacts are expressed over the artifacts on which they depend



LIMBO can be applied to perform clustering

Structural and Non-structural information



Artifacts are expressed over the artifacts on which they depend AND their properties



LIMBO can then be applied

SLIDE 32

January 9, 2006 Periklis Andritsos

32

Experimental Evaluation

Variety of Data Sets for which we had file usage information



TOBEY : ~ 1,000 files (total 250,000 Lines Of Code)



LINUX : ~ 1,000 files (total 750,000 Lines Of Code)



Mozilla : ~ 2,500 files (total 4,000,000 Lines Of Code)

Non-Structural Information considered (combinations of)



Developers (dev)



Directory (dir)



Lines of Code (loc)



Time of Last Update (time)

Comparison



Standard Cluster Analysis Algorithms



Cluster Analysis algorithms that identify patterns in the software graph



Cluster Analysis algorithms that adhere to software engineering principles

SLIDE 33

January 9, 2006 Periklis Andritsos

33

Results

Structural



LIMBO outperformed all other clustering algorithms



Utility subsystems were discovered

Non-Structural



“Dir” information produced better decompositions.



“Dev” information has a positive effect.



“Time” leads to worse clusterings.

SLIDE 34

January 9, 2006 Periklis Andritsos

34

LIMBO vs Requirements (2)

Inclusion of non-structural information



The representation used within LIMBO allows for the inclusion of descriptive information available about the objects to be clustered

SLIDE 35

January 9, 2006 Periklis Andritsos

35

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

SLIDE 36

January 9, 2006 Periklis Andritsos

36

Finding Horizontal Partitions

When data sets evolve or get integrated:



Older constraints are relaxed



Heterogeneous groups of rows may be added

Using LIMBO, our goal is to find groups of homogeneous

information content.

Horizontal Partitioning can be used independently from the rest of

the tools



Hierarchical in nature  exploit any number of partitions

SLIDE 37

January 9, 2006 Periklis Andritsos

37

Identifying Structure

Our work complements previous results in two areas:



Data Browsing

 Discover how data joins together  Perform Data Cleaning



Information-Theoretic characterization of database design

 Characterize constraints in a data set [Dalkilic and Robertson ‘00]  Provide metrics that assess design quality [Arenas and Libkin ‘03]

Our intention is to bring some of the theory of the Information-

Theoretic characterization of databases into a tool by making use of clustering

SLIDE 38

January 9, 2006 Periklis Andritsos

38

Identifying Structure

We can use LIMBO in order to



Horizontally partition a data set

 Integrated information contains heterogeneous groups



Find naturally co-occurring values in the data

 By reversing the roles rows and values, we can cluster the values so

that they preserve information about the rows in which they appear



Find groups of attributes that share data with high duplication

 Groups of naturally co-occurring values provide useful hints about the

duplication that exists in the attributes

SLIDE 39

January 9, 2006 Periklis Andritsos

39

Finding Duplicated Values

In horizontal partitioning we group Tuples in order to

preserve information about their Values.

What if we exchange the roles of Tuples and Values in

the clustering process ?

Attribute Value Clustering: Group Values such that the

information about the Tuples in which they appear is preserved.

This clustering groups naturally co-occuring values.

SLIDE 40

January 9, 2006 Periklis Andritsos

40

Example

Thriller Grant Luhrman Thriller Grant Mehta Thriller Grant Hitchcock Comedy DeNiro Scorsese Crime DeNiro Scorsese Genre Actor Director List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese

If we allow no Information loss we get perfectly correlated values Cluster 1: {Scorsese, DeNiro} Cluster 2: {Grant, Thriller}

?

SLIDE 41

January 9, 2006 Periklis Andritsos

41

Example

3 1 1 Genre 3 Grant Crime Comedy Thriller 2 DeNiro 1 Mehta 1 Luhrman 1 Hitchcock 2 Scorsese Actor Director Value

Initial matrix Matrix with counts of values in attributes

List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese

SLIDE 42

January 9, 2006 Periklis Andritsos

42

Example (Cont.)

1 1 3 Genre 3 Grant, Thriller Crime Comedy 1 Mehta 1 Luhrman 1 Hitchcock 2 2 Scorsese, DeNiro Actor Director Value

After Clustering

3 Genre 3 2 Actor 2 Director Cluster 2 Cluster 1 Value

Actor Genre Director

SLIDE 43

January 9, 2006 Periklis Andritsos

43

Characterizing Attribute Redundancy

Attributes are clustered “earlier” when they contain more

duplicate values

We can show the following:

Given sets of attributes C1, C2 and C3, If IL(C1,C2) < IL(C1,C3) then: Duplication(C1,C2) > Duplication(C1,C3) IL=Information Loss

SLIDE 44

January 9, 2006 Periklis Andritsos

44

Experience

Data Sets:



DBLP:

 Conference, Journal, Thesis etc. publications data set  50K tuples, 13 attributes and ~60K values  Very large number of NULL values

Used FDEP to discover FDs from current instance

SLIDE 45

January 9, 2006 Periklis Andritsos

45

DBLP

First performed attribute grouping
The first step was to horizontally partition the data set into three

clusters using an information-theoretic heuristic.

NULLs

SLIDE 46

January 9, 2006 Periklis Andritsos

46

DBLP (Cont.)

Conferences Journals 11 FDs 9 FDs

The third cluster contains random publications and no FDs
The use of ranked FDs should be interactive

SLIDE 47

January 9, 2006 Periklis Andritsos

47

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

SLIDE 48

January 9, 2006 Periklis Andritsos

48

Conclusions

Motivated the problem of Categorical Clustering and the need to use

different than the traditional distance measures

Presented LIMBO, an algorithm that is



Hierarchical



Scalable



Requires minimal setting of parameters



Performs well on integrated information



Incorporates any type of information about the objects to be clustered



Although it is necessarily heuristic, it increases efficiency without loss in quality w.r.t. to non-scalable information theoretic methods



Is insensitive to the order of input (experimentally found)

SLIDE 49

January 9, 2006 Periklis Andritsos

49

Future Directions

Collect and process information in an online fashion



n-the-fly



within a bounded amount of memory

Example: U.S. airports start incorporating RFID tags on

passenger suitcases for routing and quality control purposes (Las Vegas McCarran Airport)

Collection, summarization, error identification



at different regional centers (airports)



at different aggregation points (provincial authorities)

SLIDE 50

January 9, 2006 Periklis Andritsos

50

Future Directions

Envision a large clustering suite, where users can



Cluster data from several application domains



Need extensions to cluster numerical and categorical data at the same time



Produce even better qualitative results



Apply different methods in conjunction and aggregate their results



Provide ranking capabilities for the clusters



Perform clustering with emphasis on semantics of produced clusters



Identify and correct errors



Provide automatic ways to replace erroneous values with other values from the data



Incorporate value importance



Allow the user to associate his/her preference with particular values



Incorporate automatic weighting schemes that reflect the fact that, e.g., some files in a software system are more important than others

SLIDE 51

January 9, 2006 Periklis Andritsos

51

Grazie !

Periklis Andritsos URL: http://www.cs.toronto.edu/~periklis e-mail: periklis@dit.unitn.it You may contact me anytime to get to know more about Trento and our projects.