Scalable Clustering of Categorical Data and Applications - - PowerPoint PPT Presentation

scalable clustering of categorical data and applications
SMART_READER_LITE
LIVE PREVIEW

Scalable Clustering of Categorical Data and Applications - - PowerPoint PPT Presentation

Scalable Clustering of Categorical Data and Applications University of Trento Periklis Andritsos periklis@dit.unitn.it Problem Definition o Clustering is a procedure that groups members of a population into similar categories, or clusters o


slide-1
SLIDE 1

Scalable Clustering of Categorical Data and Applications

Periklis Andritsos periklis@dit.unitn.it

University of Trento

slide-2
SLIDE 2

January 9, 2006 Periklis Andritsos

2

Problem Definition

  • Clustering is a procedure that groups members of a

population into similar categories, or clusters

  • Why is clustering important ?

 Get insight in the way data is distributed  Preprocess an initial data set

slide-3
SLIDE 3

January 9, 2006 Periklis Andritsos

3

Exercise from the real world

slide-4
SLIDE 4

January 9, 2006 Periklis Andritsos

4

  • In September 2004 The New York Times reported the

launch of Clusty (http://www.clusty.com/)

Given a query, the meta-search engine places relevant web documents into groups

Clusty uses clustering technology on ten different types of web content including material from the web, image, news and shopping databases

  • Example:
slide-5
SLIDE 5

January 9, 2006 Periklis Andritsos

5

  • IDC (http://www.idc.com/) employs clustering for

customer segmentation purposes

  • Example: Digital camera companies investigated ways to

increase sales during Christmas holidays

  • IDC surveyed 1,000 U.S. consumers at 50 malls

“Likely buyers will be more motivated to buy a digital camera knowing that digital images can be displayed on a TV, printed using a PC-less photo quality printer, or printed at traditional film developer outlets“

“Likely buyers capture more images per month on their film cameras than unlikely buyers”

  • Companies were able to target the proper market

segment

… of Digital Cameras

Source:http://www.imaging-resource.com/NEWS/1037573998.html

slide-6
SLIDE 6

January 9, 2006 Periklis Andritsos

6

Understanding software systems

boxParserClass

error

stackOfAnyWithTop boxScannerClass

Fonts

Globals Mathlib EdgeClass ColorTable stackOfAny

Lexer

hashedBoxes

Event

GenerateProlog

NodeOfAny hashGlobals

main

boxClass

Main Event error boxParserClass Globals Generate Prolog

stackOfAnyWithTop

boxClass colorTable Fonts edgaClass MathLib

stackOfAny NodeOfAny hashedBoxes hashGlobals

boxScanner Class Lexer

[MMBRCG’98]

slide-7
SLIDE 7

January 9, 2006 Periklis Andritsos

7

What if we have ….

[Andritsos, Miller: IEEE Int’l Workshop on Program Comprehension, 2001]

Some information cannot be depicted

slide-8
SLIDE 8

January 9, 2006 Periklis Andritsos

8

Integrated Information

  • We deal with data that:

are stored in heterogeneous sources

exist under different formats

are available online (with schemas)

 Schema: A type specification of a collection of data

  • We often need to integrate data, which introduces errors
Customer Order Scheduled Delivery Product Salesperson

<title>... <author>... <year>...

select all

cust emp dept dno dna

XML Repository Information System Relational Database

O-O Database

Integrated Information

slide-9
SLIDE 9

January 9, 2006 Periklis Andritsos

9

Cluster Analysis Stages

Data Collection

Initial Screening Represen- tation

Clustering Strategy

Validation Interpre- tation Focus of my work

  • Intention was not to build yet another clustering algorithm,

but one that adheres to real-world constraints

slide-10
SLIDE 10

January 9, 2006 Periklis Andritsos

10

Requirements

  • Perform good quality clustering on different data types

The majority of existing commercial algorithms perform clustering

  • f objects expressed over numerical values
  • Scalability

The optimal solution to clustering is hard to find, and existing heuristic techniques do not necessarily perform well with large inputs.

  • Parameter setting

Many algorithms expect the user to give a set of (sometimes) unintuitive parameters

  • Inclusion of descriptive information in software clustering

Software clustering techniques use structural information exclusively

slide-11
SLIDE 11

January 9, 2006 Periklis Andritsos

11

Calculating Distance

  • Categorical data

no single ordering of values

movie director actor genre Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

  • J. Stewart
  • C. Grant
  • C. Grant
  • J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

  • Numerical Data

Lp metrics defined

Euclidean, Manhattan

$60,000 $2,500 $6,000 $5,000 salary 32 30 26 25 age Jenny Peter Mary John employee

slide-12
SLIDE 12

January 9, 2006 Periklis Andritsos

12

Agglomerative Clustering

Agglomerative, or Hierarchical clustering in Agglomerative, or Hierarchical clustering in Euclidean space on 6 points. Euclidean space on 6 points. Need to compute distance between objects as well as between objects and sub-clusters A B E F C D A B E F C D

slide-13
SLIDE 13

January 9, 2006 Periklis Andritsos

13

Contributions

  • Developed LIMBO, an algorithm that

is hierarchical

clusters categorical data using a small number of parameters

is scalable as the size of the input increases International Conference on Extending Data Base Technology, (EDBT’04)

  • Studied software systems using both structural and non-structural

information

The algorithm incorporates information such as the Developer, Lines Of Code or Directory structure International Working Conference on Reverse Engineering, (WCRE’03) IEEE Transactions on Software Engineering, (TSE’05)

  • Proposed a set if Information-Theoretic tools to discover structure in large

data sets

ACM International Conference on the Management of Data, (SIGMOD’04)

International Workshop on Information Integration on the Web, (WEB’02) IEEE Data Engineering Bulletin 2002, 2003

slide-14
SLIDE 14

January 9, 2006 Periklis Andritsos

14

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

slide-15
SLIDE 15

January 9, 2006 Periklis Andritsos

15

Clustering Categorical Data

  • Cluster rows (objects) in order to preserve as much

information as possible about the attribute values movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

  • J. Stewart
  • C. Grant
  • C. Grant
  • J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

Preserves Information for actor, genre Two choices for director Two choices for director, actor, and genre

slide-16
SLIDE 16

January 9, 2006 Periklis Andritsos

16

Clustering Categorical Data

  • Cluster rows (objects) in order to preserve as much

information as possible about the attribute values

Three choices for director, actor, and two for genre Preserves Information for director, genre Two choices for actor

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

  • J. Stewart
  • C. Grant
  • C. Grant
  • J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

slide-17
SLIDE 17

January 9, 2006 Periklis Andritsos

17

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

slide-18
SLIDE 18

January 9, 2006 Periklis Andritsos

18

Information Theory Basics

  • Entropy:

 Measures the Uncertainty in a random variable

  • Conditional Entropy:

 Measures the Uncertainty of one variable knowing

the values of another.

  • Mutual Information:

 Measures the Dependence of two random

variables

( )

− = ) ( log ) ( x p x p X H

( )

Y X H |

( )

) | ( ) ( ; Y X H X H Y X I − =

slide-19
SLIDE 19

January 9, 2006 Periklis Andritsos

19

Information Theoretic Clustering

  • T : a random variable that ranges over the rows
  • V : a random variable that ranges over the attribute values
  • I(T;V) : mutual information of T and V
  • Information Bottleneck Method [TPB’99]
  • Compress T into a clustering Ck so that the information

preserved about V is maximum (k=number of clusters).

  • Optimization criterion:
  • Minimize{I(T;V) - I(Ck;V)}
  • i.e., minimization of Information Loss
slide-20
SLIDE 20

January 9, 2006 Periklis Andritsos

20

Computing Information Loss

  • Representation: Every cluster, ci , is represented by

Its probability p(ci)=n(ci)/n

Conditional probability of the values in V given the cluster, p(V|ci)

  • This information is sufficient to compute the Information Loss
slide-21
SLIDE 21

January 9, 2006 Periklis Andritsos

21

Agglomerative IB [ST99]

  • Computes an (nxn) Distance Matrix using Information

Loss as distance

  • Merge sub-clusters with the minimum Information Loss

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

  • J. Stewart
  • C. Grant
  • C. Grant
  • J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

slide-22
SLIDE 22

January 9, 2006 Periklis Andritsos

22

  • Agglomerative approach (AIB) has quadratic complexity

since we need to compute an (nxn) distance matrix.

  • LIMBO algorithm

 produce a summary of the data  apply agglomerative clustering on the summary

  • Summary=Distributional Cluster Features

DCF(c) = (n(c) , p(V|c) )

  • DCFs can be computed incrementally

        + + + + = ) | ( ) ( ) ( ) ( ) | ( ) ( ) ( ) ( ), ( ) ( *) (

2 2 1 2 1 2 1 1 2 1

c V p c n c n c n c V p c n c n c n c n c n c DCF

scaLable InforMation BOttleneck

slide-23
SLIDE 23

January 9, 2006 Periklis Andritsos

23

movie director actor genre

Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro

  • J. Stewart
  • C. Grant
  • C. Grant
  • J. Stewart

Crime Crime Thriller Thriller Comedy Comedy

LIMBO Algorithm

  • Read one row at a time and convert it to DCF(t).
  • Find the closest DCF(c) summary.
  • Check if DCF(t) can be merged to the closest DCF(c) summary.

n V T I ) ; ( ⋅ = ϕ τ

DCF(t1) DCF(t2) DCF(c1)

)

DCF(t3) DCF(t4) DCF(c2)

)

Apply AIB on DCFs

DCF(c3)

)

slide-24
SLIDE 24

January 9, 2006 Periklis Andritsos

24

LIMBO vs. Requirements (1)

  • Scalability

Proposes a new (incremental) computation of summaries

Requires a single pass over the data to compute cluster summaries

  • Parameter Setting

Number of clusters : the hierarchical nature of LIMBO permits the production of a range of cluster cardinalities

Parameter φ : it should be set, but…

Version of LIMBO with parameter S (in place of φ) : bounds the amount of memory allocated to the summaries

slide-25
SLIDE 25

January 9, 2006 Periklis Andritsos

25

Experimental Evaluation

  • Data Sets

Standard testing data (UCI Machine Learning Repository)

Web Data

Legacy Data (DBLP)

Software Engineering (VIM source code)

Synthetic

 Up to 10 Million rows, 20 columns with 10-20 values each

  • Quality Criteria

Information Loss

Precision, Recall, Classification Error

Category Utility

slide-26
SLIDE 26

January 9, 2006 Periklis Andritsos

26

Qualitative Results

  • φ controls the information lost

2.56 0.002 0.998 0.998 77.56 5000 LIMBO (_=0.0) 0.4 0.28 0.724 0.839 85.00

  • ROCK

[Guha et al ‘00]

2.54 0.05 0.995 0.995 78.02 125 COOLCAT

[Barbara et al’00]

2.56 0.002 0.998 0.998 77.56 66 LIMBO (_=1.0) Category Utility Clas/tion Error Recall Precision Information Loss (%) Size Algorithm

DS10 (n=5,000 , 10 attributes , 5 clusters)

  • With a 98% reduction of the size of the model, LIMBO remains

superior

slide-27
SLIDE 27

January 9, 2006 Periklis Andritsos

27

Scalability Results

Number of Rows (x 1M)

slide-28
SLIDE 28

January 9, 2006 Periklis Andritsos

28

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Strucuture  Conclusions & Future Work

slide-29
SLIDE 29

January 9, 2006 Periklis Andritsos

29

Clustering Software Data

  • Input:

set of software artifacts (files, classes) [nodes]

structural information, i.e. interdependencies between the artifacts (invocations, inheritance) [edges]

non-structural information (timestamps, ownership) [node properties]

  • Goal:

Partition the artifacts into “meaningful” groups in order to understand and maintain the software system

Data analysis tools produce valuable information about source code

slide-30
SLIDE 30

January 9, 2006 Periklis Andritsos

30

Example snapshot

Program files Utility files Used by the same program files Have almost the same dependencies

  • What if : “f1” and “f2” are developed by Bob

“f3”, “u1” are developed by Alice “u2” is developed by Mary

slide-31
SLIDE 31

January 9, 2006 Periklis Andritsos

31

Solution

  • Structural Information

Artifacts are expressed over the artifacts on which they depend

LIMBO can be applied to perform clustering

  • Structural and Non-structural information

Artifacts are expressed over the artifacts on which they depend AND their properties

LIMBO can then be applied

slide-32
SLIDE 32

January 9, 2006 Periklis Andritsos

32

Experimental Evaluation

  • Variety of Data Sets for which we had file usage information

TOBEY : ~ 1,000 files (total 250,000 Lines Of Code)

LINUX : ~ 1,000 files (total 750,000 Lines Of Code)

Mozilla : ~ 2,500 files (total 4,000,000 Lines Of Code)

  • Non-Structural Information considered (combinations of)

Developers (dev)

Directory (dir)

Lines of Code (loc)

Time of Last Update (time)

  • Comparison

Standard Cluster Analysis Algorithms

Cluster Analysis algorithms that identify patterns in the software graph

Cluster Analysis algorithms that adhere to software engineering principles

slide-33
SLIDE 33

January 9, 2006 Periklis Andritsos

33

Results

  • Structural

LIMBO outperformed all other clustering algorithms

Utility subsystems were discovered

  • Non-Structural

“Dir” information produced better decompositions.

“Dev” information has a positive effect.

“Time” leads to worse clusterings.

slide-34
SLIDE 34

January 9, 2006 Periklis Andritsos

34

LIMBO vs Requirements (2)

  • Inclusion of non-structural information

The representation used within LIMBO allows for the inclusion of descriptive information available about the objects to be clustered

slide-35
SLIDE 35

January 9, 2006 Periklis Andritsos

35

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

slide-36
SLIDE 36

January 9, 2006 Periklis Andritsos

36

Finding Horizontal Partitions

  • When data sets evolve or get integrated:

Older constraints are relaxed

Heterogeneous groups of rows may be added

  • Using LIMBO, our goal is to find groups of homogeneous

information content.

  • Horizontal Partitioning can be used independently from the rest of

the tools

Hierarchical in nature  exploit any number of partitions

slide-37
SLIDE 37

January 9, 2006 Periklis Andritsos

37

Identifying Structure

  • Our work complements previous results in two areas:

Data Browsing

 Discover how data joins together  Perform Data Cleaning

Information-Theoretic characterization of database design

 Characterize constraints in a data set [Dalkilic and Robertson ‘00]  Provide metrics that assess design quality [Arenas and Libkin ‘03]

  • Our intention is to bring some of the theory of the Information-

Theoretic characterization of databases into a tool by making use of clustering

slide-38
SLIDE 38

January 9, 2006 Periklis Andritsos

38

Identifying Structure

  • We can use LIMBO in order to

Horizontally partition a data set

 Integrated information contains heterogeneous groups

Find naturally co-occurring values in the data

 By reversing the roles rows and values, we can cluster the values so

that they preserve information about the rows in which they appear

Find groups of attributes that share data with high duplication

 Groups of naturally co-occurring values provide useful hints about the

duplication that exists in the attributes

slide-39
SLIDE 39

January 9, 2006 Periklis Andritsos

39

Finding Duplicated Values

  • In horizontal partitioning we group Tuples in order to

preserve information about their Values.

  • What if we exchange the roles of Tuples and Values in

the clustering process ?

  • Attribute Value Clustering: Group Values such that the

information about the Tuples in which they appear is preserved.

  • This clustering groups naturally co-occuring values.
slide-40
SLIDE 40

January 9, 2006 Periklis Andritsos

40

Example

Thriller Grant Luhrman Thriller Grant Mehta Thriller Grant Hitchcock Comedy DeNiro Scorsese Crime DeNiro Scorsese Genre Actor Director List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese

If we allow no Information loss we get perfectly correlated values Cluster 1: {Scorsese, DeNiro} Cluster 2: {Grant, Thriller}

?

slide-41
SLIDE 41

January 9, 2006 Periklis Andritsos

41

Example

3 1 1 Genre 3 Grant Crime Comedy Thriller 2 DeNiro 1 Mehta 1 Luhrman 1 Hitchcock 2 Scorsese Actor Director Value

Initial matrix Matrix with counts of values in attributes

List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese

slide-42
SLIDE 42

January 9, 2006 Periklis Andritsos

42

Example (Cont.)

1 1 3 Genre 3 Grant, Thriller Crime Comedy 1 Mehta 1 Luhrman 1 Hitchcock 2 2 Scorsese, DeNiro Actor Director Value

After Clustering

3 Genre 3 2 Actor 2 Director Cluster 2 Cluster 1 Value

Actor Genre Director

slide-43
SLIDE 43

January 9, 2006 Periklis Andritsos

43

Characterizing Attribute Redundancy

  • Attributes are clustered “earlier” when they contain more

duplicate values

  • We can show the following:

Given sets of attributes C1, C2 and C3, If IL(C1,C2) < IL(C1,C3) then: Duplication(C1,C2) > Duplication(C1,C3) IL=Information Loss

slide-44
SLIDE 44

January 9, 2006 Periklis Andritsos

44

Experience

  • Data Sets:

DBLP:

 Conference, Journal, Thesis etc. publications data set  50K tuples, 13 attributes and ~60K values  Very large number of NULL values

  • Used FDEP to discover FDs from current instance
slide-45
SLIDE 45

January 9, 2006 Periklis Andritsos

45

DBLP

  • First performed attribute grouping
  • The first step was to horizontally partition the data set into three

clusters using an information-theoretic heuristic.

NULLs

slide-46
SLIDE 46

January 9, 2006 Periklis Andritsos

46

DBLP (Cont.)

Conferences Journals 11 FDs 9 FDs

  • The third cluster contains random publications and no FDs
  • The use of ranked FDs should be interactive
slide-47
SLIDE 47

January 9, 2006 Periklis Andritsos

47

Roadmap

 Introduction  Contributions  Motivating example  LIMBO Algorithm  Studying Software Systems  Identifying Structure  Conclusions & Future Work

slide-48
SLIDE 48

January 9, 2006 Periklis Andritsos

48

Conclusions

  • Motivated the problem of Categorical Clustering and the need to use

different than the traditional distance measures

  • Presented LIMBO, an algorithm that is

Hierarchical

Scalable

Requires minimal setting of parameters

Performs well on integrated information

Incorporates any type of information about the objects to be clustered

Although it is necessarily heuristic, it increases efficiency without loss in quality w.r.t. to non-scalable information theoretic methods

Is insensitive to the order of input (experimentally found)

slide-49
SLIDE 49

January 9, 2006 Periklis Andritsos

49

Future Directions

  • Collect and process information in an online fashion

  • n-the-fly

within a bounded amount of memory

  • Example: U.S. airports start incorporating RFID tags on

passenger suitcases for routing and quality control purposes (Las Vegas McCarran Airport)

  • Collection, summarization, error identification

at different regional centers (airports)

at different aggregation points (provincial authorities)

slide-50
SLIDE 50

January 9, 2006 Periklis Andritsos

50

Future Directions

  • Envision a large clustering suite, where users can

Cluster data from several application domains

Need extensions to cluster numerical and categorical data at the same time

Produce even better qualitative results

Apply different methods in conjunction and aggregate their results

Provide ranking capabilities for the clusters

Perform clustering with emphasis on semantics of produced clusters

Identify and correct errors

Provide automatic ways to replace erroneous values with other values from the data

Incorporate value importance

Allow the user to associate his/her preference with particular values

Incorporate automatic weighting schemes that reflect the fact that, e.g., some files in a software system are more important than others

slide-51
SLIDE 51

January 9, 2006 Periklis Andritsos

51

Grazie !

Periklis Andritsos URL: http://www.cs.toronto.edu/~periklis e-mail: periklis@dit.unitn.it You may contact me anytime to get to know more about Trento and our projects.