Scalable Clustering of Categorical Data and Applications
Periklis Andritsos periklis@dit.unitn.it
University of Trento
Scalable Clustering of Categorical Data and Applications - - PowerPoint PPT Presentation
Scalable Clustering of Categorical Data and Applications University of Trento Periklis Andritsos periklis@dit.unitn.it Problem Definition o Clustering is a procedure that groups members of a population into similar categories, or clusters o
University of Trento
January 9, 2006 Periklis Andritsos
2
Get insight in the way data is distributed Preprocess an initial data set
January 9, 2006 Periklis Andritsos
3
January 9, 2006 Periklis Andritsos
4
Given a query, the meta-search engine places relevant web documents into groups
Clusty uses clustering technology on ten different types of web content including material from the web, image, news and shopping databases
January 9, 2006 Periklis Andritsos
5
“Likely buyers will be more motivated to buy a digital camera knowing that digital images can be displayed on a TV, printed using a PC-less photo quality printer, or printed at traditional film developer outlets“
“Likely buyers capture more images per month on their film cameras than unlikely buyers”
Source:http://www.imaging-resource.com/NEWS/1037573998.html
January 9, 2006 Periklis Andritsos
6
boxParserClass
error
stackOfAnyWithTop boxScannerClass
Fonts
Globals Mathlib EdgeClass ColorTable stackOfAny
Lexer
hashedBoxes
Event
GenerateProlog
NodeOfAny hashGlobals
main
boxClass
Main Event error boxParserClass Globals Generate Prolog
stackOfAnyWithTop
boxClass colorTable Fonts edgaClass MathLib
stackOfAny NodeOfAny hashedBoxes hashGlobals
boxScanner Class Lexer
[MMBRCG’98]
January 9, 2006 Periklis Andritsos
7
[Andritsos, Miller: IEEE Int’l Workshop on Program Comprehension, 2001]
Some information cannot be depicted
January 9, 2006 Periklis Andritsos
8
are stored in heterogeneous sources
exist under different formats
are available online (with schemas)
Schema: A type specification of a collection of data
<title>... <author>... <year>...
select all
cust emp dept dno dna
XML Repository Information System Relational Database
O-O Database
Integrated Information
January 9, 2006 Periklis Andritsos
9
January 9, 2006 Periklis Andritsos
10
The majority of existing commercial algorithms perform clustering
The optimal solution to clustering is hard to find, and existing heuristic techniques do not necessarily perform well with large inputs.
Many algorithms expect the user to give a set of (sometimes) unintuitive parameters
Software clustering techniques use structural information exclusively
January 9, 2006 Periklis Andritsos
11
no single ordering of values
movie director actor genre Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro
Crime Crime Thriller Thriller Comedy Comedy
Lp metrics defined
Euclidean, Manhattan
$60,000 $2,500 $6,000 $5,000 salary 32 30 26 25 age Jenny Peter Mary John employee
January 9, 2006 Periklis Andritsos
12
Agglomerative, or Hierarchical clustering in Agglomerative, or Hierarchical clustering in Euclidean space on 6 points. Euclidean space on 6 points. Need to compute distance between objects as well as between objects and sub-clusters A B E F C D A B E F C D
January 9, 2006 Periklis Andritsos
13
is hierarchical
clusters categorical data using a small number of parameters
is scalable as the size of the input increases International Conference on Extending Data Base Technology, (EDBT’04)
information
The algorithm incorporates information such as the Developer, Lines Of Code or Directory structure International Working Conference on Reverse Engineering, (WCRE’03) IEEE Transactions on Software Engineering, (TSE’05)
data sets
ACM International Conference on the Management of Data, (SIGMOD’04)
International Workshop on Information Integration on the Web, (WEB’02) IEEE Data Engineering Bulletin 2002, 2003
January 9, 2006 Periklis Andritsos
14
January 9, 2006 Periklis Andritsos
15
Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro
Crime Crime Thriller Thriller Comedy Comedy
Preserves Information for actor, genre Two choices for director Two choices for director, actor, and genre
January 9, 2006 Periklis Andritsos
16
Three choices for director, actor, and two for genre Preserves Information for director, genre Two choices for actor
Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro
Crime Crime Thriller Thriller Comedy Comedy
January 9, 2006 Periklis Andritsos
17
January 9, 2006 Periklis Andritsos
18
Measures the Uncertainty in a random variable
Measures the Uncertainty of one variable knowing
Measures the Dependence of two random
January 9, 2006 Periklis Andritsos
19
January 9, 2006 Periklis Andritsos
20
Its probability p(ci)=n(ci)/n
Conditional probability of the values in V given the cluster, p(V|ci)
January 9, 2006 Periklis Andritsos
21
Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro
Crime Crime Thriller Thriller Comedy Comedy
January 9, 2006 Periklis Andritsos
22
produce a summary of the data apply agglomerative clustering on the summary
+ + + + = ) | ( ) ( ) ( ) ( ) | ( ) ( ) ( ) ( ), ( ) ( *) (
2 2 1 2 1 2 1 1 2 1
c V p c n c n c n c V p c n c n c n c n c n c DCF
January 9, 2006 Periklis Andritsos
23
Godfather II Good Fellas Vertigo N by NW Bishop’s Wife Harvey Scorcese Coppola Hitchcock Hitchcock Koster Koster De Niro De Niro
Crime Crime Thriller Thriller Comedy Comedy
DCF(t1) DCF(t2) DCF(c1)
DCF(t3) DCF(t4) DCF(c2)
DCF(c3)
January 9, 2006 Periklis Andritsos
24
Proposes a new (incremental) computation of summaries
Requires a single pass over the data to compute cluster summaries
Number of clusters : the hierarchical nature of LIMBO permits the production of a range of cluster cardinalities
Parameter φ : it should be set, but…
Version of LIMBO with parameter S (in place of φ) : bounds the amount of memory allocated to the summaries
January 9, 2006 Periklis Andritsos
25
Standard testing data (UCI Machine Learning Repository)
Web Data
Legacy Data (DBLP)
Software Engineering (VIM source code)
Synthetic
Up to 10 Million rows, 20 columns with 10-20 values each
Information Loss
Precision, Recall, Classification Error
Category Utility
January 9, 2006 Periklis Andritsos
26
2.56 0.002 0.998 0.998 77.56 5000 LIMBO (_=0.0) 0.4 0.28 0.724 0.839 85.00
[Guha et al ‘00]
2.54 0.05 0.995 0.995 78.02 125 COOLCAT
[Barbara et al’00]
2.56 0.002 0.998 0.998 77.56 66 LIMBO (_=1.0) Category Utility Clas/tion Error Recall Precision Information Loss (%) Size Algorithm
DS10 (n=5,000 , 10 attributes , 5 clusters)
superior
January 9, 2006 Periklis Andritsos
27
January 9, 2006 Periklis Andritsos
28
January 9, 2006 Periklis Andritsos
29
set of software artifacts (files, classes) [nodes]
structural information, i.e. interdependencies between the artifacts (invocations, inheritance) [edges]
non-structural information (timestamps, ownership) [node properties]
Partition the artifacts into “meaningful” groups in order to understand and maintain the software system
Data analysis tools produce valuable information about source code
January 9, 2006 Periklis Andritsos
30
January 9, 2006 Periklis Andritsos
31
Artifacts are expressed over the artifacts on which they depend
LIMBO can be applied to perform clustering
Artifacts are expressed over the artifacts on which they depend AND their properties
LIMBO can then be applied
January 9, 2006 Periklis Andritsos
32
TOBEY : ~ 1,000 files (total 250,000 Lines Of Code)
LINUX : ~ 1,000 files (total 750,000 Lines Of Code)
Mozilla : ~ 2,500 files (total 4,000,000 Lines Of Code)
Developers (dev)
Directory (dir)
Lines of Code (loc)
Time of Last Update (time)
Standard Cluster Analysis Algorithms
Cluster Analysis algorithms that identify patterns in the software graph
Cluster Analysis algorithms that adhere to software engineering principles
January 9, 2006 Periklis Andritsos
33
LIMBO outperformed all other clustering algorithms
Utility subsystems were discovered
“Dir” information produced better decompositions.
“Dev” information has a positive effect.
“Time” leads to worse clusterings.
January 9, 2006 Periklis Andritsos
34
The representation used within LIMBO allows for the inclusion of descriptive information available about the objects to be clustered
January 9, 2006 Periklis Andritsos
35
January 9, 2006 Periklis Andritsos
36
Older constraints are relaxed
Heterogeneous groups of rows may be added
information content.
the tools
Hierarchical in nature exploit any number of partitions
January 9, 2006 Periklis Andritsos
37
Data Browsing
Discover how data joins together Perform Data Cleaning
Information-Theoretic characterization of database design
Characterize constraints in a data set [Dalkilic and Robertson ‘00] Provide metrics that assess design quality [Arenas and Libkin ‘03]
Theoretic characterization of databases into a tool by making use of clustering
January 9, 2006 Periklis Andritsos
38
Horizontally partition a data set
Integrated information contains heterogeneous groups
Find naturally co-occurring values in the data
By reversing the roles rows and values, we can cluster the values so
that they preserve information about the rows in which they appear
Find groups of attributes that share data with high duplication
Groups of naturally co-occurring values provide useful hints about the
duplication that exists in the attributes
January 9, 2006 Periklis Andritsos
39
January 9, 2006 Periklis Andritsos
40
Thriller Grant Luhrman Thriller Grant Mehta Thriller Grant Hitchcock Comedy DeNiro Scorsese Crime DeNiro Scorsese Genre Actor Director List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese
If we allow no Information loss we get perfectly correlated values Cluster 1: {Scorsese, DeNiro} Cluster 2: {Grant, Thriller}
January 9, 2006 Periklis Andritsos
41
3 1 1 Genre 3 Grant Crime Comedy Thriller 2 DeNiro 1 Mehta 1 Luhrman 1 Hitchcock 2 Scorsese Actor Director Value
Initial matrix Matrix with counts of values in attributes
List of Tuples Count {t4} 1 {t5} 1 {t3,t4,t5} 3 {t3,t4,t5} 3 {t3} 1 Grant {t1} 1 Crime {t2} 1 Comedy Thriller {t1,t2} 2 DeNiro Mehta Luhrman Hitchcock {t1,t2} 2 Scorsese
January 9, 2006 Periklis Andritsos
42
1 1 3 Genre 3 Grant, Thriller Crime Comedy 1 Mehta 1 Luhrman 1 Hitchcock 2 2 Scorsese, DeNiro Actor Director Value
After Clustering
3 Genre 3 2 Actor 2 Director Cluster 2 Cluster 1 Value
January 9, 2006 Periklis Andritsos
43
January 9, 2006 Periklis Andritsos
44
DBLP:
Conference, Journal, Thesis etc. publications data set 50K tuples, 13 attributes and ~60K values Very large number of NULL values
January 9, 2006 Periklis Andritsos
45
clusters using an information-theoretic heuristic.
NULLs
January 9, 2006 Periklis Andritsos
46
January 9, 2006 Periklis Andritsos
47
January 9, 2006 Periklis Andritsos
48
different than the traditional distance measures
Hierarchical
Scalable
Requires minimal setting of parameters
Performs well on integrated information
Incorporates any type of information about the objects to be clustered
Although it is necessarily heuristic, it increases efficiency without loss in quality w.r.t. to non-scalable information theoretic methods
Is insensitive to the order of input (experimentally found)
January 9, 2006 Periklis Andritsos
49
within a bounded amount of memory
at different regional centers (airports)
at different aggregation points (provincial authorities)
January 9, 2006 Periklis Andritsos
50
Cluster data from several application domains
Need extensions to cluster numerical and categorical data at the same time
Produce even better qualitative results
Apply different methods in conjunction and aggregate their results
Provide ranking capabilities for the clusters
Perform clustering with emphasis on semantics of produced clusters
Identify and correct errors
Provide automatic ways to replace erroneous values with other values from the data
Incorporate value importance
Allow the user to associate his/her preference with particular values
Incorporate automatic weighting schemes that reflect the fact that, e.g., some files in a software system are more important than others
January 9, 2006 Periklis Andritsos
51