[PPT] - Saurav Kumar Singh Department of Computer Science & Engineering PowerPoint Presentation

SLIDE 1

Saurav Kumar Singh Department of Computer Science & Engineering Dual degree 4th year

SLIDE 2

Outline

 Motivation  Basics  Hierarchical Structure  Parameter Generation  Query Types  Algorithm

SLIDE 3

Motivation

 All previous clustering algorithm are query dependent  They are built for one query and generally no use for

ther query.

 Need a separate scan for each query.  So computation more complex at least O(n).  So we need a structure out of Database so that various

queries can be answered without rescanning.

SLIDE 4

Basics

 Grid based method-quantizes the object space into a

finite number of cells that form a grid structure on which all of the operations for clustering are performed

 Develop hierarchical Structure out of given data and

answer various queries efficiently.

 Every level of hierarchy consist of cells  Answering a query is not O(n) where n is the number

f elements in the database

SLIDE 5

A hierarchical structure for STING clustering

SLIDE 6

continue …..

 The root of the hierarchy be at level 1  Cell in level i corresponds to the union of the areas of

its children at level i + 1

 Cell at a higher level is partitioned to form a number of

cells of the next lower level

 Statistical information of each cell is calculated and

stored beforehand and is used to answer queries

SLIDE 7

Cell parameter

 Attribute Independent parameter-

n- number of objects (points) in this cell

 Attribute dependent parameters-

m - mean of all values in this cell s - standard deviation of all values of the attribute in this cell min - the minimum value of the attribute in this cell max - the maximum value of the attribute in this cell distribution - the type of distribution that the attribute value in this cell follows

SLIDE 8

Parameter Generation

 n, m, s, min, and max of bottom level cells are

calculated directly from data

 Distribution can be either assigned by user or can be

btained by hypothetical tests - χ2 test

 Parameters of higher level cells is calculated from

parameter of lower level cells.

SLIDE 9

continue…..

 n, m, s, min, max, dist be parameters of current cell  ni, mi, si, mini, maxi and disti be parameters of

corresponding lower level cells

SLIDE 10

dist for Parent Cell

 Set dist as the distribution type followed by most points in

this cell

 Now check for conflicting points in the child cells call it

confl.

1. If disti ≠ dist, mi ≈ m and si ≈ s, then confl is increased by an

amount of ni;

2. If disti ≠ dist, but either mi ≈ m or si ≈ s is not satisfied, then

set confl to n

3. If disti = dist, mi ≈ m and si ≈ s, then confl is increased by 0;
4. If disti = dist, but either mi ≈ m or si ≈ s is not satisfied, then

confl is set to n.

SLIDE 11

continue…..

 If is greater than a threshold t set dist as NONE.  Other wise keep the original type.

Example :

SLIDE 12

continue…..

 Parameter for parent cell would be

n = 220 m = 20.27 s = 2.37 min = 3.8 max = 40 dist = NORMAL

 210 points whose distribution type is NORMAL  Set dist of parent as Normal  confl = 10  = 0.045 < 0.05 so keep the original.

SLIDE 13

Query types

 STING structure is capable of answering various queries  But if it doesn’t then we always have the underlying

Database

 Even if statistical information is not sufficient to answer

queries we can still generate possible set of answers.

SLIDE 14

Common queries

 Select regions that satisfy certain conditions

Select the maximal regions that have at least 100 houses per unit

area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence

SELECT REGION FROM house-map WHERE DENSITY IN (100, ∞) AND price RANGE (400000, ∞) WITH PERCENT (0.7, 1) AND AREA (100, ∞) AND WITH CONFIDENCE 0.9

SLIDE 15

continue….

 Selects regions and returns some function of the region

Select the range of age of houses in those maximal regions where there

are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.

SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, ∞) AND price RANGE (150000, 300000) WITH PERCENT (0.7, 1) AND AREA (100, ∞) AND LOCATION California

SLIDE 16

Algorithm

 With the hierarchical structure of grid cells on hand,

we can use a top-down approach to answer spatial data mining queries

 For any query, we begin by examining cells on a high

level layer

 calculate the likelihood that this cell is relevant to the

query at some confidence level using the parameters of this cell

 If the distribution type is NONE, we estimate the

likelihood using some distribution free techniques instead

SLIDE 17

continue….

 After we obtain the confidence interval, we label this

cell to be relevant or not relevant at the specified confidence level

 Proceed to the next layer but only consider the Childs

f relevant cells of upper layer

 We repeat this until we reach to the final layer  Relevant cells of final layer have enough statistical

information to give satisfactory result to query.

 However for accurate mining we may refer to data

corresponding to relevant cells and further process it.

SLIDE 18

Finding regions

 After we have got all the relevant cells at the final level

we need to output regions that satisfies the query

 We can do it using Breadth First Search

SLIDE 19

Breadth First Search

 we examine cells within a certain distance from

the center of current cell

 If the average density within this small area is

greater than the density specified mark this area

 Put the relevant cells just examined in the queue.  Take element from queue repeat the same

procedure except that only those relevant cells that are not examined before are enqueued. When queue is empty we have identified one region.

SLIDE 20

Statistical Information Grid-based Algorithm

1. Determine a layer to begin with.
2. For each cell of this layer, we calculate the confidence interval (or

estimated range) of probability that this cell is relevant to the query.

3. From the interval calculated above, we label the cell as relevant or not

relevant.

4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. We go down the hierarchy structure by one level. Go to Step 2 for those

cells that form the relevant cells of the higher level layer.

6. If the specification of the query is met, go to Step 8; otherwise, go to

Step 7.

7. Retrieve those data fall into the relevant cells and do further processing.

Return the result that meet the requirement of the query. Go to Step 9.

8. Find the regions of relevant cells. Return those regions that meet the

requirement of the query. Go to Step 9.

9. Stop.

SLIDE 21

Time Analysis:

 Step 1 takes constant time. Steps 2 and 3 require

constant time.

 The total time is less than or equal to the total number

f cells in our hierarchical structure.

 Notice that the total number of cells is 1.33K, where K

is the number of cells at bottom layer.

 So the overall computation complexity on the grid

hierarchy structure is O(K)

SLIDE 22

Time Analysis:

 STING goes through the database once to compute the

statistical parameters of the cells

 time complexity of generating clusters is O(n), where

n is the total number of objects.

 After generating the hierarchical structure, the query

processing time is O(g), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.

SLIDE 23

Comparison

SLIDE 24

CLIQUE: A Dimension-Growth Subspace Clustering Method

 First dimension growth subspace clustering algorithm  Clustering starts at single-dimension subspace and

move upwards towards higher dimension subspace

 This algorithm can be viewed as the integration

Density based and Grid based algorithm

SLIDE 25

Informal problem statement

 Given a large set of multidimensional data points, the

data space is usually not uniformly occupied by the data points.

 CLIQUE’s clustering identifies the sparse and the

“crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set.

 A unit is dense if the fraction of total data points

contained in it exceeds an input model parameter.

 In CLIQUE, a cluster is defined as a maximal set of

connected dense units.

SLIDE 26

Formal Problem Statement

 Let A= {A1, A2, . . . , Ad } be a set of bounded, totally

rdered domains and S = A1× A2× · · · × Ad a d-

dimensional numerical space.

 We will refer to A1, . . . , Ad as the dimensions

(attributes) of S.

 The input consists of a set of d-dimensional points V =

{v1, v2, . . . , vm}

 Where vi = vi1, vi2, . . . , vid . The j th component of vi is

drawn from domain Aj .

SLIDE 27

Clique Working

 2 Step Process  1st step – Partitioning the d- dimensional data space  2nd step- Generates the minimal description of each

cluster.

SLIDE 28

1st step- Partitioning

 Partitioning is done for each dimension.

SLIDE 29

Example continue….

SLIDE 30

continue….

 The subspaces representing these dense units are

intersected to form a candidate search space in which dense units of higher dimensionality may exist.

 This approach of selecting candidates is quite similar

to Apiori Gen process of generating candidates.

 Here it is expected that if some thing is dense in

higher dimensional space it cant be sparse in lower dimension state.

SLIDE 31

More formally

 If a k-dimensional unit is dense, then so are its projections

in (k-1)-dimensional space.

 Given a k-dimensional candidate dense unit, if any of it’s

(k-1)th projection unit is not dense then kth dimensional unit cannot be dense

 So,we can generate candidate dense units in k-dimensional

space from the dense units found in (k-1)-dimensional space

 The resulting space searched is much smaller than the

riginal space.

 The dense units are then examined in order to determine

the clusters.

SLIDE 32

Intersection

Dense units found with respect to age for the dimensions salary and vacation are intersected in order to provide a candidate search space for dense units of higher dimensionality.

SLIDE 33

2nd stage- Minimal Description

 For each cluster, Clique determines the maximal

region that covers the cluster of connected dense units.

 It then determines a minimal cover (logic description)

for each cluster.

SLIDE 34

Effectiveness of Clique-

 CLIQUE automatically finds subspaces of the highest

dimensionality such that high-density clusters exist in those subspaces.

 It is insensitive to the order of input objects  It scales linearly with the size of input  Easily scalable with the number of dimensions in the

data

SLIDE 35

 Thank You 

SLIDE 36

References:

 STING : A Statistical Information

n Grid Approach to Spatial Data Mining ng Wei Wang, Jiong Yang, and Richard Muntz Department of Computer Science University of California, Los Angeles ,February 20, 1997