Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data - - PowerPoint PPT Presentation

mining adaptively frequent closed unlabeled rooted trees
SMART_READER_LITE
LIVE PREVIEW

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data - - PowerPoint PPT Presentation

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavald Universitat Politcnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD08) 2008 Las


slide-1
SLIDE 1

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Albert Bifet and Ricard Gavaldà

Universitat Politècnica de Catalunya

14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA

slide-2
SLIDE 2

Data Streams

Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example

Tree Mining

Mining frequent trees is becoming an important task Applications:

chemical informatics computer vision text retrieval bioinformatics Web analysis.

Many link-based structures may be studied formally by means of unordered trees

slide-3
SLIDE 3

Introduction: Trees

Our trees are: Rooted Unlabeled Ordered and Unordered Our subtrees are: Induced Two different ordered trees but the same unordered tree

slide-4
SLIDE 4

Introduction

What Is Tree Pattern Mining?

Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FS):

Include all the trees whose support is no less than min_sup

Closed Frequent Tree Pattern (CS):

Include no tree which has a super-tree with the same support

CS ⊆ FS Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information

slide-5
SLIDE 5

Introduction

Unordered Subtree Mining

A: B: X: Y: X: Y:

D = {A,B},min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:

slide-6
SLIDE 6

Introduction

Problem

Given a data stream D of rooted, unlabelled and unordered trees, find frequent closed trees. D We provide three algorithms,

  • f increasing power

Incremental Sliding Window Adaptive

slide-7
SLIDE 7

Relaxed Support

Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng, Yunfeng Liu and Kunqing Xie. CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data Linear Relaxed Interval:The support space of all subpatterns can be divided into n = ⌈1/εr⌉ intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li,ui), where li = (n −i)∗εr ≥ 0, ui = (n −i +1)∗εr ≤ 1 and i ≤ n. Linear Relaxed closed subpattern t: if and only if there exists no proper superpattern t′ of t such that their suports belong to the same interval Ii.

slide-8
SLIDE 8

Relaxed Support

As the number of closed frequent patterns is not linear with respect support, we introduce a new relaxed support: Logarithmic Relaxed Interval:The support space of all subpatterns can be divided into n = ⌈1/εr⌉ intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li,ui), where li = ⌈ci⌉, ui = ⌈ci+1 −1⌉ and i ≤ n. Logarithmic Relaxed closed subpattern t: if and only if there exists no proper superpattern t′ of t such that their suports belong to the same interval Ii.

slide-9
SLIDE 9

Galois Lattice of closed set of trees

D We need a Galois connection pair a closure operator 1 2 3 12 13 23 123

slide-10
SLIDE 10

Algorithms

Algorithms

Incremental: INCTREENAT Sliding Window: WINTREENAT Adaptive: ADATREENAT Uses ADWIN to monitor change

ADWIN

An adaptive sliding window whose size is recomputed online according to the rate of change observed.

ADWIN has rigorous guarantees (theorems)

On ratio of false positives and negatives On the relation of the size of the current window and change rates

slide-11
SLIDE 11

Experimental Validation: TN1

INCTREENAT CMTreeMiner Time (sec.) Size (Milions) 2 4 6 8 100 200 300

Figure: Time on experiments on ordered trees on TN1 dataset

slide-12
SLIDE 12

Experimental Validation

5 15 25 35 45 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140

Number of Samples Number of Closed Trees

AdaTreeInc 1 AdaTreeInc 2

Figure: Number of closed trees maintaining the same number of closed datasets on input data

slide-13
SLIDE 13

Summary

Conclusions

New logarithmic relaxed closed support Using Galois Latice Theory, we present methods for mining closed trees

Incremental: INCTREENAT Sliding Window: WINTREENAT Adaptive: ADATREENAT using ADWIN to monitor change

Future Work

Labeled Trees and XML data.