[PPT] - Data Mining: Concepts and Techniques Chap 8. Data Streams, Time PowerPoint Presentation

SLIDE 1

March 27, 2008 Data Mining: Concepts and Techniques

1

Data Mining:

Concepts and Techniques

Chap 8. Data Streams, Time Series Data, and

Sequential Patterns Li Xiong

Slides credits: Jiawei Han and Micheline Kamber and others

SLIDE 2

March 27, 2008 Data Mining: Concepts and Techniques

2

Mining Stream, Time-Series, and Sequence Data

Mining data streams Mining time-series data Mining sequence data

SLIDE 3

March 27, 2008 Data Mining: Concepts and Techniques

3

Mining Data Streams

Stream data and stream data processing Basic methodologies for stream data processing and

mining

Stream frequent pattern analysis Stream classification Stream cluster analysis

SLIDE 4

March 27, 2008 Data Mining: Concepts and Techniques

4

Data Streams

A sequence of data in transmission An ordered pair (s, ∆) where: s is a sequence of tuples,

∆ is the sequence of time intervals

Characteristics

Continuous Huge volumes, possibly infinite Fast changing and requires fast, real-time response Random access is expensive—single scan algorithm Low-level or multi-dimensional in nature

SLIDE 5

March 27, 2008 Data Mining: Concepts and Techniques

5

Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing

Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)

SLIDE 6

March 27, 2008 Data Mining: Concepts and Techniques

6

Architecture: Stream Query Processing and Mining Scratch Space Scratch Space (Main memory and/or Disk) (Main memory and/or Disk) User/Application User/Application User/Application Continuous Query Continuous Query Stream Query Stream Query Processor Processor Results Results

Multiple streams Multiple streams SDMS (Stream Data Management System)

SLIDE 7

March 27, 2008 Data Mining: Concepts and Techniques

7

DBMS versus DSMS

Persistent relations
One-time queries
Random access
“Unbounded” disk store
Only current state matters
No real-time services
Relatively low update rate
Data at any granularity
Assume precise data
Access plan determined by

query processor, physical DB design

Transient streams
Continuous queries
Sequential access
Bounded main memory
Historical data is important
Real-time requirements
Possibly multi-GB arrival rate
Data at fine granularity
Data stale/imprecise
Unpredictable/variable data

arrival and characteristics

Ack. From Motwani’s PODS tutorial slides

SLIDE 8

March 27, 2008 Data Mining: Concepts and Techniques

8

Mining Data Streams

Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis

SLIDE 9

March 27, 2008 Data Mining: Concepts and Techniques

9

Methodologies for Stream Data Processing

Major challenges

Keep track of a large universe

Methodology

Choosing a subset of data

Sampling Sliding windows Load shedding

Summarizing the data

Synopses (trade-off between accuracy and storage)

SLIDE 10

Slides: R. Gemulla, W. Lehner, P. J. Haas

Random Sampling: Uniform Sampling

Uniform sampling

Data stream of size N Assume all samples are equally likely

Example

a data stream of size 4 (also called population) possible samples of size 2

SLIDE 11

Random Sampling: Reservoir Sampling

Reservoir sampling

Single-scan algorithm
Compute a uniform sample of M elements without N

Idea

Maintain a reservoir, which form a random sample of

the elements seen so far in the stream

Algorithm

add the first M elements
Afterwards at item i, flip a coin

a) ignore the element (reject) b) replace a random element in the sample (accept)

i M t P

i

= = size population current size sample ) accepted is (

Slides: R. Gemulla, W. Lehner, P. J. Haas

SLIDE 12

Random Sampling: Reservoir Sampling (Example)

Example

data stream sample size M = 2

1/3 2/4 1/4 1/4 2/4 1/4 1/4 2/4 1/4 1/4 1/3 1/3

SLIDE 13

PODS 2002

13

Sliding Windows

Make decisions based only on recent data of sliding

window size w

An element arriving at time t expires at time t + w

Why?

Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics

0 1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0

SLIDE 14

Load Shedding

Load shedding

Discards some data so the system can flow

Techniques

Filters (semantic drop)

Chooses what to shed based on QoS, selectivity

Drops (random drop)

Eliminates a random fraction of input

Hospital example

Load shedding based on condition

Join

Doctors Patients Doctors who can work on a patient

Join

Doctors Patients Doctors who can work on a patient Condition Filter

SLIDE 15

March 27, 2008 15

Synopsis

Summaries for data Can be used to return approximate answers Trade off between space and accuracy

Techniques

Histograms Wavelets Sketching

May require multiple passes

Synopses/Data Structures

1 1 1 1 1 1 1

SLIDE 16

March 27, 2008 Data Mining: Concepts and Techniques

16

Mining Data Streams

Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis Research issues

SLIDE 17

March 27, 2008 Data Mining: Concepts and Techniques

17

Frequent Pattern Mining for Data Streams

Issues

Multiple scans for training not feasible Memory/space management Concept drift

Methods

Approximate frequent patterns (Manku & Motwani VLDB’02) Mining evolution of freq. patterns (C. Giannella, J. Han, X. Yan,

P.S. Yu, 2003)

Space-saving computation of frequent and top-k elements

(Metwally, Agrawal, and El Abbadi, ICDT'05)

SLIDE 18

March 27, 2008 Data Mining: Concepts and Techniques

18

Mining Approximate Frequent Patterns

Lossy Counting Algorithm (Manku & Motwani, VLDB’02)
Motivation

Mining precise freq. patterns in stream data: unrealistic Approximate answers are often sufficient (e.g., trend/pattern

analysis)

Example: a router interested in all flows whose frequency is at least

1% (σ) of the entire traffic stream seen so far;

1/10 of σ (ε = 0.1%) error is comfortable

Major ideas: approximation by tracing only “frequent” items

Adv: guaranteed error bound Disadv: keep a large set of traces

SLIDE 19

March 27, 2008 Data Mining: Concepts and Techniques

19

Lossy Counting for Frequent I tems

Bucket 1 Bucket 2 Bucket 3

Input variables
ϭ: min_support, ε: error bound
Fixed variables
w=1/ ε: window size
Running variables
N: current stream length
bcurrent = ε N: the current bucket
fe: the real frequency count of element e
Set of (e, f, ∆): (element, approximate frequency, max error)

SLIDE 20

March 27, 2008 Data Mining: Concepts and Techniques

20

Lossy Counting for Frequent I tems

Bucket 1 Bucket 2 Bucket 3

For each new element e
If an entry for e exists, then incrementing its frequency f by 1
Otherwise, create a new entry (e, 1, bcurrent -1)
At bucket boundaries
Decrement frequency of all entries by 1
Delete entries with f+∆ <= bcurrent

SLIDE 21

March 27, 2008 Data Mining: Concepts and Techniques

21

I llustration

+

(e, f, ∆)

bcurrent Empty

(summary)

+

(e, f, ∆)

bcurrent=1

SLIDE 22

March 27, 2008 Data Mining: Concepts and Techniques

22

Approximation Guarantee

Output: items with frequency counts exceeding (σ – ε) N Error analysis: how much do we undercount?

If stream length seen so far = N and bucket-size = 1/ε then frequency count error ≤ ≤ #buckets = εN

Approximation guarantee

No false negatives False positives have true frequency count at least (σ–ε)N Frequency count underestimated by at most εN

SLIDE 23

March 27, 2008 Data Mining: Concepts and Techniques

23

Lossy Counting For Frequent I temsets

Divide Stream into ‘Buckets’ as for itemsets

Bucket 1 Bucket 2 Bucket 3

Set of (set, f, ∆): (itemset, approximate frequency, max error)

SLIDE 24

March 27, 2008 Data Mining: Concepts and Techniques

24

Update of Summary Data Structure

2 2 1 2 1 1 1

summary data Processing 3 buckets in memory

4 4 10 2 2

+

3 3 9

summary data

SLIDE 25

March 27, 2008 Data Mining: Concepts and Techniques

25

Summary of Lossy Counting

Strength

A simple idea Can be extended to frequent itemsets

Weakness:

Space Bound is not good For frequent itemsets, they do scan each record many

times

The output is based on all previous data. But

sometimes, we are only interested in recent data

SLIDE 26

March 27, 2008 Data Mining: Concepts and Techniques

26

Mining Evolution of Frequent Patterns for Stream Data

Mining evolution and dramatic changes of frequent patterns

(Giannella, Han, Yan, Yu, 2003)

Use tilted time window frame Use compressed form to store significant (approximate) frequent

patterns and their time-dependent traces

SLIDE 27

March 27, 2008 Data Mining: Concepts and Techniques

27

A Titled Time Model

Natural tilted time frame:

Example: Minimal: quarter, then 4 quarters → 1 hour, 24 hours →

day, …

Logarithmic tilted time frame:

Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …

Time t 8t 4t 2t t 16t 32t 64t 4 qtrs 24 hours 31 days 12 months time

SLIDE 28

March 27, 2008 Data Mining: Concepts and Techniques

28

Two Structures for Mining Frequent Patterns with Tilted-Time Window (1)

FP-Trees store Frequent Patterns Tilted-time major: An FP-tree for each tilted time frame

SLIDE 29

March 27, 2008 Data Mining: Concepts and Techniques

29

Frequent Pattern & Tilted-Time Window (2)

The second data structure:

Observation: FP-Trees of different time units are similar Pattern-tree major: each node is associated with a tilted-time

window

SLIDE 30

March 27, 2008 Data Mining: Concepts and Techniques

30

Mining Data Streams

Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis

SLIDE 31

March 27, 2008 Data Mining: Concepts and Techniques

31

Classification for Dynamic Data Streams

Issues

Multiple scans for training not feasible Concept drift

Methods

VFDT (Very Fast Decision Tree) and CVFDT (Concept-adapting

Very Fast Decision Tree) (Domingos, Hulten, Spencer, KDD00/KDD01)

Ensemble (Wang, Fan, Yu, Han. KDD’03) K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04)

SLIDE 32

March 27, 2008 Data Mining: Concepts and Techniques

32

VFDT

Basic idea

Consider only a small subset of training examples to find best split

attribute at a node given a split evaluation measure G

How many examples are necessary at each node?

Statistical foundation: Hoeffding Bound (Additive Chernoff Bound)

r: random variable R: range of r n: # independent observations True mean of r is at least ravg – ε, with probability 1 – δ

Given observed best attribute Xa and second best attribute Xb
if ∆G = G(Xa) – G(Xb) > ε, then ∆G >= ∆G - ε > 0 with probability 1- δ

n R 2 ) / 1 ln(

2

δ ε =

SLIDE 33

March 27, 2008 Data Mining: Concepts and Techniques

33

Hoeffding Tree Algorithm

Hoeffding Tree Input

S: sequence of examples X: attributes G: split evaluation function (info gain, Gini index) δ: 1 - desired probability of choosing correct attribute

Hoeffding Tree Algorithm

for each example in S retrieve G(Xa) and G(Xb) //two highest G(Xi) compute ε if ( G(Xa) – G(Xb) > ε ) split on Xa recursive to next node break

SLIDE 34

March 27, 2008 Slide: Gehrke

34

yes no Packets > 10 Protocol = http Protocol = ftp yes yes no Packets > 10 Bytes > 60K Protocol = http

Data Stream Data Stream

Decision-Tree I nduction with Data Streams

SLIDE 35

March 27, 2008 Data Mining: Concepts and Techniques

35

Hoeffding Tree: Strengths and Weaknesses

Strengths

Scales better than traditional methods

Sublinear with sampling Very small memory utilization

Incremental

Make class predictions in parallel New examples are added as they come

Weakness

Could spend a lot of time with ties Memory utilization issues with tree expansion and large

number of candidate attributes

SLIDE 36

March 27, 2008 Data Mining: Concepts and Techniques

36

VFDT (Very Fast Decision Tree)

Modifications to Hoeffding Tree

Near-ties broken more aggressively G computed every nmin Deactivates certain leaves to save memory Poor attributes dropped Initialize with traditional learner (helps learning curve)

Compare to traditional decision tree

Similar accuracy Better runtime with 1.61 million examples

21 minutes for VFDT 24 hours for C4.5

SLIDE 37

March 27, 2008 Data Mining: Concepts and Techniques

37

CVFDT (Concept-adapting VFDT)

Concept Drift

Time-changing data streams Incorporate new and eliminate old

CVFDT

Sliding window approach

Increments count with new example Decrement old example

Grows alternate subtrees When alternate more accurate => replace old

SLIDE 38

March 27, 2008 Data Mining: Concepts and Techniques

38

Mining Data Streams

Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis

SLIDE 39

March 27, 2008 Data Mining: Concepts and Techniques

39

Stream Cluster Analysis

Issues

Multiple scan not feasible Memory and time constraints Concept drift

Methods

STREAM based on k-medians [GMMO01] CLuStream based on microclustering and macroclustering

(Agarwal, Han, Wang, Yu, VLDB’03)

SLIDE 40

March 27, 2008 Data Mining: Concepts and Techniques

40

STREAM [GMMO01]

Problem: find k clusters in the stream s.t. the sum of

distances from data points to their closest center is minimized (k-median method)

Basic idea: divide-and-conquer
Approximation algorithm

1. For each set of M records, Si, perform k-median clustering and find O(k) centers

Only retain center information (weighted by #

points assigned to the cluster) 2. When there are enough centers, cluster the weighted centers

SLIDE 41

March 27, 2008 Data Mining: Concepts and Techniques

41

Hierarchical Clustering Tree

data points level-i medians level-(i+ 1) medians

SLIDE 42

March 27, 2008 Data Mining: Concepts and Techniques

42

Hierarchical Tree

Method:

maintain at most m level-i medians On seeing m of them, generate O(k) level-(i+1)

medians of weight equal to the sum of the weights of the intermediate medians assigned to them

Drawbacks:

Low quality for evolving data streams (register only k

centers)

Limited functionality in discovering and exploring

clusters over different portions of the stream over time

SLIDE 43

March 27, 2008 Data Mining: Concepts and Techniques

43

CluStream : A Fram ew ork for Clustering Evolving Data Stream s

Basic idea
Tilted time framework
Two stages: micro-clustering and macro-clustering
Algorithm
Online/micro-clustering: periodically computes microclusters

Given Multi-dimensional points at time stamps Cluster-feature vector (temporal extension of BIRCH)

Offline/macro-clustering: compute macroclusters using the k-

means algorithm

based on user-specified time-horizon

... ...

1 k

X X

... ...

1 k

T T

( )

n CF CF CF CF

t t x x

, 1 , 2 , 1 , 2

SLIDE 44

March 27, 2008 Data Mining: Concepts and Techniques

44

Summary: Stream Data Mining

Stream data mining: A rich and on-going research

field

Current research focus in database community:

DSMS system architecture, continuous query processing, supporting

mechanisms

Stream data mining

Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems

SLIDE 45

March 27, 2008 Data Mining: Concepts and Techniques

45

References on Stream Data Mining (1)

C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data

Streams, VLDB'03

C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of Evolving

Data Streams, KDD'04

C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of

High Dimensional Data Streams, VLDB'04

S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, Sept.

2001

B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data

Stream Systems”, PODS'02. (Conference tutorial)

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. "Multi-Dimensional Regression

Analysis of Time-Series Data Streams, VLDB'02

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00
A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate

Queries over Data Streams, SIGMOD’02

J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous

data streams. SIGMOD'01

C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. Mining frequent patterns in data streams

at multiple time granularities, Kargupta, et al. (eds.), Next Generation Data Mining’04

SLIDE 46

March 27, 2008 Data Mining: Concepts and Techniques

46

References on Stream Data Mining (2)

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, FOCS'00
G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams. KDD 2001
S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous

Queries over Streams, SIGMOD02

G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams, VLDB’02
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k

Elements in Data Streams. ICDT'05

S. Muthukrishnan, Data streams: algorithms and applications, Proceedings of the

fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003

R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995
S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming Information

Sources, SIGMOD’02

Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams

in Real Time, VLDB’02

H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data Streams using

Ensemble Classifiers, KDD'03

SLIDE 47

March 27, 2008 Data Mining: Concepts and Techniques

47

Mining Stream, Time-Series, and Sequence Data

Mining data streams Mining time-series data Mining sequence data

SLIDE 48

March 27, 2008 Data Mining: Concepts and Techniques

48

Time-Series Data and Time-Series Analysis

Time-series data
A sequences of data points measured at successive (often regular) time

intervals

Time-series data vs. data streams
Can be a snapshot of data streams
Persistent, various granularity
Time-series analysis
Understand characteristics and generating mechanism of the data

trend, cycle, seasonal, irregular

Make forecasts
Time-series analysis vs. ordinary analysis and spatial analysis
Applications
Economics and finance: stock price, exchange rate
Industry: power consumption
Scientific: experiment results
Meteorological: precipitation

SLIDE 49

March 27, 2008 Data Mining: Concepts and Techniques

49 A time series can be illustrated as a time-series graph

which describes a point moving with the passage of time

Time-Series Data I llustration

SLIDE 50

March 27, 2008 Data Mining: Concepts and Techniques

50

I dentifying Patterns in Time-Series

Components

Long-term or trend movements (T). Long term cyclic oscillations (C). E.g. business cycles Short term oscillations (S). E.g. seasonal and calendar-related Irregular or random movements

Decomposition models

Additive models Multiplicative models

Quarterly Gross Domestic Product

SLIDE 51

March 27, 2008 Data Mining: Concepts and Techniques

51

Additive Models

Additive Modal: TS = T + C + S + I

General Government and Other Current Transfers to Other Sectors

SLIDE 52

March 27, 2008 Data Mining: Concepts and Techniques

52

Multiplicative Models

Multiplicative Modal: TS = T * C * S * I

Monthly Job Advertisements

SLIDE 53

March 27, 2008 Data Mining: Concepts and Techniques

53

Trend Analysis

Trend analysis: identify the long term trend in the time series
Method
The freehand method
Function fitting

Linear vs. non-linear

Preprocessing
Smoothing: moving-average method

Alternatives: moving mean

Seasonal adjustment (deseasonalize)

SLIDE 54

Seasonality Analysis

Seasonality analysis: identify seasonal patterns

Correlational dependency of order k between each i'th

element of the series and the (i-k)'th element

Method

Visual identification Autocorrelation

March 27, 2008 Data Mining: Concepts and Techniques

54

SLIDE 55

March 27, 2008 Data Mining: Concepts and Techniques

55

Other Components

Estimation of cyclic variations

Long term cyclic variations can be identified in similar

manner as seasonality

Estimation of irregular variations

By adjusting the data for trend, seasonal and cyclic

variations

With the systematic analysis of the trend, cyclic, seasonal,

and irregular components, it is possible to make long- or short-term predictions with reasonable quality

SLIDE 56

Time-Series Forecasting

Technical analysis (time series analysis) vs.

fundamental analysis

Models and patterns

Head and shoulder pattern Random walk model

Methods

ARIMA model Neural networks

SLIDE 57

Time-Series Forecasting: ARI MA

ARIMA (Auto-Regressive Integrated Moving Average)

model by Box and Jenkins (1976)

ARIMA(p,d,q) model

Auto-regressive process AR(p): each element is made up of a

random component and a linear combination of prior elements

Moving average process MA(q): each element is made up of a

random error component and a linear combination of prior random errors

Integrated/Differenced I(d) Special cases

ARIMA(0,1,0) – random walk model

Identification, estimation and forecasting

March 27, 2008 Data Mining: Concepts and Techniques

57

SLIDE 58

March 27, 2008 Data Mining: Concepts and Techniques

58

Similarity Search in Time-Series Analysis

Two categories of similarity search

Whole matching: find a sequence that is similar to the

query sequence

Subsequence matching: find all pairs of similar

sequences

Typical Applications

Financial market Market basket data analysis Scientific databases Medical diagnosis

SLIDE 59

March 27, 2008 Data Mining: Concepts and Techniques

59

Similarity Search

Whole matching

Construct a multidimensional index

based on Fourier or Wavelet coefficients

Retrieve similar sequences

Subsequence matching

Break each sequence into a set of

pieces of window with length w

Use a multi-piece assembly algorithm

to search for longer sequence matches

SLIDE 60

References

R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. FODO’93

(Foundations of Data Organization and Algorithms).

R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise,

scaling, and translation in time-series databases. VLDB'95.

R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95.
C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman & Hall, 1984.
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series
databases. SIGMOD'94.
D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97.
Y. Moon, K. Whang, W. Loh. Duality Based Subsequence Matching in Time-Series Databases,

ICDE’02

B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time
warping. ICDE'98.
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining

for co-evolving time sequences. ICDE'00.

Dennis Shasha and Yunyue Zhu. High Performance Discovery in Time Series: Techniques

and Case Studies, SPRINGER, 2004

March 27, 2008 Data Mining: Concepts and Techniques

60

SLIDE 61

March 27, 2008 Data Mining: Concepts and Techniques

61

Mining Stream, Time-Series, and Sequence Data

Mining data streams Mining time-series data Mining sequence data

SLIDE 62

March 27, 2008 Data Mining: Concepts and Techniques

62

Sequence Data & Sequential Patterns

Sequence data

A sequence of ordered data items, with or without notion of time

Sequence data vs. time-series data vs. transaction data Frequent sequential pattern mining (symbolic) Applications of sequential pattern mining

Customer shopping sequences Telephone calling patterns Weblog click streams

SLIDE 63

March 27, 2008 Data Mining: Concepts and Techniques

63

Sequential Pattern Mining

Agrawal and Srikant 1995 Given a set of sequences, find the complete set of

frequent subsequences

An l-sequence is a sequence of length l (l items) <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a

sequential pattern A sequence database

A sequence : < (ef) (ab) (df) c b > An element contains a set of unordered items

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

SLIDE 64

March 27, 2008 Data Mining: Concepts and Techniques

64

Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are

hidden in databases

A mining algorithm should

find the complete set of patterns, when possible,

satisfying the minimum support (frequency) threshold

be highly efficient, scalable, involving only a small

number of database scans

be able to incorporate various kinds of user-specific

constraints

SLIDE 65

March 27, 2008 Data Mining: Concepts and Techniques

65

Sequential Pattern Mining Algorithms

Concept introduction and an initial Apriori-like algorithm

Agrawal & Srikant. Mining sequential patterns, ICDE’95

Apriori-based method: GSP (Generalized Sequential Patterns: Srikant

& Agrawal @ EDBT’96)

Pattern-growth methods: FreeSpan & PrefixSpan (Han et

al.@KDD’00; Pei, et al.@ICDE’01)

Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
Constraint-based sequential pattern mining (SPIRIT: Garofalakis,

Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)

Mining closed sequential patterns: CloSpan (Yan, Han & Afshar

@SDM’03)

SLIDE 66

March 27, 2008 Data Mining: Concepts and Techniques

66

GSP—Generalized Sequential Pattern Mining

GSP (Generalized Sequential Pattern) mining algorithm

proposed by Agrawal and Srikant, EDBT’96

Outline of the method

Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do

scan database to collect support count for each

candidate sequence

generate candidate length-(k+1) sequences from

length-k frequent sequences using Apriori

repeat until no frequent sequence or no candidate can

be found

Major strength: Candidate pruning by Apriori

SLIDE 67

March 27, 2008 Data Mining: Concepts and Techniques

67

The Apriori Property of Sequential Patterns

A basic property: Apriori (Agrawal & Sirkant’94)

If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b>

<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence

Seq. ID

Given support threshold min_sup =2

SLIDE 68

March 27, 2008 Data Mining: Concepts and Techniques

68

GSP Example: Finding Length-1 Patterns

Initial candidates: all singleton sequences

Scan database once, count support for candidates

<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence

Seq. ID

min_sup =2

Cand Sup < a> 3 5 < c> 4 < d> 3 < e> 3 < f> 2 <g> 1 <h> 1

SLIDE 69

March 27, 2008 Data Mining: Concepts and Techniques

69

GSP Example: Generating Length-2 Candidates

With Apriori: 66+65/2 = 51 candidates Without Apriori: 88+87/2 = 92 candidates Apriori prunes 44.57% candidates

2-element sequences 1-element sequences

SLIDE 70

March 27, 2008 Data Mining: Concepts and Techniques

70

GSP Example

<a> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1st scan: 8 cand. 6 length-1 pat. 2nd scan: 51 cand. 19 length-2 pat. 10 cand. not in DB at all 3rd scan: 47 cand. 19 length-3 pat. 20 cand. not in DB at all 4th scan: 8 cand. 6 length-4 pat. 5th scan: 1 cand. 1 length-5 pat.

Cand. cannot pass
sup. threshold
Cand. not in DB at all

<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence

Seq. ID

min_sup =2

SLIDE 71

March 27, 2008 Data Mining: Concepts and Techniques

71

Candidate Generate-and-test: Drawbacks

A huge set of candidate sequences generated.

Especially 2-item candidate sequence.

Multiple Scans of database needed.

The length of each candidate grows by one at each

database scan.

Inefficient for mining long sequential patterns.

A long pattern grow up from short patterns The number of short patterns is exponential to the

length of mined patterns.

SLIDE 72

March 27, 2008 Data Mining: Concepts and Techniques

72

PrefixSpan

PrefixSpan (Han et al.@KDD’00)

Divide and conquer Grow frequent patterns in projected database

No candidate sequence needs to be generated Major cost: constructing projected databases

SLIDE 73

March 27, 2008 Data Mining: Concepts and Techniques

73

The SPADE Algorithm

SPADE (Sequential PAttern Discovery using Equivalent

Class) developed by Zaki 2001

A vertical format sequential pattern mining method A sequence database is mapped to a large set of

Item: <SID, EID>

Sequential pattern mining is performed by

growing the subsequences (patterns) one item at a

time by Apriori candidate generation

SLIDE 74

March 27, 2008 Data Mining: Concepts and Techniques

74

Ref: Mining Sequential Patterns

R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. EDBT’96.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning,

2001.

J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential

Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04).

J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large

Databases, CIKM'02.

X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.
H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in

Large Database, KDD'04.

J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series

Database, ICDE'99.

J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series

data, KDD'00.