March 27, 2008 Data Mining: Concepts and Techniques
1
Data Mining:
Concepts and Techniques
Chap 8. Data Streams, Time Series Data, and
Sequential Patterns Li Xiong
Slides credits: Jiawei Han and Micheline Kamber and others
Data Mining: Concepts and Techniques Chap 8. Data Streams, Time - - PowerPoint PPT Presentation
Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential Patterns Li Xiong Slides credits: Jiawei Han and Micheline Kamber and others 1 March 27, 2008 Data Mining: Concepts and Techniques Mining Stream,
March 27, 2008 Data Mining: Concepts and Techniques
1
Slides credits: Jiawei Han and Micheline Kamber and others
March 27, 2008 Data Mining: Concepts and Techniques
2
March 27, 2008 Data Mining: Concepts and Techniques
3
Stream data and stream data processing Basic methodologies for stream data processing and
Stream frequent pattern analysis Stream classification Stream cluster analysis
March 27, 2008 Data Mining: Concepts and Techniques
4
Data Streams
A sequence of data in transmission An ordered pair (s, ∆) where: s is a sequence of tuples,
∆ is the sequence of time intervals
Characteristics
Continuous Huge volumes, possibly infinite Fast changing and requires fast, real-time response Random access is expensive—single scan algorithm Low-level or multi-dimensional in nature
March 27, 2008 Data Mining: Concepts and Techniques
5
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
March 27, 2008 Data Mining: Concepts and Techniques
6
Multiple streams Multiple streams SDMS (Stream Data Management System)
March 27, 2008 Data Mining: Concepts and Techniques
7
query processor, physical DB design
arrival and characteristics
March 27, 2008 Data Mining: Concepts and Techniques
8
Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis
March 27, 2008 Data Mining: Concepts and Techniques
9
Major challenges
Keep track of a large universe
Methodology
Choosing a subset of data
Sampling Sliding windows Load shedding
Summarizing the data
Synopses (trade-off between accuracy and storage)
Slides: R. Gemulla, W. Lehner, P. J. Haas
Uniform sampling
Data stream of size N Assume all samples are equally likely
Example
a data stream of size 4 (also called population) possible samples of size 2
Reservoir sampling
Idea
Algorithm
a) ignore the element (reject) b) replace a random element in the sample (accept)
i M t P
i
= = size population current size sample ) accepted is (
Slides: R. Gemulla, W. Lehner, P. J. Haas
Example
data stream sample size M = 2
1/3 2/4 1/4 1/4 2/4 1/4 1/4 2/4 1/4 1/4 1/3 1/3
PODS 2002
13
Sliding Windows
Make decisions based only on recent data of sliding
An element arriving at time t expires at time t + w
Why?
Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics
0 1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0
Load shedding
Discards some data so the system can flow
Techniques
Filters (semantic drop)
Chooses what to shed based on QoS, selectivity
Drops (random drop)
Eliminates a random fraction of input
Hospital example
Load shedding based on condition
Doctors Patients Doctors who can work on a patient
Doctors Patients Doctors who can work on a patient Condition Filter
March 27, 2008 15
Synopsis
Summaries for data Can be used to return approximate answers Trade off between space and accuracy
Techniques
Histograms Wavelets Sketching
May require multiple passes
Synopses/Data Structures
1 1 1 1 1 1 1
March 27, 2008 Data Mining: Concepts and Techniques
16
Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis Research issues
March 27, 2008 Data Mining: Concepts and Techniques
17
Issues
Multiple scans for training not feasible Memory/space management Concept drift
Methods
Approximate frequent patterns (Manku & Motwani VLDB’02) Mining evolution of freq. patterns (C. Giannella, J. Han, X. Yan,
P.S. Yu, 2003)
Space-saving computation of frequent and top-k elements
(Metwally, Agrawal, and El Abbadi, ICDT'05)
March 27, 2008 Data Mining: Concepts and Techniques
18
Mining precise freq. patterns in stream data: unrealistic Approximate answers are often sufficient (e.g., trend/pattern
analysis)
Example: a router interested in all flows whose frequency is at least
1% (σ) of the entire traffic stream seen so far;
1/10 of σ (ε = 0.1%) error is comfortable
Adv: guaranteed error bound Disadv: keep a large set of traces
March 27, 2008 Data Mining: Concepts and Techniques
19
March 27, 2008 Data Mining: Concepts and Techniques
20
March 27, 2008 Data Mining: Concepts and Techniques
21
(e, f, ∆)
(summary)
(e, f, ∆)
March 27, 2008 Data Mining: Concepts and Techniques
22
Output: items with frequency counts exceeding (σ – ε) N Error analysis: how much do we undercount?
Approximation guarantee
No false negatives False positives have true frequency count at least (σ–ε)N Frequency count underestimated by at most εN
March 27, 2008 Data Mining: Concepts and Techniques
23
Divide Stream into ‘Buckets’ as for itemsets
March 27, 2008 Data Mining: Concepts and Techniques
24
2 2 1 2 1 1 1
summary data Processing 3 buckets in memory
4 4 10 2 2
+
3 3 9
summary data
March 27, 2008 Data Mining: Concepts and Techniques
25
Strength
A simple idea Can be extended to frequent itemsets
Weakness:
Space Bound is not good For frequent itemsets, they do scan each record many
The output is based on all previous data. But
March 27, 2008 Data Mining: Concepts and Techniques
26
(Giannella, Han, Yan, Yu, 2003)
Use tilted time window frame Use compressed form to store significant (approximate) frequent
patterns and their time-dependent traces
March 27, 2008 Data Mining: Concepts and Techniques
27
Natural tilted time frame:
Example: Minimal: quarter, then 4 quarters → 1 hour, 24 hours →
day, …
Logarithmic tilted time frame:
Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …
March 27, 2008 Data Mining: Concepts and Techniques
28
FP-Trees store Frequent Patterns Tilted-time major: An FP-tree for each tilted time frame
March 27, 2008 Data Mining: Concepts and Techniques
29
The second data structure:
Observation: FP-Trees of different time units are similar Pattern-tree major: each node is associated with a tilted-time
window
March 27, 2008 Data Mining: Concepts and Techniques
30
Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis
March 27, 2008 Data Mining: Concepts and Techniques
31
Issues
Multiple scans for training not feasible Concept drift
Methods
VFDT (Very Fast Decision Tree) and CVFDT (Concept-adapting
Very Fast Decision Tree) (Domingos, Hulten, Spencer, KDD00/KDD01)
Ensemble (Wang, Fan, Yu, Han. KDD’03) K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04)
March 27, 2008 Data Mining: Concepts and Techniques
32
Basic idea
Consider only a small subset of training examples to find best split
attribute at a node given a split evaluation measure G
How many examples are necessary at each node?
r: random variable R: range of r n: # independent observations True mean of r is at least ravg – ε, with probability 1 – δ
n R 2 ) / 1 ln(
2
δ ε =
March 27, 2008 Data Mining: Concepts and Techniques
33
Hoeffding Tree Input
Hoeffding Tree Algorithm
March 27, 2008 Slide: Gehrke
34
yes no Packets > 10 Protocol = http Protocol = ftp yes yes no Packets > 10 Bytes > 60K Protocol = http
Data Stream Data Stream
March 27, 2008 Data Mining: Concepts and Techniques
35
Strengths
Scales better than traditional methods
Sublinear with sampling Very small memory utilization
Incremental
Make class predictions in parallel New examples are added as they come
Weakness
Could spend a lot of time with ties Memory utilization issues with tree expansion and large
March 27, 2008 Data Mining: Concepts and Techniques
36
Modifications to Hoeffding Tree
Near-ties broken more aggressively G computed every nmin Deactivates certain leaves to save memory Poor attributes dropped Initialize with traditional learner (helps learning curve)
Compare to traditional decision tree
Similar accuracy Better runtime with 1.61 million examples
21 minutes for VFDT 24 hours for C4.5
March 27, 2008 Data Mining: Concepts and Techniques
37
Concept Drift
Time-changing data streams Incorporate new and eliminate old
CVFDT
Sliding window approach
Increments count with new example Decrement old example
Grows alternate subtrees When alternate more accurate => replace old
March 27, 2008 Data Mining: Concepts and Techniques
38
Stream data and stream data processing Foundations for stream data mining Stream frequent pattern analysis Stream classification Stream cluster analysis
March 27, 2008 Data Mining: Concepts and Techniques
39
Issues
Multiple scan not feasible Memory and time constraints Concept drift
Methods
STREAM based on k-medians [GMMO01] CLuStream based on microclustering and macroclustering
(Agarwal, Han, Wang, Yu, VLDB’03)
March 27, 2008 Data Mining: Concepts and Techniques
40
March 27, 2008 Data Mining: Concepts and Techniques
41
data points level-i medians level-(i+ 1) medians
March 27, 2008 Data Mining: Concepts and Techniques
42
Method:
maintain at most m level-i medians On seeing m of them, generate O(k) level-(i+1)
Drawbacks:
Low quality for evolving data streams (register only k
Limited functionality in discovering and exploring
March 27, 2008 Data Mining: Concepts and Techniques
43
Given Multi-dimensional points at time stamps Cluster-feature vector (temporal extension of BIRCH)
means algorithm
based on user-specified time-horizon
... ...
1 k
X X
... ...
1 k
T T
n CF CF CF CF
t t x x
, 1 , 2 , 1 , 2
March 27, 2008 Data Mining: Concepts and Techniques
44
Stream data mining: A rich and on-going research
Current research focus in database community:
DSMS system architecture, continuous query processing, supporting
mechanisms
Stream data mining
Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems
March 27, 2008 Data Mining: Concepts and Techniques
45
Streams, VLDB'03
Data Streams, KDD'04
High Dimensional Data Streams, VLDB'04
2001
Stream Systems”, PODS'02. (Conference tutorial)
Analysis of Time-Series Data Streams, VLDB'02
Queries over Data Streams, SIGMOD’02
data streams. SIGMOD'01
at multiple time granularities, Kargupta, et al. (eds.), Next Generation Data Mining’04
March 27, 2008 Data Mining: Concepts and Techniques
46
Queries over Streams, SIGMOD02
Elements in Data Streams. ICDT'05
fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003
Sources, SIGMOD’02
in Real Time, VLDB’02
Ensemble Classifiers, KDD'03
March 27, 2008 Data Mining: Concepts and Techniques
47
March 27, 2008 Data Mining: Concepts and Techniques
48
intervals
trend, cycle, seasonal, irregular
March 27, 2008 Data Mining: Concepts and Techniques
49 A time series can be illustrated as a time-series graph
March 27, 2008 Data Mining: Concepts and Techniques
50
Components
Long-term or trend movements (T). Long term cyclic oscillations (C). E.g. business cycles Short term oscillations (S). E.g. seasonal and calendar-related Irregular or random movements
Decomposition models
Additive models Multiplicative models
Quarterly Gross Domestic Product
March 27, 2008 Data Mining: Concepts and Techniques
51
Additive Modal: TS = T + C + S + I
General Government and Other Current Transfers to Other Sectors
March 27, 2008 Data Mining: Concepts and Techniques
52
Multiplicative Modal: TS = T * C * S * I
Monthly Job Advertisements
March 27, 2008 Data Mining: Concepts and Techniques
53
Linear vs. non-linear
Alternatives: moving mean
Seasonality analysis: identify seasonal patterns
Correlational dependency of order k between each i'th
Method
Visual identification Autocorrelation
March 27, 2008 Data Mining: Concepts and Techniques
54
March 27, 2008 Data Mining: Concepts and Techniques
55
Estimation of cyclic variations
Long term cyclic variations can be identified in similar
Estimation of irregular variations
By adjusting the data for trend, seasonal and cyclic
With the systematic analysis of the trend, cyclic, seasonal,
Technical analysis (time series analysis) vs.
Models and patterns
Head and shoulder pattern Random walk model
Methods
ARIMA model Neural networks
ARIMA (Auto-Regressive Integrated Moving Average)
ARIMA(p,d,q) model
Auto-regressive process AR(p): each element is made up of a
random component and a linear combination of prior elements
Moving average process MA(q): each element is made up of a
random error component and a linear combination of prior random errors
Integrated/Differenced I(d) Special cases
ARIMA(0,1,0) – random walk model
Identification, estimation and forecasting
March 27, 2008 Data Mining: Concepts and Techniques
57
March 27, 2008 Data Mining: Concepts and Techniques
58
Two categories of similarity search
Whole matching: find a sequence that is similar to the
Subsequence matching: find all pairs of similar
Typical Applications
Financial market Market basket data analysis Scientific databases Medical diagnosis
March 27, 2008 Data Mining: Concepts and Techniques
59
Whole matching
Construct a multidimensional index
based on Fourier or Wavelet coefficients
Retrieve similar sequences
Subsequence matching
Break each sequence into a set of
pieces of window with length w
Use a multi-piece assembly algorithm
to search for longer sequence matches
(Foundations of Data Organization and Algorithms).
scaling, and translation in time-series databases. VLDB'95.
ICDE’02
for co-evolving time sequences. ICDE'00.
and Case Studies, SPRINGER, 2004
March 27, 2008 Data Mining: Concepts and Techniques
60
March 27, 2008 Data Mining: Concepts and Techniques
61
March 27, 2008 Data Mining: Concepts and Techniques
62
Sequence data
Sequence data vs. time-series data vs. transaction data Frequent sequential pattern mining (symbolic) Applications of sequential pattern mining
Customer shopping sequences Telephone calling patterns Weblog click streams
March 27, 2008 Data Mining: Concepts and Techniques
63
Agrawal and Srikant 1995 Given a set of sequences, find the complete set of
An l-sequence is a sequence of length l (l items) <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a
A sequence : < (ef) (ab) (df) c b > An element contains a set of unordered items
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
March 27, 2008 Data Mining: Concepts and Techniques
64
A huge number of possible sequential patterns are
A mining algorithm should
find the complete set of patterns, when possible,
be highly efficient, scalable, involving only a small
be able to incorporate various kinds of user-specific
March 27, 2008 Data Mining: Concepts and Techniques
65
Agrawal & Srikant. Mining sequential patterns, ICDE’95
& Agrawal @ EDBT’96)
al.@KDD’00; Pei, et al.@ICDE’01)
Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)
@SDM’03)
March 27, 2008 Data Mining: Concepts and Techniques
66
GSP (Generalized Sequential Pattern) mining algorithm
proposed by Agrawal and Srikant, EDBT’96
Outline of the method
Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do
scan database to collect support count for each
generate candidate length-(k+1) sequences from
repeat until no frequent sequence or no candidate can
Major strength: Candidate pruning by Apriori
March 27, 2008 Data Mining: Concepts and Techniques
67
A basic property: Apriori (Agrawal & Sirkant’94)
If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b>
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
March 27, 2008 Data Mining: Concepts and Techniques
68
Initial candidates: all singleton sequences
<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
Scan database once, count support for candidates
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
Cand Sup < a> 3 < b> 5 < c> 4 < d> 3 < e> 3 < f> 2 <g> 1 <h> 1
March 27, 2008 Data Mining: Concepts and Techniques
69
<a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>
2-element sequences 1-element sequences
March 27, 2008 Data Mining: Concepts and Techniques
70
<a> <b> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1st scan: 8 cand. 6 length-1 pat. 2nd scan: 51 cand. 19 length-2 pat. 10 cand. not in DB at all 3rd scan: 47 cand. 19 length-3 pat. 20 cand. not in DB at all 4th scan: 8 cand. 6 length-4 pat. 5th scan: 1 cand. 1 length-5 pat.
<a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence
min_sup =2
March 27, 2008 Data Mining: Concepts and Techniques
71
A huge set of candidate sequences generated.
Especially 2-item candidate sequence.
Multiple Scans of database needed.
The length of each candidate grows by one at each
Inefficient for mining long sequential patterns.
A long pattern grow up from short patterns The number of short patterns is exponential to the
March 27, 2008 Data Mining: Concepts and Techniques
72
PrefixSpan (Han et al.@KDD’00)
Divide and conquer Grow frequent patterns in projected database
No candidate sequence needs to be generated Major cost: constructing projected databases
March 27, 2008 Data Mining: Concepts and Techniques
73
SPADE (Sequential PAttern Discovery using Equivalent
A vertical format sequential pattern mining method A sequence database is mapped to a large set of
Item: <SID, EID>
Sequential pattern mining is performed by
growing the subsequences (patterns) one item at a
March 27, 2008 Data Mining: Concepts and Techniques
74
2001.
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04).
Databases, CIKM'02.
Large Database, KDD'04.
Database, ICDE'99.
data, KDD'00.