Processing Forecasting Queries Processing Forecasting Queries - - PowerPoint PPT Presentation
Processing Forecasting Queries Processing Forecasting Queries - - PowerPoint PPT Presentation
Processing Forecasting Queries Processing Forecasting Queries Songyun Duan, Shivnath Babu Duke University Motivation Motivation Real-time forecasting of future events based on historical data is useful in many domains Proactive system
2
Motivation Motivation
Real-time forecasting of future events based on
historical data is useful in many domains
- Proactive system management
If a performance problem is forecast, take corrective actions in advance to avoid it
- Adaptive query processing
- Inventory planning
- Environmental monitoring
- And many others
Need a framework to process forecasting queries
automatically and efficiently
3
Forecasting Queries Forecasting Queries
Select
From D Forecast L
Xi
0.1 … 1.3
…
0.8 … 1.3 0.3 … … …
…
… … … …
… … …
2.2 0.4 2 0.1 0.2 1
Xi X1 Xn
¿+ L
L
…
?
¿ ¿
Lead Time
T
- : historical time-series data up
to timestamp
Denoted as
D(T; X 1; X 2; : : : ; X n)
Forecast(D, Xi ; L)
4
An Example Forecasting Query An Example Forecasting Query
Select C From Usage Forecast 1 day
25 47 25 46 46 46 46 46 25
B
12 13 35 36 35 13 13 35 35
A
18 17 16 15 14 13 12 11 10 9
Day
16 68 16 16 68 68 16 68 17
C
Table: Usage ? Lead time
5
Example Query Processing Example Query Processing
- - a Na
a Naï ïve Approach ve Approach
16 68 16 16 68 68 16 68
?
17
26.6
C1
Class attribute
25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C
?
C1 = 0.47*A + 1.18*B - 0.53*C
6
Example Query Processing Example Query Processing
17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68
?
A¡ 2 B¡ 1 C1
25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C
?
13 35 12 36 35 13 13 35 35 47 25 25 46 46 46 46 46 25
- Bayesian network
Previous prediction=26.6
= 1.24*A¡ 2 + 0:3 ¤ B¡ 1
C1
54 68
7
Challenges Challenges
To process real-time forecasting queries,
Challenge 1: generate a good processing strategy
automatically and efficiently
- Apply appropriate transformations to the data
E.g., shift, discretization, normalization, aggregation
- Pick the right type of statistical model
E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN)
Challenge 2: for continuous forecasting over streaming
data, adapt processing strategies when necessary
8
Outline Outline
Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary
9
Execution Plan Execution Plan
D
Synopsis
Time-series data
Forecasting result
Predictor
u
P(Syn; u) ) u:Z
Query: Forecast(D, Xi ; L)
Syn(f Y1; ¢¢¢; Yn g ! Z)
B(D; Z ) ) Syn(f Y1; ¢¢¢; Yn g ! Z)
T1 Tk
Transformers D 0
1
D 0
k
Synopsis Builder
Logical operators
Transformer:
E.g., Shift (X, δ)
Synopsis builder: Predictor:
Synopsis
E.g., linear regression, regression tree
D ) D 0
10
Sample Execution Plan Sample Execution Plan
Select C From Usage Forecast 1 day
MLR
MLR learne r Predictor
Shift (C; 1)
Usage
Shift(A; ¡ 2)
¼
(A ¡ 2 ;B ¡ 1 )
Shift(B; ¡ 1)
D0
16 25 12 17 … … … 68 46 35 13 68 47 13 16 18 16 25 35 15 14 Day 16 46 36 C B A 68 16
?
16 68 16
?
C1
… … 12 13 35 36 35
A¡ 2 B¡ 1
… 46 25 47 25 46
?
47 35 17 … … … 25 46 46 … 16 … 13 16 36 16 68 35 15 14 Day 16 …
A¡ 2 B¡ 1
C1
D’
u = (35; 47; ?) Synopsis = multivariate linear regression (MLR) 54
u
= 1.24*A¡ 2 + 0:3 ¤ B¡ 1
C1
11
Estimating Accuracy of a Plan Estimating Accuracy of a Plan
- Accuracy
- How close are forecasting results to “real values”
- Given a dataset D and a plan P
Tk T1
Synopsis Predictor
P D K-fold cross validation to get unbiased estimation
Synopsis Builder
RMSE= q
P
i (ai ¡ bi )2
m
Synopsis type
Training Test Example accuracy metric:
Predicted value Actual value
b1 b2 bm
a2 a1
am
12
Find a Good Plan Quickly Find a Good Plan Quickly
Optimization challenge: minimize the number of plans
executed before finding a plan with high accuracy
Efficient plan search to balance accuracy Vs. running time
Simplified plan space to describe our algorithms
Two types of transformers: Shift and Project One synopsis: Bayesian Network (BN)
Accuracy
- f best
execution plan found so far
Algorithm 2
Reasonably-good execution plan available
Algorithm 1
Elapsed processing time
Lead Time
13
Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm
Dataset (n = 3)
Query: Forecast(Usage, C, 1)
7 8 9
…
4 5 6
…
1 2 3
… A B C
Class attribute
A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1
7 8 9
…
4 5 6
…
1 2 3
…
7 8 9
…
4 5 6
…
1 2 3
…
8 9
…
?
Ranked list
A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C
Attribute Ranking
Linear-correlation-based Entropy-based, e.g., information gain
Shift( , )
X i
δ
Learn a synopsis and generate the plan? Imagine n = 100, = 90
Extended data has 9000+ attributes Takes too much time to get a plan
(-<=δ<0) (=2)
14
BN Predictor
Shif t(A; ¡ 2) Shif t(B ; ¡ 1)
P1
¼
A ¡ 2 ;B ¡ 1
Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm
A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1 A B C
Ranked list
A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C
(n=3,=2) Shifted data
Attribute selection
Fast Correlation-Based Filter (FCBF) Correlation-based Feature Selection (CFS) Wrapper
( , )
B¡ 1 A¡ 2
( , , )
A¡ 2 C¡ 1 C¡ 2 BN Predictor
Shif t(A; ¡ 2)
P2
Shif t(C; ¡ 1) Shif t(C; ¡ 2)
¼
A ¡ 2 ;C ¡ 1 ;C ¡ 2
Stop Do forecasting using the plan with highest accuracy
, otherwise increase
(Acc1) (Acc2)
15
Adaptive Fa Adaptive Fa’ ’s Plan Search (FPS s Plan Search (FPS-
- A)
A)
Forecast(S[W], Xi ; L) Stream S … … … …
W
L …
The ranked list and plans for S[W]
A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 C1 A B
Plan
P1
Plan
P2
(A¡ 2; B¡ 1) (A¡ 2; C¡ 1; C¡ 2)
B¡ 2 C
(A¡ 2; C)
Plan
P0
1
(A¡ 2; C; C¡ 1)
Plan
P0
2
B¡ 2 C
Continuous query:
16
Outline Outline
Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary
17
Experimental Setting Experimental Setting
Target domain: system and database monitoring Datasets (#attributes 3~250, #instances 700~15000)
- Aging dataset from a departmental cluster
Aging behavior: progressive degradation in performance
- A real dataset parsed from logs of 98’ World Cup web-site
Periodic segments – characteristic of most popular web-sites
- 5 testbed datasets
Our testbed runs OLTP applications using MySQL Simulated periodic workloads, aging behavior, and multiple resource contentions
- 2 synthetic datasets: simulated complicated patterns to
study the robustness of our algorithms
18
Accuracy metric = balanced accuracy
BA = 1 ¡ 0:5 ¤ ( # f al se posi t i ves
# n egat i ves
+ # f al se n egat i ves
# posi t i ves
)
Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk
5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1
E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan
M u ltip le c h u n k s O n e c h u n k
Testbed dataset, Lead time = 25, n=50+,= 30
19
Synopsis Comparison Synopsis Comparison
3200 933 22339
Time
.85 .91 .86
BA FPS(RF)
482 1948
Time
.86 0.51
BA FPS(SVM)
24 19 130 201 36
Time
.80 .85 .80 .84 .64
BA FPS(MLR)
109 50 249 37 135
Time
.81 .91 .85 .85 .71
BA FPS(CART)
.82 .91 .84 .87 .71
BA FPS(BN)
14 53 45 29 62
Time Aging-variant-tb Multi-small-tb Periodic-small-tb FIFA-real Aging-real Dataset
- FPS using BN or CART can achieve accuracy
comparable to more sophisticated synopses
- Lesson: More important to find right transformations
- FPS using BN or CART has lower running time
- Thus: we use BN as the default synopsis
20
FPS Vs. State FPS Vs. State-
- of
- f-
- the
the-
- Art Synopsis
Art Synopsis
10 10
1
10
2
10
3
0.5 0.6 0.7 0.8 0.9 1
E lap sed p rocessin g tim e (sec, log scale) BA of current best plan
F P S R F -base R F -shifts
Synthetic dataset, Lead time = 25, n = 3,= 90
21
FPS FPS-
- adaptive Vs. FPS
adaptive Vs. FPS-
- nonadaptive
nonadaptive
Runtime overhead < 2% Adaptability and Convergence
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
N u m b er of tu p les p rocessed so far BA of current best plan
F P S -A d ap tive F P S -N o nad ap tive
(Lead time = 25, = 90)
22
Sharing Computation in FPS and FPS Sharing Computation in FPS and FPS-
- A
A
Lead time = 25, n = 50+,= 90
5 0 0 1 0 0 0 1 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1
E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan
F P S -n o s h a rin g F P S -s h a rin g
FPS and FPS-A aggressively share computation
as multiple plans are explored during plan search
23
Related Work Related Work
Time-series forecasting, performance problem
forecasting, and self-tuning
- Difference: our automated framework balances accuracy
against running time
Machine learning techniques on data transformation
and modeling
- Can be incorporated as operators in our framework
Integration of synopses and data mining algorithms
with DBMS and DSMS
- Used for processing conventional SQL/XML queries
- Difference: our framework automatically chooses good
combinations of transformations and synopses
24
Summary Summary
Defined declarative one-time and continuous
forecasting queries
Proposed an automatic plan search algorithm for
processing one-time forecasting queries
Proposed an adaptive algorithm for continuous
forecasting over streaming data
Extensive experimental evaluation of both
algorithms
Thanks! Thanks!
25
Execution Plans for Example Query Execution Plans for Example Query
Query: Forecast (Usage, C, 1)
Predictor
P1 Usage
Shif t(C; 1)
26.6
MLR Predictor
Shif t(A; ¡ 2) Shif t(B ; ¡ 1)
P2 Usage
Shif t(C; 1)
54
MLR
¼
A ¡ 2 ;B ¡ 1
CART Predictor
Shif t(A; ¡ 2) Shif t(B ; ¡ 1)