[PPT] - Processing Forecasting Queries Processing Forecasting Queries PowerPoint Presentation

SLIDE 1

Processing Forecasting Queries Processing Forecasting Queries

Songyun Duan, Shivnath Babu

Duke University

SLIDE 2

2

Motivation Motivation

Real-time forecasting of future events based on

historical data is useful in many domains

Proactive system management

If a performance problem is forecast, take corrective actions in advance to avoid it

Adaptive query processing
Inventory planning
Environmental monitoring
And many others

Need a framework to process forecasting queries

automatically and efficiently

SLIDE 3

3

Forecasting Queries Forecasting Queries

Select

From D Forecast L

Xi

0.1 … 1.3

…

0.8 … 1.3 0.3 … … …

…

… … … …

… … …

2.2 0.4 2 0.1 0.2 1

Xi X1 Xn

¿+ L

L

…

?

¿ ¿

Lead Time

T

: historical time-series data up

to timestamp

Denoted as

D(T; X 1; X 2; : : : ; X n)

Forecast(D, Xi ; L)

SLIDE 4

4

An Example Forecasting Query An Example Forecasting Query

Select C From Usage Forecast 1 day

25 47 25 46 46 46 46 46 25

B

12 13 35 36 35 13 13 35 35

A

18 17 16 15 14 13 12 11 10 9

Day

16 68 16 16 68 68 16 68 17

C

Table: Usage ? Lead time

SLIDE 5

5

Example Query Processing Example Query Processing

- a Na

a Naï ïve Approach ve Approach

16 68 16 16 68 68 16 68

?

17

26.6

C1

Class attribute

25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C

?

C1 = 0.47A + 1.18B - 0.53*C

SLIDE 6

6

Example Query Processing Example Query Processing

17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68

?

A¡ 2 B¡ 1 C1

25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C

?

13 35 12 36 35 13 13 35 35 47 25 25 46 46 46 46 46 25

Bayesian network

Previous prediction=26.6

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

54 68

SLIDE 7

7

Challenges Challenges

To process real-time forecasting queries,

Challenge 1: generate a good processing strategy

automatically and efficiently

Apply appropriate transformations to the data

E.g., shift, discretization, normalization, aggregation

Pick the right type of statistical model

E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN)

Challenge 2: for continuous forecasting over streaming

data, adapt processing strategies when necessary

SLIDE 8

8

Outline Outline

Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary

SLIDE 9

9

Execution Plan Execution Plan

D

Synopsis

Time-series data

Forecasting result

Predictor

u

P(Syn; u) ) u:Z

Query: Forecast(D, Xi ; L)

Syn(f Y1; ¢¢¢; Yn g ! Z)

B(D; Z ) ) Syn(f Y1; ¢¢¢; Yn g ! Z)

T1 Tk

Transformers D 0

1

D 0

k

Synopsis Builder

Logical operators

Transformer:

E.g., Shift (X, δ)

Synopsis builder: Predictor:

Synopsis

E.g., linear regression, regression tree

D ) D 0

SLIDE 10

10

Sample Execution Plan Sample Execution Plan

Select C From Usage Forecast 1 day

MLR

MLR learne r Predictor

Shift (C; 1)

Usage

Shift(A; ¡ 2)

¼

(A ¡ 2 ;B ¡ 1 )

Shift(B; ¡ 1)

D0

16 25 12 17 … … … 68 46 35 13 68 47 13 16 18 16 25 35 15 14 Day 16 46 36 C B A 68 16

?

16 68 16

?

C1

… … 12 13 35 36 35

A¡ 2 B¡ 1

… 46 25 47 25 46

?

47 35 17 … … … 25 46 46 … 16 … 13 16 36 16 68 35 15 14 Day 16 …

A¡ 2 B¡ 1

C1

D’

u = (35; 47; ?) Synopsis = multivariate linear regression (MLR) 54

u

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

SLIDE 11

11

Estimating Accuracy of a Plan Estimating Accuracy of a Plan

Accuracy
How close are forecasting results to “real values”
Given a dataset D and a plan P

Tk T1

Synopsis Predictor

P D K-fold cross validation to get unbiased estimation

Synopsis Builder

RMSE= q

P

i (ai ¡ bi )2

m

Synopsis type

Training Test Example accuracy metric:

Predicted value Actual value

b1 b2 bm

a2 a1

am

SLIDE 12

12

Find a Good Plan Quickly Find a Good Plan Quickly

Optimization challenge: minimize the number of plans

executed before finding a plan with high accuracy

Efficient plan search to balance accuracy Vs. running time

Simplified plan space to describe our algorithms

Two types of transformers: Shift and Project One synopsis: Bayesian Network (BN)

Accuracy

f best

execution plan found so far

Algorithm 2

Reasonably-good execution plan available

Algorithm 1

Elapsed processing time

Lead Time

SLIDE 13

13

Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

Dataset (n = 3)

Query: Forecast(Usage, C, 1)

7 8 9

…

4 5 6

…

1 2 3

… A B C

Class attribute

A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1

7 8 9

…

4 5 6

…

1 2 3

…

7 8 9

…

4 5 6

…

1 2 3

…

8 9

…

?

Ranked list

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C

Attribute Ranking

Linear-correlation-based Entropy-based, e.g., information gain

Shift( , )

X i

δ

Learn a synopsis and generate the plan? Imagine n = 100, = 90

Extended data has 9000+ attributes Takes too much time to get a plan

(-<=δ<0) (=2)

SLIDE 14

14

BN Predictor

Shif t(A; ¡ 2) Shif t(B ; ¡ 1)

P1

¼

A ¡ 2 ;B ¡ 1

Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1 A B C

Ranked list

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C

(n=3,=2) Shifted data

Attribute selection

Fast Correlation-Based Filter (FCBF) Correlation-based Feature Selection (CFS) Wrapper

( , )

B¡ 1 A¡ 2

( , , )

A¡ 2 C¡ 1 C¡ 2 BN Predictor

Shif t(A; ¡ 2)

P2

Shif t(C; ¡ 1) Shif t(C; ¡ 2)

¼

A ¡ 2 ;C ¡ 1 ;C ¡ 2

Stop Do forecasting using the plan with highest accuracy

, otherwise increase

(Acc1) (Acc2)

SLIDE 15

15

Adaptive Fa Adaptive Fa’ ’s Plan Search (FPS s Plan Search (FPS-

A)

A)

Forecast(S[W], Xi ; L) Stream S … … … …

W

L …

The ranked list and plans for S[W]

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 C1 A B

Plan

P1

Plan

P2

(A¡ 2; B¡ 1) (A¡ 2; C¡ 1; C¡ 2)

B¡ 2 C

(A¡ 2; C)

Plan

P0

1

(A¡ 2; C; C¡ 1)

Plan

P0

2

B¡ 2 C

Continuous query:

SLIDE 16

16

Outline Outline

Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary

SLIDE 17

17

Experimental Setting Experimental Setting

Target domain: system and database monitoring Datasets (#attributes 3~250, #instances 700~15000)

Aging dataset from a departmental cluster

Aging behavior: progressive degradation in performance

A real dataset parsed from logs of 98’ World Cup web-site

Periodic segments – characteristic of most popular web-sites

5 testbed datasets

Our testbed runs OLTP applications using MySQL Simulated periodic workloads, aging behavior, and multiple resource contentions

2 synthetic datasets: simulated complicated patterns to

study the robustness of our algorithms

SLIDE 18

18

Accuracy metric = balanced accuracy

BA = 1 ¡ 0:5 ¤ ( # f al se posi t i ves

# n egat i ves

+ # f al se n egat i ves

# posi t i ves

)

Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk

5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan

M u ltip le c h u n k s O n e c h u n k

Testbed dataset, Lead time = 25, n=50+,= 30

SLIDE 19

19

Synopsis Comparison Synopsis Comparison

3200 933 22339

Time

.85 .91 .86

BA FPS(RF)

482 1948

Time

.86 0.51

BA FPS(SVM)

24 19 130 201 36

Time

.80 .85 .80 .84 .64

BA FPS(MLR)

109 50 249 37 135

Time

.81 .91 .85 .85 .71

BA FPS(CART)

.82 .91 .84 .87 .71

BA FPS(BN)

14 53 45 29 62

Time Aging-variant-tb Multi-small-tb Periodic-small-tb FIFA-real Aging-real Dataset

FPS using BN or CART can achieve accuracy

comparable to more sophisticated synopses

Lesson: More important to find right transformations
FPS using BN or CART has lower running time
Thus: we use BN as the default synopsis

SLIDE 20

20

FPS Vs. State FPS Vs. State-

of
f-
the

the-

Art Synopsis

Art Synopsis

10 10

1

10

2

10

3

0.5 0.6 0.7 0.8 0.9 1

E lap sed p rocessin g tim e (sec, log scale) BA of current best plan

F P S R F -base R F -shifts

Synthetic dataset, Lead time = 25, n = 3,= 90

SLIDE 21

21

FPS FPS-

adaptive Vs. FPS

adaptive Vs. FPS-

nonadaptive

nonadaptive

Runtime overhead < 2% Adaptability and Convergence

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

N u m b er of tu p les p rocessed so far BA of current best plan

F P S -A d ap tive F P S -N o nad ap tive

(Lead time = 25, = 90)

SLIDE 22

22

Sharing Computation in FPS and FPS Sharing Computation in FPS and FPS-

A

A

Lead time = 25, n = 50+,= 90

5 0 0 1 0 0 0 1 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan

F P S -n o s h a rin g F P S -s h a rin g

FPS and FPS-A aggressively share computation

as multiple plans are explored during plan search

SLIDE 23

23

Related Work Related Work

Time-series forecasting, performance problem

forecasting, and self-tuning

Difference: our automated framework balances accuracy

against running time

Machine learning techniques on data transformation

and modeling

Can be incorporated as operators in our framework

Integration of synopses and data mining algorithms

with DBMS and DSMS

Used for processing conventional SQL/XML queries
Difference: our framework automatically chooses good

combinations of transformations and synopses

SLIDE 24

24

Summary Summary

Defined declarative one-time and continuous

forecasting queries

Proposed an automatic plan search algorithm for

processing one-time forecasting queries

Proposed an adaptive algorithm for continuous

forecasting over streaming data

Extensive experimental evaluation of both

algorithms

Thanks! Thanks!

Processing Forecasting Queries Processing Forecasting Queries

Songyun Duan, Shivnath Babu

Duke University

Motivation Motivation

historical data is useful in many domains

automatically and efficiently

Forecasting Queries Forecasting Queries

From D Forecast L

Xi

0.1 … 1.3

0.8 … 1.3 0.3 … … …

… … … …

2.2 0.4 2 0.1 0.2 1

Xi X1 Xn

¿+ L

L

?

¿ ¿

Lead Time

T

to timestamp

D(T; X 1; X 2; : : : ; X n)

Forecast(D, Xi ; L)

An Example Forecasting Query An Example Forecasting Query

Select C From Usage Forecast 1 day

Table: Usage ? Lead time

Example Query Processing Example Query Processing

a Naï ïve Approach ve Approach

?

26.6

Class attribute

?

C1 = 0.47*A + 1.18*B - 0.53*C

Example Query Processing Example Query Processing

?

?

Previous prediction=26.6

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

54 68

Challenges Challenges

To process real-time forecasting queries,

automatically and efficiently

data, adapt processing strategies when necessary

Outline Outline

Execution Plan Execution Plan

Forecasting result

u

Query: Forecast(D, Xi ; L)

Syn(f Y1; ¢¢¢; Yn g ! Z)

T1 Tk

Sample Execution Plan Sample Execution Plan

Shift (C; 1)

Usage

Shift(A; ¡ 2)

¼

Shift(B; ¡ 1)

D0

u = (35; 47; ?) Synopsis = multivariate linear regression (MLR) 54

u

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

Estimating Accuracy of a Plan Estimating Accuracy of a Plan

Tk T1

P D K-fold cross validation to get unbiased estimation

RMSE= q

Training Test Example accuracy metric:

b1 b2 bm

a2 a1

am

Find a Good Plan Quickly Find a Good Plan Quickly

executed before finding a plan with high accuracy

Lead Time

Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

Query: Forecast(Usage, C, 1)

Class attribute

?

Ranked list

Shift( , )

X i

C1 = 0.47A + 1.18B - 0.53*C