Processing Forecasting Queries Processing Forecasting Queries - - PowerPoint PPT Presentation

processing forecasting queries processing forecasting
SMART_READER_LITE
LIVE PREVIEW

Processing Forecasting Queries Processing Forecasting Queries - - PowerPoint PPT Presentation

Processing Forecasting Queries Processing Forecasting Queries Songyun Duan, Shivnath Babu Duke University Motivation Motivation Real-time forecasting of future events based on historical data is useful in many domains Proactive system


slide-1
SLIDE 1

Processing Forecasting Queries Processing Forecasting Queries

Songyun Duan, Shivnath Babu

Duke University

slide-2
SLIDE 2

2

Motivation Motivation

Real-time forecasting of future events based on

historical data is useful in many domains

  • Proactive system management

If a performance problem is forecast, take corrective actions in advance to avoid it

  • Adaptive query processing
  • Inventory planning
  • Environmental monitoring
  • And many others

Need a framework to process forecasting queries

automatically and efficiently

slide-3
SLIDE 3

3

Forecasting Queries Forecasting Queries

Select

From D Forecast L

Xi

0.1 … 1.3

0.8 … 1.3 0.3 … … …

… … … …

… … …

2.2 0.4 2 0.1 0.2 1

Xi X1 Xn

¿+ L

L

?

¿ ¿

Lead Time

T

  • : historical time-series data up

to timestamp

Denoted as

D(T; X 1; X 2; : : : ; X n)

Forecast(D, Xi ; L)

slide-4
SLIDE 4

4

An Example Forecasting Query An Example Forecasting Query

Select C From Usage Forecast 1 day

25 47 25 46 46 46 46 46 25

B

12 13 35 36 35 13 13 35 35

A

18 17 16 15 14 13 12 11 10 9

Day

16 68 16 16 68 68 16 68 17

C

Table: Usage ? Lead time

slide-5
SLIDE 5

5

Example Query Processing Example Query Processing

  • - a Na

a Naï ïve Approach ve Approach

16 68 16 16 68 68 16 68

?

17

26.6

C1

Class attribute

25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C

?

C1 = 0.47*A + 1.18*B - 0.53*C

slide-6
SLIDE 6

6

Example Query Processing Example Query Processing

17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68

?

A¡ 2 B¡ 1 C1

25 47 25 46 46 46 46 46 25 B 12 13 35 36 35 13 13 35 35 A 18 17 16 15 14 13 12 11 10 9 Day 16 68 16 16 68 68 16 68 17 C

?

13 35 12 36 35 13 13 35 35 47 25 25 46 46 46 46 46 25

  • Bayesian network

Previous prediction=26.6

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

54 68

slide-7
SLIDE 7

7

Challenges Challenges

To process real-time forecasting queries,

Challenge 1: generate a good processing strategy

automatically and efficiently

  • Apply appropriate transformations to the data

E.g., shift, discretization, normalization, aggregation

  • Pick the right type of statistical model

E.g., multivariate linear regression (MLR), classification and regression tree (CART), Bayesian network (BN)

Challenge 2: for continuous forecasting over streaming

data, adapt processing strategies when necessary

slide-8
SLIDE 8

8

Outline Outline

Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary

slide-9
SLIDE 9

9

Execution Plan Execution Plan

D

Synopsis

Time-series data

Forecasting result

Predictor

u

P(Syn; u) ) u:Z

Query: Forecast(D, Xi ; L)

Syn(f Y1; ¢¢¢; Yn g ! Z)

B(D; Z ) ) Syn(f Y1; ¢¢¢; Yn g ! Z)

T1 Tk

Transformers D 0

1

D 0

k

Synopsis Builder

Logical operators

Transformer:

E.g., Shift (X, δ)

Synopsis builder: Predictor:

Synopsis

E.g., linear regression, regression tree

D ) D 0

slide-10
SLIDE 10

10

Sample Execution Plan Sample Execution Plan

Select C From Usage Forecast 1 day

MLR

MLR learne r Predictor

Shift (C; 1)

Usage

Shift(A; ¡ 2)

¼

(A ¡ 2 ;B ¡ 1 )

Shift(B; ¡ 1)

D0

16 25 12 17 … … … 68 46 35 13 68 47 13 16 18 16 25 35 15 14 Day 16 46 36 C B A 68 16

?

16 68 16

?

C1

… … 12 13 35 36 35

A¡ 2 B¡ 1

… 46 25 47 25 46

?

47 35 17 … … … 25 46 46 … 16 … 13 16 36 16 68 35 15 14 Day 16 …

A¡ 2 B¡ 1

C1

D’

u = (35; 47; ?) Synopsis = multivariate linear regression (MLR) 54

u

= 1.24*A¡ 2 + 0:3 ¤ B¡ 1

C1

slide-11
SLIDE 11

11

Estimating Accuracy of a Plan Estimating Accuracy of a Plan

  • Accuracy
  • How close are forecasting results to “real values”
  • Given a dataset D and a plan P

Tk T1

Synopsis Predictor

P D K-fold cross validation to get unbiased estimation

Synopsis Builder

RMSE= q

P

i (ai ¡ bi )2

m

Synopsis type

Training Test Example accuracy metric:

Predicted value Actual value

b1 b2 bm

a2 a1

am

slide-12
SLIDE 12

12

Find a Good Plan Quickly Find a Good Plan Quickly

Optimization challenge: minimize the number of plans

executed before finding a plan with high accuracy

Efficient plan search to balance accuracy Vs. running time

Simplified plan space to describe our algorithms

Two types of transformers: Shift and Project One synopsis: Bayesian Network (BN)

Accuracy

  • f best

execution plan found so far

Algorithm 2

Reasonably-good execution plan available

Algorithm 1

Elapsed processing time

Lead Time

slide-13
SLIDE 13

13

Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

Dataset (n = 3)

Query: Forecast(Usage, C, 1)

7 8 9

4 5 6

1 2 3

… A B C

Class attribute

A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1

7 8 9

4 5 6

1 2 3

7 8 9

4 5 6

1 2 3

8 9

?

Ranked list

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C

Attribute Ranking

Linear-correlation-based Entropy-based, e.g., information gain

Shift( , )

X i

δ

Learn a synopsis and generate the plan? Imagine n = 100, = 90

Extended data has 9000+ attributes Takes too much time to get a plan

(-<=δ<0) (=2)

slide-14
SLIDE 14

14

BN Predictor

Shif t(A; ¡ 2) Shif t(B ; ¡ 1)

P1

¼

A ¡ 2 ;B ¡ 1

Fa Fa’ ’s Plan Search (FPS) Algorithm s Plan Search (FPS) Algorithm

A¡ 1B¡ 1C¡ 1A¡ 2 C¡ 2 B¡ 2 C1 A B C

Ranked list

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 B¡ 2 C1 A B C

(n=3,=2) Shifted data

Attribute selection

Fast Correlation-Based Filter (FCBF) Correlation-based Feature Selection (CFS) Wrapper

( , )

B¡ 1 A¡ 2

( , , )

A¡ 2 C¡ 1 C¡ 2 BN Predictor

Shif t(A; ¡ 2)

P2

Shif t(C; ¡ 1) Shif t(C; ¡ 2)

¼

A ¡ 2 ;C ¡ 1 ;C ¡ 2

Stop Do forecasting using the plan with highest accuracy

, otherwise increase

(Acc1) (Acc2)

slide-15
SLIDE 15

15

Adaptive Fa Adaptive Fa’ ’s Plan Search (FPS s Plan Search (FPS-

  • A)

A)

Forecast(S[W], Xi ; L) Stream S … … … …

W

L …

The ranked list and plans for S[W]

A¡ 1 B¡ 1 C¡ 1 A¡ 2 C¡ 2 C1 A B

Plan

P1

Plan

P2

(A¡ 2; B¡ 1) (A¡ 2; C¡ 1; C¡ 2)

B¡ 2 C

(A¡ 2; C)

Plan

P0

1

(A¡ 2; C; C¡ 1)

Plan

P0

2

B¡ 2 C

Continuous query:

slide-16
SLIDE 16

16

Outline Outline

Space of execution plans Plan Search Algorithm Processing continuous forecasting query Experimental evaluation Related work Summary

slide-17
SLIDE 17

17

Experimental Setting Experimental Setting

Target domain: system and database monitoring Datasets (#attributes 3~250, #instances 700~15000)

  • Aging dataset from a departmental cluster

Aging behavior: progressive degradation in performance

  • A real dataset parsed from logs of 98’ World Cup web-site

Periodic segments – characteristic of most popular web-sites

  • 5 testbed datasets

Our testbed runs OLTP applications using MySQL Simulated periodic workloads, aging behavior, and multiple resource contentions

  • 2 synthetic datasets: simulated complicated patterns to

study the robustness of our algorithms

slide-18
SLIDE 18

18

Accuracy metric = balanced accuracy

BA = 1 ¡ 0:5 ¤ ( # f al se posi t i ves

# n egat i ves

+ # f al se n egat i ves

# posi t i ves

)

Multiple Chunks Vs. One Chunk Multiple Chunks Vs. One Chunk

5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan

M u ltip le c h u n k s O n e c h u n k

Testbed dataset, Lead time = 25, n=50+,= 30

slide-19
SLIDE 19

19

Synopsis Comparison Synopsis Comparison

3200 933 22339

Time

.85 .91 .86

BA FPS(RF)

482 1948

Time

.86 0.51

BA FPS(SVM)

24 19 130 201 36

Time

.80 .85 .80 .84 .64

BA FPS(MLR)

109 50 249 37 135

Time

.81 .91 .85 .85 .71

BA FPS(CART)

.82 .91 .84 .87 .71

BA FPS(BN)

14 53 45 29 62

Time Aging-variant-tb Multi-small-tb Periodic-small-tb FIFA-real Aging-real Dataset

  • FPS using BN or CART can achieve accuracy

comparable to more sophisticated synopses

  • Lesson: More important to find right transformations
  • FPS using BN or CART has lower running time
  • Thus: we use BN as the default synopsis
slide-20
SLIDE 20

20

FPS Vs. State FPS Vs. State-

  • of
  • f-
  • the

the-

  • Art Synopsis

Art Synopsis

10 10

1

10

2

10

3

0.5 0.6 0.7 0.8 0.9 1

E lap sed p rocessin g tim e (sec, log scale) BA of current best plan

F P S R F -base R F -shifts

Synthetic dataset, Lead time = 25, n = 3,= 90

slide-21
SLIDE 21

21

FPS FPS-

  • adaptive Vs. FPS

adaptive Vs. FPS-

  • nonadaptive

nonadaptive

Runtime overhead < 2% Adaptability and Convergence

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

N u m b er of tu p les p rocessed so far BA of current best plan

F P S -A d ap tive F P S -N o nad ap tive

(Lead time = 25, = 90)

slide-22
SLIDE 22

22

Sharing Computation in FPS and FPS Sharing Computation in FPS and FPS-

  • A

A

Lead time = 25, n = 50+,= 90

5 0 0 1 0 0 0 1 5 0 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1

E la p s e d p ro c e s s in g tim e (s e c ) BA of current best plan

F P S -n o s h a rin g F P S -s h a rin g

FPS and FPS-A aggressively share computation

as multiple plans are explored during plan search

slide-23
SLIDE 23

23

Related Work Related Work

Time-series forecasting, performance problem

forecasting, and self-tuning

  • Difference: our automated framework balances accuracy

against running time

Machine learning techniques on data transformation

and modeling

  • Can be incorporated as operators in our framework

Integration of synopses and data mining algorithms

with DBMS and DSMS

  • Used for processing conventional SQL/XML queries
  • Difference: our framework automatically chooses good

combinations of transformations and synopses

slide-24
SLIDE 24

24

Summary Summary

Defined declarative one-time and continuous

forecasting queries

Proposed an automatic plan search algorithm for

processing one-time forecasting queries

Proposed an adaptive algorithm for continuous

forecasting over streaming data

Extensive experimental evaluation of both

algorithms

Thanks! Thanks!

slide-25
SLIDE 25

25

Execution Plans for Example Query Execution Plans for Example Query

Query: Forecast (Usage, C, 1)

Predictor

P1 Usage

Shif t(C; 1)

26.6

MLR Predictor

Shif t(A; ¡ 2) Shif t(B ; ¡ 1)

P2 Usage

Shif t(C; 1)

54

MLR

¼

A ¡ 2 ;B ¡ 1

CART Predictor

Shif t(A; ¡ 2) Shif t(B ; ¡ 1)

P3 Usage

Shif t(C; 1)

68

¼

A ¡ 2 ;B ¡ 1