[PPT] - Database Learning: Toward a Database that Becomes Smarter Over Time PowerPoint Presentation

SLIDE 1

Database Learning: Toward a Database that Becomes Smarter Over Time

Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari

University of Michigan, Ann Arbor

SLIDE 2

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 3

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 4

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 5

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 6

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 7

Today’s databases

Users Database

query Answer to query

After answering queries, THE WORK is GONE.

Our Goal: reuse the work

1

SLIDE 8

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 9

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 10

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 11

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 12

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 13

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 14

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)

2

SLIDE 15

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) ˆ A (2% err)

2

SLIDE 16

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)

1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3

SLIDE 17

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)

1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3

SLIDE 18

Our high-level approach

Users

Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)

1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3

SLIDE 19

Technical challenges

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 20

Technical challenges

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 21

Technical challenges

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 22

Technical challenges

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 23

Technical challenges

· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?

4

SLIDE 24

Our idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 25

Our idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 26

Our idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 27

Our idea

Q1 (Q1, A1) Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 28

Our idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 29

Our idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers

· · ·

. . . . . . . . . . . . . . .

?

5

SLIDE 30

Our idea

Q1 Q1 A1 Q2 (Q2, A2)

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 31

Our idea

Q1 Q1 A1 Q2 Q2 A2

more queries and answers · · · . . . . . . . . . . . . . . .

?

5

SLIDE 32

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 33

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 34

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 35

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 36

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 37

Concrete example

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

True data Ranges observed by past queries Model (with 95% confidence interval)

1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)

6

SLIDE 38

Design goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 39

Design goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 40

Design goals

select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;

1. Support a wide class of SQL queries
2. No Assumptions about Data

BlinkDB DBL latency

3. Lightweight

7

SLIDE 41

Our Approach

SLIDE 42

Problem statement

Problem: Given past queries (q1 qn), a new query (qn

1), and their approximate answers,

Find the most likely answer to the new query (qn

1) and its estimated error.

Our result: Under a certain model assumption,

ur answer’s error bound
riginal answer’s error bound

(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

8

SLIDE 43

Problem statement

Problem: Given past queries (q1, . . . , qn), a new query (qn+1), and their approximate answers, Find the most likely answer to the new query (qn+1) and its estimated error. Our result: Under a certain model assumption,

ur answer’s error bound
riginal answer’s error bound

(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

8

SLIDE 44

Problem statement

Problem: Given past queries (q1, . . . , qn), a new query (qn+1), and their approximate answers, Find the most likely answer to the new query (qn+1) and its estimated error. Our result: Under a certain model assumption,

ur answer’s error bound ≤ original answer’s error bound

(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.

8

SLIDE 45

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

9

SLIDE 46

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

1 2 3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

9

SLIDE 47

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr

1 2 3

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

9

SLIDE 48

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values correlation between answers

9

SLIDE 49

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr

3 1 2

Estimated answer 3

Two aggregations involve common values → correlation between answers

9

SLIDE 50

Overview of our technique

select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;

θ1, θ2, θ3

Random variables (our uncertainty on answers) 1

Pr(θ1, θ2, θ3)

Probability distribution 2

Pr(θ3 | θ1, θ2)

Estimated answer 3

Two aggregations involve common values → correlation between answers

9

SLIDE 51

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 52

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable θ for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 53

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable θ for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 54

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable θ for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 55

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable θ for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 56

How to define random variables

select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates

We define a random variable θ for every combination of:

select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;

What if your query is complex?

10

SLIDE 57

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of

1 2 3

Most-likely Pr

1 2 3

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 58

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr

1 2 3

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 59

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 60

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 61

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 62

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 63

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 64

How to determine the probability distribution

The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)

Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity

Our choice: (co)variances between pairs of answers.

11

SLIDE 65

Most-likely probability distribution θ1 θ3 θ2

Statistical Information: Mean, variances, covariances

MaxEnt

Multivariate normal distribution Fast inference using a closed form

12

SLIDE 66

Most-likely probability distribution θ1 θ3 θ2

Statistical Information: Mean, variances, covariances

MaxEnt

Multivariate normal distribution Fast inference using a closed form

12

SLIDE 67

Most-likely probability distribution θ1 θ3 θ2

Statistical Information: Mean, variances, covariances

MaxEnt

Multivariate normal distribution Fast inference using a closed form

12

SLIDE 68

Most-likely probability distribution θ1 θ3 θ2

Statistical Information: Mean, variances, covariances

MaxEnt

Multivariate normal distribution Fast inference using a closed form

12

SLIDE 69

Benefits of database learning

Database learning vs. indexing

Indexing DBL database size storage

1. Little storage overhead

Database learning vs. materialized view

date

2. Without alignment

view selection DBL system uptime

verhead
3. No upfront overhead

13

SLIDE 70

Benefits of database learning

Database learning vs. indexing

Indexing DBL database size storage

1. Little storage overhead

Database learning vs. materialized view

date

2. Without alignment

view selection DBL system uptime

verhead
3. No upfront overhead

13

SLIDE 71

Benefits of database learning

Database learning vs. indexing

Indexing DBL database size storage

1. Little storage overhead

Database learning vs. materialized view

date

2. Without alignment

view selection DBL system uptime

verhead
3. No upfront overhead

13

SLIDE 72

Benefits of database learning

Database learning vs. indexing

Indexing DBL database size storage

1. Little storage overhead

Database learning vs. materialized view

date

2. Without alignment

view selection DBL system uptime

verhead
3. No upfront overhead

13

SLIDE 73

Benefits of database learning

Database learning vs. indexing

Indexing DBL database size storage

1. Little storage overhead

Database learning vs. materialized view

date

2. Without alignment

view selection DBL system uptime

verhead
3. No upfront overhead

13

SLIDE 74

Experiment

SLIDE 75

Experiment setup

1. Two systems:
NoLearn: Approximate query processing engine (The longer runtime, the more accurate

answer)

Verdict: Our database learning system (on top of NoLearn)
2. Datasets:
Customer1: 536GB data and query log from a customer
TPC-H: 100GB TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge) + 1 master
SSD-backed HDFS for Spark’s data loading

14

SLIDE 76

Experiment setup

1. Two systems:
NoLearn: Approximate query processing engine (The longer runtime, the more accurate

answer)

Verdict: Our database learning system (on top of NoLearn)
2. Datasets:
Customer1: 536GB data and query log from a customer
TPC-H: 100GB TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge) + 1 master
SSD-backed HDFS for Spark’s data loading

14

SLIDE 77

Experiment setup

1. Two systems:
NoLearn: Approximate query processing engine (The longer runtime, the more accurate

answer)

Verdict: Our database learning system (on top of NoLearn)
2. Datasets:
Customer1: 536GB data and query log from a customer
TPC-H: 100GB TPC-H dataset
3. Environment:
5 Amazon EC2 workers (m4.2xlarge) + 1 master
SSD-backed HDFS for Spark’s data loading

14

SLIDE 78

Our experimental claims

1. Verdict supports a large portion of real-world queries
2. Verdict achieves speedup compared to NoLearn
3. Verdict works with small memory and computational overhead

15

SLIDE 79

Our experimental claims

1. Verdict supports a large portion of real-world queries
2. Verdict achieves speedup compared to NoLearn
3. Verdict works with small memory and computational overhead

15

SLIDE 80

Our experimental claims

1. Verdict supports a large portion of real-world queries
2. Verdict achieves speedup compared to NoLearn
3. Verdict works with small memory and computational overhead

15

SLIDE 81

Generality of Verdict

Dataset # Analyzed # Supported Percentage Customer1 3,342 2,463 73.7% TPC-H 21 14 66.7% Unsupported queries:

1. Nested queries (that cannot be flattened)
2. Textual filters:

city like '%arbor%'

16

SLIDE 82

Runtime-error trade-off

Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50

10 20 30 40 50 60 5 10 15 Runtime (sec) Error bound (%) NoLearn Verdict

(a) Data in Memory

6 12 18 24 30 5 10 15 Runtime (min) Error bound (%)

(b) Data on SSD

17

SLIDE 83

Runtime-error trade-off

Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50

10 20 30 40 50 60 5 10 15 Runtime (sec) Error bound (%) NoLearn Verdict

(a) Data in Memory

6 12 18 24 30 5 10 15 Runtime (min) Error bound (%)

(b) Data on SSD

17

SLIDE 84

Speedup

The results on the Customer1 dataset (the paper has the TPC-H results)

4% 2% 5 10 15 20 25 30 7.7 2.5 Target Error Bound Speedup (x)

(a) Data in memory

4% 2% 5 10 15 20 25 30 23 5.7 Target Error Bound Speedup (x)

(b) Data on SSD

18

SLIDE 85

Speedup

The results on the Customer1 dataset (the paper has the TPC-H results)

4% 2% 5 10 15 20 25 30 7.7 2.5 Target Error Bound Speedup (x)

(a) Data in memory

4% 2% 5 10 15 20 25 30 23 5.7 Target Error Bound Speedup (x)

(b) Data on SSD

18

SLIDE 86

Memory and computational overhead

1. Memory overhead:
Queries and their answer, some matrices and their inverses
23.2 KB per query for the Customer1 dataset
15.8 KB per query for the TPC-H dataset
2. Computational overhead:

Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

19

SLIDE 87

Memory and computational overhead

1. Memory overhead:
Queries and their answer, some matrices and their inverses
23.2 KB per query for the Customer1 dataset
15.8 KB per query for the TPC-H dataset
2. Computational overhead:

Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

19

SLIDE 88

Memory and computational overhead

1. Memory overhead:
Queries and their answer, some matrices and their inverses
23.2 KB per query for the Customer1 dataset
15.8 KB per query for the TPC-H dataset
2. Computational overhead:

Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

19

SLIDE 89

Memory and computational overhead

1. Memory overhead:
Queries and their answer, some matrices and their inverses
23.2 KB per query for the Customer1 dataset
15.8 KB per query for the TPC-H dataset
2. Computational overhead:

Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)

19

SLIDE 90

Thank You!

19