Database Learning: Toward a Database that Becomes Smarter Over Time
Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari
University of Michigan, Ann Arbor
Database Learning: Toward a Database that Becomes Smarter Over Time - - PowerPoint PPT Presentation
Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari University of Michigan, Ann Arbor Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse the work Users Database query Answer to query After
Database Learning: Toward a Database that Becomes Smarter Over Time
Yongjoo Park Ahmad Shahab Tajik Michael Cafarella Barzan Mozafari
University of Michigan, Ann Arbor
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Today’s databases
Users Database
query Answer to query
After answering queries, THE WORK is GONE.
Our Goal: reuse the work
1
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q A (10% err, 1 sec) Q Q A (10% err) ˆ A (2% err)
2
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)
1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)
1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3
Our high-level approach
Users
Database Learning Query Synopsis AQP engine Q Q A (10% err) ˆ A (2% err)
1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 Time (sec) Error(%) AQP engine Database learning 3
Technical challenges
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Technical challenges
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Technical challenges
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Technical challenges
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Technical challenges
· · · . . . . . . . . . . . . . . . Queries use the data in different columns/rows. How to leverage those queries for future queries?
4
Our idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our idea
Q1 (Q1, A1) Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our idea
Q1 (Q1, A1) Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers
. . . . . . . . . . . . . . .
5
Our idea
Q1 Q1 A1 Q2 (Q2, A2)
more queries and answers
. . . . . . . . . . . . . . .
5
Our idea
Q1 Q1 A1 Q2 (Q2, A2)
more queries and answers · · · . . . . . . . . . . . . . . .
5
Our idea
Q1 Q1 A1 Q2 Q2 A2
more queries and answers · · · . . . . . . . . . . . . . . .
5
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Concrete example
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
True data Ranges observed by past queries Model (with 95% confidence interval)
1 20 40 60 80 100 20M 30M 40M Week Number SUM(count) 1 20 40 60 80 100 20M 30M 40M Week Number SUM(count)
6
Design goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Design goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Design goals
select X3, avg(Y1) from t where 5 < X1 < 8; select sum(Y2) from t where X2 between Apr and May group by X3;
BlinkDB DBL latency
7
Our Approach
Problem statement
Problem: Given past queries (q1 qn), a new query (qn
1), and their approximate answers,
Find the most likely answer to the new query (qn
1) and its estimated error.
Our result: Under a certain model assumption,
(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
8
Problem statement
Problem: Given past queries (q1, . . . , qn), a new query (qn+1), and their approximate answers, Find the most likely answer to the new query (qn+1) and its estimated error. Our result: Under a certain model assumption,
(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
8
Problem statement
Problem: Given past queries (q1, . . . , qn), a new query (qn+1), and their approximate answers, Find the most likely answer to the new query (qn+1) and its estimated error. Our result: Under a certain model assumption,
(in practice, much more accurate) if the error bounds provide the same probabilistic guarantees.
8
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
1 2 3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
9
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
1 2 3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
9
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr
1 2 3
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
9
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values correlation between answers
9
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr
3 1 2
Estimated answer 3
Two aggregations involve common values → correlation between answers
9
Overview of our technique
select count(Y2) from t where 1 < X1 < 2; select avg(Y2) from t where 6 < X1 < 8; select sum(Y2) from t where 5 < X1 < 8;
θ1, θ2, θ3
Random variables (our uncertainty on answers) 1
Pr(θ1, θ2, θ3)
Probability distribution 2
Pr(θ3 | θ1, θ2)
Estimated answer 3
Two aggregations involve common values → correlation between answers
9
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable θ for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable θ for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable θ for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable θ for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to define random variables
select sum(Y2) from t where 5 < X1 < 8; Aggregate function Selection predicates
We define a random variable θ for every combination of:
select X3, avg(Y1), sum(Y2) from t where 5 < X1 < 8 and X2 between Apr and May group by X3;
What if your query is complex?
10
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of
1 2 3
Most-likely Pr
1 2 3
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr
1 2 3
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
How to determine the probability distribution
The Principle of Maximum Entropy (ME) Statistical Info of (θ1, θ2, θ3) Most-likely Pr(θ1, θ2, θ3)
Low Amount of Info High Amount of Info Simple Pr Complex Pr Fast Inference Low-fidelity Slow Inference High-fidelity
Our choice: (co)variances between pairs of answers.
11
Most-likely probability distribution θ1 θ3 θ2
Statistical Information: Mean, variances, covariances
MaxEnt
Multivariate normal distribution Fast inference using a closed form
12
Most-likely probability distribution θ1 θ3 θ2
Statistical Information: Mean, variances, covariances
MaxEnt
Multivariate normal distribution Fast inference using a closed form
12
Most-likely probability distribution θ1 θ3 θ2
Statistical Information: Mean, variances, covariances
MaxEnt
Multivariate normal distribution Fast inference using a closed form
12
Most-likely probability distribution θ1 θ3 θ2
Statistical Information: Mean, variances, covariances
MaxEnt
Multivariate normal distribution Fast inference using a closed form
12
Benefits of database learning
Database learning vs. indexing
Indexing DBL database size storage
Database learning vs. materialized view
date
view selection DBL system uptime
13
Benefits of database learning
Database learning vs. indexing
Indexing DBL database size storage
Database learning vs. materialized view
date
view selection DBL system uptime
13
Benefits of database learning
Database learning vs. indexing
Indexing DBL database size storage
Database learning vs. materialized view
date
view selection DBL system uptime
13
Benefits of database learning
Database learning vs. indexing
Indexing DBL database size storage
Database learning vs. materialized view
date
view selection DBL system uptime
13
Benefits of database learning
Database learning vs. indexing
Indexing DBL database size storage
Database learning vs. materialized view
date
view selection DBL system uptime
13
Experiment
Experiment setup
answer)
14
Experiment setup
answer)
14
Experiment setup
answer)
14
Our experimental claims
15
Our experimental claims
15
Our experimental claims
15
Generality of Verdict
Dataset # Analyzed # Supported Percentage Customer1 3,342 2,463 73.7% TPC-H 21 14 66.7% Unsupported queries:
city like '%arbor%'
16
Runtime-error trade-off
Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50
10 20 30 40 50 60 5 10 15 Runtime (sec) Error bound (%) NoLearn Verdict
(a) Data in Memory
6 12 18 24 30 5 10 15 Runtime (min) Error bound (%)
(b) Data on SSD
17
Runtime-error trade-off
Results on the TPC-H dataset (the paper has the Customer1 results) Number of past queries fixed to 50
10 20 30 40 50 60 5 10 15 Runtime (sec) Error bound (%) NoLearn Verdict
(a) Data in Memory
6 12 18 24 30 5 10 15 Runtime (min) Error bound (%)
(b) Data on SSD
17
Speedup
The results on the Customer1 dataset (the paper has the TPC-H results)
4% 2% 5 10 15 20 25 30 7.7 2.5 Target Error Bound Speedup (x)
(a) Data in memory
4% 2% 5 10 15 20 25 30 23 5.7 Target Error Bound Speedup (x)
(b) Data on SSD
18
Speedup
The results on the Customer1 dataset (the paper has the TPC-H results)
4% 2% 5 10 15 20 25 30 7.7 2.5 Target Error Bound Speedup (x)
(a) Data in memory
4% 2% 5 10 15 20 25 30 23 5.7 Target Error Bound Speedup (x)
(b) Data on SSD
18
Memory and computational overhead
Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
19
Memory and computational overhead
Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
19
Memory and computational overhead
Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
19
Memory and computational overhead
Latency for memory Latency for SSD NoLearn 2.083 sec 52.50 sec Verdict 2.093 sec 52.51 sec Overhead 0.010 sec (0.48%) 0.010 sec (0.02%)
19
19