Wentao Wu1, Yun Chi2, Shenghuo Zhu2, Junichi Tatemura2, Hakan Hacigumus2, Jeffrey Naughton1
1Dept of Computer Sciences, University of Wisconsin-Madison 2NEC Laboratories America
1
Motivation Database as a service (DaaS) User Service Provider - - PowerPoint PPT Presentation
Wentao Wu 1 , Yun Chi 2 , Shenghuo Zhu 2 , Junichi Tatemura 2 , Hakan Hacigumus 2 , Jeffrey Naughton 1 1 Dept of Computer Sciences, University of Wisconsin-Madison 2 NEC Laboratories America 1 Motivation Database as a service (DaaS) User
1Dept of Computer Sciences, University of Wisconsin-Madison 2NEC Laboratories America
1
Database
Service Level Agreement (SLA)
2
Run this query or not?
If we decide to run it, when?
How long should we wait if something is wrong?
How much hardware does it require to run in the given time?
3
Can we just use them?
Query optimizers’ cost estimates are unusable.
avg err: 120%
4
The ratios between the c’s are correct. The n’s are correct.
𝐷 = 𝑜𝑡𝑑𝑡 + 𝑜𝑠𝑑𝑠 + 𝑜𝑢𝑑𝑢 + 𝑜𝑗𝑑𝑗 + 𝑜𝑝𝑑𝑝
Cost Unit Value cs: seq_page_cost 1.0 cr: rand_page_cost 4.0 ct: cpu_tuple_cost 0.01 ci: cpu_index_tuple_cost 0.005 co: cpu_operator_cost 0.0025
𝑈 = 𝑏 ⋅ 𝐷 = 𝑑𝑡
′ ⋅ (𝑜𝑡 + 𝑜𝑠
𝑑𝑠 𝑑𝑡 + 𝑜𝑢 𝑑𝑢 𝑑𝑡 + 𝑜𝑗 𝑑𝑗 𝑑𝑡 + 𝑜𝑝 𝑑𝑝 𝑑𝑡 )
5
𝑑𝑡
′ = 𝑏 ⋅ 𝑑𝑡 = 𝑏 ⋅ 1.0 = 𝑏
Should be correct!
Use machine learning ([Ganapathi ICDE’09], [Akdere
Calibrate the c’s and the n’s! (our work)
6
𝐷 = 𝑜𝑡𝑑𝑡 + 𝑜𝑠𝑑𝑠 + 𝑜𝑢𝑑𝑢 + 𝑜𝑗𝑑𝑗 + 𝑜𝑝𝑑𝑝
7
Calibrate the c’s: use profiling queries. Calibrate the n’s: refine cardinality estimates.
8
9
Want to know the true 𝑑𝑢 and 𝑑𝑝
k cost units (i.e., k unknowns) => k queries (i.e., k equations) k = 5 in the case of PostgreSQL
Cost Unit cs: seq_page_cost cr: rand_page_cost ct: cpu_tuple_cost ci: cpu_index_tuple_cost co: cpu_operator_cost
10
Each c should be covered by at least one query.
The set of queries is incomplete if any query is removed.
Each query should be as simple as possible.
11
q1: select * from R R in memory 𝑢1 = 𝑑𝑢 ∙ 𝑜𝑢1 q2: select count(*) from R R in memory 𝑢2 = 𝑑𝑢 ∙ 𝑜𝑢2 + 𝑑𝑝 ∙ 𝑜𝑝2 q3: select * from R where R.A < a (R.A with an Index) R in memory 𝑢3 = 𝑑𝑢 ∙ 𝑜𝑢3 + 𝑑𝑗 ∙ 𝑜𝑗3 + 𝑑𝑝 ∙ 𝑜𝑝3 q4: select * from R R on disk 𝑢4 = 𝑑𝑡 ∙ 𝑜𝑡4 + 𝑑𝑢 ∙ 𝑜𝑢4 q5: select * from R where R.B < b (R.B unclustered Index) R on disk 𝑢5 = 𝑑𝑡 ∙ 𝑜𝑡5 + 𝑑𝑠 ∙ 𝑜𝑠5 + 𝑑𝑢 ∙ 𝑜𝑢5 + 𝑑𝑗 ∙ 𝑜𝑗5 + 𝑑𝑝 ∙ 𝑜𝑝5
12
Calibrating the n’s => Calibrating the N’s
𝑝 ⋅ 𝑂𝑢 𝑗 + 𝑂𝑢 𝑝 ⋅ 𝑠𝑑 𝑝𝑔 𝑗𝑜𝑜𝑓𝑠 𝑑ℎ𝑗𝑚𝑒
13
14
[Haas et al., J. Comput. Syst. Sci. 1996]
15
𝐶11 𝐶12 … 𝐶1𝑂1 𝐶21 𝐶22 … 𝐶2𝑂2
𝐶11
𝐶22
𝐶1𝑂1
𝐶21
ො 𝜍𝑟 = 1 𝑜
𝑗=1 𝑜
𝜍𝑗
𝜍1 𝜍𝑜
𝜍1 = |𝐶11 ⋈ 𝐶22| 𝐶11 × |𝐶22|
Problem Our Solution
16
I/Os at runtime to take samples.
them as tables in the database.
more than one operators.
single run, by reusing partial results.
select/join operators.
Future work: Add estimators for aggregates ([Charikar PODS’00]).
𝑡 ⋈ 𝑆2 𝑡|
𝑡| × |𝑆2 𝑡|
𝑡 ⋈ 𝑆2 𝑡 ⋈ 𝑆3 𝑡|
𝑡| × |𝑆2 𝑡| × |𝑆3 𝑡|
17
𝑡
𝑡
𝑡
𝑡, 𝑆2 𝑡, 𝑆3 𝑡 are samples (as tables) of 𝑆1, 𝑆2, 𝑆3
Both uniform and skewed data distribution
PC1: 1-core 2.27 GHz Intel CPU, 2GB memory PC2: 8-core 2.40 GHz Intel CPU, 16GB memory
18
Cost Unit Calibrated (ms) Calibrated (normalized to cs) Default cs: seq_page_cost 5.03e-2 1.0 1.0 cr: rand_page_cost 4.89e-1 9.7 4.0 ct: cpu_tuple_cost 1.41e-4 0.0028 0.01 ci: cpu_index_tuple_cost 3.34e-5 0.00066 0.005 co: cpu_operator_cost 7.10e-5 0.0014 0.0025 Cost Unit Calibrated (ms) Calibrated (normalized to cs) Default cs: seq_page_cost 5.53e-2 1.0 1.0 cr: rand_page_cost 6.50e-2 1.2 4.0 ct: cpu_tuple_cost 1.67e-4 0.003 0.01 ci: cpu_index_tuple_cost 3.41e-5 0.0006 0.005 co: cpu_operator_cost 1.12e-4 0.002 0.0025
19
Mean Relative Error (MRE)
Unseen queries frequently occur.
Naive scaling More complex machine learning approaches
20
Represent a query as a feature vector Train a regression model
Did not compare since [Akdere ICDE’12] is better.
21
22
Naïve Scaling
23
24
30% samples (0.3GB) => 20% additional overhead
5% samples (0.5GB) => 4% additional overhead
25
26