DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

database system implementation
SMART_READER_LITE
LIVE PREVIEW

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // - - PowerPoint PPT Presentation

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #15: OPTIMIZER IMPLEMENTATION (PART II) 2 LOGISTICS Reminder: Reading Review 3 due today. Reminder: Proposal presentations in class on Thursday (Mar 7).


slide-1
SLIDE 1

DATABASE SYSTEM IMPLEMENTATION

GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #15: OPTIMIZER IMPLEMENTATION (PART II)

slide-2
SLIDE 2

LOGISTICS

Reminder: Reading Review 3 due today. Reminder: Proposal presentations in class on Thursday (Mar 7). Reminder: HW3 submission postponed to next Tuesday (Mar 12).

2

slide-3
SLIDE 3

TODAY’S AGENDA

Cascades / Columbia Orca Optimizer MemSQL Optimizer Working in a large code base

2

slide-4
SLIDE 4

QUERY OPTIMIZATION STRATEGIES

Choice #1: Heuristics

→ INGRES, Oracle (until mid 1990s)

Choice #2: Heuristics + Cost-based Join Search

→ System R, early IBM DB2, most open-source DBMSs

Choice #3: Randomized Search

→ Academics in the 1980s, current Postgres

Choice #4: Stratified Search

→ IBM’s STARBURST (late 1980s), now IBM DB2 + Oracle

Choice #5: Unified Search

→ Volcano/Cascades in 1990s, now MSSQL + Greenplum

4

slide-5
SLIDE 5

POSTGRES OPTIMIZER

Imposes a rigid workflow for query optimization:

→ First stage performs initial rewriting with heuristics → It then executes a cost-based search to find optimal join

  • rdering.

→ Everything else is treated as an “add-on”. → Then recursively descends into sub-queries.

Difficult to modify or extend because the ordering has to be preserved.

5

slide-6
SLIDE 6

OPTIMIZER GENERATORS

Framework to allow a DBMS implementer to write the declarative rules for optimizing queries.

→ Separate the search strategy from the data model. → Separate the transformation rules and logical operators from physical rules and physical operators.

Implementation can be independent of the

  • ptimizer's search strategy.

Examples: Starburst, Exodus, Volcano, Cascades, OPT++

6

slide-7
SLIDE 7

STRATIFIED SEARCH

First rewrite the logical query plan using transformation rules.

→ The engine checks whether the transformation is allowed before it can be applied. → Cost is never considered in this step.

Then perform a cost-based search to map the logical plan to a physical plan.

7

slide-8
SLIDE 8

UNIFIED SEARCH

Unify the notion of both logical→logical and logical→physical transformations.

→ No need for separate stages because everything is transformations.

This approach generates a lot more transformations so it makes heavy use of memoization to reduce redundant work.

8

slide-9
SLIDE 9

TOP-DOWN VS. BOTTOM-UP

Top-down Optimization

→ Start with the final outcome that you want, and then work down the tree to find the optimal plan that gets you to that goal. → Example: Volcano, Cascades

Bottom-up Optimization

→ Start with nothing and then build up the plan to get to the final outcome that you want. → Examples: System R, Starburst

9

slide-10
SLIDE 10

CASCADES OPTIMIZER

Object-oriented implementation of the Volcano query optimizer. Simplistic expression re-writing can be through a direct mapping function rather than an exhaustive search.

10

THE CASCADES FRAMEWORK FOR QUERY OPTIMIZATION IEEE Data Engineering Bulletin 1995

Graefe

slide-11
SLIDE 11

CASCADES OPTIMIZER

Optimization tasks as data structures. Rules to place property enforcers. Ordering of moves by promise. Predicates as logical/physical operators.

11

EFFICIENCY IN THE COLUMBIA DATABASE QUERY OPTIMIZER Portland State University MS Thesis 1998

slide-12
SLIDE 12

CASCADES – EXPRESSIONS

A expression is an operator with zero or more input expressions. Logical Expression: (A ⨝ B) ⨝ C Physical Expression: (AF ⨝HJ BF) ⨝NLJ CF

12

slide-13
SLIDE 13

CASCADES – GROUPS

A group is a set of logically equivalent logical and physical expressions that produce the same output.

→ All logical forms of an expression → All physical expressions that can be derived from selecting the allowable physical operators for the corresponding logical forms.

13

Output: [ABC]

Logical Exps 1. (A⨝B)⨝C 2. (B⨝C)⨝A 3. (A⨝C)⨝B 4. A⨝(B⨝C) ⋮ Physical Exps 1. (AF⨝LBF)⨝LCF 2. (BF⨝LCF)⨝LAF 3. (AF⨝LCF)⨝LBF 4. AF⨝L(CF⨝LBF) ⋮

slide-14
SLIDE 14

CASCADES – GROUPS

A group is a set of logically equivalent logical and physical expressions that produce the same output.

→ All logical forms of an expression → All physical expressions that can be derived from selecting the allowable physical operators for the corresponding logical forms.

14

Output: [ABC]

Logical Exps 1. (A⨝B)⨝C 2. (B⨝C)⨝A 3. (A⨝C)⨝B 4. A⨝(B⨝C) ⋮ Physical Exps 1. (AF⨝LBF)⨝LCF 2. (BF⨝LCF)⨝LAF 3. (AF⨝LCF)⨝LBF 4. AF⨝L(CF⨝LBF) ⋮

Group

slide-15
SLIDE 15

CASCADES – GROUPS

A group is a set of logically equivalent logical and physical expressions that produce the same output.

→ All logical forms of an expression → All physical expressions that can be derived from selecting the allowable physical operators for the corresponding logical forms.

15

Output: [ABC]

Logical Exps 1. (A⨝B)⨝C 2. (B⨝C)⨝A 3. (A⨝C)⨝B 4. A⨝(B⨝C) ⋮ Physical Exps 1. (AF⨝LBF)⨝LCF 2. (BF⨝LCF)⨝LAF 3. (AF⨝LCF)⨝LBF 4. AF⨝L(CF⨝LBF) ⋮

Equivalent Expressions

Group

slide-16
SLIDE 16

CASCADES – MULTI-EXPRESSION

Instead of explicitly instantiating all possible expressions in a group, the optimizer implicitly represents redundant expressions in a group as a multi-expression.

→ This reduces the number of transformations, storage

  • verhead, and repeated cost estimations.

16

Output: [ABC]

Logical Multi-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [A]⨝[BC] ⋮ Physical Multi-Exps 1. [AB]⨝L[C] 2. [BC]⨝L[A] 3. [AC]⨝L[B] 4. [A]⨝L[CB] ⋮

slide-17
SLIDE 17

CASCADES – RULES

A rule is a transformation of an expression to a logically equivalent expression.

→ Transformation Rule: Logical to Logical → Implementation Rule: Logical to Physical

Each rule is represented as a pair of attributes:

→ Pattern: Defines the structure of the logical expression that can be applied to the rule. → Substitute: Defines the structure of the result after applying the rule.

17

slide-18
SLIDE 18

Pattern

CASCADES – RULES

18 EQJOIN EQJOIN GROUP 1 GROUP 2 GROUP 3

slide-19
SLIDE 19

Pattern

CASCADES – RULES

19 EQJOIN EQJOIN GROUP 1 GROUP 2 GROUP 3 [AB]⨝C A⨝B GET(A) GET(B) GET(C)

Matching Plan

slide-20
SLIDE 20

Pattern

CASCADES – RULES

20 EQJOIN EQJOIN GROUP 1 GROUP 2 GROUP 3

Transformation Rule Rotate Left-to-Right

A⨝[BC] GET(A) GET(B) GET(C) B⨝C [AB]⨝C A⨝B GET(A) GET(B) GET(C)

Matching Plan

slide-21
SLIDE 21

Pattern

CASCADES – RULES

21 EQJOIN EQJOIN GROUP 1 GROUP 2 GROUP 3

Transformation Rule Rotate Left-to-Right Implementation Rule EQJOIN→SORTMERGE

A⨝[BC] GET(A) GET(B) GET(C) B⨝C [AB]⨝SMC A⨝SMB GET(A) GET(B) GET(C) [AB]⨝C A⨝B GET(A) GET(B) GET(C)

Matching Plan

slide-22
SLIDE 22

CASCADES – MEMO TABLE

Stores all previously explored alternatives in a compact graph structure. Equivalent operator trees and their corresponding plans are stored together in groups. Provides memoization, duplicate detection, and property + cost management.

22

slide-23
SLIDE 23

PRINCIPLE OF OPTIMALITY

Every sub-plan of an optimal plan is itself optimal. This allows the optimizer to restrict the search space to a smaller set of expressions.

→ The optimizer never has to consider a plan containing sub-plan P1 that has a greater cost than equivalent plan P2 with the same physical properties.

23

EXPLOITING UPPER AND LOWER BOUNDS IN TOP-DOWN QUERY OPTIMIZATION IDEAS 2001

slide-24
SLIDE 24

CASCADES – MEMO TABLE

24 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-25
SLIDE 25

CASCADES – MEMO TABLE

25 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-26
SLIDE 26

CASCADES – MEMO TABLE

26 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-27
SLIDE 27

CASCADES – MEMO TABLE

27 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-28
SLIDE 28

CASCADES – MEMO TABLE

28 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-29
SLIDE 29

CASCADES – MEMO TABLE

29 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Winner [ABC] [AB] [A] [C] [B]

slide-30
SLIDE 30

CASCADES – MEMO TABLE

30 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A)

slide-31
SLIDE 31

CASCADES – MEMO TABLE

31 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A)

slide-32
SLIDE 32

CASCADES – MEMO TABLE

32 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A)

slide-33
SLIDE 33

CASCADES – MEMO TABLE

33 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-34
SLIDE 34

CASCADES – MEMO TABLE

34 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-35
SLIDE 35

CASCADES – MEMO TABLE

35 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-36
SLIDE 36

CASCADES – MEMO TABLE

36 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-37
SLIDE 37

CASCADES – MEMO TABLE

37 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-38
SLIDE 38

CASCADES – MEMO TABLE

38 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B)

slide-39
SLIDE 39

CASCADES – MEMO TABLE

39 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20 Cost: 50+(10+20)

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B) [A]⨝SM[B]

slide-40
SLIDE 40

CASCADES – MEMO TABLE

40 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20 Cost: 50+(10+20) Cost: 5

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B) I-SCAN(C) [A]⨝SM[B]

slide-41
SLIDE 41

CASCADES – MEMO TABLE

41 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20 Cost: 50+(10+20) Cost: 5

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B) I-SCAN(C) [A]⨝SM[B]

slide-42
SLIDE 42

CASCADES – MEMO TABLE

42 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20 Cost: 50+(10+20) Cost: 5

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B) I-SCAN(C) [A]⨝SM[B]

slide-43
SLIDE 43

CASCADES – MEMO TABLE

43 Output: [ABC]

Logical M-Exps 1. [AB]⨝[C] 2. [BC]⨝[A] 3. [AC]⨝[B] 4. [B]⨝[AC] Physical M-Exps 1. [AB]⨝LC 2. [BC]⨝LA 3. [AC]⨝LB ⋮

Output: [AB]

Logical M-Exps 1. [A]⨝[B] 2. [B]⨝[A] Physical M-Exps 1. [A]⨝L[B] 2. [A]⨝SM[B] 3. [B]⨝L[A]

Output: [A]

Logical M-Exps 1. GET(A) Physical M-Exps 1. F-SCAN(A) 2. I-SCAN(A)

Output: [B]

Logical M-Exps 1. GET(B) Physical M-Exps 1. F-SCAN(B) 2. I-SCAN(B)

Output: [C]

Logical M-Exps 1. GET(C) Physical M-Exps 1. F-SCAN(C) 2. I-SCAN(C)

Cost: 10 Cost: 20 Cost: 50+(10+20) Cost: 5

Winner [ABC] [AB] [A] [C] [B] F-SCAN(A) F-SCAN(B) I-SCAN(C) [A]⨝SM[B] Output: [BC]

Logical M-Exps 1. [B]⨝[C] 2. [C]⨝[B] Physical M-Exps

Output: [AC]

Logical M-Exps 1. [A]⨝[C] 2. [C]⨝[A] Physical M-Exps

slide-44
SLIDE 44

SEARCH TERMINATION

Approach #1: Wall-clock Time

→ Stop after the optimizer runs for some length of time.

Approach #2: Cost Threshold

→ Stop when the optimizer finds a plan that has a lower cost than some threshold.

Approach #3: Transformation Exhaustion

→ Stop when there are no more ways to transform the target plan. Usually done per group.

44

slide-45
SLIDE 45

CASCADES IMPLEMENTATIONS

Standalone:

→ Wisconsin OPT++ (1990s) → Portland State Columbia (1990s) → Pivotal Orca (2010s) → Apache Calcite (2010s)

Integrated:

→ Microsoft SQL Server (1990s) → Tandem NonStop SQL (1990s) → Clustrix (2000s) → CMU Peloton (2010s)

45

slide-46
SLIDE 46

PREDICATE EXPRESSIONS

Predicates are defined as part of each operator.

→ These are typically represented as an AST. → Postgres implements them as flatten lists.

The same logical operator can be represented in multiple physical operators using variations of the same expression.

46

slide-47
SLIDE 47

PREDICATE PUSHDOWN

Approach #1: Logical Transformation

→ Like any other transformation rule in Cascades. → Can use cost-model to determine benefit.

Approach #2: Rewrite Phase

→ Perform pushdown before starting search using an initial rewrite phase. Tricky to support complex predicates.

Approach #3: Late Binding

→ Perform pushdown after generating optimal plan in

  • Cascades. Will likely produce a bad plan.

47

slide-48
SLIDE 48

PREDICATE MIGRATION

Observation: Not all predicates cost the same to evaluate on tuples. The optimizer should consider selectivity and computation cost when determining the evaluation order of predicates.

48

SELECT * FROM foo WHERE foo.id = 1234 AND SHA_512(foo.val) = '...'

PREDICATE MIGRATION: OPTIMIZING QUERIES WITH EXPENSIVE PREDICATES SIGMOD 1993

slide-49
SLIDE 49

PIVOTAL ORCA

Standalone Cascades implementation.

→ Originally written for Greenplum. → Extended to support HAWQ.

A DBMS can use Orca by implementing API to send catalog + stats + logical plans and then retrieve physical plans. Supports multi-threaded search.

49

ORCA: A MODULAR QUERY OPTIMIZER ARCHITECTURE FOR BIG DATA SIGMOD 2014

slide-50
SLIDE 50

ORCA – ENGINEERING

Issue #1: Remote Debugging

→ Automatically dump the state of the optimizer (with inputs) whenever an error occurs. → The dump is enough to put the optimizer back in the exact same state later on for further debugging.

Issue #2: Optimizer Accuracy

→ Automatically check whether the ordering of the estimate cost of two plans matches their actual execution cost.

50

slide-51
SLIDE 51

MEMSQL OPTIMIZER

Rewriter

→ Logical-to-logical transformations with access to the cost-model.

Enumerator

→ Logical-to-physical transformations. → Mostly join ordering.

Planner

→ Convert physical plans back to SQL. → Contains MemSQL-specific commands for moving data.

51

THE MEMSQL QUERY OPTIMIZER VLDB 2017

slide-52
SLIDE 52

MEMSQL OPTIMIZER OVERVIEW

52

Parser

Abstract Syntax Tree Logical Plan Physical Plan Cost Estimates

SQL Query

Binder Rewriter Enumerator Planner

Physical Plan SQL

slide-53
SLIDE 53

53

LIFE LESSONS FOR WORKING ON CODE

slide-54
SLIDE 54

DISCLAIMER

I have worked on a few large projects in my lifetime (2 DBMSs – Peloton, PostgreSQL). But I’m not claiming to be all knowledgeable in modern software engineering practices.

54

slide-55
SLIDE 55

OBSERVATION

Most software development is never from scratch. You will be expected to be able to work with a large amount of code that you did not write. Being able to independently work on a large code base is the #1 skill that companies tell me they are looking for in students they hire.

55

slide-56
SLIDE 56

PASSIVE READING

Reading the code for the sake of reading code is (usually) a waste of time.

→ It’s hard to internalize functionality if you don’t have direction.

It’s important to start working with the code right away to understand how it works.

56

slide-57
SLIDE 57

TEST CASES

Adding or improving tests allows you to improve the reliability of the code base without the risk of breaking production code.

→ It forces you to understand code in a way that is not possible when just reading it.

Nobody will complain (hopefully) about adding new tests to the system.

57

slide-58
SLIDE 58

REFACTORING

Find the general location of code that you want to work on and start cleaning it up.

→ Add/edit comments → Clean up messy code → Break out repeated logic into separate functions.

Tread lightly though because you are changing code that you are not familiar with yet.

58

slide-59
SLIDE 59

TOOLCHAINS & PROCESSES

Beyond working on the code, there will also be an established protocol for software development. More established projects will have either training

  • r comprehensive documentation.

→ If the documentation isn’t available, then you can take the initiative and try to write it.

59

slide-60
SLIDE 60

TIPS FOR NEW PROGRAMMERS

  • Keep your pull request small and targeted
  • Learn the development process, coding style
  • Maintain a design document: classes, interfaces
  • Keep your eye on the end goal (actions per item)
  • Avoid writing bad code: large functions, complex logic,

premature optimizations, duplicated code

  • Lose your ego, don't be afraid to ask questions, and

don't try to write "clever" unintelligible code

60 Source: Top signs of an inexperienced programmer

slide-61
SLIDE 61

NEXT CLASS

Cost Models Tips for technical writing

61