Hacking the Query Planner, Again Richard Guo / VMware PGCon 2020 - - PowerPoint PPT Presentation

hacking the query planner again
SMART_READER_LITE
LIVE PREVIEW

Hacking the Query Planner, Again Richard Guo / VMware PGCon 2020 - - PowerPoint PPT Presentation

Hacking the Query Planner, Again Richard Guo / VMware PGCon 2020 Agenda What does planner do? Phases of planning Overall backend structure Parser Determines the semantic meaning of a query string Rewriter


slide-1
SLIDE 1

Hacking the Query Planner, Again

Richard Guo / VMware PGCon 2020

slide-2
SLIDE 2

Agenda

  • What does planner do?
  • Phases of planning
slide-3
SLIDE 3

Overall backend structure

  • Parser

○ Determines the semantic meaning of a query string

  • Rewriter

○ Performs view and rule expansion

  • Planner

○ Designs an execution plan for the query

  • Executor

○ Runs the plan

slide-4
SLIDE 4

What does planner do?

  • For a given query, find a correct execution plan that has the lowest "cost"

○ A given query can be actually executed in a wide variety of different ways ○ If it is computationally feasible, examine each of these possible ways, represented by data structures called Path ○ Select the cheapest Path and convert it to a full-fledged Plan

slide-5
SLIDE 5

Agenda

  • What does planner do?
  • Phases of planning
slide-6
SLIDE 6

Phases of planning

  • Preprocessing

○ simplify the query if possible; collect information

  • Scan/join planning

○ decide how to implement FROM/WHERE

  • Post scan/join planning

○ deal with plan steps that aren’t scans or joins

  • Postprocessing

○ convert results into form the executor wants

slide-7
SLIDE 7

Early preprocessing

  • Simplify scalar expressions
  • Expand simple SQL functions in-line
  • Simplify join tree
slide-8
SLIDE 8

Simplify scalar expressions

  • Simplify function calls

○ The function is strict and has any constant-null inputs ○ The function is immutable and has all constant inputs int4eq(1,NULL) => NULL 2 + 2 => 4

slide-9
SLIDE 9

Simplify scalar expressions

  • Simplify boolean expressions

"x OR true" => "true" "x AND false" => "false"

slide-10
SLIDE 10

Simplify scalar expressions

  • Simplify CASE expressions

CASE WHEN 2+2 = 4 THEN x+1 ELSE 1/0 END ⇒ x+1 ... not “ERROR: division by zero”

slide-11
SLIDE 11

Why bother simplifying?

  • Do computations only once, not once per row
  • Exploit constant-folding opportunities exposed by view expansion and SQL

function inlining

slide-12
SLIDE 12

Expand simple SQL functions in-line

CREATE FUNCTION incr4(int) RETURNS int AS 'SELECT $1 + (2 + 2)' LANGUAGE SQL; SELECT incr4(a) FROM foo; => SELECT a + 4 FROM foo;

slide-13
SLIDE 13

Why bother inlining SQL functions?

  • Avoid the rather high per-call overhead of SQL functions
  • Expose opportunities for constant-folding within the function expression
slide-14
SLIDE 14

Simplify join tree

  • Convert IN, EXISTS sub-selects to semi-joins
  • Flatten (“pull up”) sub-selects if possible
  • Flatten UNION ALL, expand inheritance trees
  • Reduce outer joins to inner joins
  • Reduce outer joins to anti joins
slide-15
SLIDE 15

Convert IN, EXISTS sub-selects to semi-joins

SELECT * FROM foo WHERE EXISTS (SELECT 1 FROM bar WHERE foo.a = bar.c); => SELECT * FROM foo *SEMI JOIN* bar ON foo.a = bar.c;

slide-16
SLIDE 16

rtable jointree RTE_RELATION (foo) fromlist quals rtindex:1 EXISTS_SUBLINK subselect rtable jointree RTE_RELATION (bar) fromlist quals rtindex:1 = args varno:1 varattno:1 varlevelsup:1 varno:1 varattno:1 varlevelsup:0 Query RangeTblEntry FromExpr RangeTblRef SubLink Query RangeTblEntry FromExpr RangeTblRef OpExpr Var Var

SELECT * FROM foo WHERE EXISTS (SELECT 1 FROM bar WHERE foo.a = bar.c);

EXISTS SubLink

slide-17
SLIDE 17

rtable jointree RTE_RELATION (foo) fromlist quals:NULL rtindex:2 = args varno:1 varattno:1 varlevelsup:0 varno:2 varattno:1 varlevelsup:0 Query RangeTblEntry FromExpr RangeTblRef OpExpr Var Var

SELECT * FROM foo *SEMI JOIN* bar ON foo.a = bar.c;

RTE_RELATION (bar) RangeTblEntry JOIN_SEMI quals larg | rarg fromlist quals:NULL rtindex:1 FromExpr RangeTblRef JoinExpr

slide-18
SLIDE 18

Flatten (“pull up”) sub-selects if possible

SELECT * FROM foo JOIN (SELECT bar.c FROM bar JOIN baz ON TRUE) AS sub ON foo.a = sub.c; => SELECT * FROM foo JOIN (bar JOIN baz ON TRUE) ON foo.a = bar.c;

slide-19
SLIDE 19

rtable jointree RTE_RELATION (foo) fromlist quals:NULL = args varno:1 varattno:1 varlevelsup:0 varno:2 varattno:1 varlevelsup:0 Query RangeTblEntry FromExpr OpExpr Var Var RTE_SUBQUERY (sub) RangeTblEntry rtable jointree RTE_RELATION (bar) fromlist quals:NULL Query RangeTblEntry FromExpr RTE_RELATION (baz) RangeTblEntry RTE_JOIN RangeTblEntry JOIN_INNER quals:TRUE larg | rarg JoinExpr rtindex:1 RangeTblRef rtindex:2 RangeTblRef RTE_JOIN RangeTblEntry JOIN_INNER quals larg | rarg JoinExpr rtindex:1 RangeTblRef rtindex:2 RangeTblRef

SubSelect

SELECT * FROM foo JOIN (SELECT bar.c FROM bar JOIN baz ON TRUE) AS sub ON foo.a = sub.c;

slide-20
SLIDE 20

rtable jointree RTE_RELATION (foo) fromlist quals:NULL = args varno:1 varattno:1 varlevelsup:0 varno:4 varattno:1 varlevelsup:0 Query RangeTblEntry FromExpr OpExpr Var Var RTE_SUBQUERY (sub) RangeTblEntry RTE_RELATION (bar) RangeTblEntry RTE_RELATION (baz) RangeTblEntry RTE_JOIN RangeTblEntry RTE_JOIN RangeTblEntry JOIN_INNER quals larg | rarg JoinExpr rtindex:1 RangeTblRef JOIN_INNER quals:TRUE larg | rarg JoinExpr rtindex:4 RangeTblRef rtindex:5 RangeTblRef

SELECT * FROM foo JOIN (bar JOIN baz ON TRUE) ON foo.a = bar.c;

slide-21
SLIDE 21

Why bother flattening sub-selects?

  • It may help produce a better plan to pull up a subquery into the parent query

and consider it as part of the entire plan search space

  • Otherwise the subquery would be planned independently and treated as a

"black box" during planning of the outer query

slide-22
SLIDE 22

Reduce outer joins to inner joins

  • If there is a strict qual above the outer join that constrains a Var from the

nullable side of the join to be non-null

SELECT ... FROM foo LEFT JOIN bar ON (...) WHERE bar.d = 42; => SELECT ... FROM foo INNER JOIN bar ON (...) WHERE bar.d = 42;

slide-23
SLIDE 23

Reduce outer joins to anti joins

  • If the outer join's own quals are strict for any nullable Var that was forced null

by higher qual levels

SELECT * FROM foo LEFT JOIN bar ON foo.a = bar.c WHERE bar.c IS NULL; => SELECT * FROM foo *ANTI JOIN* bar on foo.a = bar.c;

slide-24
SLIDE 24

Later preprocessing

  • Distribute WHERE and JOIN/ON qual clauses
  • Build equivalence classes for provably-equal expressions
  • Gather information about join ordering restrictions
  • Remove useless joins
  • ...
slide-25
SLIDE 25

Distribute WHERE and JOIN/ON qual clauses

  • In general, we want to use each qual at the lowest possible join level
  • When dealing with inner joins, we can push a qual down to its "natural"

semantic level

  • When dealing with outer joins, a qual may be delayed and cannot be pushed

down to its "natural" semantic level

  • We mark the outerjoin-delayed qual with a "required_relids" including all the

required rels in the outer join

slide-26
SLIDE 26

Quals that are outerjoin-delayed

  • An outer join's own JOIN/ON quals mentioning nonnullable side rels cannot

be pushed down below the outer join

# EXPLAIN (COSTS OFF) SELECT * FROM foo LEFT JOIN bar ON foo.a = 42; QUERY PLAN

  • Nested Loop Left Join

Join Filter: (foo.a = 42)

  • > Seq Scan on foo
  • > Materialize
  • > Seq Scan on bar

(5 rows)

slide-27
SLIDE 27

Quals that are outerjoin-delayed

  • Quals appearing in WHERE or in a JOIN above the outer join cannot be

pushed down below the outer join, if they reference any nullable Vars

# EXPLAIN (COSTS OFF) SELECT * FROM foo LEFT JOIN bar ON foo.a = bar.c WHERE COALESCE(bar.c, 1) = 42; QUERY PLAN

  • Hash Left Join

Hash Cond: (foo.a = bar.c) Filter: (COALESCE(bar.c, 1) = 42)

  • > Seq Scan on foo
  • > Hash
  • > Seq Scan on bar

(6 rows)

slide-28
SLIDE 28

EquivalenceClasses

  • For mergejoinable equality clauses A = B that are not outerjoin-delayed, we

use EquivalenceClasses to record this knowledge

  • An EquivalenceClass represents a set of values that are known all transitively

equal to each other

  • Equivalence clauses are removed from the standard qual distribution process.

Instead, eclass-based qual clauses are generated dynamically when needed

  • EquivalenceClasses also represent the value that a PathKey orders by (since

if x = y, then ORDER BY x must be the same as ORDER BY y)

slide-29
SLIDE 29

Gather information about join ordering restrictions

  • One-sided outer joins constrain the order of joining partially but not completely

○ non-FULL joins can be freely associated into the lefthand side of an OJ, but in some cases they can't be associated into the righthand side

(A leftjoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) leftjoin B on (Pab) (A leftjoin B on (Pab)) innerjoin C on (Pbc) != A leftjoin (B innerjoin C on (Pbc)) on (Pab)

slide-30
SLIDE 30

Gather information about join ordering restrictions

  • One-sided outer joins constrain the order of joining partially but not completely
  • We flatten non-FULL joins to top-level "joinlist" so that they participate fully in

the join order search

  • We record information about each outer join, in order to avoid generating

illegal join orders

slide-31
SLIDE 31

Remove useless joins

  • A left join can be removed if:

○ innerrel is a single baserel ○ innerrel attributes are not used above the join ○ the join condition cannot match more than one inner-side row

SELECT foo.a FROM foo LEFT JOIN (SELECT DISTINCT c AS c FROM bar) sub ON foo.a = sub.c; => SELECT foo.a FROM foo;

slide-32
SLIDE 32

Scan/join planning

  • Basically deals with the FROM and WHERE parts of the query
  • Knows about ORDER BY too

○ mainly so that it can design merge-join plans ○ but also to avoid final sort if possible

# EXPLAIN (COSTS OFF) SELECT * FROM foo JOIN bar ON foo.a = bar.c AND foo.b = bar.d ORDER BY b, a; QUERY PLAN

  • Merge Join

Merge Cond: ((foo.b = bar.d) AND (foo.a = bar.c))

  • > Sort

Sort Key: foo.b, foo.a

  • > Seq Scan on foo
  • > Sort

Sort Key: bar.d, bar.c

  • > Seq Scan on bar

(8 rows)

slide-33
SLIDE 33

Scan/join planning

  • Basically deals with the FROM and WHERE parts of the query
  • Knows about ORDER BY too

○ mainly so that it can design merge-join plans ○ but also to avoid final sort if possible

  • Cost estimate driven
slide-34
SLIDE 34

Scan/join planning

  • Identify feasible scan methods for base relations, estimate their costs and

result sizes

  • Search the join-order space, using dynamic programming or heuristic “GEQO”

method, to identify feasible plans for join relations

  • Honor outer-join ordering constraints
  • Produce one or more “Path” data structures
slide-35
SLIDE 35

Join searching

  • Multi-way joins have to be built up from pairwise joins, because that’s all the

executor knows how to do

  • For any given pairwise join step, we can identify the best input Paths and join

methods via straightforward cost comparisons, resulting in a list of Paths much as for a base relation

  • Finding the best ordering of pairwise joins is the hard part
slide-36
SLIDE 36

Join searching

  • We usually have many choices of join order for a multi-way join query, and

some orders will be cheaper than others

  • If the query contains only plain inner joins, we can join the base relations in

any order

  • Outer joins can be re-ordered in some but not all cases; we handle that by

checking whether each proposed join step is legal

slide-37
SLIDE 37

Standard join search method

  • Generate paths for each base relation
  • Generate paths for each possible two-way join
  • Generate paths for each possible three-way join
  • Generate paths for each possible four-way join
  • Continue until all base relations are joined into a single join relation; then use

that relation’s best path

slide-38
SLIDE 38

A B C SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

slide-39
SLIDE 39

A B C A⋈B

HASH_JOIN(A,B) MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

slide-40
SLIDE 40

A B C A⋈B

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

slide-41
SLIDE 41

A B C A⋈B A⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

slide-42
SLIDE 42

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

HASH_JOIN(B,C) MERGE_JOIN(B,C)

slide-43
SLIDE 43

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

HASH_JOIN(B,C)

slide-44
SLIDE 44

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

HASH_JOIN(B,C)

A⋈B⋈C

HASH_JOIN(A⋈B,C) MERGE_JOIN(A⋈B,C)

slide-45
SLIDE 45

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

HASH_JOIN(B,C)

A⋈B⋈C

MERGE_JOIN(A⋈B,C)

slide-46
SLIDE 46

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i;

HASH_JOIN(B,C)

A⋈B⋈C

MERGE_JOIN(A⋈B,C) HASH_JOIN(B⋈C,A) MERGE_JOIN(B⋈C,A)

slide-47
SLIDE 47

A B C A⋈B A⋈C B⋈C

MERGE_JOIN(A,B)

SELECT * FROM A JOIN (B JOIN C ON B.j = C.j) ON A.i = B.i; A⋈B⋈C

MERGE_JOIN(A⋈B,C)

slide-48
SLIDE 48

Join searching is expensive

  • An n-way join problem can potentially be implemented in n! (n factorial)

different join orders

  • It is not feasible to consider all possibilities
  • We use a few heuristics, like not considering clause-less joins
  • With too many relations (12 by default), fall back to “GEQO” (genetic query
  • ptimizer) search
slide-49
SLIDE 49

Heuristics used in join search

  • Don't join relations that are not connected by any join clause, unless forced to

by join-order restrictions

  • Break down large join problems into sub-problems by not flattening JOIN

clauses according to collapse limit

SELECT * FROM (SELECT * FROM T1, T2, T3, T4) SUB1 JOIN (SELECT * FROM T5, T6, T7, T8) SUB2 ON TRUE JOIN (SELECT * FROM T9, T10) SUB3 ON TRUE; SET join_collapse_limit TO 4;

SUB1 SUB2 T9 T10

T1 T2 T3 T4 T5 T6 T7 T8

slide-50
SLIDE 50

Post scan/join planning

  • Deal with GROUP BY, aggregation, window functions, DISTINCT
  • Deal with UNION/INTERSECT/EXCEPT
  • Apply final sort if needed for ORDER BY
  • Produce one or more “Path” data structures for each step
  • Add LockRows, Limit, ModifyTable steps to each surviving Path
slide-51
SLIDE 51

Postprocessing

  • Expand best Path to Plan
  • Adjust some representational details of Plan

○ Flatten subquery rangetables into a single list ○ Label Vars in upper plan nodes as OUTER_VAR or INNER_VAR, to refer to the outputs of their subplans ○ Remove unnecessary SubqueryScan, Append, and MergeAppend plan nodes ○ etc.

slide-52
SLIDE 52

Thank You

Output: Thank You Index Scan using common_phrases_idx on common_phrases Index Cond: (value = 'Thank You'::text) Filter: (language = 'English'::text)