Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A - - PowerPoint PPT Presentation

evaluation of join operations
SMART_READER_LITE
LIVE PREVIEW

Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A - - PowerPoint PPT Presentation

Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A (Joins) 340151 Big Data & Cloud Computing (P. Baumann) 1 Relational Operations: Join Definition Natural join: R S := L ( C ( R S ) ) Where C:


slide-1
SLIDE 1

1 340151 Big Data & Cloud Computing (P. Baumann)

Evaluation of Join Operations

Ramakrishnan/Gehrke Chapter 14, Part A (Joins)

slide-2
SLIDE 2

2 340151 Big Data & Cloud Computing (P. Baumann)

Relational Operations: Join Definition

  • Natural join:
  • R  S := L (

C( R S ) )

  • Where
  • C: condition that equates all pairs of attributes of R and S that have the same name
  • Ex: R has x, S has x

"R.x=S.x and…" in C

  • L: list of all attributes of R and S, except equate duplicates
  • Ex: C contains "R.x=S.x"
  • nly one x chosen for L
  • Example: R(a,b) and S(b,c) relations
  • Then, R  S = …
  • Hence: join is shorthand (but more efficient to compute in 1 step)
slide-3
SLIDE 3

3 340151 Big Data & Cloud Computing (P. Baumann)

Relational Operations: More Joins

  • Theta join
  • Let R(a,b) and S(b,c) be relations
  • R  S :=

C( R S )

  • Why "theta"?
  • Historically: R  S where

{ =, , >, , <, }

  • Today: can be any condition
  • Special case: C = "R.x = S.y"

equijoin

  • No projection!

C

x y

slide-4
SLIDE 4

4 340151 Big Data & Cloud Computing (P. Baumann)

Schema for Examples

  • Similar to old schema; rname added for variations
  • Reserves:

tuple 40 bytes, 100 tuples per page, 1000 pages

  • Sailors:

tuple 50 bytes, 80 tuples per page, 500 pages Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

slide-5
SLIDE 5

5 340151 Big Data & Cloud Computing (P. Baumann)

Equality Joins With One Join Column

  • R S large

R S followed by selection inefficient

  • Assume:

M #pages of R, pR tuples per page, N #pages of S, pS tuples per page

  • Cost metric: # of I/Os
  • will ignore output costs, disk access patterns

SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid

slide-6
SLIDE 6

6 340151 Big Data & Cloud Computing (P. Baumann)

Simple Nested Loops Join

  • For each tuple in outer relation R, scan entire inner relation S
  • Cost: M + pR * M * N = 1,000 + 100*1000*500 I/Os = 50,001,000 I/Os
  • Page-oriented Nested Loops join (Block-Nested Loop Join) :

For each page of R: get each page of S, write out matching pairs of tuples <r,s>, where r in R-page and s in S-page

  • Cost: M + M*N = 1000 + 1000*500 = 501,000 
  • If smaller relation (S) is outer, cost = 500 + 500*1000 = 500,500

foreach tuple r in R do foreach tuple s in S do if ri == sj then add <r,s> to result

  • Reserves:

M = 1000; pR = 100

  • Sailors:

N = 500; pS = 80

slide-7
SLIDE 7

7 340151 Big Data & Cloud Computing (P. Baumann)

Index Nested Loops Join

  • Assume index on join column of one relation (say S)

can make it inner and exploit index

  • Cost: M + ( M*pR * cost of finding matching S tuples )
  • For each R tuple, cost of probing S index ~1.2 for hash index
  • For each R tuple, cost of probing S index 2…4 for B+ tree
  • Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on

clustering:

  • Clustered index:

1 I/O for all tuples (typical), unclustered: up to 1 I/O per matching S tuple

slide-8
SLIDE 8

8 340151 Big Data & Cloud Computing (P. Baumann)

  • Hash-index on sid of Sailors (as inner):
  • Scan Reserves: 1000 page I/Os, 100*1000 tuples
  • For each Reserves tuple:

1.2 I/Os to get data entry in index + 1 I/O to get (the exactly one) matching Sailors tuple

  • Total: 220,000 I/Os
  • Hash-index on sid of Reserves (as inner):
  • Scan Sailors: 500 page I/Os, 80*500 tuples
  • For each Sailors tuple:

1.2 I/Os to find index page with data entries + cost of retrieving matching Reserves tuples (*)

  • (*) Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000)

Cost is 1 or 2.5 I/Os, depending on whether index is clustered

  • Total: …4,000 * (1.2 + 2.5) = 148,000 I/Os

Examples of Index Nested Loops

SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid

slide-9
SLIDE 9

9 340151 Big Data & Cloud Computing (P. Baumann)

Block Nested Loops Join

  • 1 page as input buffer for scanning inner S
  • 1 page as output buffer
  • all remaining pages hold "block" of outer R

. . .

. . .

R & S

Hash table for block of R (k < B-1 pages) Input buffer for S Output buffer

. . .

Join Result

Earlier: block = page Now: block = sequence of pages

slide-10
SLIDE 10

10 340151 Big Data & Cloud Computing (P. Baumann)

  • Cost: Scan of outer + #outer blocks * scan of inner
  • #outer blocks =

Examples of Block Nested Loops

# /

  • f pages of outer

blocksize

  • With Reserves (R) as outer, and 100 pages of R per block:
  • Cost of scanning R is 1000 I/Os; total of 10 blocks
  • Per block of R, scan Sailors (S); 10*500 I/Os
  • With 100-page block of Sailors as outer:
  • Cost of scanning S is 500 I/Os; total of 5 blocks
  • Per block of S, scan Reserves; 5*1000 I/Os
slide-11
SLIDE 11

11 340151 Big Data & Cloud Computing (P. Baumann)

Sort-Merge Join

  • Approach:
  • Sort R & S on join column
  • scan them to do a ``merge’’ (on join column)
  • output result tuples
  • Efficiency:
  • R scanned once; each S group scanned once per matching R tuple
  • Multiple scans of an S group likely to find needed pages in buffer
  • Cost: M log M + N log N + (M+N)
  • Ex: with 35, 100 or 300 buffer pages, Reserves & Sailors sorted in 2 passes
  • total join cost: 7500
  • In practice, cost of sort-merge join linear

(BNL cost: 2,500 to 15,000 I/Os)

slide-12
SLIDE 12

12 340151 Big Data & Cloud Computing (P. Baumann)

Refinement of Sort-Merge Join

  • combine merging phases in sorting of R & S with merging required for join
  • With B > , where L is size of larger relation,

using sorting refinement that produces runs of length 2B in Pass 0: #runs of each relation is < B/2

  • Allocate 1 page per run of each relation, `merge’ while checking join condition
  • Cost:

read+write each relation in Pass 0 + read each relation in (only) merging pass (+ writing of result tuples)

  • In example: cost goes down from 7,500 to 4,500 I/Os
  • In practice, cost of sort-merge join is linear
  • like cost of external sorting

L