Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A - - PowerPoint PPT Presentation

▶

Aug 21, 2023 221 likes •365 views

Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A (Joins) 340151 Big Data & Cloud Computing (P. Baumann) 1 Relational Operations: Join Definition Natural join: R S := L ( C ( R S ) ) Where C:

SLIDE 1

1 340151 Big Data & Cloud Computing (P. Baumann)

Evaluation of Join Operations

Ramakrishnan/Gehrke Chapter 14, Part A (Joins)

SLIDE 2

2 340151 Big Data & Cloud Computing (P. Baumann)

Relational Operations: Join Definition

Natural join:
R  S := L (

C( R S ) )

Where
C: condition that equates all pairs of attributes of R and S that have the same name
Ex: R has x, S has x

"R.x=S.x and…" in C

L: list of all attributes of R and S, except equate duplicates
Ex: C contains "R.x=S.x"
nly one x chosen for L
Example: R(a,b) and S(b,c) relations
Then, R  S = …
Hence: join is shorthand (but more efficient to compute in 1 step)

SLIDE 3

3 340151 Big Data & Cloud Computing (P. Baumann)

Relational Operations: More Joins

Theta join
Let R(a,b) and S(b,c) be relations
R  S :=

C( R S )

Why "theta"?
Historically: R  S where

{ =, , >, , <, }

Today: can be any condition
Special case: C = "R.x = S.y"

equijoin

No projection!

x y

SLIDE 4

4 340151 Big Data & Cloud Computing (P. Baumann)

Schema for Examples

Similar to old schema; rname added for variations
Reserves:

tuple 40 bytes, 100 tuples per page, 1000 pages

Sailors:

tuple 50 bytes, 80 tuples per page, 500 pages Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

SLIDE 5

5 340151 Big Data & Cloud Computing (P. Baumann)

Equality Joins With One Join Column

R S large

R S followed by selection inefficient

Assume:

M #pages of R, pR tuples per page, N #pages of S, pS tuples per page

Cost metric: # of I/Os
will ignore output costs, disk access patterns

SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid

SLIDE 6

6 340151 Big Data & Cloud Computing (P. Baumann)

Simple Nested Loops Join

For each tuple in outer relation R, scan entire inner relation S
Cost: M + pR * M * N = 1,000 + 100*1000*500 I/Os = 50,001,000 I/Os
Page-oriented Nested Loops join (Block-Nested Loop Join) :

For each page of R: get each page of S, write out matching pairs of tuples <r,s>, where r in R-page and s in S-page

Cost: M + M*N = 1000 + 1000*500 = 501,000 
If smaller relation (S) is outer, cost = 500 + 500*1000 = 500,500

foreach tuple r in R do foreach tuple s in S do if ri == sj then add <r,s> to result

Reserves:

M = 1000; pR = 100

Sailors:

N = 500; pS = 80

SLIDE 7

7 340151 Big Data & Cloud Computing (P. Baumann)

Index Nested Loops Join

Assume index on join column of one relation (say S)

can make it inner and exploit index

Cost: M + ( M*pR * cost of finding matching S tuples )
For each R tuple, cost of probing S index ~1.2 for hash index
For each R tuple, cost of probing S index 2…4 for B+ tree
Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on

clustering:

Clustered index:

1 I/O for all tuples (typical), unclustered: up to 1 I/O per matching S tuple

SLIDE 8

8 340151 Big Data & Cloud Computing (P. Baumann)

Hash-index on sid of Sailors (as inner):
Scan Reserves: 1000 page I/Os, 100*1000 tuples
For each Reserves tuple:

1.2 I/Os to get data entry in index + 1 I/O to get (the exactly one) matching Sailors tuple

Total: 220,000 I/Os
Hash-index on sid of Reserves (as inner):
Scan Sailors: 500 page I/Os, 80*500 tuples
For each Sailors tuple:

1.2 I/Os to find index page with data entries + cost of retrieving matching Reserves tuples (*)

(*) Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000)

Cost is 1 or 2.5 I/Os, depending on whether index is clustered

Total: …4,000 * (1.2 + 2.5) = 148,000 I/Os

Examples of Index Nested Loops

SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid

SLIDE 9

9 340151 Big Data & Cloud Computing (P. Baumann)

Block Nested Loops Join

1 page as input buffer for scanning inner S
1 page as output buffer
all remaining pages hold "block" of outer R

. . .

R & S

Hash table for block of R (k < B-1 pages) Input buffer for S Output buffer

. . .

Join Result

Earlier: block = page Now: block = sequence of pages

SLIDE 10

10 340151 Big Data & Cloud Computing (P. Baumann)

Cost: Scan of outer + #outer blocks * scan of inner
#outer blocks =

Examples of Block Nested Loops

# /

f pages of outer

blocksize

With Reserves (R) as outer, and 100 pages of R per block:
Cost of scanning R is 1000 I/Os; total of 10 blocks
Per block of R, scan Sailors (S); 10*500 I/Os
With 100-page block of Sailors as outer:
Cost of scanning S is 500 I/Os; total of 5 blocks
Per block of S, scan Reserves; 5*1000 I/Os

SLIDE 11

11 340151 Big Data & Cloud Computing (P. Baumann)

Sort-Merge Join

Approach:
Sort R & S on join column
scan them to do a ``merge’’ (on join column)
output result tuples
Efficiency:
R scanned once; each S group scanned once per matching R tuple
Multiple scans of an S group likely to find needed pages in buffer
Cost: M log M + N log N + (M+N)
Ex: with 35, 100 or 300 buffer pages, Reserves & Sailors sorted in 2 passes
total join cost: 7500
In practice, cost of sort-merge join linear

(BNL cost: 2,500 to 15,000 I/Os)

SLIDE 12

12 340151 Big Data & Cloud Computing (P. Baumann)

Refinement of Sort-Merge Join

combine merging phases in sorting of R & S with merging required for join
With B > , where L is size of larger relation,

using sorting refinement that produces runs of length 2B in Pass 0: #runs of each relation is < B/2

Allocate 1 page per run of each relation, `merge’ while checking join condition
Cost:

read+write each relation in Pass 0 + read each relation in (only) merging pass (+ writing of result tuples)

In example: cost goes down from 7,500 to 4,500 I/Os
In practice, cost of sort-merge join is linear
like cost of external sorting