Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf - - PowerPoint PPT Presentation

coping with the memory hierarchy the cache oblivious way
SMART_READER_LITE
LIVE PREVIEW

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf - - PowerPoint PPT Presentation

Coping with the Memory Hierarchy the Cache-Oblivious Way Rolf Fagerberg University of Aarhus Imada, SDU, February 18, 2004 Overview The memory hierachy The I/O-model The cache-oblivious model Examples of cache-oblivious


slide-1
SLIDE 1

Coping with the Memory Hierarchy the Cache-Oblivious Way

Rolf Fagerberg University of Aarhus

Imada, SDU, February 18, 2004

slide-2
SLIDE 2

Overview

  • The memory hierachy
  • The I/O-model
  • The cache-oblivious model
  • Examples of cache-oblivious algorithms
  • Double for-loop (with applications)
  • Searching
  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

2

slide-3
SLIDE 3

The Memory Hierarchy

Modern computers:

Disk Tertiary Storage RAM CPU

  • Reg. Cache1

Cache2 Cache3

Fagerberg: The Cache-Oblivious Way

3

slide-4
SLIDE 4

The Memory Hierarchy

Modern computers:

Disk Tertiary Storage RAM CPU

  • Reg. Cache1

Cache2 Cache3

Access time Volume Registers 1 cycle 1 Kb Cache 10 cycles 512 Kb RAM 100 cycles 512 Mb Disk 20,000,000 cycles 80 Gb

Fagerberg: The Cache-Oblivious Way

3

slide-5
SLIDE 5

The Memory Hierarchy

Modern computers:

Disk Tertiary Storage RAM CPU

  • Reg. Cache1

Cache2 Cache3

Access time Volume Registers 1 cycle 1 Kb Cache 10 cycles 512 Kb RAM 100 cycles 512 Mb Disk 20,000,000 cycles 80 Gb Gap increases over time. Real problems of Gigabyte, Terabyte, and even Petabyte size: Databases (finance, phone companies, banks, weather, geology, geography, astron-

  • my), WWW, GIS systems, computer

graphics.

Fagerberg: The Cache-Oblivious Way

3

slide-6
SLIDE 6

Classic RAM Model

The RAM model:

CPU A R M

Add: O(1) Branch: O(1) Mem access: O(1)

Fagerberg: The Cache-Oblivious Way

4

slide-7
SLIDE 7

Classic RAM Model

The RAM model:

CPU A R M

Add: O(1) Branch: O(1) Mem access: O(1) Increasingly inadequate

Fagerberg: The Cache-Oblivious Way

4

slide-8
SLIDE 8

Overview

√ The memory hierachy

  • The I/O-model
  • The cache-oblivious model
  • Examples of cache-oblivious algorithms
  • Double for-loop (with applications)
  • Searching
  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

5

slide-9
SLIDE 9

I/O Model

Model two layers

CPU External I/O Memory y r

  • m

e M

N = problem size M = memory size B = I/O block size

Aggarwal and Vitter 1988

Cost: number of I/Os.

Fagerberg: The Cache-Oblivious Way

6

slide-10
SLIDE 10

Example

CPU time Inplace Worstcase Heapsort N log N √ √ Quicksort N log N √ Mergesort N log N √

Fagerberg: The Cache-Oblivious Way

7

slide-11
SLIDE 11

Example

CPU time Inplace Worstcase I/O Heapsort N log N √ √ N log N Quicksort N log N √ (N log N)/B Mergesort N log N √ (N log N)/B Random memory access ⇒ page fault at every access. Sequential memory access ⇒ page fault every B accesses. Typically, B ∼ 103

Fagerberg: The Cache-Oblivious Way

7

slide-12
SLIDE 12

I/O-Optimal Sorting

Binary Mergesort:

N B log2 N I/Os

Multi-Way Merging: Maximal merge degree ≈ M/B Multi-Way Mergesort:

N B logM/B N M I/Os

Fagerberg: The Cache-Oblivious Way

8

slide-13
SLIDE 13

I/O Model Facts

  • Scanning: Θ(N/B) I/Os.
  • Searching: Θ(logB N) I/Os by B-trees.
  • Sorting: Θ
  • N

B logM/B N M

  • I/Os by M

B -way merge-sort.

  • Permuting: Θ
  • min{N, N

B logM/B N M }

  • by direct move or

sorting 1988-2004: Many algorithms and data structures for problems from computational geometry, graphs, strings, . . .

Fagerberg: The Cache-Oblivious Way

9

slide-14
SLIDE 14

Overview

√ The memory hierachy √ The I/O-model

  • The cache-oblivious model
  • Examples of cache-oblivious algorithms
  • Double for-loop (with applications)
  • Searching
  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

10

slide-15
SLIDE 15

Computer Models

Reality:

Disk CPU L2 L1 A R M C a c C c a e h h e

Increasing access time

Models:

CPU A R M CPU M e m

  • r

y

B M I/O

c a c h e

Cache- Oblivious- ness RAM model I/O model Multi-level models New Model

Fagerberg: The Cache-Oblivious Way

11

slide-16
SLIDE 16

Cache-Oblivious Model

  • Program in the RAM model
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

arbitrary B and M

  • Optimal off-line cache

replacement strategy

Frigo, Leiserson, Prokop, Ramachandran, FOCS’99

Fagerberg: The Cache-Oblivious Way

12

slide-17
SLIDE 17

Cache-Oblivious Model

  • Program in the RAM model
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

arbitrary B and M

  • Optimal off-line cache

replacement strategy

Frigo, Leiserson, Prokop, Ramachandran, FOCS’99

Advantages:

  • Optimal on arbitrary level ⇒ optimal on all levels
  • Portability
  • Simplicity of model.

Disk CPU L2 L1 A R M C a c C c a e h h e

Increasing access time

Fagerberg: The Cache-Oblivious Way

12

slide-18
SLIDE 18

Cache-Oblivious Results

Scanning ⇒ stack, queue, selection,. . . .

Fagerberg: The Cache-Oblivious Way

13

slide-19
SLIDE 19

Cache-Oblivious Results

Scanning ⇒ stack, queue, selection,. . . . Matrix multiplication, FFT:

FOCS’99

Sorting:

FOCS’99, ICALP’02, ALENEX’04

Search trees:

Prokop 99, FOCS’00, WAE’01, SODA’02 × 2, ESA’02, FOCS’03

Priority queues:

STOC’02, ISAAC’02

Graph algorithms:

STOC’02, BRICS-04-2

Computational geometry:

2 × ICALP’02 , SCG’03

Scanning dynamic sets:

ESA’02

Power of cache-obliviousness:

STOC’03

Fagerberg: The Cache-Oblivious Way

13

slide-20
SLIDE 20

Cache-Oblivious Results

Scanning ⇒ stack, queue, selection,. . . . Matrix multiplication, FFT:

FOCS’99

Sorting:

FOCS’99, ICALP’02, ALENEX’04

Search trees:

Prokop 99, FOCS’00, WAE’01, SODA’02 × 2, ESA’02, FOCS’03

Priority queues:

STOC’02, ISAAC’02

Graph algorithms:

STOC’02, BRICS-04-2

Computational geometry:

2 × ICALP’02 , SCG’03

Scanning dynamic sets:

ESA’02

Power of cache-obliviousness:

STOC’03

Fagerberg: The Cache-Oblivious Way

13

slide-21
SLIDE 21

Overview

√ The memory hierachy √ The I/O-model √ The cache-oblivious model

  • Examples of cache-oblivious algorithms
  • Double for-loop (with applications)
  • Searching
  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

14

slide-22
SLIDE 22

Double for-loop

X, Y arrays of length n:

X Y i j

✂ ✄ ☎ ✆ ✝✟✞ ☎ ✠ ✡ ✞ ☎ ☛ ☛ ☞
✂ ✄ ✌ ✆ ✝✟✞ ✌ ✠ ✡ ✞ ✌ ☛ ☛ ☞
✍ ✎ ☎ ✏✒✑ ✓ ✎ ✌ ✏ ☞

I/O complexity: n × n B = n2 B

Fagerberg: The Cache-Oblivious Way

15

slide-23
SLIDE 23

Double for-loop

More efficient version in the I/O-model:

X Y M M

I/O complexity: n M × n M × M B = n2 MB

✂ ✄ ☎ ✆ ✝✟✞ ☎ ✠ ✡ ✞ ☎ ✆ ☎ ☛
✂ ✄ ✌ ✆ ✝ ✞ ✌ ✠ ✡ ✞ ✌ ☛ ☛ ☞
✂ ✄ ✁ ✆ ☎ ✞ ✁ ✠ ☎ ☛
✁ ☛ ☛ ☞
✍ ✎ ☎ ✏✒✑ ✓ ✎ ✌ ✏ ☞

Fagerberg: The Cache-Oblivious Way

16

slide-24
SLIDE 24

Double for-loop

Cache-oblivious version:

X Y n/2 n/2 n/2 n/2

+ recursion I/O complexity: Again n2 MB

Fagerberg: The Cache-Oblivious Way

17

slide-25
SLIDE 25

Double for-loop

Cache-oblivious version

✁ ✂✄ ☎ ✆ ✁ ✁ ✝ ✄ ☎ ✑ ✌ ✑ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ☎
✄ ☎ ✡ ✞ ✟ ✠ ✆ ✆ ✡ ☞
✍ ✎ ☎ ✏✒✑ ✓ ✎ ✌ ✏ ☞ ☎ ✄ ☛ ☎
✁ ✂ ✄ ☎ ✆ ✁ ✁ ✝ ✄ ☎ ✑ ✌ ✑ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ☞
✁ ✂ ✄ ☎ ✆ ✁ ✁ ✝ ✄ ☎ ✑ ✌ ☛ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ✑ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ☞
✁ ✂ ✄ ☎ ✆ ✁ ✁ ✝ ✄ ☎ ☛ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ✑ ✌ ✑ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ☞
✁ ✂ ✄ ☎ ✆ ✁ ✁ ✝ ✄ ☎ ☛ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ✑ ✌ ☛ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ✑ ✄ ☎ ✡ ✞ ✟ ✠ ☞ ✌ ☞

Fagerberg: The Cache-Oblivious Way

18

slide-26
SLIDE 26

Experiments

1 10 100 1000 10000 15 16 17 18 19 20 21 time (seconds) log2 of array size (bytes) plain cache-aware (L1) cache-aware (L2) cache-oblivious

Sizes within RAM (element size 4 bytes)

366 MHz Pentium II, 128 MB RAM, 256 KB Cache, gcc -O3, Linux

Fagerberg: The Cache-Oblivious Way

19

slide-27
SLIDE 27

Experiments

0.1 1 10 100 1000 19 20 21 22 23 24 25 26 27 time (seconds) log2 of array size (bytes) plain cache-aware (L2) cache-aware (RAM) cache-oblivious

Sizes exceeding RAM (element size 1 KB)

366 MHz Pentium II, 128 MB RAM, 256 KB Cache, gcc -O3, Linux

Fagerberg: The Cache-Oblivious Way

20

slide-28
SLIDE 28

For-loop Applications

Join in databases Dynamic programming (bioinformatics) Matrix multiplication (scientific computing)

Fagerberg: The Cache-Oblivious Way

21

slide-29
SLIDE 29

Overview

√ The memory hierachy √ The I/O-model √ The cache-oblivious model

  • Examples of cache-oblivious algorithms

√ Double for-loop (with applications)

  • Searching
  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

22

slide-30
SLIDE 30

Static Cache-Oblivious Trees

Recursive memory layout (van Emde Boas layout)

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Binary tree Searches use O(logB N) I/Os

Fagerberg: The Cache-Oblivious Way

23

slide-31
SLIDE 31

Static Cache-Oblivious Trees

Recursive memory layout (van Emde Boas layout)

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Binary tree Searches use O(logB N) I/Os Dynamization?

Fagerberg: The Cache-Oblivious Way

23

slide-32
SLIDE 32

Static Cache-Oblivious Trees

Recursive memory layout (van Emde Boas layout)

Prokop 1999

Bk A B1 A B1 Bk · · · · · · h ⌈h/2⌉ ⌊h/2⌋ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Binary tree Searches use O(logB N) I/Os Dynamization?

Bender, Demaine, Farach-Colton, FOCS’00 Rahman, Cole, Raman, WAE’01 Bender, Duan, Iacono, Wu, SODA 02 Brodal, Fagerberg, Jacob, SODA’02

Fagerberg: The Cache-Oblivious Way

23

slide-33
SLIDE 33

Binary Trees of Height log2(n) + O(1)

6 4 1 3 5 8 7 11 10 13 2 New 6 3 1 2 4 8 7 11 10 13 5

  • If an insertion causes non-small height then rebuild subtree

at nearest ancestor with sufficient few descendents

  • Insertions require amortized time O(log2 N)

Itai, Konheim, Rodeh, 1981 Andersson, Lai, 1990

Fagerberg: The Cache-Oblivious Way

24

slide-34
SLIDE 34

Simple Dynamic Cache-Oblivious Trees

  • Embed a dynamic tree of height log2(n) + O(1) into a

complete tree

  • Static van Emde Boas layout of the complete tree in array.

6 4 1 3 5 8 7 11 10 13

Fagerberg: The Cache-Oblivious Way

25

slide-35
SLIDE 35

Example

6 4 1 3 5 8 7 11 10 13

6 4 8 1 − 3 5 − − 7 − − 11 10 13

Search O(logB N) Range Reporting O

  • logB N + k

B

  • Updates

O

  • logB N + log2 N

B

  • Fagerberg: The Cache-Oblivious Way

26

slide-36
SLIDE 36

Experiments

Brodal, Fagerberg, Jacob, SODA’02

  • 1. Study search time in static tree layouts.
  • Classic layouts: BFS, DFS, inorder, randomly built trees
  • Cache-aware multi-way trees
  • Cache-oblivious vEB layout
  • 2. Study pointer-based vs. implicit representation.

Fagerberg: The Cache-Oblivious Way

27

slide-37
SLIDE 37

Different Memory Layouts

vEB BFS DFS inorder

Fagerberg: The Cache-Oblivious Way

28

slide-38
SLIDE 38

Search in Pointer Based Tree

2e-07 4e-07 1e-06 2e-06 4e-06 6e-06 12 14 16 18 20 22 24 26 average search time in seconds log2 of number elements stored cache veb:pointer bfs:pointer dfs:pointer rin:pointer

Pointer based vEB, BFS, DFS, and randomly built.

1 GHz Pentium III, 1GB RAM, 256 KB Cache, gcc -O3 / linux

Fagerberg: The Cache-Oblivious Way

29

slide-39
SLIDE 39

Search in Implicit Tree

2e-07 4e-07 1e-06 2e-06 4e-06 6e-06 12 14 16 18 20 22 24 26 average search time in seconds log2 of number elements stored cache veb:implicit bfs:implicit high008:implicit high016:implicit inorder:implicit

Implicit vEB, BFS, inorder, multi-way.

1 GHz Pentium III, 1GB RAM, 256 KB Cache, gcc -O3 / linux

Fagerberg: The Cache-Oblivious Way

30

slide-40
SLIDE 40

Pointer versus Implicit

2e-07 4e-07 1e-06 2e-06 4e-06 6e-06 12 14 16 18 20 22 24 26 average search time in seconds log2 of number elements stored veb:implicit veb:pointer bfs:implicit bfs:pointer

static van Emde Boas layout and bfs-layout

1 GHz Pentium III, 1GB RAM, 256 KB Cache, gcc -O3 / linux

Fagerberg: The Cache-Oblivious Way

31

slide-41
SLIDE 41

Beyond Main Memory

1e-06 1e-05 0.0001 0.001 0.01 0.1 20 21 22 23 24 25 26 27 28 29 average search time in seconds log2 of number elements stored bfs veb high1024

Multiway-tree, BFS, vEB implicit layout

1 GHz Pentium III, 32MB RAM, 256 KB Cache, gcc -O3 / linux

Fagerberg: The Cache-Oblivious Way

32

slide-42
SLIDE 42

Overview

√ The memory hierachy √ The I/O-model √ The cache-oblivious model

  • Examples of cache-oblivious algorithms

√ Double for-loop (with applications) √ Searching

  • Sorting
  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

33

slide-43
SLIDE 43

Funnelsort

Divide input in N 1/3 segments of size N 2/3 Recursively Funnelsort each segment Merge sorted segments by an N 1/3-merger

k N1/3 N2/9 N4/27 . . . 2 Frigo, Leiserson, Prokop, Ramachandran, 1999

Fagerberg: The Cache-Oblivious Way

34

slide-44
SLIDE 44

k-merger

B1 · · · · · · · · · M1 M√

k

Mtop B√

k

Buffer size α · √ k

d Brodal, Fagerberg 2002

Fagerberg: The Cache-Oblivious Way

35

slide-45
SLIDE 45

k-merger

B1 · · · · · · · · · M1 M√

k

Mtop B√

k

Fill(v): while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step

Buffer size α · √ k

d Brodal, Fagerberg 2002

Fagerberg: The Cache-Oblivious Way

35

slide-46
SLIDE 46

Experiments

Engineer: Try a large number of possibilities for parameter choices, design choices, code optimizations (human vs. compiler+CPU), memory layouts,. . . Choose a hard pack of competitors:

  • 2-way and 4-way Funnelsort
  • Best library Quicksort we can find
  • Recent cache-aware proposals (tuned for RAM or for disk)

Run on a number of different machines: Pentium 4, Pentium III, MIPS 10000, AMD Athlon, Itanium 2.

Brodal, Fagerberg, Vinther, ALENEX’04

Fagerberg: The Cache-Oblivious Way

36

slide-47
SLIDE 47

Results for Inputs in RAM

Fagerberg: The Cache-Oblivious Way

37

slide-48
SLIDE 48

2e-08 2.5e-08 3e-08 3.5e-08 4e-08 4.5e-08 5e-08 5.5e-08 6e-08 6.5e-08 12 14 16 18 20 22 24 Walltime/n*log n log n Uniform pairs - Pentium III Funnelsort2 Funnelsort4 Mix msort-c msort-m Rmerge GCC TPIE

Fagerberg: The Cache-Oblivious Way

38

slide-49
SLIDE 49

1e-08 1.5e-08 2e-08 2.5e-08 3e-08 12 14 16 18 20 22 24 Walltime/n*log n log n Uniform pairs - AMD Athlon Funnelsort2 Funnelsort4 Mix msort-c msort-m Rmerge GCC TPIE

Fagerberg: The Cache-Oblivious Way

39

slide-50
SLIDE 50

6e-09 8e-09 1e-08 1.2e-08 1.4e-08 1.6e-08 1.8e-08 12 14 16 18 20 22 24 26 Walltime/n*log n log n Uniform pairs - Pentium 4 Funnelsort2 Funnelsort4 Mix msort-c msort-m Rmerge GCC TPIE

Fagerberg: The Cache-Oblivious Way

40

slide-51
SLIDE 51

Results for Inputs on Disk

Fagerberg: The Cache-Oblivious Way

41

slide-52
SLIDE 52

1e-07 2e-07 3e-07 4e-07 5e-07 6e-07 21 22 23 24 25 26 27 28 Walltime/n*log n log n Uniform pairs - Pentium III Funnelsort2 msort-c msort-m Rmerge GCC TPIE

Fagerberg: The Cache-Oblivious Way

42

slide-53
SLIDE 53

5e-08 1e-07 1.5e-07 2e-07 2.5e-07 3e-07 3.5e-07 4e-07 21 22 23 24 25 26 27 28 Walltime/n*log n log n Uniform pairs - Pentium 4 Funnelsort2 msort-c msort-m Rmerge GCC TPIE

Fagerberg: The Cache-Oblivious Way

43

slide-54
SLIDE 54

Practical Conclusion

For a number of basic algorithmic problems there exist solutions which

  • Are theoretically I/0-efficient
  • Are simple
  • Are robust - adapt automatically to the specifics of the

memory hierarchy

  • Compete well with explicit (cache-aware) I/O algorithms.

Fagerberg: The Cache-Oblivious Way

44

slide-55
SLIDE 55

Practical Conclusion

For a number of basic algorithmic problems there exist solutions which

  • Are theoretically I/0-efficient
  • Are simple
  • Are robust - adapt automatically to the specifics of the

memory hierarchy

  • Compete well with explicit (cache-aware) I/O algorithms.

Consider putting cache-obliviousness in your toolbox of algorithmic techniques.

Fagerberg: The Cache-Oblivious Way

44

slide-56
SLIDE 56

Overview

√ The memory hierachy √ The I/O-model √ The cache-oblivious model

  • Examples of cache-oblivious algorithms

√ Double for-loop (with applications) √ Searching √ Sorting

  • Theoretical limits of cache-obliviousness

Fagerberg: The Cache-Oblivious Way

45

slide-57
SLIDE 57

Basic Question

In terms of power: Cache-oblivious algorithms = I/O-algorithms?

Fagerberg: The Cache-Oblivious Way

46

slide-58
SLIDE 58

Basic Question

In terms of power: Cache-oblivious algorithms = I/O-algorithms? Or is there a separation: A problem for which no cache-oblivious algorithm can match the best I/O-algorithm?

Frigo, Leiserson, Prokop, Ramachandran, FOCS’99

Fagerberg: The Cache-Oblivious Way

46

slide-59
SLIDE 59

Recent Answers

Brodal, Fagerberg, STOC’03 Bender, Brodal, Fagerberg, Ge, He, Hu, Iacono, López-Ortiz, FOCS’03

Sorting: SortB,M(N) I/Os ? Permuting: PermB,M(N) I/Os ? Searching: c · logB

N M I/Os

?

Fagerberg: The Cache-Oblivious Way

47

slide-60
SLIDE 60

Recent Answers

Brodal, Fagerberg, STOC’03 Bender, Brodal, Fagerberg, Ge, He, Hu, Iacono, López-Ortiz, FOCS’03

Sorting: SortB,M(N) I/Os ?

Not possible without tall cache assumption

Permuting: PermB,M(N) I/Os ?

Not possible even with tall cache assumption

Searching: c · logB

N M I/Os

?

c ≥ log(e) ≈ 1.443

Fagerberg: The Cache-Oblivious Way

47

slide-61
SLIDE 61

Recent Answers

Brodal, Fagerberg, STOC’03 Bender, Brodal, Fagerberg, Ge, He, Hu, Iacono, López-Ortiz, FOCS’03

Sorting: SortB,M(N) I/Os ?

Not possible without tall cache assumption

Permuting: PermB,M(N) I/Os ?

Not possible even with tall cache assumption

Searching: c · logB

N M I/Os

?

c ≥ log(e) ≈ 1.443

Matching upper bounds for sorting and searching

Fagerberg: The Cache-Oblivious Way

47

slide-62
SLIDE 62

Cache-Oblivious Sorting

Previous cache-oblivious sorting results use a Tall Cache Assumption B ≤ M 1/2

B M:

Fagerberg: The Cache-Oblivious Way

48

slide-63
SLIDE 63

Cache-Oblivious Sorting

Previous cache-oblivious sorting results use a Tall Cache Assumption B ≤ M 1/2

B M:

Algorithm Assumption I/O bound Funnelsort [FLPR99] B ≤ M 1/2 SortB,M(N) Lazy Funnelsort [BF02] B ≤ M 1−ǫ

N ǫB logM N M

Binary Mergesort (B ≤ M)

N B log2 N M

1 M 1/2 M B: M 1−ǫ

Fagerberg: The Cache-Oblivious Way

48

slide-64
SLIDE 64

Result for Sorting

Assumption I/Os Lazy Funnelsort B ≤ M 1−ǫ B = M 1−ǫ : SortB,M(N) B ≤ M 1/2 : SortB,M(N) ·

1 ǫ

Binary Mergesort (B ≤ M) B = M : SortB,M(N) B ≤ M 1/2 : SortB,M(N) · log M

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁

1 M 1/2 M 1−ǫ M B:

Penalty

Fagerberg: The Cache-Oblivious Way

49

slide-65
SLIDE 65

Result for Sorting

Assumption I/Os Lazy Funnelsort B ≤ M 1−ǫ (a) B = M 1−ǫ : SortB,M(N) (b) B ≤ M 1/2 : SortB,M(N) ·

1 ǫ

Binary Mergesort (B ≤ M) (a) B = M : SortB,M(N) (b) B ≤ M 1/2 : SortB,M(N) · log M

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁

1 M 1/2 M 1−ǫ M B:

Penalty

Theorem This is tight. For any cache-oblivious comparison based sorting algorithm: (a) ⇒ (b)

Fagerberg: The Cache-Oblivious Way

49

slide-66
SLIDE 66

The end

Fagerberg: The Cache-Oblivious Way

50

slide-67
SLIDE 67

Proof

One algorithm, two machines (B1 ≤ B2): Block Size Memory I/Os Machine 1 B1 M t1 Machine 2 B2 M t2 Main result: 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N (∗) Theorem 1: Inserting (a) in (∗) leads to (b)

Fagerberg: The Cache-Oblivious Way

51

slide-68
SLIDE 68

Fake Proof

Goal: 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N (∗) Merging sorted lists X and Y takes |X| log |Y |

|X| comparisons.

In total t1B1 elements touched ⇒ t1B1/t2 elements touched on average per B2-I/O ⇒ effective B2 is t1B1/t2. Comparisons gained per B2-I/O:

M: B2:

t1B1/t2 · log M t1B1/t2 . Hence: t1B1 · log Mt2 t1B1 ≥ N log N − 1.45N . ✷

Fagerberg: The Cache-Oblivious Way

52

slide-69
SLIDE 69

Fake Proof

Goal: 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N (∗) Merging sorted lists X and Y takes |X| log |Y |

|X| comparisons.

In total t1B1 elements touched ⇒ t1B1/t2 elements touched on average per B2-I/O ⇒ effective B2 is t1B1/t2. Comparisons gained per B2-I/O:

M: B2:

One problem: Online choice

t1B1/t2 · log M t1B1/t2 . Hence: t1B1 · log Mt2 t1B1 ≥ N log N − 1.45N . ✷

Fagerberg: The Cache-Oblivious Way

52

slide-70
SLIDE 70

Ideas from Real Proof

A[i] ← A[j] I/O1[s, t], . . . I/O2[s, t], . . . Answers A[i] ≤ A[j]

A:

∗ ∗ ∗

i s

T T T

Fagerberg: The Cache-Oblivious Way

53

slide-71
SLIDE 71

Ideas from Real Proof

A[i] ← A[j] I/O1[s, t], . . . I/O2[s, t], . . . Answers A[i] ≤ A[j]

A:

∗ ∗ ∗

i s

T T T

8t1B1 + 3t1B1 log 8Mt2 B1t1 ≥ height ≥ N log N M − 1.45N . (∗)

Fagerberg: The Cache-Oblivious Way

53

slide-72
SLIDE 72

The end

Fagerberg: The Cache-Oblivious Way

54