Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus - - PowerPoint PPT Presentation

cache oblivious sorting
SMART_READER_LITE
LIVE PREVIEW

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus - - PowerPoint PPT Presentation

Cache Oblivious Sorting Gerth Stlting Brodal University of Aarhus Algorithms and Data Structures, Bertinoro, Forl` , Italy, June 22-28, 2003 1 Foundation 2 Outline of Talk Cache oblivious model Sorting problem Binary and


slide-1
SLIDE 1

Cache Oblivious Sorting

Gerth Stølting Brodal

University of Aarhus

Algorithms and Data Structures, Bertinoro, Forl` ı, Italy, June 22-28, 2003

1

slide-2
SLIDE 2

– Foundation

2

slide-3
SLIDE 3

Outline of Talk

  • Cache oblivious model
  • Sorting problem
  • Binary and multiway merge-sort
  • Funnel-sort
  • Lower bound — tall cache assumption
  • Experimental results
  • Conclusions

Gerth S. Brodal: Cache Oblivious Sorting

3

slide-4
SLIDE 4

Cache Oblivious Model

Frigo, Leiserson, Prokop, Ramachandran, FOCS’99

  • Program in the RAM model
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

arbitrary B and M

Gerth S. Brodal: Cache Oblivious Sorting

4

slide-5
SLIDE 5

Cache Oblivious Model

Frigo, Leiserson, Prokop, Ramachandran, FOCS’99

  • Program in the RAM model
  • Analyze in the I/O model for

CPU

M e m

  • r

y

B M I/O

c a c h e

arbitrary B and M Advantages:

  • Optimal on arbitrary level ⇒ optimal on all levels
  • Portability

Disk CPU L1 L2 A R M

Increasing access time and space Gerth S. Brodal: Cache Oblivious Sorting

4

slide-6
SLIDE 6

Sorting Problem

  • Input

: array containing x1, . . . , xN

  • Output : array with x1, . . . , xN in sorted order
  • Elements can be compared and copied

3 4 8 2 8 4 4 4 6

2 3 4 4 4 4 6 8 8

Gerth S. Brodal: Cache Oblivious Sorting

5

slide-7
SLIDE 7

Binary Merge-Sort

2 8 4 8 4 4 6 4 2 8 4 3 2 8 3 4 4 4 8 6 4 3 4 3 4 8 4 6 2 8 4 4 4 6 8 2 8 4 4

Merging Merging Merging Ouput Input Merging

Gerth S. Brodal: Cache Oblivious Sorting

6

slide-8
SLIDE 8

Binary Merge-Sort

2 8 4 8 4 4 6 4 2 8 4 3 2 8 3 4 4 4 8 6 4 3 4 3 4 8 4 6 2 8 4 4 4 6 8 2 8 4 4

Merging Merging Merging Ouput Input Merging

  • Recursive; two arrays; size O(M) internally in cache
  • O(N log N) comparisons
  • O
  • N

B log2 N M

  • I/Os

Gerth S. Brodal: Cache Oblivious Sorting

6

slide-9
SLIDE 9

Merge-Sort

Degree I/O 2 O

  • N

B log2 N M

  • d

O

  • N

B logd N M

  • (d ≤ M

B − 1)

Θ

  • M

B

  • O
  • N

B logM/B N M

  • = O(SortM,B(N))

Aggarwal and Vitter 1988

Funnel-Sort

2 O( 1

ε SortM,B(N))

(M ≥ B1+ε)

Frigo, Leiserson, Prokop and Ramachandran 1999 Brodal and Fagerberg 2002

Gerth S. Brodal: Cache Oblivious Sorting

7

slide-10
SLIDE 10

Outline of Talk

  • Cache oblivious model
  • Sorting problem
  • Binary and multiway merge-sort
  • Funnel-sort
  • Lower bound — tall cache assumption
  • Experimental results
  • Conclusions

Gerth S. Brodal: Cache Oblivious Sorting

8

slide-11
SLIDE 11

Funnel-Sort

Gerth S. Brodal: Cache Oblivious Sorting

9

slide-12
SLIDE 12

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

Gerth S. Brodal: Cache Oblivious Sorting

10

slide-13
SLIDE 13

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

=

Recursive def.

B1 · · · · · · · · · M1 M√ k M0 B√ k

← buffers of size k3/2 ← k1/2-mergers

Gerth S. Brodal: Cache Oblivious Sorting

10

slide-14
SLIDE 14

k-merger

Frigo et al., FOCS’99 Sorted output stream

M · · ·

k sorted input streams

=

Recursive def.

B1 · · · · · · · · · M1 M√ k M0 B√ k

← buffers of size k3/2 ← k1/2-mergers

· · ·

M0 M1 B1 B√

k M√ k

B2 M2

Recursive Layout

Gerth S. Brodal: Cache Oblivious Sorting

10

slide-15
SLIDE 15

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Gerth S. Brodal: Cache Oblivious Sorting

11

slide-16
SLIDE 16

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step

Gerth S. Brodal: Cache Oblivious Sorting

11

slide-17
SLIDE 17

Lazy k-merger

Brodal and Fagerberg 2002

B1 · · · · · · · · · M1 M√ k M0 B√ k

Procedure Fill(v) while out-buffer not full if left in-buffer empty Fill(left child) if right in-buffer empty Fill(right child) perform one merge step

Lemma If M ≥ B2 and output buffer has size k3 then O( k3

B logM(k3) + k) I/Os are

done during an invocation of Fill(root)

Gerth S. Brodal: Cache Oblivious Sorting

11

slide-18
SLIDE 18

Funnel-Sort

Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N 1/3 segments of size N 2/3 Recursively MergeSort each segment Merge sorted segments by an N 1/3-merger

k N1/3 N2/9 N4/27 . . . 2

Gerth S. Brodal: Cache Oblivious Sorting

12

slide-19
SLIDE 19

Funnel-Sort

Brodal and Fagerberg 2002 Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N 1/3 segments of size N 2/3 Recursively MergeSort each segment Merge sorted segments by an N 1/3-merger

k N1/3 N2/9 N4/27 . . . 2

Theorem Funnel-Sort performs O(SortM,B(N)) I/Os for M ≥ B2

Gerth S. Brodal: Cache Oblivious Sorting

12

slide-20
SLIDE 20

Outline of Talk

  • Cache oblivious model
  • Sorting problem
  • Binary and multiway merge-sort
  • Funnel-sort
  • Lower bound — tall cache assumption
  • Experimental results
  • Conclusions

Gerth S. Brodal: Cache Oblivious Sorting

13

slide-21
SLIDE 21

Lower Bound

Brodal and Fagerberg 2003

Block Size Memory I/Os Machine 1 B1 M t1 Machine 2 B2 M t2 One algorithm, two machines, B1 ≤ B2 Trade-off 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N

Gerth S. Brodal: Cache Oblivious Sorting

14

slide-22
SLIDE 22

Lower Bound

Assumption I/Os Lazy Funnel-sort B ≤ M 1−ε (a) B2 = M 1−ε : SortB2,M(N) (b) B1 = 1 : SortB1,M(N) · 1

ε

Binary Merge-sort B ≤ M/2 (a) B2 = M/2 : SortB2,M(N) (b) B1 = 1 : SortB1,M(N) · log M Corollary (a) ⇒ (b)

Gerth S. Brodal: Cache Oblivious Sorting

15

slide-23
SLIDE 23

Fake Proof

Goal: 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N Merging sorted lists X and Y takes ≈ |X| log |Y |

|X| comparisons

In total t1B1 elements touched ⇒ t1B1/t2 elements touched on average per B2-I/O ⇒ effective B2 is t1B1/t2 Comparisons gained per B2-I/O:

M: B2:

t1B1/t2 · log M t1B1/t2 Hence: t1B1 · log Mt2 t1B1 ≥ N log N − 1.45N

Gerth S. Brodal: Cache Oblivious Sorting

16

slide-24
SLIDE 24

Fake Proof

Goal: 8t1B1 + 3t1B1 log 8Mt2 t1B1 ≥ N log N M − 1.45N Merging sorted lists X and Y takes ≈ |X| log |Y |

|X| comparisons

In total t1B1 elements touched ⇒ t1B1/t2 elements touched on average per B2-I/O ⇒ effective B2 is t1B1/t2 Comparisons gained per B2-I/O:

M: B2:

One problem : Online choice

t1B1/t2 · log M t1B1/t2 Hence: t1B1 · log Mt2 t1B1 ≥ N log N − 1.45N

Gerth S. Brodal: Cache Oblivious Sorting

16

slide-25
SLIDE 25

Ideas from Real Proof

A[i] ← A[j] I/O1[s, t], . . . I/O2[s, t], . . . Answers A[i] ≤ A[j]

A:

∗ ∗ ∗

i s

T T T

8t1B1 + 3t1B1 log 8Mt2 B1t1 ≥ height ≥ N log N M − 1.45N

Gerth S. Brodal: Cache Oblivious Sorting

17

slide-26
SLIDE 26

Outline of Talk

  • Cache oblivious model
  • Sorting problem
  • Binary and multiway merge-sort
  • Funnel-sort
  • Lower bound — tall cache assumption
  • Experimental results
  • Conclusions

Gerth S. Brodal: Cache Oblivious Sorting

18

slide-27
SLIDE 27

Hardware

Processor type Pentium 4 Pentium 3 MIPS 10000 Workstation Dell PC Delta PC SGI Octane Operating system GNU/Linux Kernel version 2.4.18 GNU/Linux Kernel version 2.4.18 IRIX version 6.5 Clock rate 2400 MHz 800 MHz 175 MHz Address space 32 bit 32 bit 64 bit Integer pipeline stages 20 12 6 L1 data cache size 8 KB 16 KB 32 KB L1 line size 128 Bytes 32 Bytes 32 Bytes L1 associativity 4 way 4 way 2 way L2 cache size 512 KB 256 KB 1024 KB L2 line size 128 Bytes 32 Bytes 32 Bytes L2 associativity 8 way 4 way 2 way TLB entries 128 64 64 TLB associativity Full 4 way 64 way TLB miss handler Hardware Hardware Software Main memory 512 MB 256 MB 128 MB

  • Gerth S. Brodal: Cache Oblivious Sorting

19

slide-28
SLIDE 28

Wall Clock

Pentium 4, 512/512

0.1µs 1.0µs 10.0µs 100.0µs 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Wall clock time per element ffunnelsort funnelsort lowscosa stdsort ami_sort msort-c msort-m

Kristoffer Vinther 2003

Gerth S. Brodal: Cache Oblivious Sorting

20

slide-29
SLIDE 29

Page Faults

Pentium 4, 512/512

0.0 5.0 10.0 15.0 20.0 25.0 30.0 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements Page faults per block of elements ffunnelsort funnelsort lowscosa stdsort msort-c msort-m

Kristoffer Vinther 2003

Gerth S. Brodal: Cache Oblivious Sorting

21

slide-30
SLIDE 30

Cache Misses

MIPS 10000, 1024/128

0.0 5.0 10.0 15.0 20.0 25.0 30.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements L2 cache misses per lines of elements ffunnelsort funnelsort lowscosa stdsort msort-c msort-m

Kristoffer Vinther 2003

Gerth S. Brodal: Cache Oblivious Sorting

22

slide-31
SLIDE 31

TLB Misses

MIPS 10000, 1024/128

1.0 10.0 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Elements TLB misses per block of elements ffunnelsort funnelsort lowscosa stdsort msort-c msort-m

Kristoffer Vinther 2003

Gerth S. Brodal: Cache Oblivious Sorting

23

slide-32
SLIDE 32

Outline of Talk

  • Cache oblivious model
  • Sorting problem
  • Binary and multiway merge-sort
  • Funnel-sort
  • Lower bound — tall cache assumption
  • Experimental results
  • Conclusions

Gerth S. Brodal: Cache Oblivious Sorting

24

slide-33
SLIDE 33

Conclusions

Cache oblivious sorting

  • is possible
  • requires a tall cache assumption M ≥ B1+ε
  • comparable performance with cache aware algorithms

Future work

  • more experimental justification for the cache oblivious model
  • limitations of the model — time space trade-offs ?
  • tool-box for cache oblivious algorithms

Gerth S. Brodal: Cache Oblivious Sorting

25