High-Throughput Sorting by Dynamically Merging Multiple Hardware - - PowerPoint PPT Presentation

high throughput sorting by dynamically merging
SMART_READER_LITE
LIVE PREVIEW

High-Throughput Sorting by Dynamically Merging Multiple Hardware - - PowerPoint PPT Presentation

High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters Wei Song 03/04/2014 Advanced Processor Technologies Group The School of Computer Science Motivation Hardware sorter is important. Parallel sorters have


slide-1
SLIDE 1

03/04/2014

High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters

Wei Song

Advanced Processor Technologies Group The School of Computer Science

slide-2
SLIDE 2

Motivation

  • Hardware sorter is important.
  • Parallel sorters have size limit.

– Sorting N numbers need a network sized

  • Sequential sorters have throughput limit.

– Sorting throughput is limited to 1 number per cycle.

  • Is there a way to sort N (N>1M) numbers with

a throughput larger than 1 number per cycle?

03/04/2014

Advanced Processor Technologies Group School of Computer Science

2

2

log ( ) N N

slide-3
SLIDE 3

Content

  • Review of existing sorters

– Parallel sorters – Sequential sorters

  • Parallel merge-tree sorter

– Key ideas – Hardware structure – Performance

03/04/2014

Advanced Processor Technologies Group School of Computer Science

3

slide-4
SLIDE 4

Parallel Sorters (Bitonic Sorting Network)

03/04/2014

Advanced Processor Technologies Group School of Computer Science I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 BN(8) BN(4) BN(2) BM(4) BM(8) S0 S1 S2 S3 S4 S5 BM(4) BN(4) BN(2) BN(2) BN(2)

B A min{A,B} max{A,B}

12 89 53 9 30 79 62 17 12 89 9 53 30 79 17 62 12 9 89 53 30 17 79 62 9 12 53 89 17 30 62 79 9 12 30 17 89 53 62 79 9 12 30 17 62 53 89 79 9 12 17 30 53 62 79 89 9 12 17 30 53 62 79 89

slide-5
SLIDE 5

Parallel Sorters (Bitonic Sorting Network)

03/04/2014

Advanced Processor Technologies Group School of Computer Science

I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 BN(8) BN(4) BN(2) BM(4) BM(8) S0 S1 S2 S3 S4 S5 BM(4) BN(4) BN(2) BN(2) BN(2)

BM(8) BM(4) BM(4) BM(2) BM(2) BM(2) BM(2) 2 2 2 2 4 4

Bitonic Network (BN) Bitonic Merger (BM) Data Set Size: Throughput: Size(Compare): Delay:

2

log ( ) P P

2

log ( ) P

P P

slide-6
SLIDE 6

Sequential Sorters (Insertion Sorter)

03/04/2014

Advanced Processor Technologies Group School of Computer Science

6

> > >

Cell0 Cell1 CellN-1 data_in data_out

3 3 12 7 3 7 12 1 1 3 7 12 9 1 3 7 9 12 20 1 3 7 9 12 20 3 12

Data Set Size: Throughput: 1 Size(cells): Delay:

N N N

slide-7
SLIDE 7

Sequential Sorter (FIFO-merge)

03/04/2014

Advanced Processor Technologies Group School of Computer Science

7 I1 I0 O

5 12 16 19 4 9 10 22 22 5 12 16 19 4 9 10 19 5 12 16 4 9 10 16 5 12 4 9 10 12 22 19 22 16 19 22

I1 I0 N/8 N/4 N/2 S1 S2 S0

Data Set Size: Throughput: 1 Size(Memory): Delay:

N 2N N

  • D. Koch and J. Torresen, “FPGASort: a high performance sorting architecture exploiting run-

time reconfiguration on FPGAs for large problem sorting,” in Proc. of FPGA, February 2011,

  • pp. 45–54.
slide-8
SLIDE 8

Summarise Existing Sorters

  • Parallel Sorters

– High throughput – Area increases significantly with the quantity of data – Sorting a small quantity of numbers

  • Sequential Sorters

– Linear area overhead – Feasible for large data sets – Low throughput

03/04/2014

Advanced Processor Technologies Group School of Computer Science

8

slide-9
SLIDE 9

Can we dynamically merge multiple sequential sorters?

03/04/2014

Advanced Processor Technologies Group School of Computer Science

9

slide-10
SLIDE 10

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

10 Merge multiple sequential sorters using a Bitonic network?

YES

3 1 4 5 5 9 7 6 15 10 15 13 22 20 17 24 28 24 26 28 34 30 29 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 3 4 5 5 6 7 9 10 13 15 15 17 20 22 24 24 26 28 28 29 30 34 37

slide-11
SLIDE 11

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

11 Merge multiple sequential sorters using a Bitonic network?

YES

3 1 4 5 5 9 7 6 15 10 15 13 22 20 17 24 28 24 26 28 34 30 29 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 3 4 5 5 6 7 9 10 13 15 15 17 20 22 24 24 26 28 28 29 30 34 37 5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 4 5 5 3 6 9 15 7 10 15 17 13 20 22 24 24 26 29 30 28 28 34 37

NO!

Numbers may not be distributed evenly among sequences.

slide-12
SLIDE 12

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

12

5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter

Increase the comparing window.

slide-13
SLIDE 13

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

13 Increase the comparing window.

5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter 28 24 34 30

Return unselected numbers.

5 1 9 3 10 15 22 20 30 24 34 28 5 1 9 3 10 15 22 20 24 28 28 24 34 30 28 24 34 30 22 10 5 1 9 3 10 15 22 20 5 1 9 3 10 15 28 24 34 30 20 22 15 10 28 24 34 30 20 22 15 10 9 3

YES!

slide-14
SLIDE 14

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

14 Increase the comparing window.

5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter 28 24 34 30

Return unselected numbers.

Requirement:

To merge S pre-sorted sequence and at a speed of S numbers per cycle,

  • 1. Increase the comparing window to S x S;
  • 2. Using an S x S -input Bitonic sorting network; [Area overhead]
  • 3. Return the S x (S - 1) unselected numbers; [Control overhead]
  • 4. Unselected numbers should be returned in one cycle. [Slow clock]
  • 5. Maximal shifting rate of S numbers per cycle. [Speed mismatch]
slide-15
SLIDE 15

Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

15

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 17 22 15 10 37 29 28 26 24 20 15 13

slide-16
SLIDE 16

Optimising the Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

16

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 17 22 15 10 37 29 28 26 24 20 15 13

Using a tree structure reduces the number of comparators by > 50%.

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17

slide-17
SLIDE 17

Optimising the Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

17 Replace the Bitonic sorting networks with Bitonic mergers because the sequences are pre-sorted.

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17

  • 1. Reduce the comparing

window.

  • 2. Reduce the size of

soring networks.

  • 3. Reduce the numbers

being returned.

slide-18
SLIDE 18

Bitonic Partial Merger

  • A. Farmahini-Farahani, H. J. Duwe, III, M. J. Schulte, and K. Compton,

“Modular design of high-throughput, low-latency sorting units,” IEEE Transactions on Computers, vol. 62, no. 7, pp. 1389–1402, July 2013.

03/04/2014

Advanced Processor Technologies Group School of Computer Science

18

I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 I7 I6 I5 I4 I3 I2 I1 I0 O3 O2 O1 O0

slide-19
SLIDE 19

Optimising the Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

19

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 34 30 37 29 28 24 34 30 37 29 28 26 control control control

Single clock data return.

slide-20
SLIDE 20

Optimising the Parallel Merging

03/04/2014

Advanced Processor Technologies Group School of Computer Science

20

5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 34 30 37 29 28 24 34 30 37 29 28 26 control control control

Single clock data return.

The last issue: Speed mismatch between inputs and outputs.

1 N/cyc 1 N/cyc 2 N/cyc 4 N/cyc

slide-21
SLIDE 21

Speed Mismatch: Using FIFO and Allow Stalls

03/04/2014

Advanced Processor Technologies Group School of Computer Science

21

Sequential sorter Sequential sorter control control Sequential sorter Sequential sorter control

slide-22
SLIDE 22

How Stalls Occur

03/04/2014

Advanced Processor Technologies Group School of Computer Science

22

Sequential sorter Sequential sorter control

16 11 4 3 20 9 6 5 10 13 12 1 18 19 2 7 14 17 8 15 2 1 4 3 6 5 8 7 10 9 12 11 14 13 16 15 18 17 20 19 Original Sequences Pre-sorted 0 stall R = 0% α = 0

Even distribution has 0 stall.

slide-23
SLIDE 23

How Stalls Occur

03/04/2014

Advanced Processor Technologies Group School of Computer Science

23

Sequential sorter Sequential sorter control

Uneven distribution may reduce the speed to the minimum of 1 N/Cyc.

18 6 12 3 20 9 13 2 15 7 16 1 19 10 11 4 17 5 14 8 11 1 12 2 13 3 14 4 15 5 16 6 17 7 18 8 19 9 20 10 Original Sequences Pre-sorted 10 stalls R = 50% α = 1.0

slide-24
SLIDE 24

How Stalls Occur

03/04/2014

Advanced Processor Technologies Group School of Computer Science

24

Sequential sorter Sequential sorter control

Random distribution has average stall rates which can be reduced using long FIFO.

1 2 4 3 6 5 7 8 9 10 12 11 13 15 14 16 17 18 19 20 Pre-sorted 2 stalls R = 17% α = 0.2 6 18 12 3 9 20 13 2 7 15 1 16 19 10 4 11 17 5 14 8 Original Sequences

0 stall if using an FIFO with its depth > 2.

slide-25
SLIDE 25

Reducing Stalls using Long FIFOs

03/04/2014

Advanced Processor Technologies Group School of Computer Science

25

Sequential sorter Sequential sorter control control Sequential sorter Sequential sorter control

2 3 4 5 6 7 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Stall Rates FIFO Depth

S=2 S=4 S=8 S=16

slide-26
SLIDE 26

Handle Uneven Distribution

03/04/2014

Advanced Processor Technologies Group School of Computer Science

26

Sequential sorter Sequential sorter control Sequential sorter Sequential sorter control Random sequence generator shuffler

slide-27
SLIDE 27

Handle Uneven Distribution

03/04/2014

Advanced Processor Technologies Group School of Computer Science

27

1 1 1 1 1

18 6 12 3 20 9 13 2 15 7 16 1 19 10 11 4 17 5 14 8 1 2 4 3 6 5 7 8 9 10 12 11 13 15 14 16 17 18 19 20 Original Sequences Pre-sorted 2 stalls R = 17% α = 0.2 6 18 12 3 9 20 13 2 7 15 1 16 19 10 4 11 17 5 14 8 Randomly Shuffled

Random Seq.

slide-28
SLIDE 28

Some results

03/04/2014

Advanced Processor Technologies Group School of Computer Science

28 Sorters Frequency Slices RAM

  • No. of

Records Throughput PMT(4)[Virtex 7] 226MHz 70% 100% 459K 51.5 Gb/s PMT(8) [Virtex 7] 206MHz 96% 26% 115K 91.4 Gb/s PMT(16) No seq. [Virtex 7] 202MHz 96% 0% 193.6 Gb/s

  • Seq. (Dirk Koch*)

[Virtex 5] 252MHz 74% 98% 43K 16 Gb/s PCIe 2.1 x8 [Vietex 7] 250Mhz 32 Gb/s PCIe 3.0 x8 [Virtex 7] 250MHz 64 Gb/s * D. Koch and J. Torresen, “FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting,” in Proc. of International Symposium on Field Programmable Gate Arrays, February 2011, pp. 45–54.

slide-29
SLIDE 29

Summary

  • The results of two sequential sorters can be

dynamically merged using a Bitonic partial merger.

  • Multiple sequential sorters can be merged using a

tree of Bitonic partial mergers.

  • Allowing stalls, the speed mismatch issue can be

alleviated using long FIFOs and random shufflers.

  • The throughput limit of 1 Number / cycle is

scrapped.

03/04/2014

Advanced Processor Technologies Group School of Computer Science

29

slide-30
SLIDE 30

THANKS!

03/04/2014

Advanced Processor Technologies Group School of Computer Science

30