03/04/2014
High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters
Wei Song
Advanced Processor Technologies Group The School of Computer Science
High-Throughput Sorting by Dynamically Merging Multiple Hardware - - PowerPoint PPT Presentation
High-Throughput Sorting by Dynamically Merging Multiple Hardware Sequential Sorters Wei Song 03/04/2014 Advanced Processor Technologies Group The School of Computer Science Motivation Hardware sorter is important. Parallel sorters have
03/04/2014
Advanced Processor Technologies Group The School of Computer Science
03/04/2014
Advanced Processor Technologies Group School of Computer Science
2
2
03/04/2014
Advanced Processor Technologies Group School of Computer Science
3
03/04/2014
Advanced Processor Technologies Group School of Computer Science I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 BN(8) BN(4) BN(2) BM(4) BM(8) S0 S1 S2 S3 S4 S5 BM(4) BN(4) BN(2) BN(2) BN(2)
B A min{A,B} max{A,B}
12 89 53 9 30 79 62 17 12 89 9 53 30 79 17 62 12 9 89 53 30 17 79 62 9 12 53 89 17 30 62 79 9 12 30 17 89 53 62 79 9 12 30 17 62 53 89 79 9 12 17 30 53 62 79 89 9 12 17 30 53 62 79 89
03/04/2014
Advanced Processor Technologies Group School of Computer Science
I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 BN(8) BN(4) BN(2) BM(4) BM(8) S0 S1 S2 S3 S4 S5 BM(4) BN(4) BN(2) BN(2) BN(2)
BM(8) BM(4) BM(4) BM(2) BM(2) BM(2) BM(2) 2 2 2 2 4 4
Bitonic Network (BN) Bitonic Merger (BM) Data Set Size: Throughput: Size(Compare): Delay:
2
2
03/04/2014
Advanced Processor Technologies Group School of Computer Science
6
> > >
Cell0 Cell1 CellN-1 data_in data_out
3 3 12 7 3 7 12 1 1 3 7 12 9 1 3 7 9 12 20 1 3 7 9 12 20 3 12
Data Set Size: Throughput: 1 Size(cells): Delay:
03/04/2014
Advanced Processor Technologies Group School of Computer Science
7 I1 I0 O
5 12 16 19 4 9 10 22 22 5 12 16 19 4 9 10 19 5 12 16 4 9 10 16 5 12 4 9 10 12 22 19 22 16 19 22
I1 I0 N/8 N/4 N/2 S1 S2 S0
Data Set Size: Throughput: 1 Size(Memory): Delay:
time reconfiguration on FPGAs for large problem sorting,” in Proc. of FPGA, February 2011,
03/04/2014
Advanced Processor Technologies Group School of Computer Science
8
03/04/2014
Advanced Processor Technologies Group School of Computer Science
9
03/04/2014
Advanced Processor Technologies Group School of Computer Science
10 Merge multiple sequential sorters using a Bitonic network?
3 1 4 5 5 9 7 6 15 10 15 13 22 20 17 24 28 24 26 28 34 30 29 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 3 4 5 5 6 7 9 10 13 15 15 17 20 22 24 24 26 28 28 29 30 34 37
03/04/2014
Advanced Processor Technologies Group School of Computer Science
11 Merge multiple sequential sorters using a Bitonic network?
3 1 4 5 5 9 7 6 15 10 15 13 22 20 17 24 28 24 26 28 34 30 29 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 3 4 5 5 6 7 9 10 13 15 15 17 20 22 24 24 26 28 28 29 30 34 37 5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 1 4 5 5 3 6 9 15 7 10 15 17 13 20 22 24 24 26 29 30 28 28 34 37
Numbers may not be distributed evenly among sequences.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
12
5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter
Increase the comparing window.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
13 Increase the comparing window.
5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter 28 24 34 30
Return unselected numbers.
5 1 9 3 10 15 22 20 30 24 34 28 5 1 9 3 10 15 22 20 24 28 28 24 34 30 28 24 34 30 22 10 5 1 9 3 10 15 22 20 5 1 9 3 10 15 28 24 34 30 20 22 15 10 28 24 34 30 20 22 15 10 9 3
03/04/2014
Advanced Processor Technologies Group School of Computer Science
14 Increase the comparing window.
5 1 9 3 10 15 22 20 30 24 34 28 Sequential sorter Sequential sorter 28 24 34 30
Return unselected numbers.
To merge S pre-sorted sequence and at a speed of S numbers per cycle,
03/04/2014
Advanced Processor Technologies Group School of Computer Science
15
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 17 22 15 10 37 29 28 26 24 20 15 13
03/04/2014
Advanced Processor Technologies Group School of Computer Science
16
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 24 26 29 34 28 28 37 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 17 22 15 10 37 29 28 26 24 20 15 13Using a tree structure reduces the number of comparators by > 50%.
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17
03/04/2014
Advanced Processor Technologies Group School of Computer Science
17 Replace the Bitonic sorting networks with Bitonic mergers because the sequences are pre-sorted.
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 28 24 34 30 37 29 28 26 28 24 34 30 22 20 37 29 28 26 24 17
window.
soring networks.
being returned.
“Modular design of high-throughput, low-latency sorting units,” IEEE Transactions on Computers, vol. 62, no. 7, pp. 1389–1402, July 2013.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
18
I7 I6 I5 I4 I3 I2 I1 I0 O7 O6 O5 O4 O3 O2 O1 O0 I7 I6 I5 I4 I3 I2 I1 I0 O3 O2 O1 O0
03/04/2014
Advanced Processor Technologies Group School of Computer Science
19
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 34 30 37 29 28 24 34 30 37 29 28 26 control control control
Single clock data return.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
20
5 1 5 4 9 3 6 15 10 15 7 17 22 20 13 24 30 26 34 28 Sequential sorter Sequential sorter Sequential sorter Sequential sorter 34 30 37 29 28 24 34 30 37 29 28 26 control control control
Single clock data return.
1 N/cyc 1 N/cyc 2 N/cyc 4 N/cyc
03/04/2014
Advanced Processor Technologies Group School of Computer Science
21
Sequential sorter Sequential sorter control control Sequential sorter Sequential sorter control
03/04/2014
Advanced Processor Technologies Group School of Computer Science
22
Sequential sorter Sequential sorter control
16 11 4 3 20 9 6 5 10 13 12 1 18 19 2 7 14 17 8 15 2 1 4 3 6 5 8 7 10 9 12 11 14 13 16 15 18 17 20 19 Original Sequences Pre-sorted 0 stall R = 0% α = 0
03/04/2014
Advanced Processor Technologies Group School of Computer Science
23
Sequential sorter Sequential sorter control
18 6 12 3 20 9 13 2 15 7 16 1 19 10 11 4 17 5 14 8 11 1 12 2 13 3 14 4 15 5 16 6 17 7 18 8 19 9 20 10 Original Sequences Pre-sorted 10 stalls R = 50% α = 1.0
03/04/2014
Advanced Processor Technologies Group School of Computer Science
24
Sequential sorter Sequential sorter control
1 2 4 3 6 5 7 8 9 10 12 11 13 15 14 16 17 18 19 20 Pre-sorted 2 stalls R = 17% α = 0.2 6 18 12 3 9 20 13 2 7 15 1 16 19 10 4 11 17 5 14 8 Original Sequences
03/04/2014
Advanced Processor Technologies Group School of Computer Science
25
Sequential sorter Sequential sorter control control Sequential sorter Sequential sorter control
2 3 4 5 6 7 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Stall Rates FIFO Depth
S=2 S=4 S=8 S=16
03/04/2014
Advanced Processor Technologies Group School of Computer Science
26
Sequential sorter Sequential sorter control Sequential sorter Sequential sorter control Random sequence generator shuffler
03/04/2014
Advanced Processor Technologies Group School of Computer Science
27
1 1 1 1 1
18 6 12 3 20 9 13 2 15 7 16 1 19 10 11 4 17 5 14 8 1 2 4 3 6 5 7 8 9 10 12 11 13 15 14 16 17 18 19 20 Original Sequences Pre-sorted 2 stalls R = 17% α = 0.2 6 18 12 3 9 20 13 2 7 15 1 16 19 10 4 11 17 5 14 8 Randomly Shuffled
Random Seq.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
28 Sorters Frequency Slices RAM
Records Throughput PMT(4)[Virtex 7] 226MHz 70% 100% 459K 51.5 Gb/s PMT(8) [Virtex 7] 206MHz 96% 26% 115K 91.4 Gb/s PMT(16) No seq. [Virtex 7] 202MHz 96% 0% 193.6 Gb/s
[Virtex 5] 252MHz 74% 98% 43K 16 Gb/s PCIe 2.1 x8 [Vietex 7] 250Mhz 32 Gb/s PCIe 3.0 x8 [Virtex 7] 250MHz 64 Gb/s * D. Koch and J. Torresen, “FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting,” in Proc. of International Symposium on Field Programmable Gate Arrays, February 2011, pp. 45–54.
03/04/2014
Advanced Processor Technologies Group School of Computer Science
29
03/04/2014
Advanced Processor Technologies Group School of Computer Science
30