OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - - PowerPoint PPT Presentation

opencl based erasure coding on heterogeneous architectures
SMART_READER_LITE
LIVE PREVIEW

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - - PowerPoint PPT Presentation

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1


slide-1
SLIDE 1

OpenCL-Based Erasure Coding

  • n Heterogeneous Architectures

Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu

1

slide-2
SLIDE 2

Introduction

  • A key challege in storage system
  • Failure(disk sector, entile disk, storage site)
  • A Solution:
  • Erasure Coding
  • Intel’s intelligent storage acceleration library.(ISA-L)

2

From google image

slide-3
SLIDE 3

Motivation

  • Erasure Coding
  • Replication.(simple, high cost, low toleration)
  • Reed-Solomon coding.(less cost, high toleration, complex)
  • ......
  • Motivation:
  • To explore using various heterogeneous architectures to

accelerate Reed-Solomon coding.

3

slide-4
SLIDE 4

Reed-Solomon Coding

  • Block- based Parity Encoding
  • İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.
  • Encode matrix: dests > srcs

4

Dest = V × Src

slide-5
SLIDE 5

Reed-Solomon Coding

  • Block- based Parity Encoding
  • İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.
  • Encode matrix: dests > srcs

5

𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗]

𝑡𝑠𝑑𝑡−1 𝑘=0

Dest = V × Src

slide-6
SLIDE 6

Reed-Solomon Coding

  • Block- based Parity Encoding
  • İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.
  • Encode matrix: dests > srcs
  • sum: 8-bit XOR operation; mul: GF(28) multiplication

6

𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗]

𝑡𝑠𝑑𝑡−1 𝑘=0

Dest = V × Src

slide-7
SLIDE 7

GF(28) multiplication

  • 3 Ways for Galois Field Multiplication:
  • Russian Peasant Algorithm: pure logic operations.
  • 2 small tables: 256 bytes per table, 3 table lookups, 3 logic
  • perations.
  • 1 large table: 256*256 bytes, no logic operations, one

lookup

7

Refer to paper for details.

slide-8
SLIDE 8

Reed-solomon Coding On CPUs

  • Intel ISA-L.
  • Single thread.
  • Baseline.
  • Adding Multithreading support.
  • Partition input matrix in a column-wise manner.

8

slide-9
SLIDE 9

Reed-solomon Coding On GPUs

  • Computation for one element in output

matrix is independent from others.

  • Fine-grain parallelization
  • Each workitem for one byte in output matrix.(Baseline)
  • Optimizations???

9

slide-10
SLIDE 10

Reed-solomon Coding On GPUs-Opt(A)

  • A. Optimize GPU Memory Bandwidth.
  • Memory coalescing(workitems in one group access data in the

same row).

  • Vectorization.(reads uint4 one time) ==> higher bandwidth.
  • Each workitem for 16 bytes data.

10

slide-11
SLIDE 11

Reed-solomon Coding On GPUs-Opt(B)

  • B. Overcoming Memory Bandwidth Limit

Using Texture Caches, Tiling.

  • Workitems in the same row share same value in V.

==> Putting encode matrix and large look up table(64KB, for GF(28) Multiplication) in texture cache.

11

Dest = V × Src

slide-12
SLIDE 12

Reed-solomon Coding On GPUs-Opt(B)

  • B. Overcoming Memory Bandwidth Limit

Using Texture Caches, Tiling.

  • Workitems in the same row share same value in V.

==> Putting encode matrix and large look up table(64KB, for GF(28) Multiplication) in texture cache.

  • Src in texture cache by using tiling(like MM).
  • Not helpful. Bottoleneck: computation bound

12

Dest = V × Src

slide-13
SLIDE 13

Reed-solomon Coding On GPUs-Opt(C)

  • C. Hiding Data Transmission Latency Over

PCIe

  • Partition input into multiple groups.
  • One stream for one group
  • Hide data copy time with computation time.

13

H2D Compute D2H Stream 1 H2D Compute D2H Stream 2 H2D Compute D2H Stream 3 H2D Compute D2H Stream N ..... ...... ....... ......

slide-14
SLIDE 14

Reed-solomon Coding On GPUs-Opt(D)

  • D. Shared virtual memory to eliminate

memory copying

  • Shared virtual memory (SVM) is supported in OpenCL 2.0
  • AMD APUs.
  • No need for data copy.

14

slide-15
SLIDE 15

Reed-solomon Coding On FPGAs

  • FPGAs
  • Abundant on-chip logics for computation.
  • Pipelined parallelism instead of data parallelism on GPU.
  • Relatively low memory access bandwidth
  • Reed-solomon Coding
  • Computation bound
  • A good candidate for FPGAs
  • Same baseline code as used on GPUs. (1 workitem for 1

byte)

15

slide-16
SLIDE 16

Reed-solomon Coding On FPGAs-Opt(A)

  • A. Vectorization to Optimize FPGA

Memory Bandwidth

  • One workitem reads 64 bytes from input.

16

slide-17
SLIDE 17

Reed-solomon Coding On FPGAs-Opt(B)

  • B. Overcoming memory bandwidth

limit using tiling.

  • Load a tile from input matrix to local memory shared by

workgroup.

  • A large tile size results in high data reuse and reduces off-

chip memory bandwidth

17

slide-18
SLIDE 18

Reed-solomon Coding On FPGAs-Opt(C)

  • C. Unroll loop and Kernel replication to

fully utilize FPGA logic resources.

  • __attribute__(num_compute_units(n)): n pipelines.
  • Loop unroll: deeper pipleline.

18

slide-19
SLIDE 19

Experiments

  • Input: 836.9MB file.
  • On CPU: Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores)
  • On GPU: NVIDIA K40m, CUDA7.0; AMD Carrizo.
  • On FPGA: Altera Stratix V A7.

19

slide-20
SLIDE 20

On CPU

  • srcs = 30, dests = 33

20

0.5 1 1.5 2 2.5 3 20 40 60 80 100 120

Encode Bandwidth number of threads

GB/s

2.84 56

slide-21
SLIDE 21

On NVIDIA K40m

  • One Stream:
  • Best: large table (2.15GB/s)
  • 8 Streams: == 3.9GB/s

21

Encode Bandwidth

slide-22
SLIDE 22

On AMD Carrizo SVM

  • Not as good as streaming.
  • Texture cache doesn’t work well.
  • Overhead of blocking functions to map and unmap SVM buffers.

22

0.1 0.2 0.3 0.4 0.5 0.6 char int int4 char int int4 SVM Streaming GB/s Encode Bandwidth

slide-23
SLIDE 23

On FPGA

  • DMA read/write

about 3GB/s.

  • Only focus on

kernel throughput.

  • Assume DMA

engine can be easily increased.

23

0.001 0.01 0.1 1 10 char int16 int16+tiling+unroll int int16 + tiling char int16 int16+tiling+unroll Large Table Small Table Russian Peasant

GB/s Encode Bandwidth

slide-24
SLIDE 24

Overall

  • Considering the price, FPGA platform is most

promising but needs to improve its current PCIe DMA interface.

24

1 2 3 4 5 6 7 8 10 15 20 25 30 srcs GPU FPGA MC-CPU ST-CPU GB/s

dests = srcs + 3

slide-25
SLIDE 25

NEW-update: Kernel + Memory Copy between Host and Device

1 2 3 4 5 6 7 file1 file2 file1 file2 file1 file2 file1 file2 BDW+SVM BDW Arria10 StratixV

Encode BW (GB/s)

file 1 has a size of 29MB; file 2 has a size of 438MB BDW: Integrated FPGA (arria 10) on Xeon core. SVM (Shared Virtual Memory): the Map/unMap overhead is included Arria 10: discrete FPGA board through PCIe. Stratix V: discrete FPGA board through PCIe.

25

slide-26
SLIDE 26

Conclusions

  • Explore different computing devices for erasure

codes.

  • Different optimizations for different devices.
  • FPGA is the most promising device for erasure

codes.

26