Improving Spark Performance with Zero-copy Buffer Management and - - PowerPoint PPT Presentation

improving spark performance with zero copy buffer
SMART_READER_LITE
LIVE PREVIEW

Improving Spark Performance with Zero-copy Buffer Management and - - PowerPoint PPT Presentation

Improving Spark Performance with Zero-copy Buffer Management and RDMA Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China Latency matters in big data Impala Query Dremel Query [2012]


slide-1
SLIDE 1

Improving Spark Performance with Zero-copy Buffer Management and RDMA

Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China

slide-2
SLIDE 2

Latency matters in big data

[Kay@SOSP13] Big Data: Not only capable , but also interactively

Impala Query [2012]

10 sec 10 min

Dremel Query [2010] Hive Query [2009] In-memory Spark Query [2010]

100 ms

MapReduce Batch Job [2004]

1 ms

Spark Streaming [2013]

Job Latencies

slide-3
SLIDE 3

Overview of our work

  • NetSpark: A reliable Spark package that takes

advantage of the RDMA over Converged Ethernet (RoCE) fabric

  • A combination of memory management
  • ptimizations for JVM-based applications to take

advantage of RDMA more efficiently

  • Improving latency-sensitive task performance,

while staying fully compatible with the off-the-shelf Spark

slide-4
SLIDE 4

Background: 
 Remote Direct Memory Access (RDMA)

Lower CPU utilization and lower latency

slide-5
SLIDE 5

An over view of NetSpark transfer model

Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array

DMA Read

Byte Array

DMA Write Network transfer

Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array

DMA Read

Byte Array

DMA Write Network transfer

Object

deserialization

slide-6
SLIDE 6

Zero-copy network transfer

JVM Heap JVM Off-heap

Object Byte Array Serialize

Traditional Way

Byte Array Network API (Copy)

Kernel Space

Byte Array System call (Copy) Object Byte Array Serialize

Our Way

RNIC

DMA READ

JVM Heap JVM Off-heap

slide-7
SLIDE 7

Implementation: SPARK executors

Thread 1 BlockManager BlockTransferService(TCP) SendingConnections ReceivingConnections

Executor(Spark)

Thread 2 Thread N

Thread 1 BlockManager BlockTransferService(RDMA) SendingConnections ReceivingConnections

Executor(NetSpark)

Thread 2 Thread N

BufferManager

slide-8
SLIDE 8

RDMA buffer management

  • RDMA require a fixed physical memory address
  • for Java: off-heap
  • Significant allocate/de-allocate cost
  • Need to register to RDMA
  • High overhead

Simple solution: Pre-allocate RDMA buffer space to avoid allocation / register overhead

slide-9
SLIDE 9

RDMA Buffer Management (cont’d)

  • A small number of large-enough fixed-size off-heap

buffers

  • Like the Linux kernel buffer, but @ user space
  • But … need to copy from heap to off-heap
slide-10
SLIDE 10

Serializing directly into the

  • ff-heap RDMA buffer
  • Rewrite Java InputStream

and OutputStream to take advantage of the new buffer manager

  • Details in the paper
slide-11
SLIDE 11

Evaluation: Testbed

Switch Switch

10Gb Ethernet 3 X 40Gb Ethernet

… … …

Sever

Network topology of our testbed

  • 1. 3 switches, 34 servers
  • 2. RoCE, 10GE
  • 3. Using priority flow control

for RDMA to avoid packets loss

slide-12
SLIDE 12

Evaluation: Experiment Setup

Compared four different executor implementation

  • 1. Java NIO
  • 2. Netty
  • 3. Naive RDMA
  • 4. NetSpark



 
 (Spark version: 1.5.0) max min 50 25 75 latency

slide-13
SLIDE 13

Group-by performance on small dataset

  • Spark example
  • 2.5GB data shuffled

About 17% improvement

  • ver the naive RDMA
slide-14
SLIDE 14

Why do we have an improvement?

  • CPU block time
  • Measurements from SPARK log
  • Following Kay@NSDI15
slide-15
SLIDE 15

Group by on larger data - entire reduce stage

A larger dataset about 107.3GB 
 for shuffle ~40% faster over Netty


slide-16
SLIDE 16

PageRank on a large graph

Twitter Graph Dataset


[Kwak@www2010]

41million nodes 1.5 billion edges 20% faster than Netty 10% faster than naive RDMA

slide-17
SLIDE 17

Conclusion

  • NetSpark: A reliable Spark package that takes

advantage of the RDMA over Converged Ethernet (RoCE) fabric

  • A combination of memory management
  • ptimizations for JVM-based applications to take

advantage of RDMA more efficiently

  • Improving latency-sensitive task performance, while

staying fully compatible with the off-the-shelf Spark Wei Xu weixu@tsinghua.edu.cn