Improving Spark Performance with Zero-copy Buffer Management and - - PowerPoint PPT Presentation

▶

Jan 29, 2024 361 likes •545 views

Improving Spark Performance with Zero-copy Buffer Management and RDMA Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China Latency matters in big data Impala Query Dremel Query [2012]

SLIDE 1

Improving Spark Performance with Zero-copy Buffer Management and RDMA

Hu Li, Charley Chen and Wei Xu Institute for Interdisciplinary Information Sciences Tsinghua University, China

SLIDE 2

Latency matters in big data

[Kay@SOSP13] Big Data: Not only capable , but also interactively

Impala Query [2012]

10 sec 10 min

Dremel Query [2010] Hive Query [2009] In-memory Spark Query [2010]

100 ms

MapReduce Batch Job [2004]

1 ms

Spark Streaming [2013]

Job Latencies

SLIDE 3

Overview of our work

NetSpark: A reliable Spark package that takes

advantage of the RDMA over Converged Ethernet (RoCE) fabric

A combination of memory management
ptimizations for JVM-based applications to take

advantage of RDMA more efficiently

Improving latency-sensitive task performance,

while staying fully compatible with the off-the-shelf Spark

SLIDE 4

Background:   Remote Direct Memory Access (RDMA)

Lower CPU utilization and lower latency

SLIDE 5

An over view of NetSpark transfer model

Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array

DMA Read

Byte Array

DMA Write Network transfer

Object

RNIC RNIC JVM heap JVM off-heap JVM heap JVM off-heap Machine A Machine B User Space Executor Executor serialization

Byte Array

DMA Read

Byte Array

DMA Write Network transfer

Object

deserialization

SLIDE 6

Zero-copy network transfer

JVM Heap JVM Off-heap

Object Byte Array Serialize

Traditional Way

Byte Array Network API (Copy)

Kernel Space

Byte Array System call (Copy) Object Byte Array Serialize

Our Way

RNIC

DMA READ

JVM Heap JVM Off-heap

SLIDE 7

Implementation: SPARK executors

Thread 1 BlockManager BlockTransferService(TCP) SendingConnections ReceivingConnections

Executor(Spark)

Thread 2 Thread N

…

Thread 1 BlockManager BlockTransferService(RDMA) SendingConnections ReceivingConnections

Executor(NetSpark)

Thread 2 Thread N

…

BufferManager

SLIDE 8

RDMA buffer management

RDMA require a fixed physical memory address
for Java: off-heap
Significant allocate/de-allocate cost
Need to register to RDMA
High overhead

Simple solution: Pre-allocate RDMA buffer space to avoid allocation / register overhead

SLIDE 9

RDMA Buffer Management (cont’d)

A small number of large-enough fixed-size off-heap

buffers

Like the Linux kernel buffer, but @ user space
But … need to copy from heap to off-heap

SLIDE 10

Serializing directly into the

ff-heap RDMA buffer
Rewrite Java InputStream

and OutputStream to take advantage of the new buffer manager

Details in the paper

SLIDE 11

Evaluation: Testbed

Switch Switch

10Gb Ethernet 3 X 40Gb Ethernet

… … …

Sever

Network topology of our testbed

1. 3 switches, 34 servers
2. RoCE, 10GE
3. Using priority flow control

for RDMA to avoid packets loss

SLIDE 12

Evaluation: Experiment Setup

Compared four different executor implementation

1. Java NIO
2. Netty
3. Naive RDMA
4. NetSpark

    (Spark version: 1.5.0) max min 50 25 75 latency

SLIDE 13

Group-by performance on small dataset

Spark example
2.5GB data shuffled

About 17% improvement

ver the naive RDMA

SLIDE 14

Why do we have an improvement?

CPU block time
Measurements from SPARK log
Following Kay@NSDI15

SLIDE 15

Group by on larger data - entire reduce stage

A larger dataset about 107.3GB   for shuffle ~40% faster over Netty 

SLIDE 16

PageRank on a large graph

Twitter Graph Dataset 

[Kwak@www2010]

41million nodes 1.5 billion edges 20% faster than Netty 10% faster than naive RDMA

SLIDE 17

Conclusion

NetSpark: A reliable Spark package that takes

advantage of the RDMA over Converged Ethernet (RoCE) fabric

A combination of memory management
ptimizations for JVM-based applications to take

advantage of RDMA more efficiently

Improving latency-sensitive task performance, while

staying fully compatible with the off-the-shelf Spark Wei Xu weixu@tsinghua.edu.cn

Improving Spark Performance with Zero-copy Buffer Management and RDMA

Latency matters in big data

Overview of our work

Background: Remote Direct Memory Access (RDMA)

An over view of NetSpark transfer model

Zero-copy network transfer

Implementation: SPARK executors

RDMA buffer management

RDMA Buffer Management (cont’d)

Serializing directly into the

Evaluation: Testbed

Evaluation: Experiment Setup

Group-by performance on small dataset

Why do we have an improvement?

Group by on larger data - entire reduce stage

PageRank on a large graph

Conclusion

Background:   Remote Direct Memory Access (RDMA)