[PPT] - Are objects the right level of abstraction to enable the convergence PowerPoint Presentation

SLIDE 1

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level?

Pierre Matri*, Alexandru Costan✝, Gabriel Antoniu◇, Jesús Montes*, María S. Pérez*, Luc Bougé◇

* Universidad Politécnica de Madrid, Madrid, Spain — ✝ INSA Rennes / IRISA, Rennes, France — ◇ Inria Rennes Bretagne-Atlantique, Rennes, France

SLIDE 2

1

A Catalyst for Convergence: Data Science

SLIDE 3

Can we build a converged storage system for HPC and Big Data?

2

An Approach:  The BigStorage H2020 Project

SLIDE 4

3

The BigStorage Consortium

SLIDE 5

One concern

4

SLIDE 6

HPC App

5

SLIDE 7

HPC App (POSIX) File System

6

SLIDE 8

HPC App (POSIX) File System

SLIDE 9

folder / file hierarchies permissions atomic file renaming multi-user protection Supports random reads and writes to files

7

◇

SLIDE 10

Supports random reads and writes to files

8

SLIDE 11

Supports random reads and writes to files

8

Objects

SLIDE 12

HPC App Object Storage System

9

SLIDE 13

HPC App Object Storage System Big Data App

10

SLIDE 14

HPC App Object Storage System Big Data App FS DB K/V Store

10

SLIDE 15

HPC App Big Data App DB K/V Store FS Object Storage System

10

SLIDE 16

HPC App Object Storage System Big Data App DB K/V Store Object Storage System

10

SLIDE 17

HPC App Converged Object Storage System Big Data App DB K/V Store

11

SLIDE 18

MonALISA monitoring platform of the CERN LHC ALICE Experiment

A Big Data use-case

12

SLIDE 19

One problem… A scientific monitoring service, monitoring the ALICE CERN LHC experiment:

Ingests events at a rate of up to 16 GB/s,
Produces more than 109 data files per year

Computes 35.000+ aggregates in real-time Current lock-based platform does not scale …multiple requirements

Multi-object write synchronization support
Atomic, lock-free writes
High-performance reads
Horizontal scalability

13

SLIDE 20

Why is write synchronization needed? Aggregate computation is a three-step operation:

1. Read current value remotely from storage
2. Update it with the new data
3. Write the updated value remotely to storage

Aggregate update needs to be atomic (transactions) Also, adding a new data to persistent storage and updating the related aggregates needs to be performed atomically as well.

14

Client Object Storage System

read(count) 5 write(count,6) ack

sync( )

SLIDE 21

At which level to handle concurrency management?

15

SLIDE 22

Synchronization layer

Object Storage System At the application level? Enables fine-grained synchronization (app knowledge) …but significantly complexities application design, and typically only guarantees isolation. At a middleware level? Eases application design… …but has a performance cost (zero knowledge), and usually also only guarantees isolation. At a storage level? Also eases application design, better performance than middleware (storage knowledge), and may offer additional consistency guarantees. Thread 1 Thread 2 Thread 3 Transactional Object Storage System

16

SLIDE 23

Aren’t existing transactional object stores enough?

17

SLIDE 24

Not quite. Existing transactional systems typically only ensure consistency of writes In most current systems, reads are performed atomically only because objects are small enough to be located on a single server, i.e.

Records for database systems
Values for Key-Value stores

Yet, for large objects, reads spanning multiple chunks should always return a consistent view

18

SLIDE 25

Týr transactional design Týr internally maps all writes to transactions

Multi-chunk, and even multi-object operations are processed with a serializable order
Ensures that all chunk replicas are consistent

Týr uses a high-performance, sequentially-consistent transaction chain algorithm: WARP [1].

19

[1] R. Escriva et al. – Warp: Lightweight Multi-Key Transactions for Key-Value Stores

SLIDE 26

Týr is alive! Fully implemented as a prototype with ~22.000 lines of C Lock-free, queue-free, asynchronous design. Leveraging well-known technologies:

Google LevelDB [1] for node-local persistent storage,
Google FlatBuffers [2] for message serialization,
UDT [3] as network transfer protocol.

20

[1] http://leveldb.org/ [2] https://google.github.io/flatbuffers [3] http://udt.sourceforge.net/

SLIDE 27

Týr evaluation with MonALISA MonALISA data collection was re-implemented atop Týr, and evaluated using real data Týr was compared to other state-of-the-art, object-based storage systems:

RADOS / librados (Ceph)
Azure Storage Blobs (Microsoft)
BlobSeer (Inria)

Experiments run on the Microsoft Azure cloud, up to 256 nodes 3 x replication factor for all systems

+

21

SLIDE 28

Synchronized write performance: Evaluating transactional write performance

Avg. throughput (mil. ops / sec)

1,25 2,5 3,75 5 Concurrent writers 25 50 75 100125150175200225250275300325350375400425450475500

Týr (Atomic operations) Tyr (Read-Update-Write) RADOS (Synchronized) BlobSeer (Synchronized) Azure Blobs (Synchronized) We add fine-grained, application-level, lock-based synchronization to Týr competitors Performance of Týr competitors decrease due to the synchronization cost Clear advantage of Atomic operations over Read- Update-Write aggregate updates

22

SLIDE 29

Read performance

Avg. throughput (mil. ops / sec)

2 4 6 8 Concurrent readers 25 50 75 100125150175200225250275300325350375400425450475500

Týr RADOS BlobSeer Azure Blobs We simulate MonALISA reads, varying the number of concurrent readers Slightly lower performance than RADOS, but offers read consistency guarantees Týr lightweight read protocol allows it to outperform BlobSeer and Azure Storage

23

SLIDE 30

HPC App Big Data App K/V Store RDB Converged Object Storage System Týr for HPC applications? The next step Týr as a base layer for higher-level storage abstractions?

24

SLIDE 31

Before that: A study of feasibility

25

SLIDE 32

HPC App HPC PFS Big Data App

26

Current storage stack I/O Library HPC App HPC App Big Data App Big Data App Big Data Framework Big Data DFS I/O library/ BD Framework calls POSIX-like calls

SLIDE 33

HPC App Big Data App

27

“Converged” storage stack I/O Library HPC App HPC App Big Data App Big Data App Big Data Framework Big Data Adapter I/O library/ BD Framework calls POSIX-like calls Converged Object Storage System HPC Adapter Object-based storage calls

SLIDE 34

Object-oriented primitives

Object Access: random object read, object size
Object Manipulation: random object write, truncate
Object Administration: create object, delete object
Namespace Access: scan all objects
These operations are similar to those permitted by the POSIX-IO API on a single file
Directory-level operations do not have their object-based storage counterpart (flat nature of

these kinds of systems)

Low number of them
Emulated using the scan operation (far from optimized, but compensated by the gains

permitted by using a flat namespace and simpler semantics)

28

SLIDE 35

29

Platform Application Usage Total reads Total writes R/W ratio Profile HPC/MPI mpiBLAST Protein docking 27.7 GB 12.8 MB 2.1*10^3 Read-intensive MOM Oceanic model 19.5 GB 3.2 GB 6.01 Read-intensive ECOHAM Sediment propagation 0.4 GB 9.7 GB 4.2*10^-2 Write-intensive Ray Tracing Video processing 67.4 GB 71.2 GB 0.94 Balanced Cloud/Spark Sort Text processing 5.8 GB 5.8 GB 1.00 Balanced Connected Component Graph processing 13.1 GB 71.2 MB 0.18 Read-intensive Grep Text processing 55.8 GB 863.8 MB 64.52 Read-intensive Decision Tree Machine learning 59.1 GB 4.7 GB 12.58 Read-intensive Tokenizer Text processing 55.8 GB 235.7 GB 0.24 Write-intensive

Representative set of HPC/BD applications

SLIDE 36

30

SLIDE 37

31

Operation Action Operation count mkdir Create directory 43 rmdir Remove directory 43

pendir (Input data

directory) Open/List directory 5

pendir (other

directories) Open/List directory Original operation Rewritten operation create(/foo/bar) create(/foo__bar)

pen(/foo/bar)
pen(/foo__bar)

read(fd) read(bd) write(fd) write(bd) mkdir(/foo) Dropped operation

pendir(/foo)

scan(/), return all files matching /foo__* rmdir(/foo) scan(/), remove all files matching /foo__*

SLIDE 38

32

SLIDE 39

Týr and RADOS vs Lustre (HPC) , HDFS/CephFS (Big Data)

Grid’5000 experimental testbed distributed over 11 sites in France and Luxembourg (parapluie

cluster, Rennes)

2 x 12-core 1.7 Ghz 6164 HE, 48 GB of RAM, and 250 GB HDD.
HPC applications: Lustre 2.9.0 and MPICH 3.2 [67], on a 32-node cluster.
Big data applications: Spark 2.1.0, Hadoop / HDFS 2.7.3 and Ceph Kraken on a 32-node

cluster

33

SLIDE 40

34

HPC applications

SLIDE 41

35

BD applications

SLIDE 42

36

HPC/BD applications

SLIDE 43

Conclusions

Tyr is a novel high-performance object-based storage system providing built-in multi object

transactions

Object-based storage convergence is possible, leading to a significant performance

improvement on both platforms (HPC and Cloud)

A completion time improvement of up to 25% for big data applications and 15% for HPC

applications when using object-based storage

37

SLIDE 44

Are objects the right level of abstraction to enable the convergence - - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level?

A Catalyst for Convergence: Data Science

Can we build a converged storage system for HPC and Big Data?

An Approach:  The BigStorage H2020 Project

The BigStorage Consortium

One concern

folder / file hierarchies permissions atomic file renaming multi-user protection Supports random reads and writes to files

Supports random reads and writes to files

Supports random reads and writes to files

Objects

MonALISA monitoring platform of the CERN LHC ALICE Experiment

A Big Data use-case

sync( )

At which level to handle concurrency management?

Aren’t existing transactional object stores enough?

+

Before that: A study of feasibility

Thank you!

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level?

A Catalyst for Convergence: Data Science

Can we build a converged storage system for HPC and Big Data?

An Approach: The BigStorage H2020 Project

The BigStorage Consortium

One concern

folder / file hierarchies permissions atomic file renaming multi-user protection Supports random reads and writes to files

Supports random reads and writes to files

Supports random reads and writes to files

Objects

MonALISA monitoring platform of the CERN LHC ALICE Experiment

A Big Data use-case

sync( )

At which level to handle concurrency management?

Aren’t existing transactional object stores enough?

+

Before that: A study of feasibility

Thank you!

An Approach:  The BigStorage H2020 Project