Are objects the right level of abstraction to enable the convergence - - PowerPoint PPT Presentation

are objects the right level of abstraction to enable the
SMART_READER_LITE
LIVE PREVIEW

Are objects the right level of abstraction to enable the convergence - - PowerPoint PPT Presentation

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level? Pierre Matri*, Alexandru Costan , Gabriel Antoniu , Jess Montes*, Mara S. Prez *, Luc Boug * Universidad


slide-1
SLIDE 1

Are objects the right level of abstraction to enable the convergence between HPC and Big Data at storage level?

Pierre Matri*, Alexandru Costan✝, Gabriel Antoniu◇, Jesús Montes*, María S. Pérez*, Luc Bougé◇

* Universidad Politécnica de Madrid, Madrid, Spain — ✝ INSA Rennes / IRISA, Rennes, France — ◇ Inria Rennes Bretagne-Atlantique, Rennes, France

slide-2
SLIDE 2

1

A Catalyst for Convergence: Data Science

slide-3
SLIDE 3

Can we build a converged storage system for HPC and Big Data?

2

An Approach:
 The BigStorage H2020 Project

slide-4
SLIDE 4

3

The BigStorage Consortium

slide-5
SLIDE 5

One concern

4

slide-6
SLIDE 6

HPC App

5

slide-7
SLIDE 7

HPC App (POSIX) File System

6

slide-8
SLIDE 8

HPC App (POSIX) File System

slide-9
SLIDE 9

folder / file hierarchies permissions atomic file renaming multi-user protection Supports random reads and writes to files

7

slide-10
SLIDE 10

Supports random reads and writes to files

8

slide-11
SLIDE 11

Supports random reads and writes to files

8

Objects

slide-12
SLIDE 12

HPC App Object Storage System

9

slide-13
SLIDE 13

HPC App Object Storage System Big Data App

10

slide-14
SLIDE 14

HPC App Object Storage System Big Data App FS DB K/V Store

10

slide-15
SLIDE 15

HPC App Big Data App DB K/V Store FS Object Storage System

10

slide-16
SLIDE 16

HPC App Object Storage System Big Data App DB K/V Store Object Storage System

10

slide-17
SLIDE 17

HPC App Converged Object Storage System Big Data App DB K/V Store

11

slide-18
SLIDE 18

MonALISA monitoring platform of the CERN LHC ALICE Experiment

A Big Data use-case

12

slide-19
SLIDE 19

One problem… A scientific monitoring service, monitoring the ALICE CERN LHC experiment:

  • Ingests events at a rate of up to 16 GB/s,
  • Produces more than 109 data files per year

Computes 35.000+ aggregates in real-time Current lock-based platform does not scale …multiple requirements

  • Multi-object write synchronization support
  • Atomic, lock-free writes
  • High-performance reads
  • Horizontal scalability

13

slide-20
SLIDE 20

Why is write synchronization needed? Aggregate computation is a three-step operation:

  • 1. Read current value remotely from storage
  • 2. Update it with the new data
  • 3. Write the updated value remotely to storage

Aggregate update needs to be atomic (transactions) Also, adding a new data to persistent storage and updating the related aggregates needs to be performed atomically as well.

14

Client Object Storage System

read(count) 5 write(count,6) ack

sync( )

slide-21
SLIDE 21

At which level to handle concurrency management?

15

slide-22
SLIDE 22

Synchronization layer

Object Storage System At the application level? Enables fine-grained synchronization (app knowledge) …but significantly complexities application design, and typically only guarantees isolation. At a middleware level? Eases application design… …but has a performance cost (zero knowledge), and usually also only guarantees isolation. At a storage level? Also eases application design, better performance than middleware (storage knowledge), and may offer additional consistency guarantees. Thread 1 Thread 2 Thread 3 Transactional Object Storage System

16

slide-23
SLIDE 23

Aren’t existing transactional object stores enough?

17

slide-24
SLIDE 24

Not quite. Existing transactional systems typically only ensure consistency of writes In most current systems, reads are performed atomically only because objects are small enough to be located on a single server, i.e.

  • Records for database systems
  • Values for Key-Value stores

Yet, for large objects, reads spanning multiple chunks should always return a consistent view

18

slide-25
SLIDE 25

Týr transactional design Týr internally maps all writes to transactions

  • Multi-chunk, and even multi-object operations are processed with a serializable order
  • Ensures that all chunk replicas are consistent

Týr uses a high-performance, sequentially-consistent transaction chain algorithm: WARP [1].

19

[1] R. Escriva et al. – Warp: Lightweight Multi-Key Transactions for Key-Value Stores

slide-26
SLIDE 26

Týr is alive! Fully implemented as a prototype with ~22.000 lines of C Lock-free, queue-free, asynchronous design. Leveraging well-known technologies:

  • Google LevelDB [1] for node-local persistent storage,
  • Google FlatBuffers [2] for message serialization,
  • UDT [3] as network transfer protocol.

20

[1] http://leveldb.org/ [2] https://google.github.io/flatbuffers [3] http://udt.sourceforge.net/

slide-27
SLIDE 27

Týr evaluation with MonALISA MonALISA data collection was re-implemented atop Týr, and evaluated using real data Týr was compared to other state-of-the-art, object-based storage systems:

  • RADOS / librados (Ceph)
  • Azure Storage Blobs (Microsoft)
  • BlobSeer (Inria)

Experiments run on the Microsoft Azure cloud, up to 256 nodes 3 x replication factor for all systems

+

21

slide-28
SLIDE 28

Synchronized write performance: Evaluating transactional write performance

  • Avg. throughput (mil. ops / sec)

1,25 2,5 3,75 5 Concurrent writers 25 50 75 100125150175200225250275300325350375400425450475500

Týr (Atomic operations) Tyr (Read-Update-Write) RADOS (Synchronized) BlobSeer (Synchronized) Azure Blobs (Synchronized) We add fine-grained, application-level, lock-based synchronization to Týr competitors Performance of Týr competitors decrease due to the synchronization cost Clear advantage of Atomic operations over Read- Update-Write aggregate updates

22

slide-29
SLIDE 29

Read performance

  • Avg. throughput (mil. ops / sec)

2 4 6 8 Concurrent readers 25 50 75 100125150175200225250275300325350375400425450475500

Týr RADOS BlobSeer Azure Blobs We simulate MonALISA reads, varying the number of concurrent readers Slightly lower performance than RADOS, but offers read consistency guarantees Týr lightweight read protocol allows it to outperform BlobSeer and Azure Storage

23

slide-30
SLIDE 30

HPC App Big Data App K/V Store RDB Converged Object Storage System Týr for HPC applications? The next step Týr as a base layer for higher-level storage abstractions?

24

slide-31
SLIDE 31

Before that: A study of feasibility

25

slide-32
SLIDE 32

HPC App HPC PFS Big Data App

26

Current storage stack I/O Library HPC App HPC App Big Data App Big Data App Big Data Framework Big Data DFS I/O library/ BD Framework calls POSIX-like calls

slide-33
SLIDE 33

HPC App Big Data App

27

“Converged” storage stack I/O Library HPC App HPC App Big Data App Big Data App Big Data Framework Big Data Adapter I/O library/ BD Framework calls POSIX-like calls Converged Object Storage System HPC Adapter Object-based storage calls

slide-34
SLIDE 34

Object-oriented primitives

  • Object Access: random object read, object size
  • Object Manipulation: random object write, truncate
  • Object Administration: create object, delete object
  • Namespace Access: scan all objects
  • These operations are similar to those permitted by the POSIX-IO API on a single file
  • Directory-level operations do not have their object-based storage counterpart (flat nature of

these kinds of systems)

  • Low number of them
  • Emulated using the scan operation (far from optimized, but compensated by the gains

permitted by using a flat namespace and simpler semantics)

28

slide-35
SLIDE 35

29

Platform Application Usage Total reads Total writes R/W ratio Profile HPC/MPI mpiBLAST Protein docking 27.7 GB 12.8 MB 2.1*10^3 Read-intensive MOM Oceanic model 19.5 GB 3.2 GB 6.01 Read-intensive ECOHAM Sediment propagation 0.4 GB 9.7 GB 4.2*10^-2 Write-intensive Ray Tracing Video processing 67.4 GB 71.2 GB 0.94 Balanced Cloud/Spark Sort Text processing 5.8 GB 5.8 GB 1.00 Balanced Connected Component Graph processing 13.1 GB 71.2 MB 0.18 Read-intensive Grep Text processing 55.8 GB 863.8 MB 64.52 Read-intensive Decision Tree Machine learning 59.1 GB 4.7 GB 12.58 Read-intensive Tokenizer Text processing 55.8 GB 235.7 GB 0.24 Write-intensive

Representative set of HPC/BD applications

slide-36
SLIDE 36

30

slide-37
SLIDE 37

31

Operation Action Operation count mkdir Create directory 43 rmdir Remove directory 43

  • pendir (Input data

directory) Open/List directory 5

  • pendir (other

directories) Open/List directory Original operation Rewritten operation create(/foo/bar) create(/foo__bar)

  • pen(/foo/bar)
  • pen(/foo__bar)

read(fd) read(bd) write(fd) write(bd) mkdir(/foo) Dropped operation

  • pendir(/foo)

scan(/), return all files matching /foo__* rmdir(/foo) scan(/), remove all files matching /foo__*

slide-38
SLIDE 38

32

slide-39
SLIDE 39

Týr and RADOS vs Lustre (HPC) , HDFS/CephFS (Big Data)

  • Grid’5000 experimental testbed distributed over 11 sites in France and Luxembourg (parapluie

cluster, Rennes)

  • 2 x 12-core 1.7 Ghz 6164 HE, 48 GB of RAM, and 250 GB HDD.
  • HPC applications: Lustre 2.9.0 and MPICH 3.2 [67], on a 32-node cluster.
  • Big data applications: Spark 2.1.0, Hadoop / HDFS 2.7.3 and Ceph Kraken on a 32-node

cluster

33

slide-40
SLIDE 40

34

HPC applications

slide-41
SLIDE 41

35

BD applications

slide-42
SLIDE 42

36

HPC/BD applications

slide-43
SLIDE 43

Conclusions

  • Tyr is a novel high-performance object-based storage system providing built-in multi object

transactions

  • Object-based storage convergence is possible, leading to a significant performance

improvement on both platforms (HPC and Cloud)

  • A completion time improvement of up to 25% for big data applications and 15% for HPC

applications when using object-based storage

37

slide-44
SLIDE 44

Thank you!