Exploring the Future of Out-Of-Core Computing with Compute-Local - - PowerPoint PPT Presentation

▶

Mar 09, 2024 512 likes •900 views

Exploring the Future of Out-Of-Core Computing with Compute-Local Non-Volatile Memory Myoungsoo Jung 1 Ellis H. Wilson III 2 Wonil Choi 1 , 2 John Shalf 3 , 4 Hasan Metin Aktulga 3 Chao Yang 3 Erik Saule 5 Umit V. Catalyurek 5 , 6 Mahmut Kandemir 2

SLIDE 1

Exploring the Future of Out-Of-Core Computing with Compute-Local Non-Volatile Memory

Myoungsoo Jung 1 Ellis H. Wilson III 2 Wonil Choi 1,2 John Shalf 3,4 Hasan Metin Aktulga 3 Chao Yang 3 Erik Saule 5 Umit V. Catalyurek 5,6 Mahmut Kandemir 2

1Department of Electrical Engineering, The University of Texas at Dallas 2Department of Computer Science and Engineering, The Pennsylvania State University 3Computational Research Division, Lawrence Berkeley National Laboratory 4National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory 5Biomedical Informatics, The Ohio State University 6Electrical and Computer Engineering, The Ohio State University

November 20th, 2013

SLIDE 2

Overview/Motivation Holistic System Improvement Evaluation

Before We Begin: Get the Slides and Paper

Slides and Paper are Available At:

www.ellisv3.com

www.ellisv3.com OoC Compute with Local NVM

SLIDE 3

Overview/Motivation Holistic System Improvement Evaluation

Overview of OoC Computing and Motivations OoC Computing in Today’s HPC Environment Current Approaches to Acceleration in HPC Motivating a Move to Compute-Local NVM

Advancing OoC Computing via Holistic System Analysis System Organization and a Software Management Framework File System Analysis: Traditional versus a Unified File System NVM Device Architecture: Uncovering Hidden Bottlenecks

Evaluation and Analysis of Our Proposed Solutions Experimental Configuration and Tracing Methodology Results of Holistic System Improvement for OoC Computing Major Take-Aways and Conclusion

www.ellisv3.com OoC Compute with Local NVM

SLIDE 4

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

What’s an OoC? Definition of Out-Of-Core (OoC) Computation: Computation requiring constant or near-constant use of datasets, which are impossible to fit entirely in-memory for a single host.

www.ellisv3.com OoC Compute with Local NVM

SLIDE 5

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Exemplary OoC Application

Predicting Properties of Light Atomic Nuclei Performs high-accuracy calculations of nuclear structures via the Configuration Interaction (CI) method CI method utilizes the nuclear many-body Hamiltonian, ˆ H, which is sparse, so a parallel iterative eigensolver is used ˆ H can be absolutely massive, and requires much more time to compute than any single eigensolver iteration Result is preprocessing and storing ˆ H for repeated use

www.ellisv3.com OoC Compute with Local NVM

SLIDE 6

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Current OoC Solution: Shared Memory

Current Solution: Dealt with by splitting dataset across numerous nodes’ memories and sharing the memory space. Pitfalls: DRAM is extremely costly and power inefficient Capacity constrained DRAM limits scale of experiments Application dataset sizes are growing faster than DRAM capacity is scaling Expensive networking (e.g., top-tier Infiniband) is required to facilitate such demanding data movement

www.ellisv3.com OoC Compute with Local NVM

SLIDE 7

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Acceleration: From Compute to Storage

HPC is currently witness to a sea-change in computation: No longer simply General Purpose CPUs GPGPUs and co-processors are seeing increasingly serious use in numerous Top500 machines Storage in HPC is beginning to follow suit: Traditional magnetic disk is often too slow, even at scale Flash-cache accelerated NAS/SAN was first to assist Natural Extension: Recent works have explored flash on I/O Node (ION) for OoC acceleration

www.ellisv3.com OoC Compute with Local NVM

SLIDE 8

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

ION-Local Acceleration for OoC Computation

Architecture For ION-Local NVM Acceleration:

DIMM PCIe RC PCIe EP SATA HOST SATA DEVICE NVM NVM NVM NVM NVM NVM

INTERFACE

PCIe HOST

HBA Controller SATA Controller

RAID RAID RAID DISK DISK DISK DISK DISK PCIe SSD core core PCIe HOST

I/O Node (ION)

L1 L1 LLC DRAM PCIe SSD DIMM DIMM core core core core L1 L1 LLC MC DRAM Network Fiber Channel

Compute Node (CN)

Caveat: Data movement from ION to compute still required

www.ellisv3.com OoC Compute with Local NVM

SLIDE 9

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Problem: NVM Bandwidth is Out-Pacing the Network

Bandwidth Trend: High-Performance Network vs. SSDs

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 0.01563 0.03125 0.0625 0.125 0.25 0.5 1 2 4 8 16 Future Multi-channel PCM-SSD (expectation) Future PCIe SSD (expectation) Silicon Disk II (RAM-SSD) NonFlash-NVM SSD ioDrive Octal ioDrive2 Ony x PCM Prototy pe Z-Drive R4 ioDrive SF-1000 Intel-X25 ST-Zeus A25FB W inchester Bandwidth per channel (GB/sec) Year InfiniBand Fibre Channel Flash-SSD

www.ellisv3.com OoC Compute with Local NVM

SLIDE 10

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Retrain Your Brain: Flash is Memory, not Storage ”We must begin to envision and find ways to implement NVM as a form of compute-local, large but slow memory, rather than client-remote, small but fast disk.”

DIMM DIMM DIMM

Local-node SSD equipped Compute Node (CNL)

DRAM

DIMM SLOT I/O Node (ION)

PCIe SSD DIMM DIMM DIMM core core core core L1 L1 LLC MC DRAM

DIMM SLOT

PCIe HOST core core L1 L1 LLC DRAM PCIe RC PCIe EP NVM NVM NVM NVM

INTERFACE

PCIe HOST

Native PCIe Controller

Network RAID RAID RAID DISK DISK DISK DISK DISK Fiber Channel ION PCIe SSD PCIe HOST www.ellisv3.com OoC Compute with Local NVM

SLIDE 11

Overview/Motivation Holistic System Improvement Evaluation OoC Computing Today Acceleration in HPC Motivation and Proposal

Our Contributions

1 Design OoC HPC architecture with co-located NVM storage

and compute

2 Demonstrate that traditional file systems are not well-tuned

for the massively parallel architecture within modern SSDs

3 Propose new Unified File System (UFS) 4 Expose overheads implicit in modern SSD architecture 5 Present necessary protocol/interface fixes for near-optimal

performance

6 Provide comparative evaluations for all suggested

improvements using real OoC workloads

www.ellisv3.com OoC Compute with Local NVM

SLIDE 12

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Future System Design Requires a Holistic Approach

Full exploration of potential future OoC systems requires a holistic approach to system analysis and redesign: Hardware organization Software framework and applications File systems Device protocol Device architecture and interfaces

www.ellisv3.com OoC Compute with Local NVM

SLIDE 13

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Co-locating Compute and NVM: Considerations

Another look at our architecture:

DIMM DIMM DIMM

Local-node SSD equipped Compute Node (CNL)

DRAM

DIMM SLOT I/O Node (ION)

PCIe SSD DIMM DIMM DIMM core core core core L1 L1 LLC MC DRAM

DIMM SLOT

PCIe HOST core core L1 L1 LLC DRAM PCIe RC PCIe EP NVM NVM NVM NVM

INTERFACE

PCIe HOST

Native PCIe Controller

Network RAID RAID RAID DISK DISK DISK DISK DISK Fiber Channel ION PCIe SSD PCIe HOST

Considerations: Cost: SSDs aren’t cheap, but prices are dropping and bandwidth/capacity is consistently rising As SSDs out-pace network, it becomes increasingly expensive to keep them off the compute node Tradition: Typical separation of compute and storage for management reasons Administration of coupled architectures has been recently proven quite doable (e.g., Hadoop, Mesos)

www.ellisv3.com OoC Compute with Local NVM

SLIDE 14

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Our Data Management Framework

We enable application-managed data staging via: DOoC - Distributed data storage and scheduler with OoC capabilities via out-of-core linear algebra framework (LAF) DataCutter - A middleware that abstracts dataflows via the concepts of filters and streams All together, this works much in the way OpenMP does – directives and routines in the application code enable automated data storage management

www.ellisv3.com OoC Compute with Local NVM

SLIDE 15

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Traditional File Systems

The Good Ol’ (Magnetic) Bits Club Most filesystems, even modern ones, are built on a foundation

f assumptions for spinning magnetic disk

This prevents full utilization of the massively parallel architectures in modern SSDs due to:

Small block sizes (512B to 4KB)

Low coalescing limits

Metadata/journaling contention

www.ellisv3.com OoC Compute with Local NVM

SLIDE 16

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

The Unified File System

Enabling Full Parallelism in SSDs: Fixes the woes of existing file systems and an untuned block device layer Provides near-direct access to SSD by punching straight through the file system, block layer, and FTL FTL and file system duties become more tightly integrated in the host Dubious? Fusion-IO already employs a lesser variation on this

www.ellisv3.com OoC Compute with Local NVM

SLIDE 17

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Exemplary Request Comparison

Consider the path of a request between the following: Traditional File System:

NAND Flash NAND Flash NAND Flash NVM Package NVM Controller NAND Flash NAND Flash NAND Flash NVM Package NVM Controller NAND Flash NAND Flash NAND Flash NVM Package NVM Controller

Flash Translation Layer Native File System (e.g., EXT4, JFS, XFS) Out-Of-Core Scientific Computing Applications POSIX API-level Read/Write Logical Block-level Read/Write NVM Transaction- level Read, Write, Erase

Our Unified File System (UFS):

NAND Flash NAND Flash NAND Flash NVM Package NVM Controller NAND Flash NAND Flash NAND Flash NVM Package NVM Controller NAND Flash NAND Flash NAND Flash NVM Package NVM Controller

PCIe ENDPOINTS Unified File System Out-Of-Core Scientific Computing Applications POSIX API-level Read/Write NVM Transaction- level Read, Write, Erase Physical Separation

www.ellisv3.com OoC Compute with Local NVM

SLIDE 18

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

A Silent (Performance) Killer: Bridged PCIe Flash

First device hurdle discovered: Bridged SATAe-based PCIe: Many “PCIe” SSDs are simply flash chips with SATAe interfaces An internal transcode from SATA to PCIe (and back) occurs Biggest issue: SATA uses a 8/10b encoding (25%

verhead), whereas PCIe

3.0 uses a 128/130b (1.5% overhead) encoding

NAND Flash NAND Flash NAND Flash NAND Flash SATA Device SATA Host PCIe ENDPOINT ONFi 3.x NAND Controller NAND Flash NAND Flash NAND Flash NAND Flash ONFi 3.x NAND Controller NAND Flash NAND Flash NAND Flash NAND Flash ONFi 3.x NAND Controller

SATA6G/SAS PCIe links

Root Complex

www.ellisv3.com OoC Compute with Local NVM

SLIDE 19

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Correcting Performance with Native PCIe 3.0

Native PCIe 3.0: Native PCIe links to controller Achieves low overhead of 1.5% bits to assure DC-balance and bounded disparity We compare native PCIe 3.0 against PCIe 2.0, which uses 8/10b encoding, in evaluation

NAND Flash NAND Flash NAND Flash NAND Flash DDR3 NAND Controller NAND Flash NAND Flash NAND Flash NAND Flash DDR3 NAND Controller NAND Flash NAND Flash NAND Flash NAND Flash DDR3 NAND Controller

PCIe links

PCIe ENDPOINT PCIe ENDPOINT PCIe ENDPOINT Root Complex PCIe Switch

www.ellisv3.com OoC Compute with Local NVM

SLIDE 20

Overview/Motivation Holistic System Improvement Evaluation Architecture and Software Framework File System Analysis NVM Device Architecture

Lane Width and Interface Frequency Bottlenecks

Second and third device hurdles discovered: PCIe lane-widths: Post-conversion

verheads, current PCIe

2.0 SSDs only provide four lanes at 2GBps Well under maximum possible throughput potential of flash chips We explore future expanded-lane architectures with 8 and 16 lanes NVM interface frequencies: Even cutting-edge protocols such as ONFi3 leave NVM bandwidth behind Only reaches equivalent of DDR2 @ 200MHz Experimenting with next-generation speeds such as DDR3 @ 1600 will unthrottle NVM

www.ellisv3.com OoC Compute with Local NVM

SLIDE 21

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Experimental Setup

High-fidelity trace-based NVM SSD simulation NANDFlashSim Already supported SLC and MLC flash, extended for TLC and PCM Enabled queueing optimizations as described in prior work NVM Architecture Considered: SSDs filled with four NVM types: SLC, MLC, TLC, and PCM 8 internal channels 64 NVM packages 128 NVM dies (2/package)

www.ellisv3.com OoC Compute with Local NVM

SLIDE 22

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Real OoC Application Tracing Methodology

OoC Physics Tracing for Simulation Traced the OoC physics application mentioned earlier at scale

n the LBNL Carver Cluster

Trace points: At the ION-local SSDs (under GPFS) At each compute node (at POSIX level) Rerun and retraced with a variety of file systems (ext2, ext3, ext4, tuned ext4, JFS, BTRFS, and XFS) (at the block level)

www.ellisv3.com OoC Compute with Local NVM

SLIDE 23

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Access Pattern Considerations: GPFS vs POSIX

5.20E+008 5.60E+008 6.00E+008 6.40E+008 6.80E+008 600 1200 1800 2400 3000 3600 4200 4800 5.20E+008 5.60E+008 6.00E+008 6.40E+008 6.80E+008 POSIX Address Space GPFS Address Space Access Sequence

Take-Away: GPFS striping creates access patterns that fail to leverage full bandwidth

f flash – ability to issue POSIX access patterns directly to flash would be ideal

www.ellisv3.com OoC Compute with Local NVM

SLIDE 24

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Architecture and File System Results: Bandwidth Achieved

ION-GPFS CNL-JFS CNL-BTRFS CNL-XFS CNL-REISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS 500 1000 1500 2000 2500 3000 3500 Bandwidth (MB/sec) TLC MLC SLC PCM

Take-Aways: 1) ION-local is harshly limited by network. 2) CN-local varies extensively with behaviors of underlying file systems. 3) UFS reaches architectural bottlenecks.

www.ellisv3.com OoC Compute with Local NVM

SLIDE 25

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Architecture and File System Results: What Remains?

ION-GPFS CNL-JFS CNL-BTRFS CNL-XFS CNL-REISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS 500 1000 1500 2000 2500 3000 3500 Bandwidth Remaining (MB/sec) TLC MLC SLC PCM

Take-Aways: 1) ION-local and UFS leaves considerable bandwidth untapped. 2) Traditional file systems workloads are flash-limited due to workloads.

www.ellisv3.com OoC Compute with Local NVM

SLIDE 26

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Device Improvement: Bandwidth Achieved

C N L

F S C N L

R I D G E

6 C N L

A T I V E

C N L

A T I V E

6 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth (MB/sec) TLC MLC SLC PCM

Take-Aways:

1 Even with 4X more

interface bandwidth, only marginal improvements for bridged architecture

2 Native PCIe with improved

frequencies delivers superior performance

3 NVM is finally the real

bottleneck in last architecture

www.ellisv3.com OoC Compute with Local NVM

SLIDE 27

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Device Improvement: What Remains?

C N L

F S C N L

R I D G E

6 C N L

A T I V E

C N L

A T I V E

6 1000 2000 3000 4000 5000 6000 7000 8000 Bandwidth Remaining (MB/sec) TLC MLC SLC PCM

Take-Aways:

1 Despite low performance,

bridged architecture incurs so much overhead nothing is left behind

2 Move to native opens up

throttle, but gets bottlenecked on only 8 channels

www.ellisv3.com OoC Compute with Local NVM

SLIDE 28

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Main Evaluation Take-Aways

Move it Local: Keeping NVM remote (or using shared memory) is an increasingly costly decision for future systems Holistic Eye: Achieving full performance for NVM SSDs requires holistic approach to system analysis File Systems Matter: Which file system is employed plays a huge role in fully leveraging SSDs Unthrottled SSDs: Future SSD architecture has to expand lanes and increase frequencies to fully unthrottle NVM storage

www.ellisv3.com OoC Compute with Local NVM

SLIDE 29

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Conclusions

Conclusions: Have to think of NVM SSDs more as nearby, slow memory, than distant, fast storage. Demonstrate 108% improvements just by moving it nearby Another 52% and 250% in improvements can be realized by tuning file systems and SSD architecture properly Overall, comparing the original ION-local, untuned SSD architecture, to our last, CN-local fully-unthrottled SSD architecture, we are able to unlock 16X bandwidth for our OoC application, without ever changing the underlying NVM chips

www.ellisv3.com OoC Compute with Local NVM

SLIDE 30

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Questions?

www.ellisv3.com OoC Compute with Local NVM

SLIDE 31

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Backup Slides – Begin Backup Slides –

www.ellisv3.com OoC Compute with Local NVM

SLIDE 32

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Channel Utilization

Channel Utilization = Average Percent of Channels Kept Busy

ION-GPFS CNL-JFS CNL-BTRFS CNL-XFS CNL-REISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS CNL-BRIDGE-16 CNL-NATIVE-8 CNL-NATIVE-16 60 65 70 75 80 85 90 95 100 Utilization (%) TLC MLC SLC PCM

Take-Away: GPFS striping results in high utilization, but low performance

www.ellisv3.com OoC Compute with Local NVM

SLIDE 33

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Package Utilization

Package Utilization = Average Percent of Packages Serving Requests

ION-GPFS CNL-JFS CNL-BTRFS CNL-XFS CNL-REISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS CNL-BRIDGE-16 CNL-NATIVE-8 CNL-NATIVE-16 20 40 60 80 100 Utilization (%) TLC MLC SLC PCM

Take-Away: Even small percentage increases in package utilization can mean large increases in bandwidth

www.ellisv3.com OoC Compute with Local NVM

SLIDE 34

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Operation Breakdown Definitions

Six Major Categories of Operations Possible in SSD Non-Overlapped DMA: Data movement between SSD and the host, including thin interface (SAS), PCIe bus, and network. Flash-Bus Activation: Data movement between registers (or SRAM) in NVM packages and the main channel. Channel-Bus Activation: Data movement on the data bus shared by NVM packages. Cell Contention: Waiting on an NVM package already busy serving another request. Channel Contention: Waiting on a channel already busy serving another request. Cell Activation: Performing a read, write, or erase operation

n an NVM cell, including time spent moving data between

internal registers (or SRAM) and the cell array.

www.ellisv3.com OoC Compute with Local NVM

SLIDE 35

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Operation Breakdown

TLC Operation Breakdown:

ION-GPFS CNL-JFS CNL-BTRFS CNL-XFS CNL-REISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS CNL-BRIDGE-16 CNL-NATIVE-8 CNL-NATIVE-16 10 20 30 40 50 60 70 80 90 100 Execution Breakdown (%) Non-overlapped DMA Flash bus activation Channel activation Cell contention Channel contention Cell activation

PCM Operation Breakdown:

Take-Aways:

1 ION-local spends

significant time in non-overlapped DMA due to the network

2 UFS relieves internal bus

activities obvious in traditional file systems

3 Cell activation increases

dramatically towards later architectures

www.ellisv3.com OoC Compute with Local NVM

SLIDE 36

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Parallelism Classification

Four Stages of Parallelism PAL1: System-level parallelism via solely channel striping and channel pipelining. PAL2: Die (Bank) interleaving on top of PAL1. PAL3: Multi-plane mode operation on top of PAL1. PAL4: All previous levels above.

www.ellisv3.com OoC Compute with Local NVM

SLIDE 37

Overview/Motivation Holistic System Improvement Evaluation Configuration Evaluation Results Conclusion

Digging Deeper: Parallelism Breakdown

Difference: Perspective of a request – not perspective of hardware TLC Parallelism Breakdown:

ION-GPFS CNL-JFS CNL-BTRTFS CNL-XFS CNL-RAISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS CNL-BRIDGE-16 CNL-NATIVE-8 CNL-NATIVE-16 10 20 30 40 50 60 70 80 90 100 Parallelism Decomposition (%) PAL1 PAL2 PAL3 PAL4

PCM Parallelism Breakdown:

ION-GPFS CNL-JFS CNL-BTRTFS CNL-XFS CNL-RAISERFS CNL-EXT2 CNL-EXT3 CNL-EXT4 CNL-EXT4-L CNL-UFS CNL-BRIDGE-16 CNL-NATIVE-8 CNL-NATIVE-16 95 96 97 98 99 100 Parallelism Decomposition (%) PAL1 PAL2 PAL3 PAL4