Towards 1000x with Heterogeneous, Programmable Hardware Datacenter - - PowerPoint PPT Presentation

▶

Apr 17, 2023 183 likes •296 views

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter Name: Anton Burtsev, UC Irvine Summary: 1 Related work: What will hardware look like in 10-20 years? Massively heterogeneous Not just many-cores GPUs, Xeon

SLIDE 1

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter

Name: Anton Burtsev, UC Irvine
Summary:
Related work:

SLIDE 2

What will hardware look like in 10-20 years?

Massively heterogeneous

○ Not just many-cores ■ GPUs, Xeon Phi, Tilera TILE, PowerEN ○ But also ■ Fine-grained hardware ASICs accelerators ■ Programmable hardware (FPGA)

SLIDE 3

Ubiquitous, fine-grained, heterogeneous hardware-acceleration

Execution will

no longer stay

n 1 CPU

SLIDE 4

Ubiquitous, fine-grained, heterogeneous hardware-acceleration

A chain of hardware accelerators

(ASIC/FPGA)

■ On-chip, and over PCIe ○ Co-located with storage and network devices

A single machine is a distributed system

○ Yet you have to use it efficiently

SLIDE 5

Even your memory is distributed

Your memory is not local either
We will see large memories

○ 6TB are possible today (Dell R930, 96x64GB DIMMs) ○ 10x higher density in the near future [Meena et al.] ■ ~100TB of NVM on the memory bus ■ 20-80 ns latency of access

SLIDE 6

Big/New Ideas of 1000x

Your biggest problem is ...

○ Latency and parallelism ■ Sent a request to another core/accelerator

355ns on a cache-coherent Intel HARP [Choi, DAC’16]

■ Have to find something to do… ○ Parallelism ■ Expressing, and running the graph of the computation on a set of execution units

SLIDE 7

Big/New Ideas of 1000x

Your have more problems...

○ Reliability ■ A single bug can destroy your in-memory dataset

100TB of non-volatile memory are cache-coherent
Any FPGA unit, or core can wipe it

SLIDE 8

Indicated R&D for 1000x

OS/VMM support for heterogeneous hardware

○ Novel execution runtime ■ Spatial scheduling, preemption, load-balancing

Sharing across multiple users
One host and in a virtual datacenter

■ Unified OS platform for GPU, multi-cores, FPGA

Proprietary stacks and device drivers should go…
Direct (low-latency) access to hardware

SLIDE 9

Indicated R&D for 1000x

Language support

○ Programmable hardware ■ C/C++/Rust to FPGA ○ Parallelism ■ Async & delegate [Grappa, USENIX’16]

Works good for analytical workloads

■ Streaming languages ■ Your favorite model here

Well, MPI will work too

SLIDE 10

Questions for the Software Institute

Analyze potential performance gains for HEP

workloads ■ Assume a clean-slate ideal slate software stack ■ Only hardware limitations ■ Can we get to 1000x? ■ What are the bottlenecks?

SLIDE 11

Questions for the Software Institute

Encouraging example:

○ D.E. Shaw Anton/Anton 2 dynamic molecular simulation machine ■ Custom ASIC ■ 1000x speedup

Same acceleration is possible for HEP