Towards 1000x with Heterogeneous, Programmable Hardware Datacenter - - PowerPoint PPT Presentation

towards 1000x with heterogeneous programmable hardware
SMART_READER_LITE
LIVE PREVIEW

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter - - PowerPoint PPT Presentation

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter Name: Anton Burtsev, UC Irvine Summary: 1 Related work: What will hardware look like in 10-20 years? Massively heterogeneous Not just many-cores GPUs, Xeon


slide-1
SLIDE 1

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter

  • Name: Anton Burtsev, UC Irvine
  • Summary:
  • Related work:

1

slide-2
SLIDE 2

What will hardware look like in 10-20 years?

  • Massively heterogeneous

○ Not just many-cores ■ GPUs, Xeon Phi, Tilera TILE, PowerEN ○ But also ■ Fine-grained hardware ASICs accelerators ■ Programmable hardware (FPGA)

2

slide-3
SLIDE 3

Ubiquitous, fine-grained, heterogeneous hardware-acceleration

  • Execution will

no longer stay

  • n 1 CPU

3

slide-4
SLIDE 4

Ubiquitous, fine-grained, heterogeneous hardware-acceleration

  • A chain of hardware accelerators

(ASIC/FPGA)

■ On-chip, and over PCIe ○ Co-located with storage and network devices

  • A single machine is a distributed system

○ Yet you have to use it efficiently

4

slide-5
SLIDE 5

Even your memory is distributed

  • Your memory is not local either
  • We will see large memories

○ 6TB are possible today (Dell R930, 96x64GB DIMMs) ○ 10x higher density in the near future [Meena et al.] ■ ~100TB of NVM on the memory bus ■ 20-80 ns latency of access

5

slide-6
SLIDE 6

Big/New Ideas of 1000x

  • Your biggest problem is ...

○ Latency and parallelism ■ Sent a request to another core/accelerator

  • 355ns on a cache-coherent Intel HARP [Choi, DAC’16]

■ Have to find something to do… ○ Parallelism ■ Expressing, and running the graph of the computation on a set of execution units

6

slide-7
SLIDE 7

Big/New Ideas of 1000x

  • Your have more problems...

○ Reliability ■ A single bug can destroy your in-memory dataset

  • 100TB of non-volatile memory are cache-coherent
  • Any FPGA unit, or core can wipe it

7

slide-8
SLIDE 8

Indicated R&D for 1000x

  • OS/VMM support for heterogeneous hardware

○ Novel execution runtime ■ Spatial scheduling, preemption, load-balancing

  • Sharing across multiple users
  • One host and in a virtual datacenter

■ Unified OS platform for GPU, multi-cores, FPGA

  • Proprietary stacks and device drivers should go…
  • Direct (low-latency) access to hardware

8

slide-9
SLIDE 9

Indicated R&D for 1000x

  • Language support

○ Programmable hardware ■ C/C++/Rust to FPGA ○ Parallelism ■ Async & delegate [Grappa, USENIX’16]

  • Works good for analytical workloads

■ Streaming languages ■ Your favorite model here

  • Well, MPI will work too

9

slide-10
SLIDE 10

Questions for the Software Institute

  • Analyze potential performance gains for HEP

workloads ■ Assume a clean-slate ideal slate software stack ■ Only hardware limitations ■ Can we get to 1000x? ■ What are the bottlenecks?

10

slide-11
SLIDE 11

Questions for the Software Institute

  • Encouraging example:

○ D.E. Shaw Anton/Anton 2 dynamic molecular simulation machine ■ Custom ASIC ■ 1000x speedup

  • Same acceleration is possible for HEP

11