[PPT] - The IEEE Rebooting Computing Initiative and the International PowerPoint Presentation

SLIDE 1

The IEEE Rebooting Computing Initiative and the International Roadmap for Devices and Systems

Tom Conte

Co-Chair, I EEE Rebooting Com puting I nitiative Vice Chair, I nternational Roadm ap for Devices and System s Schools of CS & ECE, Georgia I nstitute of Technology tom @conte.us

SLIDE 2

W hy does com puting need a “reboot”?

Moore's Law for 2D predicted to end in 2021

– But few architects care because…

Transistors have been getting smaller but cannot be clocked faster

– From an architect’s perspective: 10nm isn’t any better than 14nm, which was

nly marginally better than 22nm

The Power Wall: Single thread exponential performance scaling already ended in 2005

2

SLIDE 3

A history of m odern com puting: How w e got here

1945: Von Neumann’s report describing computer arch. 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360), Moore # 1 1975: Moore’s Law update; Dennard’s geo. scaling rule 1985: “Killer micros”: HPC, general-purpose hitch a ride

n Moore’s law

1995: Slowdown in CMOS wires: superscalar era begins

3

SLIDE 4

I n 1 9 9 5 , w ire delays im pact pipelining: Superscalar begins

4

Moore’s law Processor performance

Source: Sanjay Patel, UIUC (used with permission)

SLIDE 5

W e hid parallelism extraction w ith

5

Instruction Fetch Decode & Dispatch Schedule Reorder instructions ALU ALU ALU ... register file ... ... ... ... Execute in parallel ... Issue N independent instructions Instruction Cache Data Cache Branch predictor

Superscalar Processor Microarchitectures

…Very few of these “tricks” are energy efficient

SLIDE 6

How w e got here, part 2

1945: Von Neumann’s report describing computer arch. 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360) 1975: Moore’s Law; Dennard’s geometric scaling rule 1985: “Killer micros”: HPC, general-purpose hitch a ride

n Moore’s law

1995: Slowdown in CMOS wires: superscalar era begins 2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) …

6

SLIDE 7

Intel P4 Prescott: Q1 2015

200W/cm2

SLIDE 8

Multicore era begins

8

Dilemma: Could not clock single core aggressively AND continued to get transistors/chip Solution: Clock multiple cores conservatively

SLIDE 9

How w e got here, part 3

1945: Von Neumann’s report describing computer arch. 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360) 1975: Moore’s Law; Dennard’s geometric scaling rule 1985: “Killer micros”: HPC, general-purpose hitch a ride

n Moore’s law

1995: Slowdown in CMOS wires: superscalar era begins 2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) 2012: Realizing the problem: IEEE Rebooting Computing Initiative founded

9

SLIDE 10

Why IEEE? Encompasses the whole computing stack

IEEE Rebooting Computing

Council on Electronic Design Automation Circuits & Systems Society

Goal: Rethink Everything: Turing & Von Neumann to now

10

SLIDE 11

I EEE Rebooting Com puting

Summit 1: 2013 Dec. 12-13

(summary online)

– Invitation only – Three Pillars:

– Energy Efficiency – Security – Applications/HCI

Rebooting Com puting Applications/ HCI Security Energy Efficiency

11

SLIDE 12

I EEE Rebooting Com puting

Summit 2: 2014 May 14-16

– Engines of Computation

Adiabatic/Reversible Computing
Approximate Computing
Neuromorphic Computing
Augmentation of CMOS

Rebooting Com puting Applications/ HCI Security Energy Efficiency

Engine Room

12

SLIDE 13

RCI Sum m it 2 : W ays to com pute

13

Many alternatives

– New switch – 3D Integration – Adiabatic/ Reversible logic – Unreliable switch – Approximate, Stochastic – Cryogenic – Neuromorphic accelerators – Analog neuromorphic – Quantum – …

not all are general-purpose drop ins

– (nor do they need to be)

SLIDE 14

There w as a com m on phenom enon w e discovered…

14

You talking to me?!?

The phenomenal success of von Neumann caused all other approaches to be labeled as “lunatic fringe” Biases against taking risks remains today

SLIDE 15

I EEE Rebooting Com puting

Summit 3: 2014 Oct. 23-24

– Algorithms and Architectures

Random algorithms
HCI and Applications
Also: Security, Approximate

Computing

ITRS joins forces with RCI

Rebooting Com puting Applications/ HCI Security Energy Efficiency

Engine Room Algorithms & Architectures

15

SLIDE 16

I EEE Rebooting Com puting

Summit 4: 2015 Dec. 10-11

Goal: coordinating efforts between: –Industry (HP, Intel, NVIDIA) –US: DOE, DARPA, IARPA, NSF Goal 2: How to roadmap the future

Rebooting Com puting Applications/ HCI Security Energy Efficiency

Engine Room Algorithms & Architectures

16

SLIDE 17

RCI : “Softw are drives the com puter industry” Questions for software industry:

– How valuable is legacy softw are? – What computing resources do the em erging applications need? – How long and how much investment will it take to train new generation of program m ers?

Degrees of Pain Vs. Gain…

17

SLIDE 18

logic device FU Microarchitecture ISA Architecture API Language Algorithm

Potential Approaches vs. Disruption in Computing Stack

Hidden changes

Architectural changes Non von Neumann computing

LEGEND: No Disruption

“Moore More”

Level 1 2 3 4 Total Disruption

SLIDE 19

Level 1 : More Moore

Software: Legacy code works without issue New switch candidates:

– Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM, …

19

SLIDE 20

More Moore: A better sw itch?

20

Courtesy Dimitri Nikonov and Ian Young

SLIDE 21

21

3 D Architecture exam ple:

21

SLIDE 22

3 D vs. 2 D Cost Reduction

Gate

Bulk Si

Lithography and etch Deposition and etch

22

SLIDE 23

Level 1 : More Moore

Software: Legacy code works without issue New switch candidates:

– Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM

Moore’s law w ill go to 3 D

23

SLIDE 24

logic device FU Microarchitecture ISA Architecture API Language Algorithm

Potential Approaches vs. Disruption in Computing Stack

Hidden changes

Architectural changes Non von Neumann computing

LEGEND: No Disruption

“Moore More”

Level 1 2 3 4 Total Disruption

SLIDE 25

Level 2 : Not CMOS, but hidden

Software: Legacy code works, but may require performance tuning Superscalar in 1995 was an example Microarchitectural changes to

– Use unreliable switch logic, and/ or – Use cryogenic superconducting – Reversible computing

25

SLIDE 26

Lowering voltage gives quadratic improvement in power, but Devices become unreliable below 1V

Probability of signal error grows as energy of signal is reduced below 20kT

SLIDE 27

Traditional Fault Tolerant Computing Reliability “Triple Modular Redundancy” (TMR)

27

– ~ 200% overhead in area and energy to correct an error due to a single bit flip. – Lose all power benefit of lower voltage

SLIDE 28

Redundant Residue Numbers can also correct errors

Range = 3*5*2*7 = 210 Redundant decimal mod 3 mod 5 mod 2 mod 7 mod 11 mod 13 13 1 3 1 6 2 14 2 4 3 1 add case 1 27 2 1 6 7->5 1 add case 2 27 1->2 1 6 5 1 add case 3 27 1->2 1 6 7->5 1

Chinese Remainder Theorem

|X’|11 & |X’|13 |X’|mc == |X|mc ? How to correct?

Case 1 (0,2,1,6) ⇔27 ; X’ = 27; |27|11 = 5; |27|13 = 1 |X’|m5 = 5 |X|m5 = 7 |X’|m6 = 1 |X|m6 = 1 replace |X|m5 with |X’|m5 Case 2 (0,1,1,6) ⇔111 ; X’ = 111; |111|11 = 1; |111|13 = 7 |X’|m5 = 1 |X|m5 = 5 |X’|m6 = 7 |X|m6 = 1 check error correction table Case 3 Two errors; Double Errors Detection algorithm could be used to detect errors. But unable to correct.

This has been around for a long time (1968)– time to look again

SLIDE 29

RRNS Core Microarchitecture

29 50% overhead (<< 200% !) 100x more power efficient

B. Deng, et al., “Computationally-redundant energy-efficient processing for y'all

(CREEPY),” Proceedings of the 2016 IEEE International Conference on Rebooting Computing (ICRC), (San Diego, CA), Oct. 17-19, 2916.

SLIDE 30

Supercomputer Titan at ORNL - #2 of Top500 Superconducting Supercomputer

Performance 17.6 PFLOP/s (#2 in world*) 20 PFLOP/s ~1x Memory 710 TB (0.04 B/FLOPS) 5 PB (0.25 B/FLOPS) 7x Power 8,200 kW avg. (not included: cooling, storage memory) 80 kW total power (includes cooling) 0.01x Space 4,350 ft2 (404 m2, not including cooling) ~200 ft2 (includes cooling) 0.05x Cooling additional power, space and infrastructure required All cooling shown

2’ x 2’ same scale comparison Courtesy of M. Manheimer, IARPA Cryogenic Computing Complexity (C3) Program

30

Superconducting: sm aller, low er pow er, sam e perform ance

SLIDE 31

MIT-LL Fully-Planarized Nb Josephson Junction Process

Nb/AlOx/Nb JJ technology
10 kA/cm2 (100 µA/µm2) baseline
200-mm Si substrates
4-, 8- &10-Nb layer nodes
Feature sizes to 500 nm
Full planarization for uniformity
Transition to stacked/stud vias

SFQ-4ee (8-Nb-layer) Process Features Target 10-Nb-layer Process 2 µm

31

SLIDE 32

Level 2 : Not CMOS, but hidden

Software: Legacy code works, but may require performance tuning Microarchitecture changes to

– Use unreliable switch logic, and/ or – Use cryogenic superconducting – Reversible computing

Potential to m ake exascale orders

f m agnitude low er pow er

Significant R&D needed

32

SLIDE 33

logic device FU Microarchitecture ISA Architecture API Language Algorithm

Potential Approaches vs. Disruption in Computing Stack

Hidden changes

Architectural changes Non von Neumann computing

LEGEND: No Disruption

“Moore More”

Level 1 2 3 4 Total Disruption

SLIDE 34

Level 3 : Architectural changes

Software: new programming required

– GPU was an example of this

Inexpensive parallelism available, but need to

reprogram to use it

Use special purpose accelerators for

Critical kernels, Digital neuromorphic, etc. Approximate computing

And/ or use memory-centric (e.g., Emu, The Machine) to move the computation to the data

34

SLIDE 35

Accelerators ( and reconfigurable)

Also been around for a long time

– IBM 7030 Project STRETCH attached stream processor in 1961, various FP accelerators

Speedup savings through gate-level parallelism Energy savings via elimination of instruction fetch & decode Programming options:

– Compiler extraction, APIs, DSLs

11/22/2016

SLIDE 36

Approxim ate com puting

Building acceptable systems out of unreliable/ inaccurate hardware and software components Many uses:

– Most start and/ or end with human perception (Images, video, control, etc.) or near-optimal search Output accuracy Efficiency and performance

36

SLIDE 37

Approxim ate com puting challenges

Algorithms & programming languages

– Not there yet

Ensuring quality of output

– Step function: great… good… good-ish…

k…

unacceptable

37

SLIDE 38

Level 3 : Architectural changes

Software: new programming required

– GPU was an example of this

Inexpensive parallelism available, but need to

reprogram to use it

Use special purpose accelerators for

Critical kernels, Digital neuromorphic, etc. Approximate computing

And/ or use memory-centric (e.g., Emu, The Machine) to move the computation to the data Softw are and program m ers are the challenge

38

SLIDE 39

logic device FU Microarchitecture ISA Architecture API Language Algorithm

Potential Approaches vs. Disruption in Computing Stack

Hidden changes

Architectural changes Non von Neumann computing

LEGEND: No Disruption

“Moore More”

Level 1 2 3 4 Total Disruption

SLIDE 40

Level 4 : Non-von Neum ann

1. Quantum- Gate-based or quantum

annealing

2. Analog neuromorphic
3. Others: coupled oscillators, stateful

devices (memristors, spintronics, etc.)

40

SLIDE 41

Native Neurom orphic

Direct analog (memristor, etc.) neuromorphic has orders of magnitude better energy efficiency Virtuous cycle of neuroscience informing neuromorphic, and neuromorphic serving as modeling platform to advance neuroscience

41

Neuromorphic computing Neuroscience research

SLIDE 42

Quantum

Two varieties: gate-based (e.g., Shor’s algorithm), quantum annealing For the former, devices are the challenge

– Quantum dots, Transmon, Ion, etc. – Current coherence time: 100usec

Need to be several of orders of magnitude longer

– Solution: Redundancy- 1 virtual qubit = 1000 physical qubits

Power needs per virtual qubit ~ 10kW

– Most of the power for waveform generators, interfacing – Cooling is a small percentage of the power

11/ 22/ 2016 42

SLIDE 43

Level 4 : Non-von Neum ann

1. Quantum- Gate-based or quantum

annealing

2. Analog neuromorphic
3. Others: coupled oscillators, stateful

devices (memristors, spintronics, etc.) System softw are nonexistent Very im m ature technology MASSI VE investm ent still needed

43

SLIDE 44

How to Roadm ap post 2 0 2 1 The I EEE I nternational Roadm ap for Devices and System s

44

SLIDE 45

Roadm ap history

Gordon Moore published two forecasts, one in 1965 and one in 1975. National Technology Roadmap of Semiconductors in 1998, changed to International Tech. Roadmap of Semi. (ITRS), and then ITRS 2.0 Produced 20 total roadmaps 1998-2015

45

SLIDE 46

RCI Sum m it 1 identified that Roadm apping is Essential For Success

Moore defined the law, Roadmapping with buy-in from industry kept it going because roadmapping:

1.Tracks progress 2.Finds roadblocks 3.Identifies and compares potential solutions 4.Pre-competitive/ standards-like

46

SLIDE 47

Recent Roadm ap history

In 2013, ITRS2.0 signed an agreement with IEEE Rebooting Computing Initiative In 2015, ITRS2.0 declared ‘last roadmap’ by US SIA leadership International industry disagreed. Rather, it should be evolved IEEE RCI moves roadmapping into IEEE as the I nternational Roadm ap for Devices and System s ( I RDS)

47

SLIDE 48

I EEE I nternational Roadm ap for Devices and System s

Formed under IEEE Standards Association, first meeting Leuven Belgium (May, 2016) Roadmapping processes from ITRS 2.0 New “front end” processes to identify:

– Important application drivers – Appropriate computer architectures

48

SLIDE 49

Capabilities Design ERM Interconnects ESH Metrology FEP Modeling Litho Test Yield

MM FI BC HC HI

Devices Components Integration

OSC Systems & Architectures Applications Benchmarking

Environmental Analysis

International Roadmap for Devices and Systems

49

SLIDE 50

I TRS vs. I RDS: A new “front end”

50

AB SA

Roadmaps: App vs. performance

MM FI BC HC HI OSC

Applications Benchmarking Systems & Architecture

SLIDE 51

Application Dom ains

51

Application area Description

Big Data Analytics

Data mining to identify nodes in a large graph that satisfy a given feature/features

Feature Recognition

Graphical dynamic moving image (movie) recognition of a class of targets (e.g., face, car) . This can include neromorphic approaches such as CNNs

Discrete Event Simulation

Large discrete event simulation of a discretized-time system. (e.g., large computer system simulation). Generally used to model engineered systems and is based on integer computation.

Physical system simulation

Simulation of physical real-world phenomena. Typically finite-element based and uses floating-point. Examples include fluid flow, weather prediction, thermo-evolution

Optimization

Integer NP-hard optimization problems

Graphics rendering / VR / AR

Large scale, real-time photorealistic rendering driven by physical world models. Examples include RPGs, VR.

Media processing

Discrete processing, including filtering, compressing, decompressing of streaming media, where the media is unknown (i.e., camera stream based). Includes integrating multiple cameras to feed graphics rendering

Cryptographic codec

Crypting and decrypting of data at the edge of cryptographic science. Includes asymmetric-key encroption, excludes symmetric-key encryption.

SLIDE 52

I TRS vs. I RDS: A new “front end”

52

AB SA

Roadmaps: App vs. performance

Cross matrix: App => Market Driver

Market Drivers

MM FI BC HC HI OSC

Applications Benchmarking Systems & Architecture

SLIDE 53

Cross m atrix ( AB interface to SA)

53

Medical diagnosis Bioinfomatics Medical device IoT (edge) Cloud / HPC Big Data Robotics CPS Phone Automotive Big Data Analytics G G X G Feature Recognition G X P X X X G, P P P P Discrete Event Simulation X Physical system simulation X X Optimization X X P X G, P P Graphics rendering X X P P Media processing X G P X X X P G P Cryptographic codec X X G, P G, P X X X P G, P P X = important G = Gating (critical) P = Power sensitive and important

SLIDE 54

I TRS vs. I RDS: A new “front end”

54

AB SA

Roadmaps: App vs. performance

Cross matrix: App => Market Driver

Market Drivers Roadmaps: Market Driver+Arch

vs. metrics (speed, bandwidth)

Technology-driven roadmapping focus teams

Applications Benchmarking Systems & Architecture

SLIDE 55

I EEE Rebooting Com puting: Sum m ary

Levels of RC:

1. More Moore (New switch/ 3D)
2. Microarchitecture changes
3. Architecture changes
4. Non-von Neumann

Direct pain / gain tradeoffs New device R&D required I RDS: Applications-driven Roadmapping is starting to identify needed devices

55

SLIDE 56

Web portal http://rebootingcomputing.ieee.org

56