Christos Kozyrakis and Kunle Olukotun http: / / ppl.stanford.edu Hot - - PowerPoint PPT Presentation

christos kozyrakis and kunle olukotun
SMART_READER_LITE
LIVE PREVIEW

Christos Kozyrakis and Kunle Olukotun http: / / ppl.stanford.edu Hot - - PowerPoint PPT Presentation

Christos Kozyrakis and Kunle Olukotun http: / / ppl.stanford.edu Hot Chips 21 Stanford August 2009 Applications Ron Fedkiw, Vladlen Koltun, Sebastian Thrun Programming & software systems Alex Aiken, Pat


slide-1
SLIDE 1

Christos Kozyrakis and Kunle Olukotun

http: / / ppl.stanford.edu

Hot Chips 21 – Stanford – August 2009

slide-2
SLIDE 2

Applications

Ron Fedkiw, Vladlen Koltun, Sebastian Thrun

Programming & software systems

Alex Aiken, Pat Hanrahan, John Ousterhout,

Mendel Rosenblum

Architecture

Bill Dally, John Hennessy, Mark Horowitz,

Christos Kozyrakis, Kunle Olukotun (director)

slide-3
SLIDE 3

Goal: the parallel computing platform for 2015

Parallel application development practical for the masses

Joe the programmer…

Parallel applications without parallel programming

PPL is a collaboration of

Leading Stanford researchers across multiple domains

Applications, languages, software systems, architecture

Leading companies in computer systems and software

Sun, AMD, Nvidia, IBM, Intel, NEC, HP

PPL is open

Any company can join; all results in the public domain

slide-4
SLIDE 4

1.

Finding independent tasks

2.

Mapping tasks to execution units

3.

Implementing synchronization

Races, livelocks, deadlocks, …

4.

Composing parallel tasks

5.

Recovering from HW & SW errors

6.

Optimizing locality and communication

7.

Predictable performance & scalability

8.

… and all the sequential programming issues

Even with new tools, can Joe handle these issues?

slide-5
SLIDE 5

Guiding observations

Must hide low-level issues from programmer No single discipline can solve all problems Top-down research driven by applications

Core techniques

Domain specific languages (DSLs)

Simple & portable programs

Heterogeneous hardware

Energy and area efficient computing

slide-6
SLIDE 6

Parallel Object Language Hardw are Architecture OOO Cores SI MD Cores Threaded Cores

Programmable Hierarchies Scalable Coherence Isolation & Atomicity Pervasive Monitoring

Virtual W orlds Personal Robotics Data I nform atics Scientific Engineering Physics Scripting Probabilistic Machine Learning Rendering Com m on Parallel Runtim e Explicit / Static I m plicit / Dynam ic Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure

slide-7
SLIDE 7

Parallel Object Language Hardw are Architecture OOO Cores SI MD Cores Threaded Cores

Programmable Hierarchies Scalable Coherence Isolation & Atomicity Pervasive Monitoring

Virtual W orlds Personal Robotics Data I nform atics Scientific Engineering Physics Scripting Probabilistic Machine Learning Rendering Com m on Parallel Runtim e Explicit / Static I m plicit / Dynam ic Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure

slide-8
SLIDE 8

Leverage domain expertise at Stanford

CS research groups, national centers for scientific computing

PPL

Media-X NIH NCBC DOE ASC Environmental Science Geophysics Web, Mining Streaming DB Graphics Games Mobile HCI

Existing Stanford research center Seismic modeling Existing Stanford CS research groups

AI/ML Robotics

slide-9
SLIDE 9

Next-generation web platform

Millions of players in vast landscapes Immersive collaboration Social gaming

Computing challenges

Client-side game engine

Graphics rendering

Server-side world simulation

Object scripting, geometric queries, AI,

physics computation

Dynamic content, huge datasets

More at http: / / vw.stanford.edu/

slide-10
SLIDE 10

Parallel Object Language Hardw are Architecture OOO Cores SI MD Cores Threaded Cores

Programmable Hierarchies Scalable Coherence Isolation & Atomicity Pervasive Monitoring

Virtual W orlds Personal Robotics Data I nform atics Scientific Engineering Physics Scripting Probabilistic Machine Learning Rendering Com m on Parallel Runtim e Explicit / Static I m plicit / Dynam ic Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure

slide-11
SLIDE 11

High-level languages targeted at specific domains

E.g.: SQL, Matlab, OpenGL, Ruby/ Rails, … Usually declarative and simpler than GP languages

DSLs higher productivity for developers

High-level data types & ops (e.g. relations, triangles, …

)

Express high-level intent w/ o implementation artifacts

DSLs scalable parallelism for the system

Declarative description of parallelism & locality patterns

Can be ported or scaled to available machine

Allows for domain specific optimization

Automatically adjust structures, mapping, and scheduling

slide-12
SLIDE 12

Goal: simplify code of mesh-based PDE solvers

Write once, run on any type of parallel machine From multi-cores and GPUs to clusters

Language features

Built-in mesh data types

Vertex, edge, face, cell

Collections of mesh elements

cell.faces(), face.edgesCCW()

Mesh-based data storage

Fields, sparse matrices

Parallelizable iterations

Map, reduce, forall statements

slide-13
SLIDE 13

val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex] for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …

slide-14
SLIDE 14

val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex] for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …

High-level data types & operations Explicit parallelism using map/reduce/forall Implicit parallelism with help from DSL & HW No low-level code to manage parallelism

slide-15
SLIDE 15

Liszt compiler & runtime manage parallel execution

Data layout & access, domain decomposition, communication, …

Domain specific optimizations

Select mesh layout (grid, tetrahedral, unstructured, custom, …

)

Select decomposition that improves locality of access Optimize communication strategy across iterations

Optimizations are possible because

Mesh semantics are visible to compiler & runtime Iterative programs with data accesses based on mesh topology Mesh topology is known to runtime

slide-16
SLIDE 16

Parallel Object Language Hardw are Architecture OOO Cores SI MD Cores Threaded Cores

Programmable Hierarchies Scalable Coherence Isolation & Atomicity Pervasive Monitoring

Virtual W orlds Personal Robotics Data I nform atics Scientific Engineering Physics Scripting Probabilistic Machine Learning Rendering Com m on Parallel Runtim e Explicit / Static I m plicit / Dynam ic Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure

slide-17
SLIDE 17

Provide a shared framework for DSL development Features

Common parallel language that retains DSL semantics Mechanism to express domain specific optimizations Static compilation + dynamic management environment

For regular and unpredictable patterns respectively

Synthesize HW features into high-level solutions

E.g. from HW messaging to fast runtime for fine-grain tasks

Exploit heterogeneous hardware to improve efficiency

slide-18
SLIDE 18

Required features

Support for functional programming (FP)

Declarative programming style for portable parallelism High-order functions allow parallel control structures

Support for object-oriented programming (OOP)

Familiar model for complex programs Allows mutable data-structures and domain-specific attributes

Managed execution environment

For runtime optimizations & automated memory management

Our approach: embed DSLs in the Scala language

Supports both FP and OOP features Supports embedding of higher-level abstractions Compiles to Java bytecode

slide-19
SLIDE 19

Calls Matrix DSL methods Delite applies generic & domain transformations to generate mapping DSL defers OP execution to Delite

slide-20
SLIDE 20

20 40 60 80 100 120 1 2 4 8 16 32 64 128

Speedup

Execution Cores

Gaussian Discrim inant Analysis

Original + Dom ain Optim izations + Data Parallelism

Low speedup due to loop dependencies

slide-21
SLIDE 21

20 40 60 80 100 120 1 2 4 8 16 32 64 128

Speedup

Execution Cores

Gaussian Discrim inant Analysis

Original + Dom ain Optim izations + Data Parallelism

Domain info used to refactor dependencies

slide-22
SLIDE 22

20 40 60 80 100 120 1 2 4 8 16 32 64 128

Speedup

Execution Cores

Gaussian Discrim inant Analysis

Original + Dom ain Optim izations + Data Parallelism

Exploiting data parallelism within tasks

slide-23
SLIDE 23

Parallel Object Language Hardw are Architecture OOO Cores SI MD Cores Threaded Cores

Programmable Hierarchies Scalable Coherence Isolation & Atomicity Pervasive Monitoring

Virtual W orlds Personal Robotics Data I nform atics Scientific Engineering Physics Scripting Probabilistic Machine Learning Rendering Com m on Parallel Runtim e Explicit / Static I m plicit / Dynam ic Applications Domain Specific Languages Heterogeneous Hardware DSL Infrastructure

slide-24
SLIDE 24

Heterogeneous HW for energy & area efficiency

ILP, threads, data-parallel engines, custom engines Q: what is the right balance of features in a chip? Q: what tools can generate best chip for an app/ domain?

Study: HW options for H.264 encoding

1 .0 1 .5 2 .0 2 .5 3 .0

4 cores + I LP + SI MD + custom inst ASI C

Area

1 1 0 1 0 0 1 0 0 0

4 cores + I LP + SI MD + custom inst ASI C

Perform ance Energy Savings

slide-25
SLIDE 25

Revisit architectural support for parallelism

Which are the basic HW primitives needed? Challenges: semantics, implementation, scalability,

virtualization, interactions, granularity (fine-grain & bulk), …

HW primitives

Coherence & consistency, atomicity & isolation, memory

partitioning, data and control messaging, event monitoring

Runtime synthesizes primitives into SW solutions

Streaming system: mem partitioning + bulk data messaging TLS: isolation + fine-grain control communication Transactional memory: atomicity + isolation + consistency Security: mem partitioning + isolation Fault tolerance: isolation + checkpoint + bulk data messaging

slide-26
SLIDE 26

Parallel tasks with a few thousand instructions

Critical to exploit in large-scale chips Tradeoff: load balance vs overheads vs locality

Software-only scheduling

Per-thread task queues + task stealing Flexible algorithms but high stealing overheads

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 SW -only HW -only ADM SW -only HW -only ADM cg gtfold

Execution Tim e

I dle Overhead Running

Overheads and scheduling get worse at larger scales

slide-27
SLIDE 27

Hardware-only scheduling

HW tasks queues + HW stealing protocol Minimal overheads (bypass coherence protocol) But fixes scheduling algorithm

Optimal approach varies across applications Impractical to support all options in HW

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 SW -only HW -only ADM SW -only HW -only ADM cg gtfold

Execution Tim e

I dle Overhead Running

Wrong scheduling algorithm makes HW slower than SW

slide-28
SLIDE 28

Simple HW feature: asynchronous direct messages

Register-to-register, received with user-level interrupt Fast messaging for SW schedulers with flexible algorithms

E.g., gtfold scheduler tracks domain-specific dependencies Also useful for fast barriers, reductions, IPC, …

Better performance, simpler HW, more flexibility & uses

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 3 2 6 4 1 2 8 SW -only HW -only ADM SW -only HW -only ADM cg gtfold

Execution Tim e

I dle Overhead Running

Scalable scheduling with simple HW

slide-29
SLIDE 29

Goal: make parallel computing practical for the masses Technical approach

Dom ain specific languages ( DSLs)

Simple & portable programs

Heterogeneous hardw are

Energy and area efficient computing

Working on the SW & HW techniques that bridge them

More info at: http:/ / ppl.stanford.edu