ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - - PowerPoint PPT Presentation

acmp an architecture to handle amdahl s law
SMART_READER_LITE
LIVE PREVIEW

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - - PowerPoint PPT Presentation

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research Group Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel Background Single-thread


slide-1
SLIDE 1

ACMP: An Architecture to Handle Amdahl’s Law

  • M. Aater Suleman

Advisor: Yale Patt

HPS Research Group

slide-2
SLIDE 2

Acknowledgements

Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel

slide-3
SLIDE 3

Background

  • Single-thread performance is power constrained
  • To leverage CMPs for a single application, it

must be parallelized

  • Many kernels cannot be parallelized completely
  • Applications likely include both serial and

parallel portions

  • Amdahl’s law is more applicable now than ever
slide-4
SLIDE 4

Serial Bottlenecks

  • Inherently serial kernels

For I = 1 to N A[I] = (A[I-1] + A[I])/2

  • Parallelization requires effort

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Programmer Effort Degree of Parallelism

Data-parallel Loops Loops with early termination Irregular code

slide-5
SLIDE 5

CMP Architectures

  • Tile small cores e.g. Sun Niagara, Intel

Larrabee

– High throughput on the parallel part – Low serial thread performance – Highest performance for completely parallelized applications

  • Tile large cores e.g. Intel Core2Duo, AMD

Barcelona, and IBM Power 5.

– High serial thread performance – Lower throughput than Niagara

slide-6
SLIDE 6

ACMP

  • Run serial thread on the large core to

extract ILP

  • Run parallel threads on small cores
slide-7
SLIDE 7

ACMP

  • Run serial thread on the large core to

extract ILP

  • Run parallel threads on small cores
slide-8
SLIDE 8

ACMP

  • Run serial thread on the large core to

extract ILP

  • Run parallel threads on small cores
slide-9
SLIDE 9

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

slide-10
SLIDE 10

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At low parallelism, ACMP and P6-Tile

  • utperform Niagara
slide-11
SLIDE 11

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At high parallelism, Niagara

  • utperforms ACMP
slide-12
SLIDE 12

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At medium parallelism, ACMP wins

slide-13
SLIDE 13

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

The cut-off point moves to the right in the future

slide-14
SLIDE 14

Experimental Methodology

  • Large core: Out-of-order (similar to P6)
  • Small Core: 2-wide, In-order
  • Configuration:

– Niagara: 16 small cores – P6-Tile: 4 large cores – ACMP: 1 Large core, 12 small cores

  • Single ISA, shared memory, private L1 and L2 caches, bi-directional

ring interconnect

  • Simulated existing multi-threaded applications without

modification

  • ACMP Thread Scheduling

– Master thread large core – All additional threads small cores

slide-15
SLIDE 15

Performance Results

0.2 0.4 0.6 0.8 1 1.2 1.4

mcf is_nasp fft_splash cg_nasp ep_nasp art_omp mg_nasp fmm_splash cholesky page convert h.264 ed

Speedup vs. Niagara

P6-Tile ACMP

Low Parallelism Medium Parallelism High Parallelism

slide-16
SLIDE 16

Summary

  • ACMP trades peak parallel performance for

serial performance

  • Improves performance for a wide range of

applications

  • Performance is less dependent on length of

serial portion

  • Improves programmer efficiency

– Programmers can only parallelize easier-to- parallelize kernels

slide-17
SLIDE 17

Future Work

  • Enhanced ACMP scheduling

– Accelerate execution of finer-grain serial portions (critical sections) using the large core – Requires compiler support and minimal hardware

  • Improved threading decision based on run-

time feedback

slide-18
SLIDE 18

Thank you