ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - - PowerPoint PPT Presentation

▶

acmp an architecture to handle amdahl s law

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman - - PowerPoint PPT Presentation

Feb 22, 2024 112 likes •310 views

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research Group Acknowledgements Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel Background Single-thread

slide-1

SLIDE 1

ACMP: An Architecture to Handle Amdahl’s Law

M. Aater Suleman

Advisor: Yale Patt

HPS Research Group

slide-2

SLIDE 2

Acknowledgements

Eric Sprangle, Intel Anwar Rohillah, Intel Anwar Ghuloum, Intel Doug Carmean, Intel

slide-3

SLIDE 3

Background

Single-thread performance is power constrained
To leverage CMPs for a single application, it

must be parallelized

Many kernels cannot be parallelized completely
Applications likely include both serial and

parallel portions

Amdahl’s law is more applicable now than ever

slide-4

SLIDE 4

Serial Bottlenecks

Inherently serial kernels

For I = 1 to N A[I] = (A[I-1] + A[I])/2

Parallelization requires effort

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Programmer Effort Degree of Parallelism

Data-parallel Loops Loops with early termination Irregular code

slide-5

SLIDE 5

CMP Architectures

Tile small cores e.g. Sun Niagara, Intel

Larrabee

– High throughput on the parallel part – Low serial thread performance – Highest performance for completely parallelized applications

Tile large cores e.g. Intel Core2Duo, AMD

Barcelona, and IBM Power 5.

– High serial thread performance – Lower throughput than Niagara

slide-6

SLIDE 6

ACMP

Run serial thread on the large core to

extract ILP

Run parallel threads on small cores

slide-7

SLIDE 7

ACMP

Run serial thread on the large core to

extract ILP

Run parallel threads on small cores

slide-8

SLIDE 8

ACMP

Run serial thread on the large core to

extract ILP

Run parallel threads on small cores

slide-9

SLIDE 9

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

slide-10

SLIDE 10

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At low parallelism, ACMP and P6-Tile

utperform Niagara

slide-11

SLIDE 11

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At high parallelism, Niagara

utperforms ACMP

slide-12

SLIDE 12

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

At medium parallelism, ACMP wins

slide-13

SLIDE 13

Performance vs. Parallelism

2 4 6 8 10 12 14 16 18 0.2 0.4 0.6 0.8 1 Degree of Parallelism Speedup vs. 1 P6-type Core ACMP Niagara P6-Tile

The cut-off point moves to the right in the future

slide-14

SLIDE 14

Experimental Methodology

Large core: Out-of-order (similar to P6)
Small Core: 2-wide, In-order
Configuration:

– Niagara: 16 small cores – P6-Tile: 4 large cores – ACMP: 1 Large core, 12 small cores

Single ISA, shared memory, private L1 and L2 caches, bi-directional

ring interconnect

Simulated existing multi-threaded applications without

modification

ACMP Thread Scheduling

– Master thread large core – All additional threads small cores

slide-15

SLIDE 15

Performance Results

0.2 0.4 0.6 0.8 1 1.2 1.4

mcf is_nasp fft_splash cg_nasp ep_nasp art_omp mg_nasp fmm_splash cholesky page convert h.264 ed

Speedup vs. Niagara

P6-Tile ACMP

Low Parallelism Medium Parallelism High Parallelism

slide-16

SLIDE 16

Summary

ACMP trades peak parallel performance for

serial performance

Improves performance for a wide range of

applications

Performance is less dependent on length of

serial portion

Improves programmer efficiency

– Programmers can only parallelize easier-to- parallelize kernels

slide-17

SLIDE 17

Future Work

Enhanced ACMP scheduling

– Accelerate execution of finer-grain serial portions (critical sections) using the large core – Requires compiler support and minimal hardware

Improved threading decision based on run-

time feedback

slide-18

SLIDE 18

Thank you