Topology-aware OpenMP Process Scheduling Peter Thoman, Hans - - PowerPoint PPT Presentation

topology aware openmp
SMART_READER_LITE
LIVE PREVIEW

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans - - PowerPoint PPT Presentation

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer University of Innsbruck (Austria) Motivation IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15 Motivation Hardware Trends


slide-1
SLIDE 1

Topology-aware OpenMP Process Scheduling

Peter Thoman, Hans Moritsch, and Thomas Fahringer University of Innsbruck (Austria)

slide-2
SLIDE 2

Motivation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-3
SLIDE 3

Motivation – Hardware Trends

 Multi-core, multi-socket NUMA machines

are in wide use in HPC

 Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready?

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-4
SLIDE 4

Motivation – Hardware Trends

 Multi-core, multi-socket NUMA machines

are in wide use in HPC

 Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready?

socket core core shared cache core core shared cache socket memory 2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-5
SLIDE 5

Motivation – Hardware Trends

 Multi-core, multi-socket NUMA machines

are in wide use in HPC

 Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready?

socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory 2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-6
SLIDE 6

Scalability

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 We profiled individual OpenMP parallel regions

in a variety of programs and problem sizes

 On a 8-socket quadcore NUMA system (32 cores)  Determine two metrics:

 Maximum threadcount

 Maximum amount of threads that

can be used with some speedup

 Optimal threadcount

 Maximum amount of threads that

allows a speedup within 20% of ideal

slide-7
SLIDE 7

Scalability Results

4 8 12 16 20 24 28 32

bt.B_130 lu.C_3085 mg.A_1105 mg.A_961 lu.A_120 gauss.S_40 mg.B_961 mg.A_236 gauss.L_20 mg.C_1091 mg.C_961 mg.B_236 mg.A_271 cg.A_254 gauss.S_20 mg.C_236 mg.B_1091 cg.B_254 is.A_638 cg.C_254 mg.B_271 is.B_638 cg.A_740 gauss.M_40 mmul.L_18 mg.C_271 mg.C_1105 cg.B_740 cg.A_785 mmul.M_18 lu.C_120 cg.C_740 mg.B_1105 cg.B_785 mg.A_230 lu.A_3049 is.A_652 ft.A_145 ft.A_123 gauss.M_20 mmul.S_18 cg.C_785 mg.B_230 is.B_652 ft.B_145 ft.B_123 ep.B_144 cg.A_171 cg.A_644 bt.A_149 bt.B_149 lu.C_3049 ep.C_144 mg.C_230 cg.B_171 cg.B_644 mg.A_1091 cg.C_171 cg.C_644 lu.A_3085 ep.A_144 bt.A_130 gauss.L_40

maximum threadcount

  • ptimal threadcount

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-8
SLIDE 8

Motivation – Multi-Process

 First idea: run more than one OMP program (job)

in parallel

100 200 300 400 500 600 700 800

1 2 3 4 5 6 7 8

Total execution time (seconds) Number of parallel jobs 2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-9
SLIDE 9

Motivation – Multi-Process

 Of course it is not always that simple –

a different workload:

500 1000 1500 2000 2500

1 2 3 4 5 6 7 8

Total execution time (seconds) Number of parallel jobs 2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-10
SLIDE 10

Algorithm & Implementation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-11
SLIDE 11

Multi-Process Scheduling Architecture

 Goal:

 Facilitate system-wide scheduling of OpenMP programs

 Basic Design:

 One central control process (server), message exchange

between server and the OMP runtime of each program

 Message protocol:

 Upon encountering a OMP parallel region:

 OMP processes send a request to server for resources

 Includes scalability information for region

 Use cores indicated by reply

 When leaving region send signal to free cores

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-12
SLIDE 12

Implementation & Flow

 Based on UNIX message queues

 Well suited semantically and fast enough

(less than 4 microseconds roundtrip on our systems)

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-13
SLIDE 13

Topology-aware Scheduling Algorithm

 Multi-process scheduling ameliorates

many-core scalability problems

 What about complex memory hierarchy?

 Make server topology aware  Base scheduling decisions on

 Region scalability  Current system-wide load  System topology

 Topology-aware OMP scheduler

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-14
SLIDE 14

Topology Representation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Distance matrix for all cores in a system

 Higher distance amplification factors for higher levels in the

memory hierarchy

 Example:

slide-15
SLIDE 15

Simple Scheduling

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Request from region with given

maxcount and optcount:

  • 1. N = optcount + loadfactor * (maxcount - optcount)

 loadfactor dependent on amount of free cores

2.

Select N-1 cores close to core from which the request

  • riginated

Slightly more complicated in practice

dealing with case where fewer than N cores available (decide whether to queue or return smaller amount)

slide-16
SLIDE 16

Fragmentation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Using simple scheduling leads to fragmentation:  Sum of local distance in all 4 processes: 44

socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory

slide-17
SLIDE 17

Improvement: Clustering

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Same processes without fragmentation:  Sum of local distance in all 4 processes: 13

socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory socket core core shared cache core core shared cache socket memory

slide-18
SLIDE 18

Clustering Algorithm

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Moving threads once started has

significant performance impact (caches, pages, etc)  instead change algorithm to discourage fragmentation

 Define cores as part of a hierarchy of core sets  When selecting a core from a new set, prefer (in order)

1.

A core set containing exactly as many free cores as required

2.

A core set containing more free cores than required

3.

An empty core set

 Further improvement possible by adjusting number of

selected cores (enhanced clustering)

slide-19
SLIDE 19

Evaluation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-20
SLIDE 20

Simulation

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Evaluate impact of scheduling enhancements

  • ver 10000 semi-random requests

 Calculate or measure 5 properties:

 Scheduling time required per request  Target miss rate: |#returned_threads - #ideal_threads|  3 distance metrics:

 Total distance: from each thread in a team to each other  Weighted distance: distance between threads with close id weighted higher  Local distance: only count distance from each core to next in sequence

slide-21
SLIDE 21

Simulation Results

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Absolute overhead always below 1.4 microseconds  Enhanced clustering reduces local distance by 70%

0% 20% 40% 60% 80% 100% 120% Overhead (µs) Target miss rate Total distance Weighted distance Local distance

slide-22
SLIDE 22

Experiments

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Hardware:

 Sun XFire 4600 M2  8 quad-cores (AMD Barcelona, partially connected, 1-3 hops)

 Software

 Backend: GCC 4.4.2  “Default” OMP: GOMP  Insieme compiler/runtime r278

slide-23
SLIDE 23

Small-scale Experiment

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Random set of 13 programs tested

100 200 300 400 500 600 700 800 900 1000

Total Time (seconds)

GOMP, sequential Optimal threadcount, standard OS mapping Our server, no locality information Our server, locality Our server, locality + enhanced clustering

slide-24
SLIDE 24

Large-scale Experiment

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Random programs chosen from NPB & 2 kernels  Random problem sizes

5000 10000 15000 20000 25000 30000 Total Time (seconds) GOMP sequential Our server, no locality Our server, locality Our server, locality + clustering

slide-25
SLIDE 25

Power Consumption

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Power consumption measured during large-scale

experiment:

 Topology-aware scheduling (with appropriate thread counts)

reduces average power consumption

slide-26
SLIDE 26

Hybrid MPI/OpenMP

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 One program consists of more than one process  Our topology-aware thread mapping meaningful even for

a single program in this case

 Test of an ADI solver,

8 MPI processes and 4 threads each

 Improvement is

around 11%

OpenMPI used in both cases

0,1 0,2 0,3 0,4 0,5 0,6 0,7 Default Topology aware Execution Time (seconds)

slide-27
SLIDE 27

Summary and Conclusion

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

slide-28
SLIDE 28

Summary

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 Central OpenMP server process

1.

Selects number of threads for parallel regions depending on

 Scalability information  System load  Clustering considerations

2.

Performs topology-aware mapping of threads to cores

 Evaluation

 Up to 33% performance improvement compared to standard

scheduling

 Additional reduction in power consumption

slide-29
SLIDE 29

Future Work

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling

 How to determine/estimate region scalability

without exhaustive profiling

 Make external non-OMP load impact scheduling decisions

slide-30
SLIDE 30

Thank you!

2010-06-15 IWOMP 2010, Topology-aware OpenMP Process Scheduling