[PPT] - in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing PowerPoint Presentation

SLIDE 1

Barriers in OpenMP

Paolo Burgio paolo.burgio@unimore.it

SLIDE 2

Outline

› Expressing parallelism

– Understanding parallel threads

› Memory Data management

– Data clauses

› Synchronization

– Barriers, locks, critical sections

› Work partitioning

– Loops, sections, single work, tasks…

› Execution devices

– Target

2

SLIDE 3

OpenMP synchronization

› OpenMP provides the following synchronization constructs:

– barrier – flush – master – critical – atomic – taskwait – taskgroup – ordered – ..and OpenMP locks

3

SLIDE 4

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

SLIDE 5

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T

SLIDE 6

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T T T T

SLIDE 7

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T

SLIDE 8

OpenMP explicit barriers

› All threads in a team must wait for all the other threads before going on

– "Each barrier region must be encountered by all threads in a team or by none at all" – "The sequence of barrier regions encountered must be the same for every thread in a team" – Why?

› Binding set is the team of threads from the innermost enclosing parreg

– "It applies to"

› Also, it enforces a consistent view of the shared memory

– We'll see this..

5

#pragma omp barrier new-line (a standalone directive)

SLIDE 9

Exercise

› Spawn a team of (many) parallel Threads

– Printing "Hello World" – Put a #pragma omp barrier – Reprint "Hello World" after

› What do you see?

– Now, remove the barrier construct

› Now, put the barrier inside an if

– E.g., if(omp_get_thread_num() == 0) { ... } – What do you see? – Error!!!!

6

Let's code!

SLIDE 10

Effects on memory

› Besides synchronization, a barrier has the effect of making threads' temporary view of the shared memory consistent

– You cannot trust any (potentially modified) shared vars before a barrier – Of course, there are no problems with private vars

› ..what???

7

SLIDE 11

The OpenMP memory model

› Shared memory with relaxed consistency

– Threads have access to "a place to store and to retrieve variables, called the memory" – Threads can have a temporary view of the memory › Caches, registers, scratchpads… › Can still be accessed by other threads

8

Process

Shared

T T T

VAR

Temp Priv.

VAR VAR

Temp Priv. Temp Priv. first/ private(a)

shared(a)

????? ??

SLIDE 12

A bit of architecture…

SLIDE 13

Caches in a nutshell

› A quick memory connected to the core processor

– ..and to the main memory – Few KB of data

› (If any,) caches are a pure hardware mechanism

– Used to store a copy mostly accessed data – To speedup execution even by 10-20 times – Istruction caches/Data caches

› They perform their work automatically

– And transparently – Poor or no control at all at application level – Extremely dangerous in multi- and many-cores

10

SLIDE 14

Caches

11

A cache is a hardware or software component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result

f

an earlier computation,

r

the duplicate

f

data stored elsewhere.

eng.wikipedia.org

CPU

D$

Main memory, or L3 cache Offchip memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T T T T

I$ I$ I$ I$ Level-2 $

SLIDE 15

The catch(es)

› Caches are power hungry

– Some embedded architectures do not have D$

› They are not suitable for critical systems

– E.g., BOSCH removed I$s

› Hardware mechanism, poor control on them

– Flush command (typically, all cache) – Color cache (assign to threads) – Prefetch (move data before it's actually needed)

Coherency problem in multi/many-cores!!

12

SLIDE 16

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a; // ... c = a;

a

SLIDE 17

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a; // ... c = a;

a

SLIDE 18

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... c = a;

a

SLIDE 19

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... c = a;

a

SLIDE 20

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a; // ... dcache_flush(); c = a;

a

SLIDE 21

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a; // ... dcache_flush(); c = a;

a

SLIDE 22

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... dcache_flush(); c = a;

a

SLIDE 23

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a; // ... dcache_flush(); c = a;

a

SLIDE 24

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a; // ... dcache_flush(); c = a; 5

a

SLIDE 25

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a;

a

SLIDE 26

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

SLIDE 27

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

SLIDE 28

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5 5

a

SLIDE 29

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a;

a

SLIDE 30

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a;

a

SLIDE 31

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a;

a

SLIDE 32

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

SLIDE 33

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a;

a

SLIDE 34

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 5 b = a;

a

SLIDE 35

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a; 5

a

SLIDE 36

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a; 5 5

a

SLIDE 37

The flush directive

› Binding thread set is the encountering thread

– More "relaxed"

› "It executes the OpenMP flush operation"

– Makes its temporary view of the shared memory consistent with other threads – "Calls to dcache_flush()"

› Enforces an order on the memory operations on the variables specified in list

18

#pragma omp flush [(list)] new-line

SLIDE 38

Semantics: barrier vs flush

#pragma omp barrier › Joins the threads of a team › Applies to all threads of a team › Forces consistency of threads' temporary view of the shared memory #pragma omp flush › Applies to one thread › Forces consistency of its temporary view of the shared memory › Much lighter!

19

SLIDE 39

OpenMP software stack

› Multi-layer stack

– Engineered for portability

20

User code Operating System Hardware OpenMP runtime

T

a = 5; #pragma omp flush void GOMP_flush() { dcache_flush(); } D$ void dcache_flush() { asm("mov r15, #1"); }

SLIDE 40

OpenMP software stack

› Multi-layer stack

– Engineered for portability

20

User code Operating System Hardware OpenMP runtime

T

a = 5; #pragma omp flush void GOMP_flush() { dcache_flush(); } D$ void dcache_flush() { asm("mov r15, #1"); } D$

SLIDE 41

How to run the examples

› Download the Code/ folder from the course website › Compile › $ gcc –fopenmp code.c -o code › Run (Unix/Linux) $ ./code › Run (Win/Cygwin) $ ./code.exe

21

Let's code!

SLIDE 42

References

› "Calcolo parallelo" website

– http://hipert.unimore.it/people/paolob/pub/PhD/index.html

› My contacts

– paolo.burgio@unimore.it – http://hipert.mat.unimore.it/people/paolob/

› Useful links

– http://www.google.com – http://www.openmp.org – https://gcc.gnu.org/

› A "small blog"

– http://www.google.com

22