in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing - - PowerPoint PPT Presentation

in openmp
SMART_READER_LITE
LIVE PREVIEW

in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing - - PowerPoint PPT Presentation

Barriers in OpenMP Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections Work


slide-1
SLIDE 1

Barriers in OpenMP

Paolo Burgio paolo.burgio@unimore.it

slide-2
SLIDE 2

Outline

› Expressing parallelism

– Understanding parallel threads

› Memory Data management

– Data clauses

› Synchronization

– Barriers, locks, critical sections

› Work partitioning

– Loops, sections, single work, tasks…

› Execution devices

– Target

2

slide-3
SLIDE 3

OpenMP synchronization

› OpenMP provides the following synchronization constructs:

– barrier – flush – master – critical – atomic – taskwait – taskgroup – ordered – ..and OpenMP locks

3

slide-4
SLIDE 4

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

slide-5
SLIDE 5

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T

slide-6
SLIDE 6

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T T T T

slide-7
SLIDE 7

Creating ting a pa parreg eg

› Master-slave, fork-join execution model

– Master thread spawns a team of Slave threads – They all perform computation in parallel – At the end of the parallel region, implicit barrier

4

int main() { /* Sequential code */ #pragma omp parallel num_threads(4) { /* Parallel code */ } // Parreg end: (implicit) barrier /* (More) sequential code */ }

T

slide-8
SLIDE 8

OpenMP explicit barriers

› All threads in a team must wait for all the other threads before going on

– "Each barrier region must be encountered by all threads in a team or by none at all" – "The sequence of barrier regions encountered must be the same for every thread in a team" – Why?

› Binding set is the team of threads from the innermost enclosing parreg

– "It applies to"

› Also, it enforces a consistent view of the shared memory

– We'll see this..

5

#pragma omp barrier new-line (a standalone directive)

slide-9
SLIDE 9

Exercise

› Spawn a team of (many) parallel Threads

– Printing "Hello World" – Put a #pragma omp barrier – Reprint "Hello World" after

› What do you see?

– Now, remove the barrier construct

› Now, put the barrier inside an if

– E.g., if(omp_get_thread_num() == 0) { ... } – What do you see? – Error!!!!

6

Let's code!

slide-10
SLIDE 10

Effects on memory

› Besides synchronization, a barrier has the effect of making threads' temporary view of the shared memory consistent

– You cannot trust any (potentially modified) shared vars before a barrier – Of course, there are no problems with private vars

› ..what???

7

slide-11
SLIDE 11

The OpenMP memory model

› Shared memory with relaxed consistency

– Threads have access to "a place to store and to retrieve variables, called the memory" – Threads can have a temporary view of the memory › Caches, registers, scratchpads… › Can still be accessed by other threads

8

Process

Shared

T T T

VAR

Temp Priv.

VAR VAR

Temp Priv. Temp Priv. first/ private(a)

shared(a)

????? ??

slide-12
SLIDE 12

A bit of architecture…

slide-13
SLIDE 13

Caches in a nutshell

› A quick memory connected to the core processor

– ..and to the main memory – Few KB of data

› (If any,) caches are a pure hardware mechanism

– Used to store a copy mostly accessed data – To speedup execution even by 10-20 times – Istruction caches/Data caches

› They perform their work automatically

– And transparently – Poor or no control at all at application level – Extremely dangerous in multi- and many-cores

10

slide-14
SLIDE 14

Caches

11

A cache is a hardware or software component that stores data so future requests for that data can be served faster; the data stored in a cache might be the result

  • f

an earlier computation,

  • r

the duplicate

  • f

data stored elsewhere.

eng.wikipedia.org

CPU

D$

Main memory, or L3 cache Offchip memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T T T T

I$ I$ I$ I$ Level-2 $

slide-15
SLIDE 15

The catch(es)

› Caches are power hungry

– Some embedded architectures do not have D$

› They are not suitable for critical systems

– E.g., BOSCH removed I$s

› Hardware mechanism, poor control on them

– Flush command (typically, all cache) – Color cache (assign to threads) – Prefetch (move data before it's actually needed)

Coherency problem in multi/many-cores!!

12

slide-16
SLIDE 16

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a; // ... c = a;

a

slide-17
SLIDE 17

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a; // ... c = a;

a

slide-18
SLIDE 18

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... c = a;

a

slide-19
SLIDE 19

An example: read stale data

13

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... c = a;

a

slide-20
SLIDE 20

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a; // ... dcache_flush(); c = a;

a

slide-21
SLIDE 21

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a; // ... dcache_flush(); c = a;

a

slide-22
SLIDE 22

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; // ... dcache_flush(); c = a;

a

slide-23
SLIDE 23

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a; // ... dcache_flush(); c = a;

a

slide-24
SLIDE 24

An example: read stale data

14

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a; // ... dcache_flush(); c = a; 5

a

slide-25
SLIDE 25

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 b = a;

a

slide-26
SLIDE 26

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

slide-27
SLIDE 27

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

slide-28
SLIDE 28

An(other) example: $ writing policies

Write-through

15

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5 5

a

slide-29
SLIDE 29

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; b = a;

a

slide-30
SLIDE 30

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 5 b = a;

a

slide-31
SLIDE 31

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a;

a

slide-32
SLIDE 32

An(other) example: $ writing policies

Write-back

16

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; 11 5 b = a; 5

a

slide-33
SLIDE 33

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a;

a

slide-34
SLIDE 34

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 5 b = a;

a

slide-35
SLIDE 35

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a; 5

a

slide-36
SLIDE 36

An(other) example: $ writing policies

Write-back w/cache flush

17

CPU

D$

Main memory

CPU 1

D$

CPU 2

D$

CPU 3

D$

T T

11 a = 5; dcache_flush(); 11 b = a; 5 5

a

slide-37
SLIDE 37

The flush directive

› Binding thread set is the encountering thread

– More "relaxed"

› "It executes the OpenMP flush operation"

– Makes its temporary view of the shared memory consistent with other threads – "Calls to dcache_flush()"

› Enforces an order on the memory operations on the variables specified in list

18

#pragma omp flush [(list)] new-line

slide-38
SLIDE 38

Semantics: barrier vs flush

#pragma omp barrier › Joins the threads of a team › Applies to all threads of a team › Forces consistency of threads' temporary view of the shared memory #pragma omp flush › Applies to one thread › Forces consistency of its temporary view of the shared memory › Much lighter!

19

slide-39
SLIDE 39

OpenMP software stack

› Multi-layer stack

– Engineered for portability

20

User code Operating System Hardware OpenMP runtime

T

a = 5; #pragma omp flush void GOMP_flush() { dcache_flush(); } D$ void dcache_flush() { asm("mov r15, #1"); }

slide-40
SLIDE 40

OpenMP software stack

› Multi-layer stack

– Engineered for portability

20

User code Operating System Hardware OpenMP runtime

T

a = 5; #pragma omp flush void GOMP_flush() { dcache_flush(); } D$ void dcache_flush() { asm("mov r15, #1"); } D$

slide-41
SLIDE 41

How to run the examples

› Download the Code/ folder from the course website › Compile › $ gcc –fopenmp code.c -o code › Run (Unix/Linux) $ ./code › Run (Win/Cygwin) $ ./code.exe

21

Let's code!

slide-42
SLIDE 42

References

› "Calcolo parallelo" website

– http://hipert.unimore.it/people/paolob/pub/PhD/index.html

› My contacts

– paolo.burgio@unimore.it – http://hipert.mat.unimore.it/people/paolob/

› Useful links

– http://www.google.com – http://www.openmp.org – https://gcc.gnu.org/

› A "small blog"

– http://www.google.com

22