[PPT] - Building a C++CSP Channel Using C++ Atomics A Busy Channel PowerPoint Presentation

SLIDE 1

Building a C++CSP Channel Using C++ Atomics

A Busy Channel Performance Analysis Dr Kevin Chalmers

School of Computing Edinburgh Napier University Edinburgh k.chalmers@napier.ac.uk

SLIDE 2

Outline

1 Introduction and Background

SLIDE 3

Outline

1 Introduction and Background 2 Current Channel Operations

SLIDE 4

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation

SLIDE 5

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking

SLIDE 6

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 7

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 8

Motivation

Most CSP inspired libraries build channels using a mutex.
Channel communication in most libraries is slow due to context switching.
A typical context switch on an i7 may take 1000ns+.
A channel build on a mutex may have up to three context switches:

1 Writer to reader after writer has stored the value to be written. 2 Reader to writer after reader has retrieved the value. 3 Writer to reader to complete the operation.

A channel effectively stops parallelism and forces two processes to become

sequential during communication.

The aim of this work is to explore atomic operations as a method to improve

channel performance.

SLIDE 9

Atomic Values

Any easily constructable type can be an atomic in C++, but how it is interacted

with depends on its base type.

Five types:
Atomic Flag
Atomic Boolean
Atomic Integral
Atomic Pointer
Atomic Easily Constructable User Type
Flag and Boolean are different.
Flag is guaranteed lock free.
Flag only has two operations.
Test and Set (returns true on successful setting).
Clear.

SLIDE 10

Atomic Operations

Atomic Flag already mentioned.
Other types support dependant on type:

store atomically stores a value. load atomically retrieves a value. exchange atomically retrieves the current value and stores a new value. compare and exchange tests the current value and if it matches expected value, exchanges with new value. Provides current value if not. strong guarantees the comparison is correct. weak may return false even if the values match. fetch and op gets the current value and performs the given operation.

SLIDE 11

Memory Ordering

Atomic operations in C++ work on the principle of what has been observed by

the different threads.

We want to cause memory synchronisation to ensure a certain state has been

reached.

Five types of memory ordering of interest:

Sequentially consistent everything behaves as if everyone is watching. Slow, but easy to think about. Relaxed no synchronisation of memory. Fast. Can achieve the memory synchronisation from subsequent operations. Acquire for load operations, etc. Matches a release. Release for store operations, etc. Matches an acquire. Acquire-release for fetch and op, etc. Matches a release and acquire.

A naive explanation is that operations chain together to allow a memory history.

SLIDE 12

C++CSP Channel Model

1: procedure read 2:

lock

3:

if strength > 0 then

4:

throw

5:

if empty then

6:

empty ← false

7:

wait

8:

else

9:

empty ← true

10:

to return ← hold

11:

notify

12:

if strength > 0 then

13:

throw

14:

return to return

1: procedure write(value) 2:

lock

3:

if strength > 0 then

4:

throw

5:

hold ← value

6:

if empty then

7:

empty ← false

8:

if alting then

9:

schedule

10:

else

11:

empty ← true

12:

notify

13:

wait

14:

if strength > 0 then

15:

throw

SLIDE 13

C++CSP Channel Model

1: procedure enable 2:

lock

3:

if strength > 0 then

4:

return true

5:

if empty then

6:

alting ← true

7:

return false

8:

else

9:

return true

1: procedure disable 2:

lock

3:

alting ← false

4:

return !empty or strength > 0

SLIDE 14

Objectives

The aim is to discover if atomics give us a better performing channel.
Objectives

1 Build atomic-based channel implementation – this is believed to be the first such

implementation based on other CSP libraries.

2 Undertake performance analysis of atomic-based channel implementation – this

analysis can be used to understand when to use an atomic-based channel.

SLIDE 15

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 16

Operation Pairings

Six interactions possible when working with channels.

1 Write and read. 2 Write and select. 3 Write and poison. 4 Read and poison. 5 Select and poison.

The lock is removed and an analysis of the required ordering of instructions made.
Only write-read and write-select are covered here. The others are in the paper.

SLIDE 17

Write and Read - Read-first

1: procedure read 2:

empty ← false

3:

wait

4:

wait

5:

to return ← hold

6:

notify procedure write(value) hold ← value empty ← true notify wait wait

SLIDE 18

Write and Read - Write-first

1: procedure read 2: 3: 4:

empty ← true

5:

to return ← hold

6:

notify

7:

return to return procedure writer(value) hold ← value empty ← false wait wait wait

SLIDE 19

Write and Select

If write goes first we have no concerns.

1: procedure enable 2:

alting ← true

3:

return false

4: 5:

procedure write(value) hold ← value empty ← false schedule wait

SLIDE 20

Initial Analysis

Channel operations are fairly simple from an instruction point of view.
There is complexity in the ordering of the operations.
Mutex-based channel avoids this complexity by enforcing sequential behaviour

during the interaction.

An atomic-based channel will need to synchronise on state, and then progress

through the operations only when the correct next state is observed.

This requires some busy spinning (equivalent to waiting).

SLIDE 21

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 22

Atomic-based Channel Members

The atomic-based channel uses six values:

hold an atomic to store the value communicated via the channel. reading an atomic boolean to indicate if the reader is reading or not. writing an atomic boolean to indicate if the writer is writing or not. alt a reference to any alt construct currently engaged with the channel. alting an atomic bool to indicate if the channel is in a selection process. strength an atomic integral storing the current poison applied to the channel.

SLIDE 23

Atomic Read

1: procedure atomic read 2:

while !load(writing, acquire) do

3:

skip

4:

if load(strength, relaxed) > 0 then

5:

throw

6:

to return ← load(hold, relaxed)

7:

store(reading, true, release)

8:

while load(writing, acquire) do

9:

skip

10:

store(reading, false, release)

11:

return to return

SLIDE 24

Atomic Write

1: procedure atomic write(value) 2:

store(hold, value, relaxed)

3:

store(writing, true, release)

4:

if load(alting, acquire) then

5:

schedule

6:

while !load(reading, acquire) do

7:

skip

8:

if load(strength, relaxed) > 0 then

9:

throw

10:

store(writing, false, release)

11:

while load(reading, acquire) do

12:

skip

SLIDE 25

Atomic Enable

1: procedure atomic enable 2:

store(alting, true, release)

3:

temp ← load(writing, acquire)

4:

if load(strength, relaxed) > 0 then

5:

return true

6:

return temp

SLIDE 26

Atomic Disable

1: procedure atomic disable 2:

store(alting, false, release)

3:

return load(writing, acquire)

SLIDE 27

Write and Read

Algorithm 1 Atomic Read-Write Interaction

1: procedure atomic read 2:

while !load(writing, acquire) do

3:

skip

4:

to return ← load(hold, relaxed)

5:

store(reading, true, release)

6:

while load(writing, acquire) do

7:

skip

8:

store(reading, false, release)

9:

return to return procedure atomic write(value) store(hold, value, relaxed) store(writing, true, release) while !load(reading, acquire) do skip store(writing, false, release) while load(reading, acquire) do skip return

SLIDE 28

Write and Select

Three possible outcomes (see paper). Concurrent version.

1: procedure atomic enable 2:

store(alting, true, release)

3: 4:

temp ← load(writing, acquire)

5:

return temp procedure atomic write(value) store(hold, value, relaxed) store(writing, true, release) if load(alting, acquire) then schedule . . .

SLIDE 29

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 30

Test Bed

Intel Core i7-4770K CPU at 3.50GHz
Hyper-threading - 4 cores, 8 hardware threads.
Any benchmark > 8 threads will hit performance problems.
Linux 4.4
GCC 7.1, -O3 flag.

SLIDE 31

Communication Time

SLIDE 32

Multiple Communication Times

SLIDE 33

Stressed Select

Channels Writers/Channel Busy Select Busy stddev Mutex Select Mutex Select stddev 2 1 380.34 11.50 2646.69 430.68 2 2 526.06 34.80 3074.77 510.18 2 4 93526.69 110770.80 2892.89 374.15 2 8 31519.77 15630.47 2909.16 407.85 4 1 308.96 7.85 884.27 169.14 4 2 516.41 173.68 1228.36 378.08 4 4 25776.55 33808.00 1295.66 314.01 4 8 57581.95 48035.26 1354.72 142.64 8 1 6434.99 19581.05 994.48 249.51 8 2 40251.15 50379.31 1233.26 234.37 8 4 187678.50 213596.23 1099.47 312.47 8 8 563502.86 721881.75 1065.47 196.07

SLIDE 34

Observations

Atomic-based channels are useful when the hardware supports the concurrency.
Atomic-based channels are fast (15.5x) in comparison to mutex-based ones.
Particularly effective for real-time control systems or parallel systems where the

thread count matches the hardware count.

Performance drops significantly when concurrency is not supported.
Operating system scheduler takes over, and are reliant on the right combination of

threads to be active to enable progress.

SLIDE 35

Outline

1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions

SLIDE 36

Conclusions

1 That an atomic-channel can be developed using correct memory primitives and

memory ordering.

2 That an atomic-channel reduces channel communication time by greater than an

rder of magnitude in comparison to a mutex-based channel. Selection time is

also reduced.

3 When the available hardware supports the number of threads in the atomic-based

channel system, performance is good. Once the number of threads increases beyond the available hardware, performance drops and becomes unpredictable as the operating system scheduler takes over determination of progress.

SLIDE 37

Future Work

Verification is required to ensure correctness.
Some real-world benchmarking would be a useful exploration.
Atomic multi-party synchronisation (alting barrier) could be very useful.
Fork it on GitHub - https://github.com/kevin-chalmers/cpp-csp

SLIDE 38

Questions