SLIDE 1
Building a C++CSP Channel Using C++ Atomics A Busy Channel - - PowerPoint PPT Presentation
Building a C++CSP Channel Using C++ Atomics A Busy Channel - - PowerPoint PPT Presentation
Building a C++CSP Channel Using C++ Atomics A Busy Channel Performance Analysis Dr Kevin Chalmers School of Computing Edinburgh Napier University Edinburgh k.chalmers@napier.ac.uk Outline 1 Introduction and Background Outline 1 Introduction
SLIDE 2
SLIDE 3
Outline
1 Introduction and Background 2 Current Channel Operations
SLIDE 4
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation
SLIDE 5
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking
SLIDE 6
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 7
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 8
Motivation
- Most CSP inspired libraries build channels using a mutex.
- Channel communication in most libraries is slow due to context switching.
- A typical context switch on an i7 may take 1000ns+.
- A channel build on a mutex may have up to three context switches:
1 Writer to reader after writer has stored the value to be written. 2 Reader to writer after reader has retrieved the value. 3 Writer to reader to complete the operation.
- A channel effectively stops parallelism and forces two processes to become
sequential during communication.
- The aim of this work is to explore atomic operations as a method to improve
channel performance.
SLIDE 9
Atomic Values
- Any easily constructable type can be an atomic in C++, but how it is interacted
with depends on its base type.
- Five types:
- Atomic Flag
- Atomic Boolean
- Atomic Integral
- Atomic Pointer
- Atomic Easily Constructable User Type
- Flag and Boolean are different.
- Flag is guaranteed lock free.
- Flag only has two operations.
- Test and Set (returns true on successful setting).
- Clear.
SLIDE 10
Atomic Operations
- Atomic Flag already mentioned.
- Other types support dependant on type:
store atomically stores a value. load atomically retrieves a value. exchange atomically retrieves the current value and stores a new value. compare and exchange tests the current value and if it matches expected value, exchanges with new value. Provides current value if not. strong guarantees the comparison is correct. weak may return false even if the values match. fetch and op gets the current value and performs the given operation.
SLIDE 11
Memory Ordering
- Atomic operations in C++ work on the principle of what has been observed by
the different threads.
- We want to cause memory synchronisation to ensure a certain state has been
reached.
- Five types of memory ordering of interest:
Sequentially consistent everything behaves as if everyone is watching. Slow, but easy to think about. Relaxed no synchronisation of memory. Fast. Can achieve the memory synchronisation from subsequent operations. Acquire for load operations, etc. Matches a release. Release for store operations, etc. Matches an acquire. Acquire-release for fetch and op, etc. Matches a release and acquire.
- A naive explanation is that operations chain together to allow a memory history.
SLIDE 12
C++CSP Channel Model
1: procedure read 2:
lock
3:
if strength > 0 then
4:
throw
5:
if empty then
6:
empty ← false
7:
wait
8:
else
9:
empty ← true
10:
to return ← hold
11:
notify
12:
if strength > 0 then
13:
throw
14:
return to return
1: procedure write(value) 2:
lock
3:
if strength > 0 then
4:
throw
5:
hold ← value
6:
if empty then
7:
empty ← false
8:
if alting then
9:
schedule
10:
else
11:
empty ← true
12:
notify
13:
wait
14:
if strength > 0 then
15:
throw
SLIDE 13
C++CSP Channel Model
1: procedure enable 2:
lock
3:
if strength > 0 then
4:
return true
5:
if empty then
6:
alting ← true
7:
return false
8:
else
9:
return true
1: procedure disable 2:
lock
3:
alting ← false
4:
return !empty or strength > 0
SLIDE 14
Objectives
- The aim is to discover if atomics give us a better performing channel.
- Objectives
1 Build atomic-based channel implementation – this is believed to be the first such
implementation based on other CSP libraries.
2 Undertake performance analysis of atomic-based channel implementation – this
analysis can be used to understand when to use an atomic-based channel.
SLIDE 15
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 16
Operation Pairings
- Six interactions possible when working with channels.
1 Write and read. 2 Write and select. 3 Write and poison. 4 Read and poison. 5 Select and poison.
- The lock is removed and an analysis of the required ordering of instructions made.
- Only write-read and write-select are covered here. The others are in the paper.
SLIDE 17
Write and Read - Read-first
1: procedure read 2:
empty ← false
3:
wait
4:
wait
5:
to return ← hold
6:
notify procedure write(value) hold ← value empty ← true notify wait wait
SLIDE 18
Write and Read - Write-first
1: procedure read 2: 3: 4:
empty ← true
5:
to return ← hold
6:
notify
7:
return to return procedure writer(value) hold ← value empty ← false wait wait wait
SLIDE 19
Write and Select
- If write goes first we have no concerns.
1: procedure enable 2:
alting ← true
3:
return false
4: 5:
procedure write(value) hold ← value empty ← false schedule wait
SLIDE 20
Initial Analysis
- Channel operations are fairly simple from an instruction point of view.
- There is complexity in the ordering of the operations.
- Mutex-based channel avoids this complexity by enforcing sequential behaviour
during the interaction.
- An atomic-based channel will need to synchronise on state, and then progress
through the operations only when the correct next state is observed.
- This requires some busy spinning (equivalent to waiting).
SLIDE 21
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 22
Atomic-based Channel Members
- The atomic-based channel uses six values:
hold an atomic to store the value communicated via the channel. reading an atomic boolean to indicate if the reader is reading or not. writing an atomic boolean to indicate if the writer is writing or not. alt a reference to any alt construct currently engaged with the channel. alting an atomic bool to indicate if the channel is in a selection process. strength an atomic integral storing the current poison applied to the channel.
SLIDE 23
Atomic Read
1: procedure atomic read 2:
while !load(writing, acquire) do
3:
skip
4:
if load(strength, relaxed) > 0 then
5:
throw
6:
to return ← load(hold, relaxed)
7:
store(reading, true, release)
8:
while load(writing, acquire) do
9:
skip
10:
store(reading, false, release)
11:
return to return
SLIDE 24
Atomic Write
1: procedure atomic write(value) 2:
store(hold, value, relaxed)
3:
store(writing, true, release)
4:
if load(alting, acquire) then
5:
schedule
6:
while !load(reading, acquire) do
7:
skip
8:
if load(strength, relaxed) > 0 then
9:
throw
10:
store(writing, false, release)
11:
while load(reading, acquire) do
12:
skip
SLIDE 25
Atomic Enable
1: procedure atomic enable 2:
store(alting, true, release)
3:
temp ← load(writing, acquire)
4:
if load(strength, relaxed) > 0 then
5:
return true
6:
return temp
SLIDE 26
Atomic Disable
1: procedure atomic disable 2:
store(alting, false, release)
3:
return load(writing, acquire)
SLIDE 27
Write and Read
Algorithm 1 Atomic Read-Write Interaction
1: procedure atomic read 2:
while !load(writing, acquire) do
3:
skip
4:
to return ← load(hold, relaxed)
5:
store(reading, true, release)
6:
while load(writing, acquire) do
7:
skip
8:
store(reading, false, release)
9:
return to return procedure atomic write(value) store(hold, value, relaxed) store(writing, true, release) while !load(reading, acquire) do skip store(writing, false, release) while load(reading, acquire) do skip return
SLIDE 28
Write and Select
- Three possible outcomes (see paper). Concurrent version.
1: procedure atomic enable 2:
store(alting, true, release)
3: 4:
temp ← load(writing, acquire)
5:
return temp procedure atomic write(value) store(hold, value, relaxed) store(writing, true, release) if load(alting, acquire) then schedule . . .
SLIDE 29
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 30
Test Bed
- Intel Core i7-4770K CPU at 3.50GHz
- Hyper-threading - 4 cores, 8 hardware threads.
- Any benchmark > 8 threads will hit performance problems.
- Linux 4.4
- GCC 7.1, -O3 flag.
SLIDE 31
Communication Time
SLIDE 32
Multiple Communication Times
SLIDE 33
Stressed Select
Channels Writers/Channel Busy Select Busy stddev Mutex Select Mutex Select stddev 2 1 380.34 11.50 2646.69 430.68 2 2 526.06 34.80 3074.77 510.18 2 4 93526.69 110770.80 2892.89 374.15 2 8 31519.77 15630.47 2909.16 407.85 4 1 308.96 7.85 884.27 169.14 4 2 516.41 173.68 1228.36 378.08 4 4 25776.55 33808.00 1295.66 314.01 4 8 57581.95 48035.26 1354.72 142.64 8 1 6434.99 19581.05 994.48 249.51 8 2 40251.15 50379.31 1233.26 234.37 8 4 187678.50 213596.23 1099.47 312.47 8 8 563502.86 721881.75 1065.47 196.07
SLIDE 34
Observations
- Atomic-based channels are useful when the hardware supports the concurrency.
- Atomic-based channels are fast (15.5x) in comparison to mutex-based ones.
- Particularly effective for real-time control systems or parallel systems where the
thread count matches the hardware count.
- Performance drops significantly when concurrency is not supported.
- Operating system scheduler takes over, and are reliant on the right combination of
threads to be active to enable progress.
SLIDE 35
Outline
1 Introduction and Background 2 Current Channel Operations 3 An Atomic Channel Implementation 4 Benchmarking 5 Conclusions
SLIDE 36
Conclusions
1 That an atomic-channel can be developed using correct memory primitives and
memory ordering.
2 That an atomic-channel reduces channel communication time by greater than an
- rder of magnitude in comparison to a mutex-based channel. Selection time is
also reduced.
3 When the available hardware supports the number of threads in the atomic-based
channel system, performance is good. Once the number of threads increases beyond the available hardware, performance drops and becomes unpredictable as the operating system scheduler takes over determination of progress.
SLIDE 37
Future Work
- Verification is required to ensure correctness.
- Some real-world benchmarking would be a useful exploration.
- Atomic multi-party synchronisation (alting barrier) could be very useful.
- Fork it on GitHub - https://github.com/kevin-chalmers/cpp-csp
SLIDE 38