[PPT] - Optimization of Scalable Concurrent Pool Based on Diffraction Trees PowerPoint Presentation

SLIDE 1

Anenkov Alexandr

Siberian State University of Telecommunications and Information Sciences, Novosibirsk E-mail: alex.anenkov@outlook.com The first summer school on practice and theory of concurrent computing (SPTCC) Saint Petersburg, ITMO University July 3-7, 2017

Paznikov Alexey

Saint Petersburg Electrotechnical University “LETI” Siberian State University of Telecommunications and Information Sciences, Novosibirsk Rzhanov Institute of Semiconductor Physics Siberian Branch of RAS, Novosibirsk E-mail: apaznikov@gmail.com

Optimization of Scalable Concurrent Pool Based

n Diffraction Trees

SLIDE 2

Concurrent pool

Let there be multicore (NUMA, SMP) computer system. The set of threads in random moments executes push() and pop() operations. We have to maximize efficiency of pool access for threads.

2 Concurrent pool

push push push pop pop pop

Memory

Core 6 Core 7 Core 8 Core 5 L1 L1 L1 L1 L2 L2 L2 L2 L3

Processor cores Cache

L3 Core 2 Core 3 Core 4 Core 1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 NUMA node 2 NUMA node 1

RAM

QuickPath / Hyper-Transport Thread 1 Thread 2 Thread 5 Thread 6 Thread 3 Thread 4

Threads

SLIDE 3

Concurrent pool

3

As the efficiency indicator we used pool’s throughput

𝑐 = 𝑂/𝑢,

where 𝑂 – total number of executed operations of push and pop operations, 𝑢 – time of modelling. Throughput shows how many operations has been done in 1 second.

Concurrent pool

push push push pop pop pop

Memory

Core 6 Core 7 Core 8 Core 5 L1 L1 L1 L1 L2 L2 L2 L2 L3

Processor cores Cache

L3 Core 2 Core 3 Core 4 Core 1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 NUMA node 2 NUMA node 1

RAM

QuickPath / Hyper-Transport Thread 1 Thread 2 Thread 5 Thread 6 Thread 3 Thread 4

Threads

Let there be multicore (NUMA, SMP) computer system. The set of threads in random moments executes push() and pop() operations. We have to maximize efficiency of pool access for threads.

SLIDE 4

Approaches for implementation of concurrent pool

4

1. Concurrent queue based on thread locking (PThread mutex, MCS,

CLH, CAS spinlocks, Flat Combining, Oyama Locks, RCL).

2. Lock-free concurrent linear lists.
3. Lists based on elimination backoff, combining trees and other

methods.

4. Using of multiple lists, combined with diffraction tree.
5. and so on.

SLIDE 5

Approaches for implementation of concurrent pool

1. Concurrent queue based on thread locking (PThread mutex, MCS,

CLH, CAS spinlocks, Flat Combining, Oyama Locks, RCL). Drawbacks: lack of scalability for large number of threads and high intensity of pool operations.

2. Lock-free concurrent linear lists.

Drawbacks: heads of lists become bottlenecks, which cause to increase of access contention and decrease of efficiency of cache.

3. Lists based on elimination backoff, combining trees and other

methods. Drawbacks: threads must wait for complementary operations (active waiting or using of system timer), algorithms have to be

ptimized by parameters: waiting interval, acceptable number of

collisions, etc

4. Using of multiple lists, combined with diffraction tree.

5

SLIDE 6

Diffracting tree

Diffracting tree is the binary tree with height ℎ, each node of which contains bits, determining the directions of thread passing. Tree nodes (balancers) redirects requests arriving from threads for pushing and popping of elements in turn to one of the child nodes:

if 𝑐 = 0, then a thread address to the node of right subtree,
if 𝑐 = 1, then a thread address to the node of left subtree.

After passing of the node thread inverts bit in it.

1 2 3 4 5

6

Thread 1 Thread 2 Thread p

…

push push push

Objects arriving from threads

Shavit N., Zemach A. Di Diff ffractin ing tree trees // ACM Transactions on Computer Systems (TOCS). 1996. vol. 14. no. 4. pp. 385–428.

1 5 2 3 4

b b

1

b

h – tree height

…

SLIDE 7

Concurrent pool based on diffracting tree

Pool is several concurrent queues 𝑟 = {1, 2, … , 2ℎ} , which are accessed by means of diffracting tree.

1 2 3 4 5

b b

1

7 Queue 1 Queue 2 Queue 2h

Thread 1 Thread 2 Thread p

…

push / pull push / pull push / pull

Afek Y., Korland G., Natanzon M., Shavit N. Sc Scalable le Prod roducer-Con

nsumer Poo
ols based on
n El

Elimin ination-Dif iffr fraction Tre rees // European Conference

n Parallel Processing, 2010. pp. 151-162.

Requests for the objects

b b1 b2

Array for elimination

f complementary
perations

push-thread’s bit pop-thread’s bit

…

h – tree height Objects arriving from threads Concurrent queues

SLIDE 8

Concurrent pool LocOptDTPool based on diffracting tree

Each node of developed pool LocOptDTPool on level 𝑘 of diffracting tree contains two arrays of atomic bits (for producer threads and consumer threads) of size 𝑛𝑘 ≤ 𝑞 instead of two atomic bits.

1 2 3 4 5

b b

1

8 Queue 1 Queue 2 Queue 2h

Thread 1 Thread 2 Thread p

…

push / pull push / pull push / pull

b

…

h – tree height b11

Array of push-threads Array of pop-threads

b12 b13 b14

…

b1mi b21 b22 b23 b24

…

b2mi 𝑛𝑘 – number of atomic bits, 𝑛𝑗 = 𝑛1/2𝑘, where 𝑘 – tree level with the node Concurrent queues

SLIDE 9

Concurrent pool LocOptDTPool based on diffracting tree

b b

1

9 Queue 1 Queue 2 Queue 2h

Thread 1 Thread 2 Thread p

…

b

…

h – tree height b21 b22 b23 b24

…

b2mi b11 b12 b13 b14

…

b1mi At each visit of tree node, thread choose the cell in the array of atomic bits by means of hash-function ℎ 𝑗 = 𝑗 mod 𝑛 where 𝑗 ∈ {1, 2, … , 𝑞} – serial number of current thread, 𝑛 – size of array of atomic bits.

ℎ 𝑗 = 𝑗 mod 𝑛

Each node of developed pool LocOptDTPool on level 𝑘 of diffracting tree contains two arrays of atomic bits (for producer threads and consumer threads) of size 𝑛𝑘 ≤ 𝑞 instead of two atomic bits. Concurrent queues

SLIDE 10

b b

1

10 Queue 1 Queue 2 Queue 2h

Thread 1 Thread 2 Thread p

…

b

…

Concurrent queues Each processor core 𝑙 ∈ {1, 2, … , 𝑜} corresponds to the queues 𝑟𝑘 = {𝑘2ℎ/𝑜, 𝑘2ℎ/𝑜 + 1, … , (𝑘 + 1)2ℎ/𝑜 − 1}. All the objects, pushing (popping) into the pool by thread 𝑗, affined to the core 𝑘 (𝑏(𝑗) = {𝑘}), are distributed among the queues 𝑟𝑘 only. processor core 1 processor core 2 … processor core 𝑜

Concurrent pool LocOptDTPool based on diffracting tree

SLIDE 11

11

During the operation push Queue, in which the object is being inserted: 𝑟𝑘 = (𝑞 × 𝑚 mod (2ℎ / 𝑞) + 𝑏(𝑘)) where 𝑞 – total number of processor cores, 𝑚 – tree leaf, visited by thread, 𝑏(𝑘) – core number, to which the thread 𝑘 is affined, 2ℎ – total number of queues in the pool. During the operation pop Queue, from which the object is being removed: 𝑟𝑘 = (𝑞𝛽 + 𝑏(𝑘)) where 𝛽 – coefficient of shift.

Concurrent pool LocOptDTPool based on diffracting tree

b b

1

Queue 1 Queue 2 Queue 2h

Thread 1 Thread 2 Thread p

…

b

…

Concurrent queues processor core 1 processor core 2 … processor core 𝑜

SLIDE 12

Concurrent pool LocOptDTPool: description of data structure

12

LocOptDTPool { Node tree BitArray prod_bits[m], cons_bits[m] ThreadSafeQueue queues[n] AffinityManager af_mgr AtomicInt prod_num, cons_num push(data) pop() }

SLIDE 13

13

LocOptDTPool { Node tree BitArray prod_bits[m], cons_bits[m] ThreadSafeQueue queues[n] AffinityManager af_mgr AtomicInt prod_num, cons_num push(data) pop() } Node { int index, level Node children[2] int traverse(BitArray bits) }

Concurrent pool LocOptDTPool: description of data structure

SLIDE 14

14

LocOptDTPool { Node tree BitArray prod_bits[m], cons_bits[m] ThreadSafeQueue queues[n] AffinityManager af_mgr AtomicInt prod_num, cons_num push(data) pop() } Node { int index, level Node children[2] int traverse(BitArray bits) } BitArray { Bit bits_array[n][m] int flip(tree_level, node_index) { return bits_array[tree_level] [node_index].flip() } Bit { AtomicInt bit int flip() { return bit.atomic_xor(1) } }

Concurrent pool LocOptDTPool: description of data structure

SLIDE 15

15

LocOptDTPool { Node tree BitArray prod_bits[m], cons_bits[m] ThreadSafeQueue queues[n] AffinityManager af_mgr AtomicInt prod_num, cons_num push(data) pop() } Node { int index, level Node children[2] int traverse(BitArray bits) } BitArray { Bit bits_array[n][m] int flip(tree_level, node_index) { return bits_array[tree_level] [node_index].flip() } Bit { AtomicInt bit int flip() { return bit.atomic_xor(1) } } AffinityManager { thread_local int core, queue_offset AtomicInt next_core, next_offset set_core() get_core() get_queue_offset() }

Concurrent pool LocOptDTPool: description of data structure

SLIDE 16

Pool LocOptDTPool: algorithms of pushing and popping of elements

16

Algorithm push of inserting of elements into pool

1. Increase counter prod_num.
2. Set thread affinity to processor core, if it have not done.
3. Choose the queue, into which the object will be inserted:

𝑟𝑘 = (𝑞 × 𝑚 mod (2ℎ / 𝑞) + 𝑏(𝑘)) where 𝑞 – number of processor cores, 𝑚 – tree leaf, visited by thread, 𝑏(𝑘) – core number, to which thread 𝑘 is affined, 2ℎ – total number of queues in pool.

SLIDE 17

17

Algorithm push of inserting of elements into pool

1. Increase counter prod_num.
2. Set thread affinity to processor core, if it have not done.
3. Choose the queue, into which the object will be inserted:

𝑟𝑘 = (𝑞 × 𝑚 mod (2ℎ / 𝑞) + 𝑏(𝑘)) where 𝑞 – number of processor cores, 𝑚 – tree leaf, visited by thread, 𝑏(𝑘) – core number, to which thread 𝑘 is affined, 2ℎ – total number of queues in pool. Algorithm pop of removing of elements from pool

1. Set the affinity of threads by means of affinity manager af_mgr and increase counter

cons_num (if it have not done).

2. Choose the queue for removing the element by equation

𝑟𝑘 = 𝑞𝛽 + 𝑏 𝑘 where 𝑞 – number of processor cores, 𝑏(𝑘) – core number, to which thread 𝑘 is affined, α – coefficient of shift.

3. In the queue, selected by equation in par. 2, is empty, then the element is being

removed from first following non-empty queue.

Pool LocOptDTPool: algorithms of pushing and popping of elements

SLIDE 18

Concurrent pool TLSDTPool based of Thread Local Storage

18

Concurrent pool TLSDTPool is similar to LocOptDTPool, but the structure BitArray is allocated in thread’s

TLS. This allows to refuse expensive atomic operations while accessing to array bits in the structure

BitArray; as bits we use common array of boolean variables.

b b

1

Queue 1 Queue 2 Queue 2h

Thread 1

…

b

…

b21 b22 b23 b24

…

b2p b11 b12 b13 b24

…

b1p Concurrent queues

TLS

b21 b11

Thread 2

TLS

b22 b21

Thread p

TLS

b2p b1p

SLIDE 19

19

Consider the possible case, when in some moment more threads are addressing to the one queue in the same time. For the solving of this problem pool TLSDTPool implements heuristic algorithm of initialization

f element of binary arrays in tree nodes.

b

1

b

1

b b b

1

b b … … … … 1 2 1 1 1 1 … 4 3 2 1 1 = 2 = 5 = … levels of tree state of the bit in nodes Initial values for tree nodes are assigned based on the binary representation of thread identifiers. In the case of constantly active threads the algorithm uniformly distributes the addresses of threads to tree nodes for preventing of imbalance of queues load in the leaves of the tree.

Concurrent pool TLSDTPool based of Thread Local Storage

SLIDE 20

Experiments

Subsystem configuration

Node of computer cluster of SibSUTIS:

two 4-core processors Intel Xeon E5420 (2,5 GHz; Intel-64) (total 8 cores). Benchmark

Synthetic benchmark: 𝑞 threads during the time 𝑢 randomly executes the operation of

pushing and popping of elements (integer variables) into pool.

Number of threads:

▪ 𝑞 = 1, 2, … , 8 – number of threads does not exceed number of processor cores ▪ 𝑞 = 10, 20, … , 200 – number of threads exceed number of processor cores

𝑢 = 10 с – time of experiment
Tree height ℎ = 4 (number of queues |𝑟| = 16).
Language C++, compiler GCC 4.8.2.

Efficiency indicator

Throughput of a pool 𝑐 = 𝑂/𝑢, where 𝑂 – number of performed operation (pushing or

popping elements) with the pool.

20

SLIDE 21

1 2 3

b, 106 op/s p

Experimental results, LocOptDTPool

21

p b, 106 op/s

1 2 3 Number of threads p does not exceed number of processor cores 1 – LocOptDTPool, lock-free concurrent queues Lockfree from boost library, 2 – LocOptDTPool, lock-based queues based on PThreads mutex, 3 – single lock-free queue Lockfree from boost library boost (used as the pool) Number of threads p exceed number of processor cores

SLIDE 22

1 2 3

b, 106 op/s p

22

p b, 106 op/s

1 2 3 Number of threads p exceed number of processor cores

Experimental results, TLSDTPool

Number of threads p does not exceed number of processor cores 1 – TLSDTPool, lock-free concurrent queues Lockfree from boost library, 2 – TLSDTPool, lock-based queues based on PThreads mutex, 3 – single lock-free queue Lockfree from boost library boost (used as the pool)

SLIDE 23

Conclusion

The algorithms for implementation of scalable concurrent pool based on

diffracting trees are developed

We compared the efficiency of pool for different types of queues in the leaves of

trees.

Pool provide large scalability while multithreaded program execution, compared

with analogous implementation of pool based on diffracting trees.

The most efficiency of data structures is achieved with the number of active

threads equal to the number of processor cores in the system.

Increase of tree size in the pool does not decrease the throughput of the pools.
As the data structures in the tree leaves for storing of pool objects we

recommend to use lock-free concurrent queues, which ensure sufficient throughput.

23

SLIDE 24

Thank you for attention!

24

SLIDE 25

Related works

Della-Libera, G., Shavit, N. Reactive diffracting trees. // J. Parallel Distrib.

Comput. 60. 2000. P. 853–890.

Ha P. H., Papatriantafilou M., Tsigas P.. Self-tuning reactive distributed trees for counting and balancing. Afek Y., Korland G., Natanzon M., Shavit N. Scalable Producer-Customer Pools based on Elimination-Diffraction Trees. ACM, 1989. V. 17. №. 3. P. 396- 406. Moir M. et al. Using elimination to implement scalable and lock-free FIFO queues // Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2005. P. 253–262.

25

SLIDE 26

Пул на основе распределяющего дерева

Недостатки данного пула на основе распределяющего дерева схожи с недостатками пула, основанного на очереди с использованием массива устранения: 1) В худшем случае поток ожидает комплементарную операцию на каждом уровне дерева 2) Необходимость учета различных параметров: тайм-аут ожидания, допустимое количество коллизий и др. 3) Эффективность пула падет с ростом высоты распределяющего дерева

26

SLIDE 27

Локализация данных пула

Произвести оптимизацию существующего потокобезопасного неблокируемого пула на основе распределяющего дерева для постоянного числа потоков за счет локализации данных пула. Данный подход позволяет сократить количество промахов по кэшу и повысить пропускную способность. Задача:

27

SLIDE 28

Другие шаги

При реализации пула учитывались и следующие оптимизации:

Выравнивание структур данных (alignas в C++) по размеру кэш-линии процессора во

избежание ложного разделения.

Учет количества потоков в пуле. Нет необходимости распределять и извлекать

объекты по большому количеству очередей, если текущая нагрузка на пул мала.

28

SLIDE 29

Формулы

Выбор очереди, в которую будет помещен объект: qj = (p×l mod (2h/p) + a(j)) Выбор очереди, из которой будет извлечен объект: qi = (pα + a(j)) p – общее количество процессорных ядер l – лист дерева, посещенный потоком a(j) – номер ядра, к которому привязан поток j 2h – общее число очередей в пуле α – коэффициент сдвига

29

SLIDE 30

class LocOptDTPool { Node tree BitArray prod_bits[m], cons_bits[m] ThreadSafeQueue queues[n] AffinityManager af_mgr AtomicInt prod_num, cons_num push(data) pop() } Структура потокобезопасного пула LocOptDTPool Класс, реализующий привязку потоков к процессорным ядрам class AffinityManager { thread_local int core thread_local int queue_offset AtomicInt next_core AtomicInt next_offset set_core() get_core_num() get_queue_offset() }

30

SLIDE 31

class Node { int index, level Node children[2] int traverse(BitArray bits) } Узел распределяющего дерева Массив атомарных битов class BitArray { Bit bits_array[n][m] int flip(tree_level, node_index ) { return bits_array[tree_level][node_index ].flip() } } Переключаемый атомарный бит class Bit { AtomicInt bit int flip() { return bit.atomic_xor(1) } }

31