Parallel high-performance graph processing
CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU
graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - - PowerPoint PPT Presentation
Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others
CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU
2
Breadth-first search
User’s Group (CUG), 2010
degree distribution
3
Level synchronous algorithms
1 3 4 5 6 2
4
5
Problem description
through interconnect network
Suggested solution
6
Top-down traversal
Bottom-up traversal
search // in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.
7
for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = level for each neighb in vert.neighbors if neighb in V.this_node if dist[neighb] = -1 dist[neighb] ← level + 1 pred[neighb] ← vert else vert_batch_to_send.push(neighb) send(vert_batch_to_send) receive(vert_batch_to_receive) parallel for each vert in vert_batch_to_receive if dist[vert] = -1 dist[vert] ← level + 1 pred[vert] ← vert.pred level++ while(!check_end())
8
Data transfers according to data transfer matrix
10 20 8 1 2 4 1 2 4
9
Data transfer time for every iteration of the algorithm
10
0,01 0,02 0,03 0,04 0,05 0,06 1 2 3 4 5 6 7 Time, sec. Iteration
for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = -1 for each neighb in vert.neighbors if bitmap_current.neighb = 1 dist[vert] ← level + 1 pred[vert] ← neighb bitmap_next.vert ← 1 break all_gather(bitmap_next) swap(bitmap_current, bitmap_next) level++ while(!check_end())
11
Data synchronization through collective communications
node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3 node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3
all_gather
12
Data transfer time for every iteration of the algorithm
13
0,00005 0,0001 0,00015 0,0002 1 2 3 4 5 6 7
Time, sec.
Iteration
14
Data transfer time for every iteration of the algorithm
15
0,00005 0,0001 0,00015 1 2 3 4 5 6 7 Time, sec. Iteration
Skewed degree distribution
16
1 2 3 4
17
some vertex (e.g. how many elements need to traverse in Column ids array)
18
Max edges elements
using additional array Part column
edges which determined by corresponding elements of Part column array
19
Graph processing without workload balancing
20
Graph processing with workload balancing
21
Filling of Part column array
parallel for i in V.this_node first ← V.this_node[i] last ← V.this_node[i+1] index ← round_up(first/max_edges) current ← index*max_edges while(current < last) part_column[index] ← i current += max_edges index++
22
Main loop of level synchronous breadth-first search modified for workload balancing
// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = level for each k є neighbors of curr_vert if dist[k] = -1 dist[k] ← level + 1 pred[k] ← curr_vert curr_vert++ // data synchronization...
23
Time spent to graph traversal with- and without workload balancing
24
0,5 1 1,5 2 1 2 3 4 5 6 7 Time, sec.
Iteration
Without balancing With balancing
Main loop of level synchronous breadth-first search modified for workload balancing
// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є
[first_edge;last_edge)
if dist[curr_vert] = -1 for each k є neighbors of curr_vert if bitmap_current.k = 1 dist[curr_vert] ← level + 1 pred[curr_vert] ← k bitmap_next.vert ← 1 break curr_vert++ // data synchronization...
25
Time spent to graph traversal with- and without workload balancing
26
0,002 0,004 0,006 0,008 0,01 0,012 0,014 1 2 3 4 5 6 7 Time, sce. Iteration Without balancing With balancing
data synchronization step of every iteration
27
All methods are integrated in custom implementation of Graph500 benchmark Measure performance of custom implementation for various number of nodes
Compare custom implementation with reference Graph500 implementations
Performance metrics – speed of graph traversal
28
100 200 300 400 500 600 700 800 900 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple
29
200 400 600 800 1000 1200 1400 1600 1800 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple
30
500 1000 1500 2000 2500 3000 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple
31
500 1000 1500 2000 2500 3000 3500 4000 4500 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple
32
33
Method of workload balancing helps to reduce overheads connected with graph processing Method of traversal hybridization helps to reduce
and coprocessors
34
35