[PPT] - graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, PowerPoint Presentation

SLIDE 1

Parallel high-performance graph processing

CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU

SLIDE 2

Graph algorithms

Bioinformatics Social networks analysis Business-analytics Data mining City planning and others…

2

SLIDE 3

Graph algorithms

Breadth-first search

Easy to understand
Widespread
Many ways to parallelize

Graph500

www.graph500.org
R. Murphy, K. Wheeler, B. Barrett, and J. Ang. Introducing the graph 500. In Cray

User’s Group (CUG), 2010

Parallel breadth-first search
MPI and OpenMP implementations
Designed for graphs with relatively small diameter and skewed

degree distribution

“Scale-Free” graphs

3

SLIDE 4

Parallel breadth-first search

Level synchronous algorithms

Processing of level N+1 begins only when processing of level N is over

1 3 4 5 6 2

4

SLIDE 5

Obstacles for efficient parallel implementation

Data transfer problem Graph marking problem

5

SLIDE 6

Data transfer problem

Problem description

Real-world graphs have irregular memory access pattern
Graphs with small diameter have big overheads connected with data transfer

through interconnect network

Level synchronous algorithms

Top-down traversal

Traditional way to implement breadth-first search
Active vertex tries to check all its neighbors

Bottom-up traversal

Inactive vertices looking for active vertices in its neighbor lists
S. Beamer, K. Asanovic, D. A. Patterson, Direction-optimizing breadth-first

search // in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.

7

SLIDE 8

Top down breadth-first search

for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = level for each neighb in vert.neighbors if neighb in V.this_node if dist[neighb] = -1 dist[neighb] ← level + 1 pred[neighb] ← vert else vert_batch_to_send.push(neighb) send(vert_batch_to_send) receive(vert_batch_to_receive) parallel for each vert in vert_batch_to_receive if dist[vert] = -1 dist[vert] ← level + 1 pred[vert] ← vert.pred level++ while(!check_end())

8

SLIDE 9

Top down breadth-first search

Data transfers according to data transfer matrix

𝑏𝑗𝑘 – amount of data (bytes) to transfer from process 𝑗 to process 𝑘

10 20 8 1 2 4 1 2 4

9

SLIDE 10

Top down breadth-first search

Data transfer time for every iteration of the algorithm

~1 million vertices graph, 4 nodes parallelization
Total time spent on data transfer – 0,83 sec.

10

0,01 0,02 0,03 0,04 0,05 0,06 1 2 3 4 5 6 7 Time, sec. Iteration

SLIDE 11

Bottom up breadth-first search

for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = -1 for each neighb in vert.neighbors if bitmap_current.neighb = 1 dist[vert] ← level + 1 pred[vert] ← neighb bitmap_next.vert ← 1 break all_gather(bitmap_next) swap(bitmap_current, bitmap_next) level++ while(!check_end())

11

SLIDE 12

Bottom up breadth-first search

Data synchronization through collective communications

node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3 node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3

all_gather

12

SLIDE 13

Bottom up breadth-first search

Data transfer time for every iteration of the algorithm

~1 million vertices graph, 4 nodes parallelization
Total time spent on data transfer – 0,001 sec.

13

0,00005 0,0001 0,00015 0,0002 1 2 3 4 5 6 7

Time, sec.

Iteration

SLIDE 14

Data transfer problem

Hybrid graph traversal

Data transfer time for every iteration of the algorithm

~1 million vertices graph, 4 nodes parallelization
Total time spent on data transfer – 0,0005 sec.

15

0,00005 0,0001 0,00015 1 2 3 4 5 6 7 Time, sec. Iteration

SLIDE 16

Graph marking problem

Skewed degree distribution

Many vertices with big number of in-/outgoing edges
Few vertices with small number of in-/outgoing edges

16

SLIDE 17

Graph marking problem

CSR is one of the most popular format to store graph data Row pointers: 0, 3, 6, 7, 10, 12 Column ids: 1, 2, 3, 0, 3, 4, 0, 0, 1, 2, 1, 3

1 2 3 4

17

SLIDE 18

Graph marking problem

Using CSR suppose to traverse through Row pointers array

It’s not know in advance how many edges are incident to

some vertex (e.g. how many elements need to traverse in Column ids array)

Performance of every iteration in level synchronous algorithms depends on performance of processing mostly “heavy-weight” vertex

Workload imbalance

18

SLIDE 19

Graph marking problem

Graph processing without workload balancing

20

SLIDE 21

Graph marking problem

Graph processing with workload balancing

max_edges = 4

21

SLIDE 22

Graph marking problem

Filling of Part column array

parallel for i in V.this_node first ← V.this_node[i] last ← V.this_node[i+1] index ← round_up(first/max_edges) current ← index*max_edges while(current < last) part_column[index] ← i current += max_edges index++

22

SLIDE 23

Top down breadth-first search

Main loop of level synchronous breadth-first search modified for workload balancing

// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = level for each k є neighbors of curr_vert if dist[k] = -1 dist[k] ← level + 1 pred[k] ← curr_vert curr_vert++ // data synchronization...

23

SLIDE 24

Top down breadth-first search

Time spent to graph traversal with- and without workload balancing

~1 million vertices graph, 4 nodes parallelization

24

0,5 1 1,5 2 1 2 3 4 5 6 7 Time, sec.

Iteration

Without balancing With balancing

SLIDE 25

Bottom-up breadth-first search

Main loop of level synchronous breadth-first search modified for workload balancing

// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є

[first_edge;last_edge)

if dist[curr_vert] = -1 for each k є neighbors of curr_vert if bitmap_current.k = 1 dist[curr_vert] ← level + 1 pred[curr_vert] ← k bitmap_next.vert ← 1 break curr_vert++ // data synchronization...

25

SLIDE 26

Bottom-up breadth-first search

Time spent to graph traversal with- and without workload balancing

~1 million vertices graph, 4 nodes parallelization

26

0,002 0,004 0,006 0,008 0,01 0,012 0,014 1 2 3 4 5 6 7 Time, sce. Iteration Without balancing With balancing

SLIDE 27

Combining methods

Methods can be used together to achieve maximum performance of breadth-first search algorithm

Method of workload balancing – to reduce time spent on

graph processing on each iteration

Hybrid traversal – to reduce data transfer overheads in

data synchronization step of every iteration

27

SLIDE 28

Benchmarking

All methods are integrated in custom implementation of Graph500 benchmark Measure performance of custom implementation for various number of nodes

1, 2, 4, 8 nodes of “Uran” supercomputer
CPU Intel Xeon X5675, 192 GB DRAM
“Scale” varies from 20 till 25

Compare custom implementation with reference Graph500 implementations

Simple implementation
Replicated implementation

Performance metrics – speed of graph traversal

Measuring in Traversed Edges Per Second (TEPS)

28

SLIDE 29

Results (1 node)

100 200 300 400 500 600 700 800 900 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

29

SLIDE 30

Results (2 nodes)

200 400 600 800 1000 1200 1400 1600 1800 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

30

SLIDE 31

Results (4 nodes)

500 1000 1500 2000 2500 3000 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

31

SLIDE 32

Results (8 nodes)

500 1000 1500 2000 2500 3000 3500 4000 4500 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

32

SLIDE 33

Results

Combining methods of workload balancing and traversal hybridization allows to achieve performance improvement of parallel breadth-first search Custom implementation has potential for further parallelization

33

SLIDE 34

Conclusion

Method of workload balancing helps to reduce overheads connected with graph processing Method of traversal hybridization helps to reduce

verheads connected with data transfer on every iteration

Future plans

Investigate scalability of developed algorithm
Modify custom implementation for using performance accelerators

and coprocessors

34

SLIDE 35

Questions?

35

Parallel high-performance graph processing

Graph algorithms

Bioinformatics Social networks analysis Business-analytics Data mining City planning and others…

Graph algorithms

Graph500

Parallel breadth-first search

Obstacles for efficient parallel implementation

Data transfer problem Graph marking problem

Data transfer problem

Level synchronous algorithms

Top down breadth-first search

Top down breadth-first search

Top down breadth-first search

Bottom up breadth-first search

Bottom up breadth-first search

Bottom up breadth-first search

Data transfer problem

Suggested solution – hybrid graph traversal

Hybrid graph traversal

Graph marking problem

Graph marking problem

CSR is one of the most popular format to store graph data Row pointers: 0, 3, 6, 7, 10, 12 Column ids: 1, 2, 3, 0, 3, 4, 0, 0, 1, 2, 1, 3

Graph marking problem

Using CSR suppose to traverse through Row pointers array

Performance of every iteration in level synchronous algorithms depends on performance of processing mostly “heavy-weight” vertex

Graph marking problem

Suggested solution – method of workload balancing

Graph marking problem

Graph marking problem

Graph marking problem

Top down breadth-first search

Top down breadth-first search

Bottom-up breadth-first search

Bottom-up breadth-first search

Combining methods

Methods can be used together to achieve maximum performance of breadth-first search algorithm

graph processing on each iteration

Benchmarking

Results (1 node)

Results (2 nodes)

Results (4 nodes)

Results (8 nodes)

Results

Combining methods of workload balancing and traversal hybridization allows to achieve performance improvement of parallel breadth-first search Custom implementation has potential for further parallelization

Conclusion

Future plans

Questions?