graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - - PowerPoint PPT Presentation

graph processing
SMART_READER_LITE
LIVE PREVIEW

graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - - PowerPoint PPT Presentation

Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others


slide-1
SLIDE 1

Parallel high-performance graph processing

CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU

slide-2
SLIDE 2

Graph algorithms

Bioinformatics Social networks analysis Business-analytics Data mining City planning and others…

2

slide-3
SLIDE 3

Graph algorithms

Breadth-first search

  • Easy to understand
  • Widespread
  • Many ways to parallelize

Graph500

  • www.graph500.org
  • R. Murphy, K. Wheeler, B. Barrett, and J. Ang. Introducing the graph 500. In Cray

User’s Group (CUG), 2010

  • Parallel breadth-first search
  • MPI and OpenMP implementations
  • Designed for graphs with relatively small diameter and skewed

degree distribution

  • “Scale-Free” graphs

3

slide-4
SLIDE 4

Parallel breadth-first search

Level synchronous algorithms

  • Processing of level N+1 begins only when processing of level N is over

1 3 4 5 6 2

4

slide-5
SLIDE 5

Obstacles for efficient parallel implementation

Data transfer problem Graph marking problem

5

slide-6
SLIDE 6

Data transfer problem

Problem description

  • Real-world graphs have irregular memory access pattern
  • Graphs with small diameter have big overheads connected with data transfer

through interconnect network

Suggested solution

  • Combine different types of level synchronous algorithms

6

slide-7
SLIDE 7

Level synchronous algorithms

Top-down traversal

  • Traditional way to implement breadth-first search
  • Active vertex tries to check all its neighbors

Bottom-up traversal

  • Inactive vertices looking for active vertices in its neighbor lists
  • S. Beamer, K. Asanovic, D. A. Patterson, Direction-optimizing breadth-first

search // in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.

7

slide-8
SLIDE 8

Top down breadth-first search

for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = level for each neighb in vert.neighbors if neighb in V.this_node if dist[neighb] = -1 dist[neighb] ← level + 1 pred[neighb] ← vert else vert_batch_to_send.push(neighb) send(vert_batch_to_send) receive(vert_batch_to_receive) parallel for each vert in vert_batch_to_receive if dist[vert] = -1 dist[vert] ← level + 1 pred[vert] ← vert.pred level++ while(!check_end())

8

slide-9
SLIDE 9

Top down breadth-first search

Data transfers according to data transfer matrix

  • 𝑏𝑗𝑘 – amount of data (bytes) to transfer from process 𝑗 to process 𝑘

10 20 8 1 2 4 1 2 4

9

slide-10
SLIDE 10

Top down breadth-first search

Data transfer time for every iteration of the algorithm

  • ~1 million vertices graph, 4 nodes parallelization
  • Total time spent on data transfer – 0,83 sec.

10

0,01 0,02 0,03 0,04 0,05 0,06 1 2 3 4 5 6 7 Time, sec. Iteration

slide-11
SLIDE 11

Bottom up breadth-first search

for all u in dist dist[u] ← -1 dist[s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = -1 for each neighb in vert.neighbors if bitmap_current.neighb = 1 dist[vert] ← level + 1 pred[vert] ← neighb bitmap_next.vert ← 1 break all_gather(bitmap_next) swap(bitmap_current, bitmap_next) level++ while(!check_end())

11

slide-12
SLIDE 12

Bottom up breadth-first search

Data synchronization through collective communications

node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3 node 0 node 0 node 1 node 1 node 2 node 2 node 3 node 3

all_gather

12

slide-13
SLIDE 13

Bottom up breadth-first search

Data transfer time for every iteration of the algorithm

  • ~1 million vertices graph, 4 nodes parallelization
  • Total time spent on data transfer – 0,001 sec.

13

0,00005 0,0001 0,00015 0,0002 1 2 3 4 5 6 7

Time, sec.

Iteration

slide-14
SLIDE 14

Data transfer problem

Suggested solution – hybrid graph traversal

  • First two iterations – “top-down”
  • Next three iterations – “bottom-up”
  • All the rest iterations – “top-down”

14

slide-15
SLIDE 15

Hybrid graph traversal

Data transfer time for every iteration of the algorithm

  • ~1 million vertices graph, 4 nodes parallelization
  • Total time spent on data transfer – 0,0005 sec.

15

0,00005 0,0001 0,00015 1 2 3 4 5 6 7 Time, sec. Iteration

slide-16
SLIDE 16

Graph marking problem

Skewed degree distribution

  • Many vertices with big number of in-/outgoing edges
  • Few vertices with small number of in-/outgoing edges

16

slide-17
SLIDE 17

Graph marking problem

CSR is one of the most popular format to store graph data Row pointers: 0, 3, 6, 7, 10, 12 Column ids: 1, 2, 3, 0, 3, 4, 0, 0, 1, 2, 1, 3

1 2 3 4

17

slide-18
SLIDE 18

Graph marking problem

Using CSR suppose to traverse through Row pointers array

  • It’s not know in advance how many edges are incident to

some vertex (e.g. how many elements need to traverse in Column ids array)

Performance of every iteration in level synchronous algorithms depends on performance of processing mostly “heavy-weight” vertex

  • Workload imbalance

18

slide-19
SLIDE 19

Graph marking problem

Suggested solution – method of workload balancing

  • Divide Column ids array on equal parts, each consisting of

Max edges elements

  • Map every part of Rowstarts array and Column ids array

using additional array Part column

  • Every thread will process previously known number of

edges which determined by corresponding elements of Part column array

19

slide-20
SLIDE 20

Graph marking problem

Graph processing without workload balancing

20

slide-21
SLIDE 21

Graph marking problem

Graph processing with workload balancing

  • max_edges = 4

21

slide-22
SLIDE 22

Graph marking problem

Filling of Part column array

parallel for i in V.this_node first ← V.this_node[i] last ← V.this_node[i+1] index ← round_up(first/max_edges) current ← index*max_edges while(current < last) part_column[index] ← i current += max_edges index++

22

slide-23
SLIDE 23

Top down breadth-first search

Main loop of level synchronous breadth-first search modified for workload balancing

// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = level for each k є neighbors of curr_vert if dist[k] = -1 dist[k] ← level + 1 pred[k] ← curr_vert curr_vert++ // data synchronization...

23

slide-24
SLIDE 24

Top down breadth-first search

Time spent to graph traversal with- and without workload balancing

  • ~1 million vertices graph, 4 nodes parallelization

24

0,5 1 1,5 2 1 2 3 4 5 6 7 Time, sec.

Iteration

Without balancing With balancing

slide-25
SLIDE 25

Bottom-up breadth-first search

Main loop of level synchronous breadth-first search modified for workload balancing

// preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)*max_edges curr_vert ← part_column[i] for each edge є [first_edge;last_edge) if neighbors of curr_vert є

[first_edge;last_edge)

if dist[curr_vert] = -1 for each k є neighbors of curr_vert if bitmap_current.k = 1 dist[curr_vert] ← level + 1 pred[curr_vert] ← k bitmap_next.vert ← 1 break curr_vert++ // data synchronization...

25

slide-26
SLIDE 26

Bottom-up breadth-first search

Time spent to graph traversal with- and without workload balancing

  • ~1 million vertices graph, 4 nodes parallelization

26

0,002 0,004 0,006 0,008 0,01 0,012 0,014 1 2 3 4 5 6 7 Time, sce. Iteration Without balancing With balancing

slide-27
SLIDE 27

Combining methods

Methods can be used together to achieve maximum performance of breadth-first search algorithm

  • Method of workload balancing – to reduce time spent on

graph processing on each iteration

  • Hybrid traversal – to reduce data transfer overheads in

data synchronization step of every iteration

27

slide-28
SLIDE 28

Benchmarking

All methods are integrated in custom implementation of Graph500 benchmark Measure performance of custom implementation for various number of nodes

  • 1, 2, 4, 8 nodes of “Uran” supercomputer
  • CPU Intel Xeon X5675, 192 GB DRAM
  • “Scale” varies from 20 till 25

Compare custom implementation with reference Graph500 implementations

  • Simple implementation
  • Replicated implementation

Performance metrics – speed of graph traversal

  • Measuring in Traversed Edges Per Second (TEPS)

28

slide-29
SLIDE 29

Results (1 node)

100 200 300 400 500 600 700 800 900 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

29

slide-30
SLIDE 30

Results (2 nodes)

200 400 600 800 1000 1200 1400 1600 1800 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

30

slide-31
SLIDE 31

Results (4 nodes)

500 1000 1500 2000 2500 3000 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

31

slide-32
SLIDE 32

Results (8 nodes)

500 1000 1500 2000 2500 3000 3500 4000 4500 20 21 22 23 24 25 Speed, MTEPS Scale custom replicated simple

32

slide-33
SLIDE 33

Results

Combining methods of workload balancing and traversal hybridization allows to achieve performance improvement of parallel breadth-first search Custom implementation has potential for further parallelization

33

slide-34
SLIDE 34

Conclusion

Method of workload balancing helps to reduce overheads connected with graph processing Method of traversal hybridization helps to reduce

  • verheads connected with data transfer on every iteration

Future plans

  • Investigate scalability of developed algorithm
  • Modify custom implementation for using performance accelerators

and coprocessors

34

slide-35
SLIDE 35

Questions?

35