Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale - - PowerPoint PPT Presentation
Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale - - PowerPoint PPT Presentation
Efficient Scaling Up of Parallel Graph Algorithms for Genome-Scale Biological Problems on Cray XT Kevin Thomas , Cray Inc. Outline Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization
Outline
Biological networks Graph algorithms and terminology Implementation of a parallel graph algorithm Optimization of single-thread performance Lessons learned
May 2008 Cray Inc. Proprietary Slide 2
Analysis of Biological Networks
Analysis of biological networks is increasingly an used tool in biology Numerous types of biological networks
Gene Expression Protein Interaction Metabolic Phylogenetic Signal Transduction
Biological networks analysis requires the solution of combinatorial problems
Maximal and maximum clique Vertex cover Dominating set Shortest path
May 2008 Cray Inc. Proprietary Slide 3
Biological Applications of Maximal Clique Enumeration
May 2008 Cray Inc. Proprietary Slide 4
Structural Alignment
MCE
Gene Expression Functional Protein Relationships Tertiary Structure Genome Mapping
Graphs are composed of vertices connected by edges A clique is a set of vertices which are pair-wise connected A maximal clique cannot include any additional vertex and still remain a clique (a,c,d,e) is a maximal clique
Graphs and Cliques
May 2008 Cray Inc. Proprietary Slide 5
e c a d b
Step 1 of 2 Step 2 of 2
Finding all of the maximal cliques of a graph (a,b,d) (a,c,d,e)
Maximal Clique Enumeration
May 2008 Cray Inc. Proprietary Slide 6
e c a d b e c a d b
Maximal Clique Enumeration
Brute Force Search
May 2008 Cray Inc. Proprietary Slide 7
Applying a backtracking algorithm results in a search tree
Maximal Clique Enumeration
May 2008 Cray Inc. Proprietary Slide 8
a b c d e / b c d e d d e e e c a d b
Parallel Maximal Clique Enumeration
The search tree is divided into independent sub-trees Unexplored sub-trees are represented as candidate paths The candidate paths are placed into per-thread work pools
May 2008 Cray Inc. Proprietary Slide 9
Thread 1
candidate path candidate path candidate path
Thread 2
candidate path candidate path candidate path
Thread 3
candidate path candidate path candidate path
Thread 4
candidate path candidate path candidate path
Load Balancing
The work pools can become unbalanced over time
May 2008 Cray Inc. Proprietary Slide 10
Thread 1
candidate path
Thread 3
candidate path candidate path candidate path
Thread 4
candidate path candidate path candidate path
Thread 2
candidate path candidate path candidate path candidate path candidate path candidate path
Dynamic load balancing through work stealing
Step1 of 3 Step2 of 3 Step3 of 3
Two levels of load balancing
Thread level
Used when one thread of a process becomes idle Balances work within a single process Each thread acts on its own to steal work from other threads Locks are used to prevent race conditions
Process level
Used when all threads of a process become idle Local master thread sends a request to another process Remote master thread responds to the request Master thread must poll for incoming requests while performing the main computation
May 2008 Cray Inc. Proprietary Slide 11
Step1 of 3 Step 2 of 3 Step3 of 3
Load balancing between processes
May 2008 Cray Inc. Proprietary Slide 12
Process 1
Thread 1
candidate path candidate path candidate path
Thread 2
candidate path candidate path candidate path
Process 2
Thread 1
candidate path
Thread 2
candidate path Request Response
Termination
Process-level load balancing attempts are made until all processes have been checked When no process has work to share, then the idle state is entered To synchronize globally, an idle notification is sent to each process When all processes are idle, the job can terminate 2(N-1)2 messages are required for termination
May 2008 Cray Inc. Proprietary Slide 13
Adjacency Test – Linear List
An important MCE operation is testing two vertices for adjacency Graph representation uses a vertex adjacency list
Each vertex has a list of adjacent vertices An adjacency test requires a list traversal A linked list is easy to build, but slow to search A linear list (array) is faster to search
May 2008 Cray Inc. Proprietary Slide 14
b c d e a a d b a d e c a b c e d a c d e
e c a d b
Adjacency Test – Bit Matrix
Adjacency bit matrix has a fast, constant time lookup Memory requirement is N2
May 2008 Cray Inc. Proprietary Slide 15
a b c d e a
- 1
1 1 1 b 1
- 1
c 1
- 1
1 d 1 1 1
- 1
e 1 1 1
- e
c a d b
Adjacency Test – Hash Table
Adjacency hash table has a fast, constant time lookup
But not as fast as bit matrix
Memory requirement is cN (2N in this example) Data structure is a sparse linear list But access is through direct through key hashing
May 2008 Cray Inc. Proprietary Slide 16
e c a d b
e d
- c
a
- c
- b
d
- a
e
- d
- e
a
- c
- c
- b
a
- d
- e
- b
d a
Adjacency Test Performance Comparison
May 2008 Cray Inc. Proprietary Slide 17
50000 100000 150000 200000 250000 300000
Linear List Hash Table Bit Matrix
cliques/second
SMP Versus DMP Programming
May 2008 Cray Inc. Proprietary Slide 18
64.00 64.10 64.20 64.30 64.40 64.50 64.60 64.70 64.80 Time (seconds) 1 process, 8 threads 2 processes, 4 threads each 4 processes, 2 threads each 8 processes, 1 thread each
Parallel Scaling on Cray XT4 quad core
At 2048 processes, compute time is 2.1 seconds
Overhead due to message passing is 0.43 seconds Graph contains 3472 vertices, found 2.6 billion maximal cliques
May 2008 Cray Inc. Proprietary Slide 19
1 2 4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048
Speedup Processes Ideal pDFS
Conclusion
Explicit decomposition at the thread level enabled easier implementation of MPI
Independent work already identified Compact representation of units of work
Additional work
Improved load balancing by grouping processes Parallel I/O optimization
May 2008 Cray Inc. Proprietary Slide 20
Conclusion
Research group members
Nagiza F. Samatova, North Carolina State University and Oak Ridge National Laboratory Matthew Schmidt, North Carolina State University and Oak Ridge National Laboratory Byung-Hoon Park, Oak Ridge National Laboratory Kevin Thomas, Cray Inc.
Thank you! Questions?
May 2008 Cray Inc. Proprietary Slide 21