Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu - - PowerPoint PPT Presentation

parallel eigensolver for graph spectral analysis on gpu
SMART_READER_LITE
LIVE PREVIEW

Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu - - PowerPoint PPT Presentation

15-618 Final Project Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu Heran Lin yiminliu@andrew.cmu.edu lin1@andrew.cmu.edu Carnegie Mellon University May 11, 2015 Overview Undirected graph G = ( V , E ) Symmetric


slide-1
SLIDE 1

15-618 Final Project

Parallel Eigensolver for Graph Spectral Analysis

  • n GPU

Heran Lin lin1@andrew.cmu.edu Yimin Liu yiminliu@andrew.cmu.edu

Carnegie Mellon University

May 11, 2015

slide-2
SLIDE 2

Overview

◮ Undirected graph G = (V , E) ◮ Symmetric square matrix M associated with graph G

(adjacency matrix A, graph Laplacian L, etc.)

◮ Eigenvalues of M encodes interesting properties of the graph

Mx = λx

slide-3
SLIDE 3

Eigendecomposition Overview

◮ Transform M to a symmetric tridiagonal matrix Tm ◮ Calculate eigenvalues of Tm

⇒ Lanczos ⇒ (easy)

slide-4
SLIDE 4

The Lanczos Algorithm for Tridiagonalization

Tm =

      

α1 β2 β2 α2 ... ... ... βm βm αm

      

  • 1. v0 ← 0, v1 ← norm-1 random vector, β1 ← 0
  • 2. for j = 1, . . . , m

◮ wj ← Mvj ◮ αj ← w⊤

j vj

◮ wj ← wj − αjvj − βjvj−1 ◮ βj+1 ← wj2 ◮ vj+1 ← wj/βj+1

Potential parallelism for CUDA: matrix-vector product, dot-product, SAXPY

slide-5
SLIDE 5

Challenges

Characteristics of M

◮ Really sparse ◮ Skewed distribution of non-zero elements

◮ Example: power-law node degree distribution in social networks

slide-6
SLIDE 6

Compressed Sparse Row (CSR) Matrix-Vector Multiplication (SPMV)

= × Row 0 Row 1 Row 2 · · · column index

slide-7
SLIDE 7

Naive Work Assignment

Thread 0 Thread 1 Thread 2 Row 0 Row 1 Row 2 · · · Row 0 Result

◮ Each thread is responsible for one row ◮ Work imbalance issues

slide-8
SLIDE 8

Warp-based Work Assignment

Warp 0 Warp 1 Warp 2 Row 0 Row 1 Row 2 · · · Partial Sum Row 0 Result

◮ Each warp (32 threads) is responsible for one row ◮ Reduce partial sum in shared memory

slide-9
SLIDE 9

Warp-based Work Assignment for Row Groups

Warp 0 Warp 1 Row 0 Row 1 Row 2 · · · Row 0 Result Row 1 Result

◮ Each warp is responsible for a group of rows ◮ Group size depending on the average row sparsity of the

matrix

slide-10
SLIDE 10

Evaluation Environment

Amazon Web Service EC2 g2.2xlarge

◮ NVIDIA GK104 GPU, 1,536 CUDA cores, with CUDA 7.0

Toolkit installed

◮ Intel Xeon E5-2670 CPU, 8 cores, with gcc/g++ 4.8.2

installed, -O3 optimization switched on Competitive reference: SPMV implementation in cuSparse (http://docs.nvidia.com/cuda/cusparse/) Dataset: generated scale-free networks based-on the Barab´ asi-Albert model, using Python NetworkX

slide-11
SLIDE 11

float SPMV Performance Similiar to cuSparse

400 800 1,600 3,200 3 4 5 6 7 8 9 Graph Node Count (×103) Speedup of GPU SPMV over CPU Group SPMV cuSparse SPMV Naive SPMV

slide-12
SLIDE 12

double SPMV Performance Better than cuSparse

400 800 1,600 3,200 4 5 6 7 8 9 10 11 Graph Node Count (×103) Speedup of GPU SPMV over CPU Group SPMV cuSparse SPMV Naive SPMV

slide-13
SLIDE 13

Real-world Graphs

◮ as-Skitter: ∼ 1,700,000 nodes, ∼ 11,000,000 edges ◮ cit-Patents: ∼ 3,800,000 nodes, ∼ 17,000,000 edges

Converted to symmetric double adjacency matrices Data source: SNAP (http://snap.stanford.edu/data/index.html)

slide-14
SLIDE 14

SPMV Better than cuSparse on Large Real-world Graphs

as-Skitter cit-Patents 2 4 6 8 10 12 7.4 11.6 7.5 10.8 2.5 7.5 Real-world Graph Speedup of GPU SPMV over CPU Group SPMV cuSparse SPMV Naive SPMV

slide-15
SLIDE 15

Faster Eigenvalue Solver on GPU

as-Skitter cit-Patents 10 20 30 40 1.6 3.1 9 31.8 Real-world Graph Running Time of Eigensolvers (sec) GPU Eigensolver CPU Eigensolver

slide-16
SLIDE 16

Discussion

SLEPc (http://slepc.upv.es)

◮ A state-of-the-art parallel CPU framework using MPI for

sparse matrix eigenvalues solving

◮ Took 84.9 sec to solve 10 largest eigenvalues for the

cit-Patents graph, while we took only 31.8 sec on CPU

◮ Unfair to compare? ◮ Many variants of the Lanczos algorithm ◮ Accuracy v.s. performance tradeoff