Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , - - PowerPoint PPT Presentation

relational graph processing on gpus
SMART_READER_LITE
LIVE PREVIEW

Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , - - PowerPoint PPT Presentation

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER


slide-1
SLIDE 1

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Haicheng Wu1, Daniel Zinn2, Molham Aref2, Sudhakar Yalamanchili1

  • 1. Georgia Institute of Technology
  • 2. LogicBlox Inc.
slide-2
SLIDE 2

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

System Diversity Today

Keeneland System (GPUs) Amazon EC2 GPU Instances

Hardware Diversity is Mainstream

Mobile Platforms (DSP, GPUs)

2

Cray Titan (GPUs)

slide-3
SLIDE 3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

GPU and CUDA

3

Shared Memory CUDA Kernel Cooperative Thread Arrays (CTA) Thread Warp 1 Warp N DRAM Coalesced Access

0 4 8 C 10 14 18 1C

branch End of branch

Address R R R R R R R R R R R R R R R R A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U A L U

Streaming Multiprocessor (SM)

② Launch Kernel ① Input Data ④ Result ③Execute CPU (Multi Core) 2-16 Cores MAIN MEM ~128GB

GPU ~3000 Cores

GPU MEM ~6GB PCI-E 16GB/s ~300 GB/s ~50 GB/s

  • GPU is a many core co-processor
  • 1000s of cores
  • 1000s of concurrent threads
  • Higher memory bandwidth
  • Smaller memory capacity
  • CUDA and OpenCL are the

dominant programming models

  • Well suited for data parallel apps
  • Molecular Dynamics, Options

Pricing, Ray Tracing, etc.

  • Commodity: led by NVIDIA, AMD,

and Intel

slide-4
SLIDE 4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Queries and Data Analytics

4

The Opportunity

 Significant potential data parallelism

The Problem

 Need to process 1-50 TBs of data1  Small Mem Capacity & Small PCIe

bandwidth

 Irregularity  Fine grained computation  Data dependent  Low locality

1 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.

Large Graphs

Applications

slide-5
SLIDE 5

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

The Challenge

…… LargeQty(p) <- Qty(q), q > 1000. ……

Relational Computations Over Massive Unstructured Data Sets: Sustain 10X – 100X throughput over multicore

New Applications and Software Stacks New Accelerator Architectures

Large Graphs

5

Candidate Application Domains

slide-6
SLIDE 6

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Goal: Implementation of Leapfrog Triejoin (LFTJ) on GPU

 A worst-case optimal multi-predicate join algorithm  Details (e.g., complexity analysis) in T. L. Veldhuizen, ICDT 2014

Benefits

 Smaller memory footprint and data movement  No data reorganization (e.g. sorting or rebuilding hash table) after

changing join key

Approach

 CPU version  CPU-Friendly GPU version  Customized GPU version

Multipredicate Join

6

slide-7
SLIDE 7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

An Important Example – Graph Problems

Finding cliques

 triangle(x,y,z)<-E(x,y),E(y,z),E(x,z), x<y<z.  4cl(x,y,z,w)<-E(x,y),E(x,z),E(x,w),E(y,z),E(y,w),E(z,w), x<y<z<w.

7

1 2 3 4 5 Edge: From To 1 1 2 1 3 2 3 2 4 3 5

Multi-predicate Join

slide-8
SLIDE 8

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Leapfrog Join (LFJ)

LFJ is the base of LFTJ Essentially multi-way-intersections Basic primitives: seek(), next()

8

A 0 1 3 4 5 6 7 8 9 11 B 0 2 6 7 8 9 C 2 4 5 8 10 seek(2) seek(3) seek(6) seek(8) seek(8) next() seek(10) seek(10)

Courtesy : T. L. Veldhuizen, ICDT 2014

slide-9
SLIDE 9

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Trie Data Structure

9

1 2 3 1 2 3 3 4 5 Root 1 2 3 4 5 Edge: From To 1 1 2 1 3 2 3 2 4 3 5 From To

LFTJ works on Trie Data Stucture

slide-10
SLIDE 10

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – join 3 tries

10

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-11
SLIDE 11

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – open() level x

11

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-12
SLIDE 12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(0) in E(x,z) level x

12

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-13
SLIDE 13

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – open() level y

13

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-14
SLIDE 14

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(1) in E(y,z) level y

14

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-15
SLIDE 15

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – open() level z

15

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-16
SLIDE 16

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(2) in E(x,z) level z and failed

16

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-17
SLIDE 17

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – up() to level y

17

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-18
SLIDE 18

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – up() to level x

18

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-19
SLIDE 19

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(1) in E(x,z) level x

19

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-20
SLIDE 20

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – open() level y

20

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-21
SLIDE 21

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(2) in E(y,z) level y

21

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-22
SLIDE 22

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – open() level z

22

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-23
SLIDE 23

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – seek(3) in E(x,z) level z

23

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-24
SLIDE 24

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – next()

24

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-25
SLIDE 25

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm –final result

25

1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z

slide-26
SLIDE 26

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ Algorithm – short conclusion

Very simple set of primitives to implement A sequential algorithm Traverse the Trie in depth first order Two methods for applying this technique with GPUs

 CPU algorithm per GPU thread  Customize data parallel application

26

slide-27
SLIDE 27

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ-GPU: First Algorithm

Evenly map the top level of the leftmost trie to GPU threads Run sequential LFTJ in each GPU thread seek() is implemented as binary search

 Data dependent control flow  No spacial or temporal locality

27

E(x,z) 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(y,z) x y z t0 t1 Example: mapping to 2 GPU threads

slide-28
SLIDE 28

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LFTJ-GPU: Optimizations and CPU Variant

Set current level (e.g. x) as template to avoid branching Reduce the search scope of binary searches Put a search tree (similar as B-tree) in shared memory

 First several lookups of binary searches are run in the shared memory  26% improvement in triangle; 3.2x improvement in 4-clique

Amenable to CPU multi-thread implementation

 Simply replace GPU threads by CPU threads  Referred as LFTJ-CPU

28

Value UpperBound 2 5 3 4 8 20 Its UpperBound must between 10 and 20 10

slide-29
SLIDE 29

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-GPU: Second Algorithm

Optimized for

 Load balance  Better Memory Access Pattern

Change from depth first order to Breadth first order Divide the algorithm into three smaller problems

 Tree node expansion  Parallel array intersection  Filtering

29

Hot GPGPU research topic Well-known in GPGPU

slide-30
SLIDE 30

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-GPU: Two APIs from ModernGPU library

Vectorized Sorted Search & Load Balancing Search ModernGPU is Designed by S.Baxter Based on Merge-Path* framework to balance workload

between CTAs and threads

Optimized for coalesced memory access

30

* O. Green, et al. Gpu merge path: A gpu merging algorithm. ICS 2012.

slide-31
SLIDE 31

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-GPU: Data Structure

Similar as CSR

31

1 2 3 1 2 3 3 4 5 Root

1 2 3 3 4 5

val

4 5 7 9 10

ptr

1 2 3 10 10 10 10 10

1 2 3 4 5 6 7 8 9

Children of the same parent are Sorted and Unique Children of the different parent may not be Sorted or Unique

slide-32
SLIDE 32

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-GPU: Intersect level by level

32

x y z 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root 1 2 3 1 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) sorted intersects sorted sorted intersects not sorted not sorted intersects not sorted

slide-33
SLIDE 33

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-LFTJ: Algorithm

Process layer by layer from top to bottom In each layer

 Intersect all sorted arrays (simplest)

 Simple Set Intersection

 Intersect all not sorted arrays

 Segmented Intersection

 Intersect the above two results (heaviest)

 Binary Search

33

avoid using binary searches

slide-34
SLIDE 34

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Optimized-LFTJ: Compared with Other Algorithms

Compared with GPU-LFTJ

 Depth first order => Breadth first order  GPU threads collaborate together in intersections  Less binary searches – at most 1 per layer  Larger memory footprint – tradeoff between time and space

Compared with pair-wise joins

 No sorting  No huge temporary result

34

slide-35
SLIDE 35

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Experimental Environment

CPU Intel i7-4771 @ 3.50GHz GPU Geforce GTX Titan (2688 cores, $1000 USD) PCIe 3.0 x 16 OS Ubuntu 12.04 G++/GCC 4.6 NVCC 6.0 Thrust 1.7

35

slide-36
SLIDE 36

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Evaluated Graphs

10K to 100M edges

 Fits the GPU memory

Node Id: 64-bit random number NumNode

 Triangle: NumNode = NumEdge  4-Clique: NumNode = NumEdge3/4

Results are very sparse

 Found cliques are less than 4

36

Larger graph has larger node degree

slide-37
SLIDE 37

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Evaluated Algorithms

GPU-LFTJ: First Algorithm Optimized-GPU: Second Algorithm CPU-LFTJ: CPU Variant of GPU-LFTJ Red Fox: Run regular pairwise sort-merge join from

ModernGPU library

37

slide-38
SLIDE 38

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Overall Performance

38

20 40 60 80 100 120 140 10K 30K 100K 300K 1M 3M 10M 30M 100M Million Edges/sec Edge Number

GPU-Optimized LFTJ-GPU LFTJ-CPU Redfox

5 10 15 20 25 10K 30K 100K 300K 1M 3M 10M 30M 100M Million Edges/sec Edge Number

GPU-Optimized LFTJ-GPU LFTJ-CPU Redfox

Triangle 4-Clique

  • Optimized-GPU

is fastest

  • GPU-LFTJ can

run much larger problem

slide-39
SLIDE 39

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Reason Behind the Performance

Optimized-GPU vs. LFTJ-GPU

 Less binary searches

 Optimized-GPU spends 40% (triangle) or 7% (4-clique) in binary searches

 Better load balance

 Over 94% warp execution efficiency

 Better memory access pattern

 ld/st replay of Optimized-GPU is only 4.6 (triangle) or 0.6 (4-clique)  ld/st replay of binary search is 19

Optimized-GPU vs. Red Fox

 No sorting

 Sorting time of Red Fox is more than overall time of Optimized-GPU

LFTJ-GPU vs. LFTJ-CPU = GPU vs. CPU

39

slide-40
SLIDE 40

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Conclusion

GPU-LFTJ

 Simple to implement  Easy to integrate into existing system  Reasonable performance  Much less memory footprint

Optimized-GPU

 Better performance  Larger memory footprint  Sophisticated traditional GPGPU program  Room to improve

 Better expansion, intersection algorithms  Fusing small CUDA kernels  Completely remove binary searches

40

slide-41
SLIDE 41

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Out-of-Core Support

In-Core algorithm is the building block Pipleline the execution with PCIe Currently, throughput is smaller than PCIe bandwidth

 Out-of-Core performance is determined by In-Core algorithm

Ideally, push the performance to be PCIe-bounded

 GPU computation can be completely hidden by PCIe  ~10GB/s throughput

41

slide-42
SLIDE 42

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Large Graphs

topnews.net.tz

Waterexchange.com

The Future is Acceleration

42

Thank You

slide-43
SLIDE 43

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Results so far and Things to do

43

1 2 1 2 3 3 Root 1 2 1 2 3 3 4 Root 1 2 3 2 3 3 4 5 Root E(x,y) E(x,z) E(y,z) x y z 3 1 2 3 5 Results so far Things to do: Segmented Intersection 1 2 3 1 2 3 3 4 5 Root

slide-44
SLIDE 44

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Memory Footprint

44

1 2 3 4 10K 30K 100K 300K 1M 3M 10M 30M 100M GByte Edge Number

GPU-Optimized LFTJ-GPU Redfox

1 2 3 4 10K 30K 100K 300K 1M 3M 10M 30M 100M GByte Edge Number

GPU-Optimized LFTJ-GPU Redfox

Triangle 4-Clique

Redfox > GPU-Optimized > LFTJ-GPU