[PPT] - Parallel Algorithms for Solving Large Assignment Problems on GPU PowerPoint Presentation

SLIDE 1

Parallel Algorithms for Solving Large Assignment Problems on GPU Clusters

2018 Blue Waters Symposium Ketan Date Rakesh Nagi (PI)

Department of Industrial and Enterprise Systems Engineering University of Illinois at Urbana-Champaign

June 6, 2018

1 / 30

SLIDE 2

Outline

Assignment Problems: Introduction and Impact Research Tasks and Role of Blue Waters The Linear Assignment Problem The Quadratic Assignment Problem

2 / 30

SLIDE 3

Outline

Assignment Problems: Introduction and Impact Research Tasks and Role of Blue Waters The Linear Assignment Problem The Quadratic Assignment Problem

3 / 30

SLIDE 4

Introduction

Assignment problems: Fundamental optimization problems in Operations Research that have prominent applications in science and engineering. Our inability to efficiently solve large instances of these problems can greatly inhibit the discovery in these domains. Objectives: Designing faster, parallel algorithms for Linear Assignment Problem (LAP) and Quadratic Assignment Problem (QAP) using GPUs and large computational clusters like Blue Waters. Future work: Extending the proposed methodology to Generalized Assignment Problem (GAP), Traveling Salesman Problem (TSP), Vehicle Routing Problem (VRP), and Graph Association/Matching (GA/GM).

4 / 30

SLIDE 5

Impact of Assignment Problems

1. Data sciences: Data association in information fusion and

multi-target tracking.

2. Bioinformatics: Alignment of protein-protein interaction (PPI)

networks.

3. Engineering: Facility location, routing and scheduling problems, etc.
3. 3/18/2013 -Suspected bank

robber John Dillinger is wanted for questioning by Greencastle police. Dillinger is said to be 30 years old, with a height of 5'8", and weight

f 185 pounds. Dillinger was last

seen driving his black 2010 Ford Focus westward down Indianapolis Road.

b’ a’ c’

Name: John Dillinger Type: Person Sex: Male Height: 5’ 8” Age: 30 Weight: 185 lbs Shirt color: Red Type: Vehicle Make: Ford Model: Focus Year: 2010 Color: Black

d2 e2

Name: Anonymous Type: Person Sex: Male Height: 5’ 10” Shirt color: Black Name: Anonymous Type: Person Sex: Male Height: 6’ 2” Shirt color: Blue Owner of

b2 a2 c2

Name: Anonymous Type: Person Sex: Male Height: 5’ 8” Shirt color: Red Type: Vehicle Make: Ford Color: Black

d2 e2

Name: Anonymous Type: Person Sex: Male Height: 5’ 10” Shirt color: Black Name: Anonymous Type: Person Sex: Male Height: 6’ 2” Shirt color: Blue Located at Located at Located at Located at

b1 a1 c1

Name: John Dillinger Type: Person Sex: Male Height: 5’ 8” Weight: 185 lbs Age: 30 Type: Vehicle Make: Ford Model: Focus Year: 2010 Color: Black Address: Indianapolis Road Type: Location Owner of Located at Located at Name: Sunoco Gas Station Address: Indianapolis Road Type: Location Time: 1320 Date: 03172013 Name: Sunoco Gas Station Address: Indianapolis Road Type: Location Time: 1320 Date: 03172013

1 5 4 2 3 V 1 V1 V2 V2 V1 V2 1 5 4 2 3

L1 L2 L3 L4 F1 F3 F2 F4

Association of complex information from heterogeneous data sources using Graph Association Formulation TSP and VRP Facility Location

5 / 30

SLIDE 6

Outline

Assignment Problems: Introduction and Impact Research Tasks and Role of Blue Waters The Linear Assignment Problem The Quadratic Assignment Problem

6 / 30

SLIDE 7

Research Tasks and Role of Blue Waters

Research tasks

1. Develop GPU accelerated Hungarian algorithm for the LAP.
2. Develop GPU accelerated Dual Ascent procedure for QAP-RLT2.
3. Couple both the algorithms and deploy on Blue Waters to obtain

exact solutions to the QAP in a parallel branch-and-bound scheme. Role of Blue Waters

◮ Exponential number of tree nodes in branch-and-bound. ◮ Lower bounding procedure requires solving O(n3) LAPs and adjusting

O(n6) Lagrange multipliers.

◮ Solving the benchmark Nug30 QAP required over 1200 XK compute

nodes for over 110 hours (15 yrs worth of computation).

◮ We are grateful to Blue Waters and the NCSA staff for providing this

invaluable service to the scientific community.

7 / 30

SLIDE 8

Outline

Assignment Problems: Introduction and Impact Research Tasks and Role of Blue Waters The Linear Assignment Problem The Quadratic Assignment Problem

8 / 30

SLIDE 9

Linear Assignment Problem: Introduction

◮ Also known as weighted bipartite matching problem. ◮ Objective: To minimize total cost of assigning n resources to n tasks. ◮ Important subproblem of many NP-Hard optimization problems, e.g.,

◮ Quadratic Assignment Problem ◮ Traveling Salesperson Problem ◮ Graph Matching and Association Problems, etc.

min

n

i=1

n

j=1

cijxij; s.t.

n

j=1

xij = 1 ∀i = 1, . . . , n;

n

i=1

xij = 1 ∀j = 1, . . . , n; xij ∈ {0, 1} ∀i, j = 1, . . . , n.

9 / 30

SLIDE 10

Literature Review

Sequential algorithms

◮ Hungarian algorithm [Kuhn, 1955, Munkres, 1957]. ◮ Shortest path algorithms [Jonker and Volgenant, 1987]. ◮ Auction algorithm [Bertsekas, 1990].

Parallel implementations

◮ Parallel synchronous/asynchronous Hungarian algorithms

[Bertsekas and Casta˜ non, 1993].

◮ Parallel shortest path algorithms

[Balas et al., 1991, Storøy and Sørevik, 1997].

◮ Parallel synchronous/asynchronous Auction algorithm:

[Wein and Zenios, 1990, Bertsekas and Casta˜ non, 1991].

◮ Parallel Auction algorithm using GPUs

[Vasconcelos and Rosenhahn, 2009]

10 / 30

SLIDE 11

Sequential Hungarian Algorithm

With opportunities for acceleration

Yes No Yes No

All Assigned? End Augmenting Path Found?

Initialization Partial Assignment Start Optimality Check Augmenting Path Search Augmentation Dual Update

High granularity Scalable to multiple GPUs Low Granularity Executed on single GPU

11 / 30

SLIDE 12

Accelerated Hungarian: Augmenting Path Search (Forward Pass)

◮ Goal is to find vertex disjoint augmenting paths from unassigned rows

to unassigned columns.

◮ In forward pass, threads traverse the graph one hop at a time and

construct augmenting trees.

◮ More than one augmenting trees may be found per iteration. ◮ Due to race condition, the trees are guaranteed to be vertex disjoint

(our innovation).

Thread 11 Thread 12

R1 R2 R3 R4 C4 C3 C2 C1

BFS Iteration 1 Frontier: R2 and R3 New Frontier: R1 and R4

R1 R2 R3 R4 C4 C3 C2 C1 R1 R2 R3 R4 C4 C3 C2 C1

Thread 22 Thread 21

R1 R2 R3 R4 C4 C3 C2 C1

BFS Iteration 2 Frontier: R1 and R4 New Frontier: --

R1 R2 R3 R4 C4 C3 C2 C1 R1 R2 R3 R4 C4 C3 C2 C1 Augmenting path(s) found: R3-C2-R1-C1 and R3-C2-R1-C4

12 / 30

SLIDE 13

Accelerated Hungarian: Reverse Pass and Augmentation

◮ Reverse pass is performed to extract augmenting paths from the

augmenting trees (each leaf vertex is processed by one thread).

◮ Due to “race” condition only one path survives per tree (our

innovation).

◮ All such paths can be used to augment the current solution and

increase number of assignments.

R1 R2 R3 R4 C4 C3 C2 C1

Thread 22 Thread 21

Augmentation Number of assignments increased by 1

Reverse Pass Survivor Path: C1-R1-C2-R3

R1 R2 R3 R4 C4 C3 C2 C1

13 / 30

SLIDE 14

Computational Experiments

Experimental Setup

◮ Small Scale: n = 500 to n = 5000 in increments of 500. Cost matrix

f randomly generated integers between [0, n].

◮ Large Scale: n = 5000 to n = 20000 in increments of 5000. Cost

matrix of randomly generated integers between [0, n

10], [0, n], and

[0, 10n]. Hardware details

◮ Computational resources from Blue Waters Supercomputing Facility at

University of Illinois at Urbana-Champaign.

◮ CPU: AMD Interlagos 6376, 2.30GHz clock speed, and 32GB memory. ◮ GPU: NVIDIA GK110 “Kepler” K20X, with 2688 processor cores, and

6GB memory.

14 / 30

SLIDE 15

Execution Times and Speedup Profiles (Small Scale)

15 / 30

SLIDE 16

Execution Times and Speedup Profiles (Large Scale)

16 / 30

SLIDE 17

Contributions

1. Developed parallel versions of two variants of the Hungarian

algorithm, for solving the LAP on GPU(s).

2. Accelerated algorithms leverage “race condition” to find multiple

vertex-disjoint augmenting paths.

3. Single GPU variant can solve problems with up to 400 Million

variables in 13 seconds.

4. Multi-GPU variant can solve problems with up to 1.6 Billion variables

in 24 seconds.

17 / 30

SLIDE 18

Outline

Assignment Problems: Introduction and Impact Research Tasks and Role of Blue Waters The Linear Assignment Problem The Quadratic Assignment Problem

18 / 30

SLIDE 19

Quadratic Assignment Problem: Introduction

◮ Introduced by [Koopmans and Beckmann, 1957] as a mathematical

model for facility location.

◮ Objective: To place n facilities on n locations such that total cost

(distance times flow) is minimized.

◮ Strongly NP-hard problem. No polynomial time optimal or ǫ-optimal

algorithm.

min

n

i=1

n

p=1

bipxip +

n

i=1

n

j=1

n

p=1

n

q=1

fijdpqxipxjq s.t.

n

i=1

xip = 1 ∀p = 1, · · · , n;

n

p=1

xip = 1 ∀i = 1, · · · , n; xip ∈ {0, 1} ∀i, p = 1, · · · , n.

19 / 30

SLIDE 20

Linearization Models

◮ One of the ways to solve QAP is to convert it into MILP and use

Branch-and-bound.

◮ Linearization is accomplished by introducing large number of variables

and constraints.

◮ LP relaxation provides a valid lower bound on QAP. ◮ Can be used to fathom nodes in branch-and-bound tree. Linearization Model Binary Variables Continuous Variables Constraints [Lawler, 1963] O(n4) – O(n4) [Kaufman and Broeckx, 1978] O(n2) O(n2) O(n2) [Frieze and Yadegar, 1983] O(n2) O(n4) O(n4) [Adams and Johnson, 1994] RLT1 O(n2) O(n4) O(n4) [Adams et al., 2007] RLT2 O(n2) O(n6) O(n6) [Hahn et al., 2012] RLT3 O(n2) O(n8) O(n8)

20 / 30

SLIDE 21

RLT2, DLRLT2, and Staged LAP Solution

RLT2: min

i
p

bipxip +

i
j=i
p
q=p

Cijpqyijpq +

i
j=i
k=i,j
p
q=p
r=p,q

Dijkpqrzijkpqr; s.t.

p

xip = 1, ∀i;

i

xip = 1, ∀p;

q=p

yijpq = xip, ∀(i, j, p) : i = j;

j=i

yijpq = xip, ∀(i, p, q) : p = q;

r=p,q

zijkpqr = yijpq, ∀(i, j, k, p, q) : i = j = k, p = q;

k=i,j

zijkpqr = yijpq, ∀(i, j, p, q, r) : i = j, p = q = r; zijkpqr = zikjprq = zjikqpr = zjkiqrp = zkijrpq = zkjirqp, ∀(i, j, k, p, q, r) : i < j < k, p = q = r; xip ∈ {0, 1}, ∀i, p; yijpq ≥ 0, ∀(i, j, p, q) : i = j, p = q; zijkpqr ≥ 0, ∀(i, j, k, p, q, r) : i = j = k, p = q = r. DLRLT2(ˆ v): max

i

αi +

p

βp; s.t. αi + βp −

j=i

γijp −

q=p

δipq ≤ bip, ∀i, p; γijp + δipq −

k=i,j

ξijkpq −

r=p,q

ψijpqr ≤ Cijpq, ∀(i, j, p, q) : i = j, p = q; ξijkpq + ψijpqr ≤ Dijkpqr − ˆ vijkpqr, ∀(i, j, k, p, q, r) : i = j = k, p = q = r; αi, βp, γijp, δipq, ξijkpq, ψijpqr ∼ ur, ∀(i, j, k, p, q, r) : i = j = k, p = q = r.

Stage 1: n2(n-1)2 Z-LAPs Stage 2: n2 Y-LAPs Stage 3: 1 X-LAP 21 / 30

SLIDE 22

LDRLT2 and Lagrangian Dual Ascent

◮ For a fixed value of v, DLRLT2 is decomposable into a large number

f LAPs (which can be solved in parallel).

◮ Lagrangian Dual Problem, LDRLT2: maxv DLRTL2(v) ◮ ν(DLRLT2(ˆ

v)) ≤ ν(LDRLT2) ≤ ν(LPRLT2) ≤ ν(RLT2) = ν(QAP)

◮ After solving DLRLT2(vm) at mth iteration, vm could be adjusted

systematically to obtain an improved lower bound.

◮ Highly parallelizable on GPUs.

vm+1

ijkpqr = vm ijkpqr + κzπ(zijkpqr) + κyπ(yijpq)

(n − 2) + κxπ(xip) (n − 1)(n − 2) − Ωijkpqr, where, 0 ≤ κ ≤ 1 and Ωijkpqr is a function of dual slacks π(x), π(y), π(z).

22 / 30

SLIDE 23

RLT2-DA

With opportunities for parallelization

Stopping Criteria Satisfied

End Initialization Z Dual ascent Cost update Start Solve Z-LAP High granularity Low Granularity Z Solution Transfer Solve Y-LAP Y Solution Transfer Y Dual ascent Cost update X Cost update Solve X-LAP

YES NO 23 / 30

SLIDE 24

Accelerated RLT2-DA

CPU(0) MPI MPI CPU(1) CPU(K-1) GPU(0)

Initialization + Dual Ascent + Z-TLAP Solve + Y-TLAP Solve + X-LAP Solve + Feasibility Check

GPU(1)

Initialization + Dual Ascent + Z-TLAP Solve + Y-TLAP Solve + X-LAP Solve + Feasibility Check

GPU(K-1)

Initialization + Dual Ascent + Z-TLAP Solve + Y-TLAP Solve + X-LAP Solve + Feasibility Check

Table: Minimum number of GPUs for various problem sizes.

n ≤ 27 30 35 40 42 Minimum # of GPUs 1 2 4 7 15

24 / 30

SLIDE 25

Tiled Linear Assignment Problems (TLAP)

◮ LAPs are solved using our GPU-Accelerated Hungarian Algorithm

[Date and Nagi, 2016].

◮ LAPs owned by a particular GPU are combined and solved as a tiled

LAP (or TLAP).

◮ In practice, TLAP solves much faster than solving individual LAPs.

LAP(0) LAP(1) LAP(M-1) T L A P (0) LAP(M) LAP(M+1) LAP(2M-1) T L A P (1) LAP(M2-M)

LAP(M2-M+1) LAP(M2-1)

T L A P (M-1) 25 / 30

SLIDE 26

Parallel Branch-and-bound

d x11=1 PE Bank 0 Best First (Priority Queue) Depth First (Stack) x1(.)=1 x1(.)=1 PE Bank (N-1) Best First (Priority Queue) x1(.)=1 x1(.)=1 x22=1 x2(.)=1 x2(.)=1 xdd=1 xd(.)=1 xd(.)=1 x1(.)=1

◮ 1 Processing Element (PE) contains 1 CPU and 1 GPU. ◮ Initial upper bound established using local search heuristic. ◮ Each PE bank executes RLT2-DA on a single node and establishes a lower bound. ◮ Tree expanded using “Polytomic” branching rule. ◮ PEs synchronized after 300 seconds and work is redistributed if needed. ◮ Branching Rule: Branch on the facility that has the lowest total flow with the

previously placed facilities.

26 / 30

SLIDE 27

Experimental Results

Problem UB N K ℓinit Nodes PE Utilization Time Initial Explored Min Avg Max (d:hh:mm:ss) Nug20† 2570∗ 4 1 2 98 134 0.77 0.89 0.99 0:00:38:41 Nug22 3596∗ 10 1 2 462 622 0.82 0.91 0.99 0:01:34:03 Nug25† 3744∗ 50 2 3 1,755 3,868 0.81 0.90 0.97 0:02:44:24 Nug27 5234∗ 100 2 3 17,550 55,761 0.97 0.98 0.99 1:02:28:32 Nug30† 6124∗ 300 4 4 164,520 840,273 0.96 0.96 0.97 4:14:06:21 Tai20a 703, 482∗ 10 1 2 380 3,512 0.88 0.92 0.98 0:03:56:52 Tai20b‡ 122, 455, 319∗ 1 1 1 1 1.00 1.00 1.00 0:00:18:57 Tai25a 1, 167, 256∗ 100 2 3 13,800 523,005 0.97 0.98 0.98 3:13:53:33 Tai25b‡ 344, 355, 646∗ 1 4 1 1 1.00 1.00 1.00 0:01:52:43 Tai30b 637, 117, 113∗ 60 4 2 870 30,523 0.91 0.93 0.95 2:09:55:17

† Grid symmetry elimination rules were used for these problem instances. ‡ Gap closure was achieved for these problems in 110 and 684 iterations respectively.

27 / 30

SLIDE 28

Contributions

1. Developed/implemented GPU-accelerated RLT2-DA solver.
2. Developed/implemented parallel branch-and-bound scheme on a

GPU-cluster, to solve QAP to optimality.

3. Under initial review in INFORMS Journal on Computing.
4. Approximate Hungarian Solver can speed up DA iterations with

minimal impact on optimality gap.

5. Next steps: Use Approximate Hungarian in B&B and solve instances

with n ≥ 30.

28 / 30

SLIDE 29

Acknowledgments

The work on “accelerated Hungarian algorithm” has been supported by a Multidisciplinary University Research Initiative (MURI) grant (Number W911NF-09-1-0392) for “Unified Research on Network-based Hard / Soft Information Fusion,” issued by the US Army Research Office (ARO) under the program management of Dr. John Lavery. We gratefully appreciate this support. This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. We gratefully appreciate this support.

29 / 30

SLIDE 30

Thank You

30 / 30

SLIDE 31

Adams, W. P., Guignard, M., Hahn, P. M., and Hightower, W. L. (2007). A level-2 reformulation–linearization technique bound for the quadratic assignment problem. European Journal of Operational Research, 180(3):983–996. Adams, W. P. and Johnson, T. A. (1994). Improved linear programming-based lower bounds for the quadratic assignment problem. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 16:43–77. Balas, E., Miller, D., Pekny, J., and Toth, P. (1991). A parallel shortest augmenting path algorithm for the assignment problem. Journal of the ACM (JACM), 38(4):985–1004. Bertsekas, D. P. (1990). The Auction algorithm for assignment and other network flow problems: A tutorial.

30 / 30

SLIDE 32

Interfaces, 20(4):133–149. Bertsekas, D. P. and Casta˜ non, D. A. (1991). Parallel synchronous and asynchronous implementations of the Auction algorithm. Parallel Computing, 17(6):707–732. Bertsekas, D. P. and Casta˜ non, D. A. (1993). Parallel asynchronous Hungarian methods for the assignment problem. ORSA Journal on Computing, 5(3):261–274. Date, K. and Nagi, R. (2016). GPU-accelerated Hungarian algorithms for the Linear Assignment Problem. Parallel Computing, 57:52–72. Frieze, A. and Yadegar, J. (1983). On the quadratic assignment problem. Discrete applied mathematics, 5(1):89–98.

30 / 30

SLIDE 33

Hahn, P. M., Zhu, Y.-R., Guignard, M., Hightower, W. L., and Saltzman, M. J. (2012). A level-3 reformulation-linearization technique-based bound for the quadratic assignment problem. INFORMS J. on Computing, 24(2):202–209. Jonker, R. and Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340. Kaufman, L. and Broeckx, F. (1978). An algorithm for the quadratic assignment problem using Bender’s decomposition. European Journal of Operational Research, 2(3):207–211. Koopmans, T. C. and Beckmann, M. (1957). Assignment problems and the location of economic activities. Econometrica: Journal of the Econometric Society, pages 53–76. Kuhn, H. W. (1955).

30 / 30

SLIDE 34

The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97. Lawler, E. L. (1963). The quadratic assignment problem. Management science, 9(4):586–599. Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial & Applied Mathematics, 5(1):32–38. Storøy, S. and Sørevik, T. (1997). Massively parallel augmenting path algorithms for the assignment problem. Computing, 59(1):1–16. Vasconcelos, C. N. and Rosenhahn, B. (2009). Bipartite graph matching computation on GPU. In Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 42–55. Springer.

30 / 30

SLIDE 35

Wein, J. M. and Zenios, S. (1990). Massively parallel Auction algorithms for the assignment problem. In Frontiers of Massively Parallel Computation, 1990. Proceedings., 3rd Symposium on the, pages 90–99. IEEE.

30 / 30