Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , - - PowerPoint PPT Presentation

scalable asynchronous contact mechanics using charm
SMART_READER_LITE
LIVE PREVIEW

Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , - - PowerPoint PPT Presentation

Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , Laxmikant V. Kale* and Rasmus Tamstorf^ * University of Illinois at Urbana Champaign ^Walt Disney Animation Studios 1 Asynchronous Contact Mechanics 2 Asynchronous Contact


slide-1
SLIDE 1

Scalable Asynchronous Contact Mechanics using Charm++

Xiang Ni*, Laxmikant V. Kale* and Rasmus Tamstorf^ * University of Illinois at Urbana Champaign

^Walt Disney Animation Studios

1

slide-2
SLIDE 2

Asynchronous Contact Mechanics

2

slide-3
SLIDE 3

Asynchronous Contact Mechanics

2

slide-4
SLIDE 4

Asynchronous Contact Mechanics

  • Necessary Guarantees

2

slide-5
SLIDE 5

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions

2

slide-6
SLIDE 6

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions
  • Correctness: follow the laws of physics

2

slide-7
SLIDE 7

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions
  • Correctness: follow the laws of physics
  • Progress: finish in a finite amount of time

2

slide-8
SLIDE 8

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions
  • Correctness: follow the laws of physics
  • Progress: finish in a finite amount of time
  • Problems with other existing algorithms

2

slide-9
SLIDE 9

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions
  • Correctness: follow the laws of physics
  • Progress: finish in a finite amount of time
  • Problems with other existing algorithms
  • An object can end up going through itself or another object

2

slide-10
SLIDE 10

Asynchronous Contact Mechanics

  • Necessary Guarantees
  • Safety: no missed collisions
  • Correctness: follow the laws of physics
  • Progress: finish in a finite amount of time
  • Problems with other existing algorithms
  • An object can end up going through itself or another object
  • Violate physical properties

2

slide-11
SLIDE 11

3

What you want What you get

pictures from Yi Wang at VT

slide-12
SLIDE 12

3

What you want What you get

pictures from Yi Wang at VT

slide-13
SLIDE 13

3

What you want What you get

pictures from Yi Wang at VT

slide-14
SLIDE 14

3

What you want What you get

incorrect handling of collisions

pictures from Yi Wang at VT

slide-15
SLIDE 15

Parallelization Challenges

  • Highly irregular communication pattern
  • Message driven execution in Charm++
  • Dynamic load imbalancing
  • Adaptive runtime system
  • Very fine grained computation
  • Overlapping computation and

communication

4

slide-16
SLIDE 16

Parallelization Challenges

  • Highly irregular communication pattern
  • Message driven execution in Charm++
  • Dynamic load imbalancing
  • Adaptive runtime system
  • Very fine grained computation
  • Overlapping computation and

communication

10 20 30 40 50 5 10 15 20 25 30 Core ID Simulated Time (s) 2K 4K 6K 8K 10K 12K Number of Active Contacts

4

slide-17
SLIDE 17

Overall Flow

5

slide-18
SLIDE 18

Overall Flow

Internal Force Internal Force

5

slide-19
SLIDE 19

Overall Flow

Internal Force Collision Detection Internal Force

5

slide-20
SLIDE 20

Overall Flow

Internal Force Collision Detection Internal Force

Collision Window

5

slide-21
SLIDE 21

Overall Flow

Collisions Detected?

Internal Force Collision Detection Internal Force

Collision Window

5

slide-22
SLIDE 22

Overall Flow

Collisions Detected?

Proceed to the next window

No Internal Force Collision Detection Internal Force

Collision Window

5

slide-23
SLIDE 23

Overall Flow

Penalty Force

Collisions Detected?

Add penalty forces and rollback

Yes

Proceed to the next window

No Internal Force Collision Detection Internal Force

Collision Window

Penalty Force

5

slide-24
SLIDE 24

Overall Flow

Penalty Force

Collisions Detected?

Add penalty forces and rollback

Yes

Proceed to the next window

No Internal Force Collision Detection Internal Force

Collision Window

Penalty Force

5

collision response

slide-25
SLIDE 25

Collision Detection

6

slide-26
SLIDE 26

Collision Detection

Broad Phase

6

slide-27
SLIDE 27

Collision Detection

Broad Phase

Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions.

6

slide-28
SLIDE 28

Collision Detection

Broad Phase

Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions. Globally among all the partitions, we fit the trajectory of each triangle to a 3D bounding box and then pass them to the existing collision detection library in Charm++.

6

slide-29
SLIDE 29

Collision Detection

Broad Phase

Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions. Globally among all the partitions, we fit the trajectory of each triangle to a 3D bounding box and then pass them to the existing collision detection library in Charm++.

6

slide-30
SLIDE 30

Collision Detection

Narrow Phase

We apply the space-time separating planes method to filter

  • ut potential collisions.

6

slide-31
SLIDE 31

Narrow Phase

First Challenge: Computation Imbalance

7

slide-32
SLIDE 32

Narrow Phase

First Challenge: Computation Imbalance

7

Time (ms)

100 200 300 400 500 600 700 800

slide-33
SLIDE 33

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time (ms)

100 200 300 400 500 600 700 800

slide-34
SLIDE 34

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time spent on each potential collision pair is not uniform

Time (ms)

100 200 300 400 500 600 700 800

slide-35
SLIDE 35

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair

Time (ms)

100 200 300 400 500 600 700 800

slide-36
SLIDE 36

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer

Time (ms)

100 200 300 400 500 600 700 800

slide-37
SLIDE 37

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer

Time (ms)

100 200 300 400 500 600 700 800

slide-38
SLIDE 38

Narrow Phase

First Challenge: Computation Imbalance

Communication Computation

7

Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer

Time (ms)

100 200 300 400 500 600 700 800

810ms —> 150ms

slide-39
SLIDE 39

Narrow Phase

Communication Computation

7

Time (ms)

100 200 300 400 500 600 700 800

810ms —> 150ms

Second Challenge: Communication Imbalance

slide-40
SLIDE 40

Narrow Phase

Communication Computation

7

Time (ms)

100 200 300 400 500 600 700 800

810ms —> 150ms

Second Challenge: Communication Imbalance

The more potential collision pairs are spread, the more communication requests.

slide-41
SLIDE 41

Narrow Phase: Communication Imbalance

8

Locality Aware Load Balancer

slide-42
SLIDE 42

Narrow Phase: Communication Imbalance

8

Locality Aware Load Balancer

Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 & Potential Collisions

slide-43
SLIDE 43

Narrow Phase: Communication Imbalance

8

Locality Aware Load Balancer

Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &

Collision Tasks

slide-44
SLIDE 44

Narrow Phase: Communication Imbalance

8

Locality Aware Load Balancer

Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &

Collision Tasks

Assignment

Node 3 Node 4 Node 5 Node 2

slide-45
SLIDE 45

Narrow Phase: Communication Imbalance

8

Locality Aware Load Balancer

Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &

Collision Tasks

Assignment

Node 3 Node 4 Node 5 Node 2

slide-46
SLIDE 46

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

slide-47
SLIDE 47

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

Let’s look at the flow on one node

slide-48
SLIDE 48

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

L list of potential collision tasks Let’s look at the flow on one node

slide-49
SLIDE 49

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

slide-50
SLIDE 50

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

  • 2. On receiving message M
slide-51
SLIDE 51

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

  • 2. On receiving message M

M.type()

slide-52
SLIDE 52

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

  • 2. On receiving message M

M.type()

sendDataReply()

Data request

slide-53
SLIDE 53

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

  • 2. On receiving message M

M.type()

sendDataReply() T <— related task if T.ready if T.size > THRESHOLD redistribute T within node else Process T

Data request Data reply

slide-54
SLIDE 54

Narrow Phase: Communication Imbalance

9

Overlapping Computation and Communication

  • 1. Send data request for the external vertices in L

L list of potential collision tasks Let’s look at the flow on one node

  • 2. On receiving message M

M.type()

sendDataReply() T <— related task if T.ready if T.size > THRESHOLD redistribute T within node else Process T Process subTask(M.start, M.end)

Data request Data reply Work request

slide-55
SLIDE 55

Narrow Phase: Communication Imbalance

10

Node level data cache

Is# present?#

Compute# Object#1# Compute# Object#2# Compute# Object#3# Compute# Object#4# Begin# Narrow# Phase# End# Narrow# Phase# Need# external# data# Yes# No.# Send#request#to# remote#cache.# NodeFlevel## soHware#cache.# Request#from# remote#cache.# Respond.#

slide-56
SLIDE 56

Narrow Phase

11

0.01 0.1 1 12 24 48 96 192 Time (s) Number of Cores Naive Profiling based Fully optimized Linear scaling

slide-57
SLIDE 57

Collision Response

Computation Imbalance

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

slide-58
SLIDE 58

Collision Response

Computation Imbalance

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation

slide-59
SLIDE 59

Collision Response

Computation Imbalance

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation

slide-60
SLIDE 60

Collision Response

Computation Imbalance

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms

slide-61
SLIDE 61

Collision Response

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms

Importance of partial barrier

slide-62
SLIDE 62

Collision Response

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms

Importance of partial barrier

slide-63
SLIDE 63

Collision Response

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms

Importance of partial barrier

slide-64
SLIDE 64

Collision Response

12

Time (ms)

1 2 3 4

2 4 6 8 10 14 12

Material Force Calculation Penalty Force Calculation

Importance of partial barrier

3.5ms —> 3.2ms

slide-65
SLIDE 65

Results

13

Machines

Edison: a Cray XC30, Intel E5-2695@2.4GHz, 12 core Ivy Bridge Brickland: a 4 socket system with Intel E7-4890@2.8GHz, 15 core Ivy Bridge

Examples

slide-66
SLIDE 66

Results

14

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)

Time(s) Time(s)

slide-67
SLIDE 67

Results

14

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

Time(s) Time(s)

slide-68
SLIDE 68

Results

14

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e

Time(s) Time(s)

slide-69
SLIDE 69

Results

14

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e 1e 1e 1e

Time(s) Time(s)

slide-70
SLIDE 70

Results

14

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)

1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (a) Bowline Knot

4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384

Number of Cores (b) Reef Knot

1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384

Number of Cores (c) Two Cloths Draped

1e 1e 1e 1e 4 1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768

Number of Cores (d) Twister

Time(s) Time(s)

slide-71
SLIDE 71

Long Twister

15

slide-72
SLIDE 72

Long Twister

15

slide-73
SLIDE 73

Long Twister

16

0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 Time (s) Simulated Time (s)

slide-74
SLIDE 74

Long Twister

16

0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 Time (s) Simulated Time (s)

5s 10s 15s 25s

slide-75
SLIDE 75

Long Twister

17

0.01 0.1 1 10 100 1000 1 12 24 48 96 192 384 768

Number of Cores (a) 5s

1 12 24 48 96 192 384 768

Number of Cores (b) 10s

768 1 12 24 48 96 192 384 768

Number of Cores (c) 15s

1 12 24 48 96 192 384 768

Number of Cores (d) 25s

0.01 0.1 1 10 100 1000

Time [s]

Time per window Force calculation Collision detection

0.01 0.1 1 10 100 1000

Time [s]

slide-76
SLIDE 76

Conclusion

  • Strong scaling to 384 cores on Edison
  • More than 10x speedup compared to the TBB

shared memory results

  • Charm++ is well-suited for dynamic irregular

applications like ACM

  • Message-driven feature helps the overlapping of

communication and computation

18