Scalable Asynchronous Contact Mechanics using Charm++
Xiang Ni*, Laxmikant V. Kale* and Rasmus Tamstorf^ * University of Illinois at Urbana Champaign
^Walt Disney Animation Studios
1
Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , - - PowerPoint PPT Presentation
Scalable Asynchronous Contact Mechanics using Charm++ Xiang Ni* , Laxmikant V. Kale* and Rasmus Tamstorf^ * University of Illinois at Urbana Champaign ^Walt Disney Animation Studios 1 Asynchronous Contact Mechanics 2 Asynchronous Contact
Xiang Ni*, Laxmikant V. Kale* and Rasmus Tamstorf^ * University of Illinois at Urbana Champaign
^Walt Disney Animation Studios
1
2
2
2
2
2
2
2
2
2
3
What you want What you get
pictures from Yi Wang at VT
3
What you want What you get
pictures from Yi Wang at VT
3
What you want What you get
pictures from Yi Wang at VT
3
What you want What you get
incorrect handling of collisions
pictures from Yi Wang at VT
communication
4
communication
10 20 30 40 50 5 10 15 20 25 30 Core ID Simulated Time (s) 2K 4K 6K 8K 10K 12K Number of Active Contacts
4
5
Internal Force Internal Force
5
Internal Force Collision Detection Internal Force
5
Internal Force Collision Detection Internal Force
Collision Window
5
Collisions Detected?
Internal Force Collision Detection Internal Force
Collision Window
5
Collisions Detected?
Proceed to the next window
No Internal Force Collision Detection Internal Force
Collision Window
5
Penalty Force
Collisions Detected?
Add penalty forces and rollback
Yes
Proceed to the next window
No Internal Force Collision Detection Internal Force
Collision Window
Penalty Force
5
Penalty Force
Collisions Detected?
Add penalty forces and rollback
Yes
Proceed to the next window
No Internal Force Collision Detection Internal Force
Collision Window
Penalty Force
5
collision response
6
Broad Phase
6
Broad Phase
Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions.
6
Broad Phase
Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions. Globally among all the partitions, we fit the trajectory of each triangle to a 3D bounding box and then pass them to the existing collision detection library in Charm++.
6
Broad Phase
Locally inside each partition, we use a 26-DOP hierarchy to fit the swept volumes of the triangle to detect potential collisions. Globally among all the partitions, we fit the trajectory of each triangle to a 3D bounding box and then pass them to the existing collision detection library in Charm++.
6
Narrow Phase
We apply the space-time separating planes method to filter
6
First Challenge: Computation Imbalance
7
First Challenge: Computation Imbalance
7
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time spent on each potential collision pair is not uniform
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer
Time (ms)
100 200 300 400 500 600 700 800
First Challenge: Computation Imbalance
Communication Computation
7
Time spent on each potential collision pair is not uniform Detection time depends on trajectory length of each vertex in the potential pair A profiling based load balancer
Time (ms)
100 200 300 400 500 600 700 800
810ms —> 150ms
Communication Computation
7
Time (ms)
100 200 300 400 500 600 700 800
810ms —> 150ms
Second Challenge: Communication Imbalance
Communication Computation
7
Time (ms)
100 200 300 400 500 600 700 800
810ms —> 150ms
Second Challenge: Communication Imbalance
The more potential collision pairs are spread, the more communication requests.
8
Locality Aware Load Balancer
8
Locality Aware Load Balancer
Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 & Potential Collisions
8
Locality Aware Load Balancer
Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &
Collision Tasks
8
Locality Aware Load Balancer
Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &
Collision Tasks
Assignment
Node 3 Node 4 Node 5 Node 2
8
Locality Aware Load Balancer
Partition 2 Partition 3 & Partition 3 Partition 4 & Partition 5 Partition 2 &
Collision Tasks
Assignment
Node 3 Node 4 Node 5 Node 2
9
Overlapping Computation and Communication
9
Overlapping Computation and Communication
Let’s look at the flow on one node
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
M.type()
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
M.type()
sendDataReply()
Data request
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
M.type()
sendDataReply() T <— related task if T.ready if T.size > THRESHOLD redistribute T within node else Process T
Data request Data reply
9
Overlapping Computation and Communication
L list of potential collision tasks Let’s look at the flow on one node
M.type()
sendDataReply() T <— related task if T.ready if T.size > THRESHOLD redistribute T within node else Process T Process subTask(M.start, M.end)
Data request Data reply Work request
10
Node level data cache
Is# present?#
Compute# Object#1# Compute# Object#2# Compute# Object#3# Compute# Object#4# Begin# Narrow# Phase# End# Narrow# Phase# Need# external# data# Yes# No.# Send#request#to# remote#cache.# NodeFlevel## soHware#cache.# Request#from# remote#cache.# Respond.#
11
0.01 0.1 1 12 24 48 96 192 Time (s) Number of Cores Naive Profiling based Fully optimized Linear scaling
Computation Imbalance
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Computation Imbalance
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation
Computation Imbalance
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation
Computation Imbalance
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms
Importance of partial barrier
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms
Importance of partial barrier
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation 4ms —> 3.5ms
Importance of partial barrier
12
Time (ms)
1 2 3 4
2 4 6 8 10 14 12
Material Force Calculation Penalty Force Calculation
Importance of partial barrier
3.5ms —> 3.2ms
13
Machines
Edison: a Cray XC30, Intel E5-2695@2.4GHz, 12 core Ivy Bridge Brickland: a 4 socket system with Intel E7-4890@2.8GHz, 15 core Ivy Bridge
Examples
14
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)
Time(s) Time(s)
14
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
Time(s) Time(s)
14
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e
Time(s) Time(s)
14
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e 1e 1e 1e
Time(s) Time(s)
14
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Charm++ Time (Brickland) Charm++ Time (Edison) TBB Time (Brickland)
1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (a) Bowline Knot
4 1e+01 1e+02 1e+03 1e+04 1 3 5 8 12 24 48 96 192 384
Number of Cores (b) Reef Knot
1e 1e 1e 1e 4 1e+02 1e+03 1e+04 1e+05 1 3 5 8 12 24 48 96 192 384
Number of Cores (c) Two Cloths Draped
1e 1e 1e 1e 4 1e+03 1e+04 1e+05 1e+06 1 3 5 8 12 24 48 96 192 768
Number of Cores (d) Twister
Time(s) Time(s)
15
15
16
0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 Time (s) Simulated Time (s)
16
0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 Time (s) Simulated Time (s)
5s 10s 15s 25s
17
0.01 0.1 1 10 100 1000 1 12 24 48 96 192 384 768
Number of Cores (a) 5s
1 12 24 48 96 192 384 768
Number of Cores (b) 10s
768 1 12 24 48 96 192 384 768
Number of Cores (c) 15s
1 12 24 48 96 192 384 768
Number of Cores (d) 25s
0.01 0.1 1 10 100 1000
Time [s]
Time per window Force calculation Collision detection
0.01 0.1 1 10 100 1000
Time [s]
shared memory results
applications like ACM
communication and computation
18