Scaling Alltoall Collective
- n Multi-core Systems
Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - - PowerPoint PPT Presentation
Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu Presentation Outline
Ranger cluster at TACC has 16 core/node and in total more than 60,000 cores
NIC NIC INFINIBAND Offload architecture Onload architecture NIC NIC INFINIBAND Core Node Node Node Node In an offload architecture, the network processing is offloaded to network
In an onload architecture, the CPU is involved in communication in addition to performing the computation In onload architecture, the faster CPU is able to speed up the communication. However, ability to overlap communication with computation is not possible
cores
cores simultaneously
Node 1 Node 2
by each core increases to two Cores
Node 0 Node 1 Node 0 Node 1 GROUP Node 0 Node 1 Node 0 Node 1 Step 1 Step 2 Step 3
GROUP 2
21
Node 0 Node 1 Step 1 Step 2 Node 0 Node 1 GROUP 1 Cores
Node 0 Node 1
receives the data
– Cluster A: 64 node (512 cores)
– Cluster B: 4 node (32 cores)
– Alltoall collective time
– CPMD Application performance
5000 10000 15000 20000 25000 30000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size
Alltoall Time
Leader-based proposed
2000 4000 6000 8000 10000 12000 2 4 8 16 32 64 Time (us) # of Nodes
Alltoall Time
leader-based proposed
system size
10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Msg Size
Leader-based proposed
500 1000 1500 2000 2500 3000 3500 4000 4500 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Time (us) Msg Size
Alltoall Time
Leader-based
proposed
cores on ConnectX architecture
20 40 60 80 100 120 140 160 180 200 32-wat si63-10ryd si63-70ryd si63-120ryd
Leader-based proposed Execution Time (sec)
extensive use of alltoall communication.
128 core system
100 200 300 400 500 600 8X8 16X8 32X8 64X8 Time (secs) System Size
CPMD
Leader-based proposed
33
34