[PDF] - An Overview of An Overview of ASCI White ASCI Red Pacific 1 PDF Document

SLIDE 1

1

3/18/2005 1

Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

An Overview of An Overview of Supercomputers, Clusters and Supercomputers, Clusters and Grid Grid

02 2 IBM BG/L ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red

1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Vector Parallel Super Scalar/Vector/Parallel

1941 1 (Floating Point operations / second, Flop/s) 1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2003 35,000,000,000,000 (35 TFlop/s)

(103) (106) (109) (1012) (1015)

2X Transistors/Chip Every 1.5 Years

A Growth A Growth-

Factor of a Billion

Factor of a Billion in Performance in a Career in Performance in a Career

02 3

H. Meuer, H. Simon, E. Strohmaier, & JD
H. Meuer, H. Simon, E. Strohmaier, & JD
Listing of the 500 most powerful

Computers in the World

Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

Updated twice a year

SC‘xy in the States in November Meeting in Mannheim, Germany in June

All data available from www.top500.org

Size Rate

TPP performance

02 4

1. 127 PF/ s
1. 167 TF/s

59.7 GF/s

70. 72 TF/ s
0. 4 GF/ s

850 GF/ s

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Fuj it su 'NWT' NAL NEC Earth Simulator Int el ASCI Red Sandia IBM ASCI Whit e LLNL

N=1 N=500 SUM 1 Gflop/ s

1 Tflop/ s 100 Mflop/ s 100 Gflop/ s 100 Tflop/ s 10 Gflop/ s 10 Tflop/ s 1 Pflop/ s

IBM BlueGene/ L

My Laptop

TOP500 Performance TOP500 Performance – – November 2004 November 2004

02 5

Vibrant Field for High Performance Vibrant Field for High Performance Computers Computers

♦ Cray X1, XD1, XT3 ♦ SGI Altix ♦ IBM Regatta ♦ IBM Blue Gene/L ♦ IBM eServer ♦ Sun ♦ HP ♦ Dawning ♦ Bull NovaScale ♦ Lanovo ♦ Fujitsu PrimePower ♦ Hitachi SR11000 ♦ NEC SX-8 ♦ Apple

♦ Coming soon …

Cray BlackWidow Galactic Computing

Steve Chen

02 6

Architecture/Systems Continuum Architecture/Systems Continuum

♦

Custom processor with custom interconnect

Cray X1
NEC SX-8
IBM Regatta
IBM Blue Gene/L

♦

Commodity processor with custom interconnect

SGI Altix

Intel Itanium 2

Cray XT3 (Red Storm)

AMD Opteron

♦

Commodity processor with commodity interconnect

Clusters

Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics

NEC TX7
IBM eServer
Dawning

Loosely Coupled Tightly Coupled

♦

Best processor performance for codes that are not “cache friendly”

♦

Good communication performance

♦

Simplest programming model

♦

Most expensive

♦

Good communication performance

♦

Good scalability

♦

Best price/performance (for codes that work well with caches and are latency tolerant)

♦

More complex programming model

0% 20% 40% 60% 80% 100% J u n -9 3 D e c -9 3 J u n -9 4 D e c -9 4 J u n -9 5 D e c -9 5 J u n -9 6 D e c -9 6 J u n -9 7 D e c -9 7 J u n -9 8 D e c -9 8 J u n -9 9 D e c -9 9 J u n -0 0 D e c -0 0 J u n -0 1 D e c -0 1 J u n -0 2 D e c -0 2 J u n -0 3 D e c -0 3 J u n -0 4

Custom Commod Hybrid

SLIDE 2

2

02 7

Architectures / Systems Architectures / Systems

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

S IMD S ingle Proc. Cluster Constellation S MP MPP

02 8

Processor Types Processor Types

100 200 300 400 500

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 S IMD Vector S calar S parc MIPS intel HP Power Alpha 02 9

Commodity Processors Commodity Processors

♦ Intel Pentium Nocona

3.6 GHz, peak = 7.2 Gflop/s Linpack 100 = 1.8 Gflop/s Linpack 1000 = 3.1 Gflop/s

♦ AMD Opteron

2.2 GHz, peak = 4.4 Gflop/s Linpack 100 = 1.3 Gflop/s Linpack 1000 = 3.1 Gflop/s

♦ Intel Itanium 2

1.5 GHz, peak = 6 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 5.4 Gflop/s

♦ HP PA RISC ♦ Sun UltraSPARC IV ♦ HP Alpha EV68

1.25 GHz, 2.5 Gflop/s peak

♦ MIPS R16000

02 10

Top500 Performance by Manufacture (11/04)

IBM 49% HP 21%

thers

14% SGI 7% NEC 4% Fujitsu 2% Cray 2% Hitachi 1% Sun 0% Intel 0%

02 11

♦ Gig Ethernet ♦ Myrinet ♦ Infiniband ♦ QsNet ♦ SCI

Clos F a t t r e e Torus

Commodity Interconnects Commodity Interconnects

Cost Cost Cost MPI Lat / 1-way / Bi-Dir Switch topology NIC Sw/node Node (us) / MB/s / MB/s Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 / 150 SCI Torus $1,600 $ 0 $1,600 5 / 300 / 400 QsNetII (R) Fat Tree $1,200 $1,700 $2,900 3 / 880 / 900 QsNetII (E) Fat Tree $1,000 $ 700 $1,700 3 / 880 / 900 Myrinet (D card) Clos $ 595 $ 400 $ 995 6.5 / 240 / 480 Myrinet (E card) Clos $ 995 $ 400 $1,395 6 / 450 / 900 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 / 790 02 12

Interconnects / Systems Interconnects / Systems

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Others Infiniband Quadrics Gigabit Ethernet Cray Int erconnect Myrinet S P S witch Crossbar N/ A

SLIDE 3

3

02 13

24th List: The TOP10 24th List: The TOP10

2500 2003 USA NCSA 9.82 Tungsten

PowerEdge, Myrinet

Dell 10 2944 2004 USA Naval Oceanographic Office 10.31 pSeries 655 IBM 9 8192 2004 USA Lawrence Livermore National Laboratory 11.68 BlueGene/L

DD1 500 MHz

IBM/LLNL 8 2200 2004 USA Virginia Tech 12.25 X

Apple XServe, Infiniband

Self Made 7 8192 2002 USA Los Alamos National Laboratory 13.88 ASCI Q

AlphaServer SC, Quadrics

HP 6 4096 2004 USA Lawrence Livermore National Laboratory 19.94 Thunder

Itanium2, Quadrics

CCD 5 3564 2004 Spain Barcelona Supercomputer Center 20.53 MareNostrum

BladeCenter JS20, Myrinet

IBM 4 5120 2002 Japan Earth Simulator Center 35.86 Earth-Simulator NEC 3 10160 2004 USA NASA Ames 51.87 Columbia

Altix, Infiniband

SGI 2 32768 2004 USA DOE/IBM 70.72 BlueGene/L

β-System

IBM 1 #Proc Year Country Installation Site Rmax

[TF/s]

Computer Manufacturer 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc; 35 in Germany 02 14

How Big Is Big? How Big Is Big?

♦ Every 10X brings new challenges 64 processors was once considered large

it hasn’t been “large” for quite a while

1024 processors is today’s “medium” size 8096 processors is today’s “large”

we’re struggling even here

♦ 100K processor systems are in construction we have fundamental challenges in dealing with machines of this size … and little in the way

f programming support

Median size of Top 1 0 MPPs and Clust ers

y = 7E- 06e0 .0 0 0 5 x

5 00

1,0 00 1,5 00 2,0 00 2,5 00 3,0 00 3,5 00 4,0 00 Sep- 9 5 Feb- 97 Jun- 98 Oct- 99 Mar- 01 Jul- 02 Dec- 03 Apr- 05 Se p- 06 Ja n- 08 Ma y- 09 Da te Number of CPUs

02 15 Chip (2 processors) Compute Card (2 chips, 2x1x1) 4 processors Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors System (64 racks, 64x32x32) 131,072 procs Rack (32 Node boards, 8x8x16) 2048 processors 2.8/5.6 GF/s 4 MB (cache) 5.6/11.2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR 2.9/5.7 TF/s 0.5 TB DDR 180/360 TF/s 32 TB DDR

IBM IBM BlueGene BlueGene/L /L

131,072 Processors 131,072 Processors

“Fastest Computer” BG/L 700 MHz 32K proc 16 racks Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s

77% of peak BlueGene/L Compute ASIC

Full system total of 131,072 processors

02 16

BlueGene/L Interconnection Networks BlueGene/L Interconnection Networks

3 Dimensional Torus

Interconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GB/s per node)
1 µs latency between nearest neighbors, 5 µs to the

farthest

4 µs latency for one hop with MPI, 10 µs to the

farthest

Communications backbone for computations
0.7/1.4 TB/s bisection bandwidth, 68TB/s total

bandwidth Global Tree

Interconnects all compute and I/O nodes (1024)
One-to-all broadcast functionality
Reduction operations functionality
2.8 Gb/s of bandwidth per link
Latency of one way tree traversal 2.5 µs
~23TB/s total binary tree bandwidth (64k machine)

Ethernet

Incorporated into every node ASIC
Active in the I/O nodes (1:64)
All external comm. (file I/O, control, user

interaction, etc.) Low Latency Global Barrier and Interrupt

Latency of round trip 1.3 µs

Control Network 02 17

NASA Ames: SGI NASA Ames: SGI Altix Altix Columbia Columbia 10,240 Processor System 10,240 Processor System

♦ Architecture: Hybrid Technical Server Cluster ♦ Vendor: SGI based on Altix systems ♦ Deployment: Today ♦ Node: 1.5 GHz Itanium-2 Processor 512 procs/node (20 cabinets) Dual FPU’s / processor ♦ System: 20 Altix NUMA systems @ 512 procs/node = 10240 procs 320 cabinets (estimate 16 per node) Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: FastNumaFlex (custom hypercube) within node

Infiniband between nodes

♦ Pluses: Large and powerful DSM nodes ♦ Potential problems (Gotchas): Power consumption - 100 kw per node (2 Mw total)

02 18

SX SX-

8 Architecture

8 Architecture

♦ ♦ Upward compatible to SX Upward compatible to SX-

5/SX

5/SX-

6

6 ♦ ♦ Vector pipelines Vector pipelines

4 logical pipelines : 2GHz

4 logical pipelines : 2GHz

144KB vector register

144KB vector register

Hardware support of

Hardware support of SQRT SQRT instruction instruction

♦ ♦ Scalar processor Scalar processor

4 way superscalar RISC

4 way superscalar RISC

♦ ♦ Main memory Main memory

2 types of

2 types of RAMs RAMs

DDR2 DDR2-

SDRAM : Large capacity 128GB/node

SDRAM : Large capacity 128GB/node FCRAM FCRAM : : High High-

speed 64GB/node

speed 64GB/node

♦ ♦ Multi node system Multi node system

up to 512 nodes

up to 512 nodes

64 TFLOPS

64 TFLOPS

♦ ♦ Enhanced Enhanced I/O performance I/O performance

Reduction of I/O overhead by adopting

Reduction of I/O overhead by adopting direct CPU control direct CPU control

XMU SHARED MAIN MEMORY Scalar Registers 4-Wide Vector Unit Input/Output sub system Scalar Unit Scalar Execution unit Vector Registers Cache Memory Load/ Store Mask Reg. Mask Logical Multiply Add/Shift Divide /SQRT Central Processing Unit Inter-node connection

SLIDE 4

4

02 19

SX SX-

8 Single Node Module

8 Single Node Module

♦ ♦ Up to 8 CPUs/node Up to 8 CPUs/node

Peak Vector Performance(PVP):

Peak Vector Performance(PVP): 16 GFLOPS/CPU 16 GFLOPS/CPU 128 128 G GFLOPS/node FLOPS/node

♦ ♦ Symmetric multiprocessing (SMP) Symmetric multiprocessing (SMP) ♦ ♦ Large Capacity Memory Large Capacity Memory

Up to 128GB

Up to 128GB

♦ ♦ Ultra Ultra-

high memory bandwidth

high memory bandwidth

64GB/s per CPU

64GB/s per CPU

Total 512GB/s per node

Total 512GB/s per node

♦ ♦ Large I/O throughput Large I/O throughput

12.8GB/s per node

12.8GB/s per node

éÂãLâØ I/O MM

I/O

I/O

....

CPU CPU CPU

to IXS Memory CPU

02 20

1. Single node performance
2. Maximum number of node
3. Data transfer rate among

nodes