An Overview of An Overview of ASCI White ASCI Red Pacific 1 - - PDF document

an overview of an overview of
SMART_READER_LITE
LIVE PREVIEW

An Overview of An Overview of ASCI White ASCI Red Pacific 1 - - PDF document

A Growth- A Growth -Factor of a Billion Factor of a Billion in Performance in a Career in Performance in a Career Super Scalar/Vector/Parallel 1 PFlop/s (10 15 ) IBM Parallel BG/L An Overview of An Overview of ASCI White ASCI Red


slide-1
SLIDE 1

1

3/18/2005 1

Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

An Overview of An Overview of Supercomputers, Clusters and Supercomputers, Clusters and Grid Grid

02 2 IBM BG/L ASCI White Pacific EDSAC 1 UNIVAC 1 IBM 7090 CDC 6600 IBM 360/195 CDC 7600 Cray 1 Cray X-MP Cray 2 TMC CM-2 TMC CM-5 Cray T3D ASCI Red

1950 1960 1970 1980 1990 2000 2010 1 KFlop/s 1 MFlop/s 1 GFlop/s 1 TFlop/s 1 PFlop/s Scalar Super Scalar Vector Parallel Super Scalar/Vector/Parallel

1941 1 (Floating Point operations / second, Flop/s) 1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2003 35,000,000,000,000 (35 TFlop/s)

(103) (106) (109) (1012) (1015)

2X Transistors/Chip Every 1.5 Years

A Growth A Growth-

  • Factor of a Billion

Factor of a Billion in Performance in a Career in Performance in a Career

02 3

  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • Listing of the 500 most powerful

Computers in the World

  • Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

  • Updated twice a year

SC‘xy in the States in November Meeting in Mannheim, Germany in June

  • All data available from www.top500.org

Size Rate

TPP performance

02 4

  • 1. 127 PF/ s
  • 1. 167 TF/s

59.7 GF/s

  • 70. 72 TF/ s
  • 0. 4 GF/ s

850 GF/ s

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Fuj it su 'NWT' NAL NEC Earth Simulator Int el ASCI Red Sandia IBM ASCI Whit e LLNL

N=1 N=500 SUM 1 Gflop/ s

1 Tflop/ s 100 Mflop/ s 100 Gflop/ s 100 Tflop/ s 10 Gflop/ s 10 Tflop/ s 1 Pflop/ s

IBM BlueGene/ L

My Laptop

TOP500 Performance TOP500 Performance – – November 2004 November 2004

02 5

Vibrant Field for High Performance Vibrant Field for High Performance Computers Computers

♦ Cray X1, XD1, XT3 ♦ SGI Altix ♦ IBM Regatta ♦ IBM Blue Gene/L ♦ IBM eServer ♦ Sun ♦ HP ♦ Dawning ♦ Bull NovaScale ♦ Lanovo ♦ Fujitsu PrimePower ♦ Hitachi SR11000 ♦ NEC SX-8 ♦ Apple

♦ Coming soon …

Cray BlackWidow Galactic Computing

Steve Chen

02 6

Architecture/Systems Continuum Architecture/Systems Continuum

Custom processor with custom interconnect

  • Cray X1
  • NEC SX-8
  • IBM Regatta
  • IBM Blue Gene/L

Commodity processor with custom interconnect

  • SGI Altix

Intel Itanium 2

  • Cray XT3 (Red Storm)

AMD Opteron

Commodity processor with commodity interconnect

  • Clusters

Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics

  • NEC TX7
  • IBM eServer
  • Dawning

Loosely Coupled Tightly Coupled

Best processor performance for codes that are not “cache friendly”

Good communication performance

Simplest programming model

Most expensive

Good communication performance

Good scalability

Best price/performance (for codes that work well with caches and are latency tolerant)

More complex programming model

0% 20% 40% 60% 80% 100% J u n -9 3 D e c -9 3 J u n -9 4 D e c -9 4 J u n -9 5 D e c -9 5 J u n -9 6 D e c -9 6 J u n -9 7 D e c -9 7 J u n -9 8 D e c -9 8 J u n -9 9 D e c -9 9 J u n -0 0 D e c -0 0 J u n -0 1 D e c -0 1 J u n -0 2 D e c -0 2 J u n -0 3 D e c -0 3 J u n -0 4

Custom Commod Hybrid

slide-2
SLIDE 2

2

02 7

Architectures / Systems Architectures / Systems

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

S IMD S ingle Proc. Cluster Constellation S MP MPP

02 8

Processor Types Processor Types

100 200 300 400 500

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 S IMD Vector S calar S parc MIPS intel HP Power Alpha 02 9

Commodity Processors Commodity Processors

♦ Intel Pentium Nocona

3.6 GHz, peak = 7.2 Gflop/s Linpack 100 = 1.8 Gflop/s Linpack 1000 = 3.1 Gflop/s

♦ AMD Opteron

2.2 GHz, peak = 4.4 Gflop/s Linpack 100 = 1.3 Gflop/s Linpack 1000 = 3.1 Gflop/s

♦ Intel Itanium 2

1.5 GHz, peak = 6 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 5.4 Gflop/s

♦ HP PA RISC ♦ Sun UltraSPARC IV ♦ HP Alpha EV68

1.25 GHz, 2.5 Gflop/s peak

♦ MIPS R16000

02 10

Top500 Performance by Manufacture (11/04)

IBM 49% HP 21%

  • thers

14% SGI 7% NEC 4% Fujitsu 2% Cray 2% Hitachi 1% Sun 0% Intel 0%

02 11

♦ Gig Ethernet ♦ Myrinet ♦ Infiniband ♦ QsNet ♦ SCI

Clos F a t t r e e Torus

Commodity Interconnects Commodity Interconnects

Cost Cost Cost MPI Lat / 1-way / Bi-Dir Switch topology NIC Sw/node Node (us) / MB/s / MB/s Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 / 150 SCI Torus $1,600 $ 0 $1,600 5 / 300 / 400 QsNetII (R) Fat Tree $1,200 $1,700 $2,900 3 / 880 / 900 QsNetII (E) Fat Tree $1,000 $ 700 $1,700 3 / 880 / 900 Myrinet (D card) Clos $ 595 $ 400 $ 995 6.5 / 240 / 480 Myrinet (E card) Clos $ 995 $ 400 $1,395 6 / 450 / 900 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 / 790 02 12

Interconnects / Systems Interconnects / Systems

100 200 300 400 500 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Others Infiniband Quadrics Gigabit Ethernet Cray Int erconnect Myrinet S P S witch Crossbar N/ A

slide-3
SLIDE 3

3

02 13

24th List: The TOP10 24th List: The TOP10

2500 2003 USA NCSA 9.82 Tungsten

PowerEdge, Myrinet

Dell 10 2944 2004 USA Naval Oceanographic Office 10.31 pSeries 655 IBM 9 8192 2004 USA Lawrence Livermore National Laboratory 11.68 BlueGene/L

DD1 500 MHz

IBM/LLNL 8 2200 2004 USA Virginia Tech 12.25 X

Apple XServe, Infiniband

Self Made 7 8192 2002 USA Los Alamos National Laboratory 13.88 ASCI Q

AlphaServer SC, Quadrics

HP 6 4096 2004 USA Lawrence Livermore National Laboratory 19.94 Thunder

Itanium2, Quadrics

CCD 5 3564 2004 Spain Barcelona Supercomputer Center 20.53 MareNostrum

BladeCenter JS20, Myrinet

IBM 4 5120 2002 Japan Earth Simulator Center 35.86 Earth-Simulator NEC 3 10160 2004 USA NASA Ames 51.87 Columbia

Altix, Infiniband

SGI 2 32768 2004 USA DOE/IBM 70.72 BlueGene/L

β-System

IBM 1 #Proc Year Country Installation Site Rmax

[TF/s]

Computer Manufacturer 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc; 35 in Germany 02 14

How Big Is Big? How Big Is Big?

♦ Every 10X brings new challenges 64 processors was once considered large

it hasn’t been “large” for quite a while

1024 processors is today’s “medium” size 8096 processors is today’s “large”

we’re struggling even here

♦ 100K processor systems are in construction we have fundamental challenges in dealing with machines of this size … and little in the way

  • f programming support

Median size of Top 1 0 MPPs and Clust ers

y = 7E- 06e0 .0 0 0 5 x

  • 5 00

1,0 00 1,5 00 2,0 00 2,5 00 3,0 00 3,5 00 4,0 00 Sep- 9 5 Feb- 97 Jun- 98 Oct- 99 Mar- 01 Jul- 02 Dec- 03 Apr- 05 Se p- 06 Ja n- 08 Ma y- 09 Da te Number of CPUs

02 15 Chip (2 processors) Compute Card (2 chips, 2x1x1) 4 processors Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors System (64 racks, 64x32x32) 131,072 procs Rack (32 Node boards, 8x8x16) 2048 processors 2.8/5.6 GF/s 4 MB (cache) 5.6/11.2 GF/s 1 GB DDR 90/180 GF/s 16 GB DDR 2.9/5.7 TF/s 0.5 TB DDR 180/360 TF/s 32 TB DDR

IBM IBM BlueGene BlueGene/L /L

131,072 Processors 131,072 Processors

“Fastest Computer” BG/L 700 MHz 32K proc 16 racks Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s

77% of peak BlueGene/L Compute ASIC

Full system total of 131,072 processors

02 16

BlueGene/L Interconnection Networks BlueGene/L Interconnection Networks

3 Dimensional Torus

  • Interconnects all compute nodes (65,536)
  • Virtual cut-through hardware routing
  • 1.4Gb/s on all 12 node links (2.1 GB/s per node)
  • 1 µs latency between nearest neighbors, 5 µs to the

farthest

  • 4 µs latency for one hop with MPI, 10 µs to the

farthest

  • Communications backbone for computations
  • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total

bandwidth Global Tree

  • Interconnects all compute and I/O nodes (1024)
  • One-to-all broadcast functionality
  • Reduction operations functionality
  • 2.8 Gb/s of bandwidth per link
  • Latency of one way tree traversal 2.5 µs
  • ~23TB/s total binary tree bandwidth (64k machine)

Ethernet

  • Incorporated into every node ASIC
  • Active in the I/O nodes (1:64)
  • All external comm. (file I/O, control, user

interaction, etc.) Low Latency Global Barrier and Interrupt

  • Latency of round trip 1.3 µs

Control Network 02 17

NASA Ames: SGI NASA Ames: SGI Altix Altix Columbia Columbia 10,240 Processor System 10,240 Processor System

♦ Architecture: Hybrid Technical Server Cluster ♦ Vendor: SGI based on Altix systems ♦ Deployment: Today ♦ Node: 1.5 GHz Itanium-2 Processor 512 procs/node (20 cabinets) Dual FPU’s / processor ♦ System: 20 Altix NUMA systems @ 512 procs/node = 10240 procs 320 cabinets (estimate 16 per node) Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: FastNumaFlex (custom hypercube) within node

Infiniband between nodes

♦ Pluses: Large and powerful DSM nodes ♦ Potential problems (Gotchas): Power consumption - 100 kw per node (2 Mw total)

02 18

SX SX-

  • 8 Architecture

8 Architecture

♦ ♦ Upward compatible to SX Upward compatible to SX-

  • 5/SX

5/SX-

  • 6

6 ♦ ♦ Vector pipelines Vector pipelines

  • 4 logical pipelines : 2GHz

4 logical pipelines : 2GHz

  • 144KB vector register

144KB vector register

  • Hardware support of

Hardware support of SQRT SQRT instruction instruction

♦ ♦ Scalar processor Scalar processor

  • 4 way superscalar RISC

4 way superscalar RISC

♦ ♦ Main memory Main memory

  • 2 types of

2 types of RAMs RAMs

DDR2 DDR2-

  • SDRAM : Large capacity 128GB/node

SDRAM : Large capacity 128GB/node FCRAM FCRAM : : High High-

  • speed 64GB/node

speed 64GB/node

♦ ♦ Multi node system Multi node system

  • up to 512 nodes

up to 512 nodes

  • 64 TFLOPS

64 TFLOPS

♦ ♦ Enhanced Enhanced I/O performance I/O performance

  • Reduction of I/O overhead by adopting

Reduction of I/O overhead by adopting direct CPU control direct CPU control

XMU SHARED MAIN MEMORY Scalar Registers 4-Wide Vector Unit Input/Output sub system Scalar Unit Scalar Execution unit Vector Registers Cache Memory Load/ Store Mask Reg. Mask Logical Multiply Add/Shift Divide /SQRT Central Processing Unit Inter-node connection

slide-4
SLIDE 4

4

02 19

SX SX-

  • 8 Single Node Module

8 Single Node Module

♦ ♦ Up to 8 CPUs/node Up to 8 CPUs/node

  • Peak Vector Performance(PVP):

Peak Vector Performance(PVP): 16 GFLOPS/CPU 16 GFLOPS/CPU 128 128 G GFLOPS/node FLOPS/node

♦ ♦ Symmetric multiprocessing (SMP) Symmetric multiprocessing (SMP) ♦ ♦ Large Capacity Memory Large Capacity Memory

  • Up to 128GB

Up to 128GB

♦ ♦ Ultra Ultra-

  • high memory bandwidth

high memory bandwidth

  • 64GB/s per CPU

64GB/s per CPU

  • Total 512GB/s per node

Total 512GB/s per node

♦ ♦ Large I/O throughput Large I/O throughput

  • 12.8GB/s per node

12.8GB/s per node

éÂãLâØ I/O MM

  • I/O

I/O

....

CPU CPU CPU

to IXS Memory CPU

02 20

  • 1. Single node performance
  • 2. Maximum number of node
  • 3. Data transfer rate among

nodes

Large Scale Multi Node Large Scale Multi Node System System

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF IOF C P U C P U C P U ノード #0 IOF

....

Max 8CPU éÂãLâØ MMU IOF IOF CPU CPU CPU Node#0 IOF

....

最大8CPU éÂãLâØ MMU IOF IOF C P U C P U C P U ノード #0 IOF

....

Max 8CPU éÂãLâØ MMU IOF IOF CPU CPU CPU IOF

....

High speed inter High speed inter-

  • node

node switch (IXS) switch (IXS)

Very efficient non-blocking switch

High speed processing of large data with high performance single node, large number of nodes, and high speed interconnects among nodes High speed processing of large data with high performance single node, large number of nodes, and high speed interconnects among nodes

#3 #1 #2 #4 #5 #6 #7 Node#511 IOF IOF

Max 8TB/s

(Peak data transfer rate)

Max 512 nodes Max 128GFLOPS

16GB/s x 2 / nodes 128GFLOPS

Max 512 nodes

128GFLOPS

Key points for high performance

2x (to SX-6) 4x (to SX-6) 8x (to SX-6) Optical Interconnection Optical Interconnection

9.2 Tflop/s HLRS (72 nodes)

02 21

Fuel Efficiency: Fuel Efficiency: Gflops Gflops/Watt /Watt

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) SGI Altix 1.5 GHz, Voltaire Infiniband Earth-Simulator eServer BladeCenter JS20+ (PowerPC970 2.2 GHz), Myrinet Intel Itanium2 Tiger4 1.4GHz - Quadrics ASCI Q - AlphaServer SC45, 1.25 GHz 1100 Dual 2.3 GHz Apple XServe/Mellanox Infiniband 4X/Cisco GigE BlueGene/L DD1 Prototype (0.5GHz PowerPC 440 w/Custom) eServer pSeries 655 (1.7 GHz Power4+) PowerEdge 1750, P4 Xeon 3.06 GHz, Myrinet eServer pSeries 690 (1.9 GHz Power4+) eServer pSeries 690 (1.9 GHz Power4+) LNX Cluster, Xeon 3.4 GHz, Myrinet RIKEN Super Combined Cluster BlueGene/L DD2 Prototype (0.7 GHz PowerPC 440) Integrity rx2600 Itanium2 1.5 GHz, Quadrics Dawning 4000A, Opteron 2.2 GHz, Myrinet Opteron 2 GHz, Myrinet MCR Linux Cluster Xeon 2.4 GHz - Quadrics ASCI White, SP Power3 375 MHz SP Power3 375 MHz 16 way TeraGrid, Itanium2 1.3/1.5 GHZ, Myrinet eServer Opteron 2.2 GHz. Myrinet xSeries Cluster Xeon 2.4 GHz - Quadrics eServer pSeries 655/690 (1.5/1.7 Ghz Power4+) xSeries Xeon 3.06 GHz, Quadrics eServer pSeries 690 (1.7 GHz Power4+) AIST Super Cluster P-32, Opteron 2.0 GHz, Myrinet Cray X1 eServer pSeries 690 (1.7 GHz Power4+)

Gflops/Watt Top 20 systems Based on processor power rating only

02 22 200 400 600 800 1000 1200 1400 1600 C h i n a B r a z i l I t a l y M e x i c

  • F

r a n c e K

  • r

e a , S

  • u

t h S a u d i a A r a b i a G e r m a n y B e l a r u s S w i t z e r l a n d C a n a d a S p a i n J a p a n I s r a e l U n i t e d K i n g d

  • m

N e w Z e a l a n d U n i t e d S t a t e s

KFlop/s KFlop/s per Capita (Flops/Pop) per Capita (Flops/Pop)

WETA Digital (Lord of the Rings)

02 23

Important Metrics: Important Metrics: Sustained Performance and Cost Sustained Performance and Cost

♦ Commodity processors Optimized for commercial applications. Meet the needs of most of the scientific computing market. Provide the shortest time-to-solution and the highest sustained performance per unit cost for a broad range of applications that have significant spatial and temporal locality (good caches use). ♦ Custom processors For bandwidth-intensive applications that do not cache well, custom processors are more cost effective Hence offering better capacity on just those applications.

02 24

High Bandwidth High Bandwidth vs vs Commodity Systems Commodity Systems

♦ High bandwidth systems have traditionally been vector

computers

Designed for scientific problems Capability computing ♦ Commodity processors are designed for web servers and the

home PC market

(should be thankful that the manufactures keep the 64 bit fl pt) Used for cluster based computers leveraging price point ♦ Scientific computing needs are different Require a better balance between data movement and floating point operations. Results in greater efficiency.

NEC SX-8 Cray X1 ASCI Q Intel VT Big Mac (NEC) (Cray) (HP EV68) (Dual Xeon) (Dual IBM PPC) Year of Introduction 2005 2003 2002 2004 2003 Node Architecture Vector Vector Alpha Pentium Power PC Processor Cycle Time 2 GHz 800 MHz 1.25 GHz 3.6 GHz 2 GHz Peak Speed per Processor 16 Gflop/s 12.8 Gflop/s 2.5 Gflop/s 7.2 Gflop/s 8 Gflop/s Bytes/flop (main memory) 4 2.6 0.8 0.88 0.5

slide-5
SLIDE 5

5

02 25

Commodity: Memory Latency and Flop Rate Commodity: Memory Latency and Flop Rate

0.1 1.0 10.0 100.0 1000.0 Jan-92 Jan-94 Jan-96 Jan-98 Jan-00 Jan-02 Jan-04 Jan-06

Time (ns)

Memory Latency (ns) Time per Floating Point Operation (ns)

02 26

Commodity Processor Trends Commodity Processor Trends

28 ns = 94,000 FP ops = 780 loads 50 ns = 1600 FP ops = 170 loads 70 ns = 140 FP ops = 70 loads (5.5%) DRAM latency 3600 MWord/s =0.0011 word/flop 380 MWord/s = 0.012 word/flop 100 MWord/s = 0.05 word/flop 25% DRAM bandwidth 27 GWord/s = 0.008 word/flop 3.5 GWord/s = 0.11 word/flop 1 GWord/s = 0.5 word/flop 23% Front-side bus bandwidth 3300 GFLOP/s 32 GFLOP/s 2 GFLOP/s 59% Single-chip floating-point performance Typical value in 2020 Typical value in 2010 Typical value in 2004 Annual increase

Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6. 02 27

System Balance (Network) System Balance (Network)

Network Speed (MB/s) vs Node speed (flop/s)

2.00 2.00 1.60 1.20 1.00 0.38 0.02 0.08 0.05 0.18 0.13

0.00 0.50 1.00 1.50 2.00 2.50 NEC SX-8 Cray X1 Cray Red Storm ASCI Red Cray T3E/1200 Blue Gene/L ASCI Blue Mountain ASCI White LANL Pink PSC Lemieux ASCI Purple

Communication/Computation Balance (Bytes/Flop) (Higher is better) 02 28

Performance Projection Performance Projection

1 9 9 3 1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 2 0 0 9 2 0 1 1 2 0 1 3 2 0 1 5

N=1 N=500 SUM 1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s 10 Pflop/s 1 Eflop/s 100 Pflop/s DARPA HPCS BlueGene/ L My Laptop 02 29

SETI@home SETI@home: Global Distributed Computing : Global Distributed Computing

♦ Running on 500,000 PCs, ~1300 CPU

Years per Day

1.3M CPU Years so far

♦ Sophisticated Data & Signal

Processing Analysis

♦ Distributes Datasets from Arecibo

Radio Telescope

02 30

SETI@home SETI@home

♦ Use thousands of Internet-

connected PCs to help in the search for extraterrestrial intelligence.

♦ When their computer is idle

  • r being wasted this

software will download ~ half a MB chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.

♦ The results of this analysis

are sent back to the SETI team, combined with thousands of other participants.

♦ About 5M users

♦ Largest distributed

computation project in existence

Averaging 72 Tflop/s

slide-6
SLIDE 6

6

02 31 ♦

Google query attributes

150M queries/day (2000/second) 100 countries 8.0B documents in the index

Data centers

100,000 Linux systems in data centers around the world

15 TFlop/s and 1000 TB total capability 40-80 1U/2U servers/cabinet 100 MB Ethernet switches/cabinet with gigabit Ethernet uplink

growth from 4,000 systems (June 2000)

18M queries then

Performance and operation simple reissue of failed commands to new servers no performance debugging

  • problems are not reproducible

Source: Monika Henzinger, Google & Cleve Moler

Forward link are referred to in the rows Back links are referred to in the columns

Eigenvalue problem; Ax = λx n=8x109 (see: MathWorks Cleve’s Corner)

The matrix is the transition probability matrix of the Markov chain; Ax = x

02 32

The Grid The Grid

♦ The Grid is about gathering resources …

run programs, access data, provide services, collaborate

♦ …To enable and exploit large scale sharing of

resources

♦ Virtual organization

Loosely coordinated groups

♦ Provides for remote access of resources

Scalable Secure Reliable mechanisms for discovery and access

♦ In some ideal setting:

User submits work, infrastructure finds an execution target Ideally you don’t care where.

02 33

Science Grid Projects Science Grid Projects

02 34

TeraGrid 2003 TeraGrid 2003

Prototype for a National Cyberinfrastructure Prototype for a National Cyberinfrastructure

40 Gb/s 20 Gb/s 30 Gb/s 10 Gb/s 10 Gb/s

02 35

A German Grid Initiative: D A German Grid Initiative: D-

  • GRID

GRID

♦ Initially driven by the HGF

centers and the DFN-Verein (2002)

♦ Meanwhile: More than 100

further partners in academia and industry

♦ Aim at a coordination of Grid

activities in Germany

♦ Deployment of a new

generation networking infrastructure (Example: Project VIOLA)

♦ Promotion of open standards

for interfaces and protocols (GGF)

Stuttgart Leipzig Berlin Frankfu rt Karlsruh e Garching Kiel Braunschweig Dresden Aachen Regensburg Kaiserslautern Augsbur g Bielefeld Hannover Erlangen Heidelber g Ilmena u Würzburg Magdeburg Marburg Göttingen Oldenburg Essen

  • St. Augustin

Rostock

Global Upstream

Hamburg

10 Gbit/s 2,4 Gbit/s 622 Mbit/s

02 36

Atmospheric Sciences Grid Atmospheric Sciences Grid

Real time data

Data Fusion General Circulation model Regional weather model Photo-chemical pollution model Particle dispersion model Topography Database Topography Database Vegetation Database Vegetation Database Bushfire model Emissions Inventory Emissions Inventory

slide-7
SLIDE 7

7

02 37

Standard Implementation Standard Implementation

GASS

Real time data

Data Fusion General Circulation model Regional weather model Photo-chemical pollution model Particle dispersion model Topography Database Topography Database Vegetation Database Vegetation Database Emissions Inventory Emissions Inventory

MPI MPI

MPI

GASS/GridFTP/GRC

MPI MPI

Bushfire model GASS

Change Models

02 38

The Grid: The Grid: The Good, The Bad, and The Ugly The Good, The Bad, and The Ugly

♦ Good: Vision; Community; Developed functional software; ♦ Bad: Oversold the grid concept; Still too hard to use; Solution in search of a problem; Underestimated the technical difficulties; Not enough of a scientific discipline; ♦ Ugly: Authentication and security

PlayStation 3 PlayStation 3

The PlayStation 3's CPU based on a chip codenamed "Cell"

Each Cell contains 8 APUs.

  • An APU is a self contained vector processor which acts independently from the
  • thers.
  • 4 floating point units capable of 32 Gflop/s (8 Gflop/s each)
  • 256 Gflop/s peak!
  • 32 bit floating point only; not even IEEE
  • Datapaths “lite”

02 40

The Computing Continuum The Computing Continuum

♦ Each strikes a different balance

computation/communication coupling

♦ Implications for execution efficiency ♦ Applications for diverse needs

computing is only one part of the story!

Loosely Coupled Tightly Coupled

Clusters Highly Parallel “Grids”

Special Purpose “SETI / Google”

02 41

Grids vs. Capability vs. Cluster Computing Grids vs. Capability vs. Cluster Computing

♦ Not an “either/or” question

Each addresses different needs Each are part of an integrated solution

♦ Grid strengths

Coupling necessarily distributed resources

instruments, software, hardware, archives, and people

Eliminating time and space barriers

remote resource access and capacity computing

Grids are not a cheap substitute for capability HPC

♦ Highest performance computing strengths

Supporting foundational computations

terascale and petascale “nation scale” problems

Engaging tightly coupled computations and teams

♦ Clusters

Low cost, group solution Potential hidden costs

♦ Key is easy access to resources in a transparent way

02 42

Real Crisis With HPC Is With The Real Crisis With HPC Is With The Software Software

♦ Programming is stuck

Arguably hasn’t changed since the 60’s

♦ It’s time for a change

Complexity is rising dramatically

highly parallel and distributed systems

From 10 to 100 to 1000 to 10000 to 100000 of processors!!

multidisciplinary applications

♦ A supercomputer application and software are usually

much more long-lived than a hardware

Hardware life typically five years at most. Fortran and C are the main programming models

♦ Software is a major cost component of modern

technologies.

The tradition in HPC system procurement is to assume that the software is free.

♦ We don’t have many great ideas about how to solve

this problem.

slide-8
SLIDE 8

8

02 43

Collaborators Collaborators

♦ TOP500

  • H. Meuer, Mannheim U
  • H. Meuer, Mannheim U
  • H. Simon, NERSC
  • H. Simon, NERSC
  • E. Strohmaier, NERSC
  • E. Strohmaier, NERSC