So you want to buy a supercomputer? James Davenport Hebron & - - PowerPoint PPT Presentation

so you want to buy a supercomputer
SMART_READER_LITE
LIVE PREVIEW

So you want to buy a supercomputer? James Davenport Hebron & - - PowerPoint PPT Presentation

So you want to buy a supercomputer? James Davenport Hebron & Medlock Professor of Information Technology University of Bath (U.K.) (visiting Waterloo) 15 May 2009 Many thanks to Prof. Guest (Cardiff) University of Bath University of Bath


slide-1
SLIDE 1

So you want to buy a supercomputer?

James Davenport Hebron & Medlock Professor of Information Technology

University of Bath (U.K.) (visiting Waterloo)

15 May 2009 Many thanks to Prof. Guest (Cardiff)

slide-2
SLIDE 2

University of Bath

slide-3
SLIDE 3

University of Bath

Good (9th out of 117 in the U.K.: Guardian 12 May 2009) Heavily co-op Strengths in Science, Engineering, Mathematics

slide-4
SLIDE 4

University of Bath

Good (9th out of 117 in the U.K.: Guardian 12 May 2009) Heavily co-op Strengths in Science, Engineering, Mathematics But small — 538 Faculty

slide-5
SLIDE 5

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers

slide-6
SLIDE 6

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500

slide-7
SLIDE 7

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500 Time bid for on competitive grants (virtual money)

slide-8
SLIDE 8

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500 Time bid for on competitive grants (virtual money) Hence you need a ‘track record’

slide-9
SLIDE 9

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500 Time bid for on competitive grants (virtual money) Hence you need a ‘track record’

slide-10
SLIDE 10

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500 Time bid for on competitive grants (virtual money) Hence you need a ‘track record’ Basically, Mark 4 v 25:

slide-11
SLIDE 11

U.K. scene — generalities

Nationally run (EPSRC etc. ≈ NSERC) major supercomputers HECToR (current one) 29th in TOP 500 Time bid for on competitive grants (virtual money) Hence you need a ‘track record’ Basically, Mark 4 v 25: “to him that hath shall be given”.

slide-12
SLIDE 12

U.K. scene — recent developments

slide-13
SLIDE 13

U.K. scene — recent developments

EPSRC etc. (≈ NSERC) now allow depreciation on computing resources to be charged to grants (Previously, you had to buy your own machine

slide-14
SLIDE 14

U.K. scene — recent developments

EPSRC etc. (≈ NSERC) now allow depreciation on computing resources to be charged to grants (Previously, you had to buy your own machine and run it)

slide-15
SLIDE 15

U.K. scene — recent developments

EPSRC etc. (≈ NSERC) now allow depreciation on computing resources to be charged to grants (Previously, you had to buy your own machine and run it) Government announce Science Research Infrastructure Fund (£500M/year) (largely buildings, but equipment not excluded)

slide-16
SLIDE 16

U.K. scene — recent developments

EPSRC etc. (≈ NSERC) now allow depreciation on computing resources to be charged to grants (Previously, you had to buy your own machine and run it) Government announce Science Research Infrastructure Fund (£500M/year) (largely buildings, but equipment not excluded) Bath share about £5M/year N.B. “year” = H.M. Treasury Year

slide-17
SLIDE 17

U.K. scene — recent developments

EPSRC etc. (≈ NSERC) now allow depreciation on computing resources to be charged to grants (Previously, you had to buy your own machine and run it) Government announce Science Research Infrastructure Fund (£500M/year) (largely buildings, but equipment not excluded) Bath share about £5M/year N.B. “year” = H.M. Treasury Year Brainwave: if I purchase a supercomputer, then I can depreciate it, and have money to buy a new one.

slide-18
SLIDE 18

Recent UK spend, excluding machine rooms etc.

!"#$!%&" '(()%*(((

slide-19
SLIDE 19

Machine Rooms — a major problem

Cardiff £1.6M on machine, £1.4M on converting machine room and (high-quality) air conditioning.

slide-20
SLIDE 20

Machine Rooms — a major problem

Cardiff £1.6M on machine, £1.4M on converting machine room and (high-quality) air conditioning. Bristol £2M on machine, £2M+ on building machine room and including chilled water.

slide-21
SLIDE 21

Machine Rooms — a major problem

Cardiff £1.6M on machine, £1.4M on converting machine room and (high-quality) air conditioning. Bristol £2M on machine, £2M+ on building machine room and including chilled water. Imperial (Central London) £3M on CO2-cooled machine room.

slide-22
SLIDE 22

Machine Rooms — a major problem

Cardiff £1.6M on machine, £1.4M on converting machine room and (high-quality) air conditioning. Bristol £2M on machine, £2M+ on building machine room and including chilled water. Imperial (Central London) £3M on CO2-cooled machine room.

slide-23
SLIDE 23

Machine Rooms — a major problem

Cardiff £1.6M on machine, £1.4M on converting machine room and (high-quality) air conditioning. Bristol £2M on machine, £2M+ on building machine room and including chilled water. Imperial (Central London) £3M on CO2-cooled machine room. Bath had an old machine room from the 1970s.

slide-24
SLIDE 24

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room
slide-25
SLIDE 25

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room

+ Comparative speed: I took under a year from initial decision to Phase 1 installed

slide-26
SLIDE 26

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room

+ Comparative speed: I took under a year from initial decision to Phase 1 installed − It will, just about, cope with the current smallish machine: I think in a few years we’ll need a new machine room

slide-27
SLIDE 27

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room

+ Comparative speed: I took under a year from initial decision to Phase 1 installed − It will, just about, cope with the current smallish machine: I think in a few years we’ll need a new machine room − The University don’t realise what a bargain they’re getting

slide-28
SLIDE 28

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room

+ Comparative speed: I took under a year from initial decision to Phase 1 installed − It will, just about, cope with the current smallish machine: I think in a few years we’ll need a new machine room − The University don’t realise what a bargain they’re getting − Despite the Estates Department’s promises, the power supply did need upgrading

slide-29
SLIDE 29

Old Machine Rooms — a mixed blessing

+ I doubt very much Bath would have spent those sort

  • f sums on a new machine room

+ Comparative speed: I took under a year from initial decision to Phase 1 installed − It will, just about, cope with the current smallish machine: I think in a few years we’ll need a new machine room − The University don’t realise what a bargain they’re getting − Despite the Estates Department’s promises, the power supply did need upgrading + Contracts signed this week on a new machine room with chilled water!

slide-30
SLIDE 30

Actual Timescale

1/2007 I am tasked with looking into this

slide-31
SLIDE 31

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case

slide-32
SLIDE 32

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case

slide-33
SLIDE 33

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case So what was the case? Researchers think they can support £450K of equipment

slide-34
SLIDE 34

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case So what was the case? Researchers think they can support £450K of equipment (i.e. earn that much depreciation over 3 years)

slide-35
SLIDE 35

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case So what was the case? Researchers think they can support £450K of equipment (i.e. earn that much depreciation over 3 years) 6 year commitment with 2-year reviews/refreshes

slide-36
SLIDE 36

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case So what was the case? Researchers think they can support £450K of equipment (i.e. earn that much depreciation over 3 years) 6 year commitment with 2-year reviews/refreshes So 4 years warning of decommitment

slide-37
SLIDE 37

Actual Timescale

1/2007 I am tasked with looking into this

slide-38
SLIDE 38

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K

slide-39
SLIDE 39

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list

slide-40
SLIDE 40

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?”

slide-41
SLIDE 41

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision

slide-42
SLIDE 42

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery

slide-43
SLIDE 43

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery 3/2008 Phase 1 acceptance

slide-44
SLIDE 44

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery 3/2008 Phase 1 acceptance

  • UK Treasury FY ends 5 April!
slide-45
SLIDE 45

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery 3/2008 Phase 1 acceptance

  • UK Treasury FY ends 5 April!

10/2008 Phase 2 decision (not to delay)

slide-46
SLIDE 46

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery 3/2008 Phase 1 acceptance

  • UK Treasury FY ends 5 April!

10/2008 Phase 2 decision (not to delay) 1/2009 Phase 2 delivery

slide-47
SLIDE 47

Actual Timescale

1/2007 I am tasked with looking into this 5/2007 Top management buys the case: RFP for £360K * There was already a national pre-qualified list 9/2007 “So what’s your final offer?” 10/2007 Purchase decision 1/2008 Phase 1 delivery 3/2008 Phase 1 acceptance

  • UK Treasury FY ends 5 April!

10/2008 Phase 2 decision (not to delay) 1/2009 Phase 2 delivery 5/2009 Acceptance

slide-48
SLIDE 48

Equipment Purchased

slide-49
SLIDE 49

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro.

slide-50
SLIDE 50

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown

slide-51
SLIDE 51

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope)

slide-52
SLIDE 52

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope) 2 nodes/power supply

slide-53
SLIDE 53

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope) 2 nodes/power supply 2GB/core main memory

slide-54
SLIDE 54

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope) 2 nodes/power supply 2GB/core main memory * Specified this way as 2/4 core wasn’t obvious

slide-55
SLIDE 55

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope) 2 nodes/power supply 2GB/core main memory * Specified this way as 2/4 core wasn’t obvious = 1.6TB main memory — it adds up!

slide-56
SLIDE 56

Equipment Purchased

Clustervision: a UK/Dutch firm of system integrators: the boards are Supermicro. 100 nodes; 2 × 4-core 2.8GHz Intel Harpertown (3.0 gave less power/£; 2.66 pushed the power envelope) 2 nodes/power supply 2GB/core main memory * Specified this way as 2/4 core wasn’t obvious = 1.6TB main memory — it adds up! Double Data Rate Infiniband

slide-57
SLIDE 57

Acceptance Tests

1 Phase 1: Linpack benchmark

slide-58
SLIDE 58

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

slide-59
SLIDE 59

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

2 Phase 2: a range of tests related to major users

slide-60
SLIDE 60

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

2 Phase 2: a range of tests related to major users

* Very grateful to Prof. Guest for organising

slide-61
SLIDE 61

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

2 Phase 2: a range of tests related to major users

* Very grateful to Prof. Guest for organising

MPI defaults were badly wrong

slide-62
SLIDE 62

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

2 Phase 2: a range of tests related to major users

* Very grateful to Prof. Guest for organising

MPI defaults were badly wrong DDR Infiniband was running out of steam faster than expected

slide-63
SLIDE 63

Acceptance Tests

1 Phase 1: Linpack benchmark

We had linear algebra compiled for the previous chip!

2 Phase 2: a range of tests related to major users

* Very grateful to Prof. Guest for organising

MPI defaults were badly wrong DDR Infiniband was running out of steam faster than expected Several partial failures.

slide-64
SLIDE 64

Partial Failures

slide-65
SLIDE 65

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”.

slide-66
SLIDE 66

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused

slide-67
SLIDE 67

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used!

slide-68
SLIDE 68

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78

slide-69
SLIDE 69

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78

slide-70
SLIDE 70

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

slide-71
SLIDE 71

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

1 Node 78 (and another one since) — poor Infiniband

slide-72
SLIDE 72

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

1 Node 78 (and another one since) — poor Infiniband 2 twice so far: a node loses 4GB of memory on a reboot

slide-73
SLIDE 73

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

1 Node 78 (and another one since) — poor Infiniband 2 twice so far: a node loses 4GB of memory on a reboot 3 Others?

slide-74
SLIDE 74

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

1 Node 78 (and another one since) — poor Infiniband 2 twice so far: a node loses 4GB of memory on a reboot 3 Others?

slide-75
SLIDE 75

Partial Failures

Very frustrating and hard to diagnose: typically one job would take “longer than expected”. Observe this is happening, and feel very confused Eventually spot that it happens when node 78 is used! Convince the manufacturer to run their tests on node 78 Failure modes

1 Node 78 (and another one since) — poor Infiniband 2 twice so far: a node loses 4GB of memory on a reboot 3 Others?

“One footsore soldier can delay a regiment” — Duke of Wellington

slide-76
SLIDE 76

Lessons I already knew

Get it in writing from Estates.

slide-77
SLIDE 77

Lessons I already knew

Get it in writing from Estates. Know your (potential) users early

slide-78
SLIDE 78

Lessons I already knew

Get it in writing from Estates. Know your (potential) users early (devise acceptance tests accordingly)

slide-79
SLIDE 79

Lessons I already knew

Get it in writing from Estates. Know your (potential) users early (devise acceptance tests accordingly) It’s hard to explain to management

slide-80
SLIDE 80

Lessons I know now

It’s very hard to explain to management

slide-81
SLIDE 81

Lessons I know now

It’s very hard to explain to management Acceptance tests are very important, especially

slide-82
SLIDE 82

Lessons I know now

It’s very hard to explain to management Acceptance tests are very important, especially Car-Parrinello Molecular Dynamics (CPMD) for interconnect

slide-83
SLIDE 83

Lessons I know now

It’s very hard to explain to management Acceptance tests are very important, especially Car-Parrinello Molecular Dynamics (CPMD) for interconnect Partial failure is far worse than total failure

slide-84
SLIDE 84

Lessons I know now

It’s very hard to explain to management Acceptance tests are very important, especially Car-Parrinello Molecular Dynamics (CPMD) for interconnect Partial failure is far worse than total failure Even DDR Infiniband has trouble with 8 cores/node

slide-85
SLIDE 85

Lessons I know now

It’s very hard to explain to management Acceptance tests are very important, especially Car-Parrinello Molecular Dynamics (CPMD) for interconnect Partial failure is far worse than total failure Even DDR Infiniband has trouble with 8 cores/node (There’s a good paper (now!) by HP)

slide-86
SLIDE 86

Lessons I know I still don’t know

Good ways of detecting partial failure

slide-87
SLIDE 87

Lessons I know I still don’t know

Good ways of detecting partial failure How to manage software licencing if you can’t afford to licence every node

slide-88
SLIDE 88

Lessons I know I still don’t know

Good ways of detecting partial failure How to manage software licencing if you can’t afford to licence every node How to persuade management to deliver on the promised refreshes

slide-89
SLIDE 89

Lessons I know I still don’t know

Good ways of detecting partial failure How to manage software licencing if you can’t afford to licence every node How to persuade management to deliver on the promised refreshes Will the assumptions hold up:

slide-90
SLIDE 90

Lessons I know I still don’t know

Good ways of detecting partial failure How to manage software licencing if you can’t afford to licence every node How to persuade management to deliver on the promised refreshes Will the assumptions hold up:

Assumptions on grant-getting

slide-91
SLIDE 91

Lessons I know I still don’t know

Good ways of detecting partial failure How to manage software licencing if you can’t afford to licence every node How to persuade management to deliver on the promised refreshes Will the assumptions hold up:

Assumptions on grant-getting Assumptions on actual usage ⇒ price/hour

slide-92
SLIDE 92

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes.

slide-93
SLIDE 93

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing

slide-94
SLIDE 94

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area

slide-95
SLIDE 95

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation

slide-96
SLIDE 96

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation 15% Equipment maintenance

slide-97
SLIDE 97

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation 15% Equipment maintenance 10% Machine electricity

slide-98
SLIDE 98

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation 15% Equipment maintenance 10% Machine electricity 8% Air conditioning (incl. depreciation)

slide-99
SLIDE 99

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation 15% Equipment maintenance 10% Machine electricity 8% Air conditioning (incl. depreciation) 17% 1 Programmer (1/3 of team of 3)

slide-100
SLIDE 100

Price per node hour: 52p≈CAN$0.9

With the exception of a “short test” queue, allocation is based on whole nodes. Allocation is based on entitlements rather than retrospective billing The Maui scheduler has (too?) many knobs in this area 48% Equipment depreciation 15% Equipment maintenance 10% Machine electricity 8% Air conditioning (incl. depreciation) 17% 1 Programmer (1/3 of team of 3) 2% My time

slide-101
SLIDE 101

Lessons I don’t know I don’t know?

slide-102
SLIDE 102

Lessons I don’t know I don’t know?

Any questions?