Future Computing Platforms for Science in a Power Constrained Era - - PowerPoint PPT Presentation

future computing platforms
SMART_READER_LITE
LIVE PREVIEW

Future Computing Platforms for Science in a Power Constrained Era - - PowerPoint PPT Presentation

Future Computing Platforms for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton) 1 Power in Data Centers An Inconvenient Truth Energy-related costs account


slide-1
SLIDE 1

Future Computing Platforms

for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton)

1

slide-2
SLIDE 2

Power in Data Centers

An Inconvenient Truth

Energy-related costs account for approximately 12 % of overall expenditure and are the fastest-rising cost, according to Gartner, Inc. (29 / 9 / 2010) CMS for 2012 data used ~ 100K x86_64 cores from 350K cores at WLCG Scaling up from the mix of machines at FNAL we estimate WLCG aggregate power consumption at 10MW CMS expects 2 - 3 orders of magnitude increase in data produced in 15 years

Think green

Local green or/and cheaper power source, e.g. Princeton energy plant (15MW) combines electricity, heat and cooling. When electricity cost increased gas, diesel or/and bio-diesel fuel is used to power local generators. Hot water and steam is provided from waste energy. Low-power and / or highly efficient hardware, e.g., Intel Atom, X-Gene (ARMv8 64-bit), GPUs, Xeon Phi.

2

slide-3
SLIDE 3

Intel Xeon

Status quo

Obvious market leader, currently targeted at non power constrained applications. De-facto standard in HEP. Needs to be a reference point even in the power efficiency case because the only way to win the game is to be power efficient and performant. Very diverse offering to match different needs in terms of performance, recently introduced "custom" silicon for big players.

Advantages

Intel main advantage comes from being one generation ahead in terms of manufacturing process and architectural sophistication (e.g. large vector units), not too mention maturity

  • f the development toolset.

Many features introduced over the years to monitor power consumption (e.g. RAPL) and improve power efficiency (e.g. TurboBoost, SpeedStep).

3

slide-4
SLIDE 4

APM X-Gene1 64-bit ARM

Old kid on the block

ARM32 is the obvious volume leader for low power, embedded world. Interesting not only because it's power efficient, but also for the business model where ``custom'' designs are the norm and because it has the economy of scale of cell phones behind. Since last CHEP a few 64-bit chips started to appear, first in embedded world (iPhone!), and now, thanks to Applied Micro X-Gene 1, in the server world.

Porting Effort

CMS Offline SW (CMSSW) has been ported to work on APM X-Gene1. Most of the work is getting the core of the linux distribution to work. Porting CMSSW to ARM64 less of an issue, because compatibility issue either solved for ARM32 or really 32-bit vs 64-bit problems in our code base. Choosing OpenSource software is key to be ready for new platforms.

4

slide-5
SLIDE 5

Intel Atom

The Empire strikes back

Obviously Intel is not sitting idle in such a strategic and lucrative market. While Atom has been so far unable to touch ARM dominance in the embedded world, it is becoming an attractive player in the server market.

Advantages

It's a standard x86_64 core, where tradeoff (number of cores, cache size / hierarchy, no hyperthreading) have been made to focus on low power consumption. Production process edge a clear strategical advantage.

5

slide-6
SLIDE 6

Outsiders

IBM POWER8

Evolution of the old POWER / PowerPC architecture. Once IBM / Motorola / Apple partnership, now guidance comes from the OpenPOWER consortium, similar to what happens with ARM. Pitched at very highly parallel workloads in high-end server market. Effort to make porting to it much easier (e.g. now with little endian support).

Intel Xeon Phi

Intel answers to GPGPU, it's becoming a somewhat popular platform albeit not widely used in HEP . Provides a high number of small cores (61 with 4 threads each) with large vector units. Main advantage is the fact such cores are normal x86_64.

6

slide-7
SLIDE 7

Software setup

ParFullCMS

Standalone CMS simulation using Geant4 (v10.1) with representative geometry (but simplified physics). Compiled with GCC 4.9.x (apart from Xeon Phi which uses ICC), with static binaries and multithreading support.

CMS Software (CMSSW)

Latest development version as of 1st of April 2015. Compiled using GCC 4.9.x. Reconstruction from local file, conditions from Frontier.

HEPSPEC06

Standard benchmark for HEP software, used for in CMS as a metric for computing

  • pledges. Compiled using GCC 4.9.x.

7

slide-8
SLIDE 8

CPU Specs

Vendor Model Year Fab Process SandyBridge Intel E5-2650 Q1/12 Intel 32nm Haswell Intel E5-2699 Q3/14 Intel 22nm Atom Intel C2750 Q3/13 Intel 22nm X-Gene1 APM 883408 Q3/13 TMSC 40nm POWER8 IBM 8247-22L Late 13 IBM 22nm Xeon Phi Intel KNC7100 Q2/14 Intel 22nm Frequency (GHz) Cores Threads per Core SandyBridge 2.0 (2.8) 8 2 Haswell 2.3 (3.6) 18 2 Atom 2.4 8 1 X-Gene1 2.4 8 1 IBM 3.4 10 8 Phi 1.23 61 4 8

slide-9
SLIDE 9

Raw performance

Single Core

ParFullCMS

Single Socket

CMSSW HEPSPEC06 0.0 0.5 1.0 1.5 2.0 5 10 15 20 25 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)

All numbers normalised to SandyBridge 1 core performance.

9

slide-10
SLIDE 10

Scalability

ParFullCMS

Threads per core evts / s (SandyBridge one core normalised) 2 4 6 8 5 10 15 20 25 Hyperthreading regime Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)

10

slide-11
SLIDE 11

Scalability

ParFullCMS

Threads per core (evts / s) / thread (SandyBridge one core normalised) 2 4 6 8 0.0 0.5 1.0 1.5 2.0 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz) Turbo Boost

11

slide-12
SLIDE 12

Power Effjciency (single CPU)

ParFullCMS − Performance vs. Power Consumption

Power consumption (W) Performance (Evt / s) (SandyBridge 1 core normalised) 50 100 150 200 5 10 15 20 25 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz)

Atom number is for a card, not for a CPU.

12

slide-13
SLIDE 13

Power Effjciency (single CPU)

ParFullCMS

Threads per Core Efficiency (Evt / J) (SandyBridge single core normalised) 0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz)

13

slide-14
SLIDE 14

Box to box comparison.

4.3U system which supports both Atom and X-Gene 1 up to 45 cartdriges.

14

slide-15
SLIDE 15

Power Effjciency (box)

ParFullCMS

Active cardriges (Evt/J) 10 20 30 40 0.00 0.02 0.04 0.06 0.08 0.10 2 sockets Haswell 3U Projected fully populated m300 system Projected fully populated m400 system XGene−1 (8 cores, 2.4GHz) in HP Moonshot m400 card Atom (8 cores, 2.4GHz) in HP Moonshot m300 card

15

slide-16
SLIDE 16

Outlook

Current market

The race is heating up, and Intel is not sitting idle. Fabrication process advantage is king. ARM64 is not an easy answer as previously thought. Intel Atom and Intel Xeon are currently unmatched in terms of both performance and power efficientcy. Will be interesting to see next X-Gene iterations, but it's not like Intel does not have a roadmap as well.

More thoughts.

Haswell vs Atom clearly shows that we need to keep into account volume and limits on the infrastructure in the equation as well. As we very well know by now, exploiting parallelism and multithreading is not an easy task. POWER8 or Xeon Phi really require effort to even remotely scale as advertised (just like GPUs).

16

slide-17
SLIDE 17

Thanks!

For providing hardware and / or helping setting it up we would like to thank:

▶ Applied Micro ▶ Intel ▶ CERN TechLab 17

slide-18
SLIDE 18

Backup slides

18

slide-19
SLIDE 19

Performance

ParFullCMS (multithreading − SandyBridge one core normalised)

# cores (log) evts / s (SandyBridge normalised) 1 2 5 10 20 50 100 200 500 5 10 15 20 25 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)

19

slide-20
SLIDE 20

Performance (multi-core)

CMSSW (multi−job − SandyBridge normalised)

# cores (log) evts / s (SandyBridge normalised) 1 2 5 10 20 50 100 2 4 6 8 10 12 14 X−Gene1 (8 cores, 2.4 GHz) Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz)

20

slide-21
SLIDE 21

Performance (multi-core)

(HEPSPEC06)

# cores (log) Performance index 1 2 5 10 20 50 100 200 500 5 10 15 20 25 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) POWER8 (10 cores, 3.4 GHz) Atom (8 cores, 2.4GHz)

21