Future Computing Platforms
for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton)
1
Future Computing Platforms for Science in a Power Constrained Era - - PowerPoint PPT Presentation
Future Computing Platforms for Science in a Power Constrained Era David Abdurachmanov (FNAL) Peter Elmer (Princeton) Giulio Eulisse (FNAL) Robert Knight (Princeton) 1 Power in Data Centers An Inconvenient Truth Energy-related costs account
1
Energy-related costs account for approximately 12 % of overall expenditure and are the fastest-rising cost, according to Gartner, Inc. (29 / 9 / 2010) CMS for 2012 data used ~ 100K x86_64 cores from 350K cores at WLCG Scaling up from the mix of machines at FNAL we estimate WLCG aggregate power consumption at 10MW CMS expects 2 - 3 orders of magnitude increase in data produced in 15 years
Local green or/and cheaper power source, e.g. Princeton energy plant (15MW) combines electricity, heat and cooling. When electricity cost increased gas, diesel or/and bio-diesel fuel is used to power local generators. Hot water and steam is provided from waste energy. Low-power and / or highly efficient hardware, e.g., Intel Atom, X-Gene (ARMv8 64-bit), GPUs, Xeon Phi.
2
Obvious market leader, currently targeted at non power constrained applications. De-facto standard in HEP. Needs to be a reference point even in the power efficiency case because the only way to win the game is to be power efficient and performant. Very diverse offering to match different needs in terms of performance, recently introduced "custom" silicon for big players.
Intel main advantage comes from being one generation ahead in terms of manufacturing process and architectural sophistication (e.g. large vector units), not too mention maturity
Many features introduced over the years to monitor power consumption (e.g. RAPL) and improve power efficiency (e.g. TurboBoost, SpeedStep).
3
ARM32 is the obvious volume leader for low power, embedded world. Interesting not only because it's power efficient, but also for the business model where ``custom'' designs are the norm and because it has the economy of scale of cell phones behind. Since last CHEP a few 64-bit chips started to appear, first in embedded world (iPhone!), and now, thanks to Applied Micro X-Gene 1, in the server world.
CMS Offline SW (CMSSW) has been ported to work on APM X-Gene1. Most of the work is getting the core of the linux distribution to work. Porting CMSSW to ARM64 less of an issue, because compatibility issue either solved for ARM32 or really 32-bit vs 64-bit problems in our code base. Choosing OpenSource software is key to be ready for new platforms.
4
Obviously Intel is not sitting idle in such a strategic and lucrative market. While Atom has been so far unable to touch ARM dominance in the embedded world, it is becoming an attractive player in the server market.
It's a standard x86_64 core, where tradeoff (number of cores, cache size / hierarchy, no hyperthreading) have been made to focus on low power consumption. Production process edge a clear strategical advantage.
5
Evolution of the old POWER / PowerPC architecture. Once IBM / Motorola / Apple partnership, now guidance comes from the OpenPOWER consortium, similar to what happens with ARM. Pitched at very highly parallel workloads in high-end server market. Effort to make porting to it much easier (e.g. now with little endian support).
Intel answers to GPGPU, it's becoming a somewhat popular platform albeit not widely used in HEP . Provides a high number of small cores (61 with 4 threads each) with large vector units. Main advantage is the fact such cores are normal x86_64.
6
Standalone CMS simulation using Geant4 (v10.1) with representative geometry (but simplified physics). Compiled with GCC 4.9.x (apart from Xeon Phi which uses ICC), with static binaries and multithreading support.
Latest development version as of 1st of April 2015. Compiled using GCC 4.9.x. Reconstruction from local file, conditions from Frontier.
Standard benchmark for HEP software, used for in CMS as a metric for computing
7
Vendor Model Year Fab Process SandyBridge Intel E5-2650 Q1/12 Intel 32nm Haswell Intel E5-2699 Q3/14 Intel 22nm Atom Intel C2750 Q3/13 Intel 22nm X-Gene1 APM 883408 Q3/13 TMSC 40nm POWER8 IBM 8247-22L Late 13 IBM 22nm Xeon Phi Intel KNC7100 Q2/14 Intel 22nm Frequency (GHz) Cores Threads per Core SandyBridge 2.0 (2.8) 8 2 Haswell 2.3 (3.6) 18 2 Atom 2.4 8 1 X-Gene1 2.4 8 1 IBM 3.4 10 8 Phi 1.23 61 4 8
Single Core
ParFullCMS
Single Socket
CMSSW HEPSPEC06 0.0 0.5 1.0 1.5 2.0 5 10 15 20 25 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)
9
ParFullCMS
Threads per core evts / s (SandyBridge one core normalised) 2 4 6 8 5 10 15 20 25 Hyperthreading regime Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)
10
ParFullCMS
Threads per core (evts / s) / thread (SandyBridge one core normalised) 2 4 6 8 0.0 0.5 1.0 1.5 2.0 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz) Turbo Boost
11
ParFullCMS − Performance vs. Power Consumption
Power consumption (W) Performance (Evt / s) (SandyBridge 1 core normalised) 50 100 150 200 5 10 15 20 25 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz)
12
ParFullCMS
Threads per Core Efficiency (Evt / J) (SandyBridge single core normalised) 0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Phi (61 cores, 1.3 Ghz)
13
14
ParFullCMS
Active cardriges (Evt/J) 10 20 30 40 0.00 0.02 0.04 0.06 0.08 0.10 2 sockets Haswell 3U Projected fully populated m300 system Projected fully populated m400 system XGene−1 (8 cores, 2.4GHz) in HP Moonshot m400 card Atom (8 cores, 2.4GHz) in HP Moonshot m300 card
15
The race is heating up, and Intel is not sitting idle. Fabrication process advantage is king. ARM64 is not an easy answer as previously thought. Intel Atom and Intel Xeon are currently unmatched in terms of both performance and power efficientcy. Will be interesting to see next X-Gene iterations, but it's not like Intel does not have a roadmap as well.
Haswell vs Atom clearly shows that we need to keep into account volume and limits on the infrastructure in the equation as well. As we very well know by now, exploiting parallelism and multithreading is not an easy task. POWER8 or Xeon Phi really require effort to even remotely scale as advertised (just like GPUs).
16
▶ Applied Micro ▶ Intel ▶ CERN TechLab 17
18
ParFullCMS (multithreading − SandyBridge one core normalised)
# cores (log) evts / s (SandyBridge normalised) 1 2 5 10 20 50 100 200 500 5 10 15 20 25 Atom (8 cores, 2.4GHz) POWER8 (10 cores, 3.4 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz) X−Gene1 (8 cores, 2.4 GHz) Xeon Phi (61 cores, 1.3 Ghz)
19
CMSSW (multi−job − SandyBridge normalised)
# cores (log) evts / s (SandyBridge normalised) 1 2 5 10 20 50 100 2 4 6 8 10 12 14 X−Gene1 (8 cores, 2.4 GHz) Atom (8 cores, 2.4GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) Xeon SandyBridge (8 cores, 2 − 2.8 GHz)
20
(HEPSPEC06)
# cores (log) Performance index 1 2 5 10 20 50 100 200 500 5 10 15 20 25 Xeon SandyBridge (8 cores, 2 − 2.8 GHz) Xeon Haswell (18 cores, 2.3 − 3.6 GHz) X−Gene1 (8 cores, 2.4 GHz) POWER8 (10 cores, 3.4 GHz) Atom (8 cores, 2.4GHz)
21