Application-controlled Frequency Scaling Jons-Tobias Wamhoff - - PowerPoint PPT Presentation

application controlled frequency scaling
SMART_READER_LITE
LIVE PREVIEW

Application-controlled Frequency Scaling Jons-Tobias Wamhoff - - PowerPoint PPT Presentation

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universitt Dresden, Germany Patrick Marlier Pascal Felber Universit de Neuchtel, Switzerland Dave Dice Oracle Labs, USA


slide-1
SLIDE 1

Application-controlled Frequency Scaling

Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universität Dresden, Germany Patrick Marlier Pascal Felber Université de Neuchâtel, Switzerland Dave Dice Oracle Labs, USA

slide-2
SLIDE 2

Overview

  • Dynamic voltage and frequency scaling (DVFS)
  • traditionally: used to save energy or boost sequential

bottlenecks/serial peak loads

  • today: improve performance by exposing asymmetric

properties of applications

  • Outline
  • Recap DVFS features on current x86 multicores
  • DVFS properties: latency and power
  • Applying DVFS on application-level

2

slide-3
SLIDE 3

P- and C-states

  • P-states: performance states
  • predefined frequency/voltage pairs
  • controlled through machine-specific registers

(MSRs, privileged rdmsr/wrmsr)

  • C-states: power states
  • trade entry/wakeup latency for higher power

savings

  • entered by hlt or monitor/mwait

3

Pturbo Pbase Pslow

… …

C0 C1-Cn

halted frequency/voltage

slide-4
SLIDE 4

AMD Turbo CORE

  • Voltage and frequency domain: module vs. package

4

Intel Turbo Boost &

  • Boosting: deterministic vs. thermal

Pbase Pbase Pbase Pbase Pturbo ≥C1 ≥C1 ≥C1 Pturbo Pslow Pslow Pslow HT HT x86 FPU x86

  • AMD only: asymmetric frequencies with manual boost
slide-5
SLIDE 5

Evaluation Setup

  • Critical sections (CS) protected by MCS queue lock
  • Decorations on acquire/release → trigger DVFS
  • Variable size of CS → amortize DVFS cost
  • Effective CS frequency:
  • Energy for 1 hour at Pbase:

5

fPbase tCS twait

Acquireentry Acquireexit Release

time

fCS = fbase · tCS tA+CS+R

ENORM = Esample · tA+CS+R tCS

slide-6
SLIDE 6

Automatic Frequency Scaling

  • Decoration: spinning vs. blocking
  • P-state transitions triggered by hardware

6

fPturbo fPbase twait tCS tPbase→Chalt tChalt→Pbase tPturbo→Pbase tramp

OS halt: entry, wakeup CPU deeper C-state boosted P-state

slide-7
SLIDE 7

0.0 1.4 3.1 4.0 fCS (GHz)

Frequency AMD

0.0 0.8 3.4 3.9

Frequency Intel

103 104 105 106 107 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ENORM (kWh)

Energy AMD

spin futex

102 103 104 105 106 107 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Energy Intel

Blocking vs. Spinning Locks

7

↑ 1.5M ↑ 4M 1M, twait = 7M ↓ 10k twait = 70k ↓

slide-8
SLIDE 8

Manual Frequency Scaling

  • Decoration: spin and application-level DVFS control

8

fPturbo fPbase fPslow twait tCS tPbase→Pslow tPslow→Pturbo tPturbo→Pbase tramp

ioctl 1k 1k 1k wrmsr 28k 2k 23k transition 2k 225k 1k

slide-9
SLIDE 9

103 104 105 106 107 108 SizeCS (cycles, log) 0.0 1.4 3.1 4.0 fCS (GHz)

Frequency AMD

103 104 105 106 107 108 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ENORM (kWh)

Energy AMD

spin

  • wnr

dlgt mgrt

Manual Lock Boosting

9

  • spin: static Pbase
  • owner: dynamically boost
  • delegate: dedicated wrmsr core
  • migrate: statically boosted core

↖ 600k ↗ 200k ↑

400k

futex: 1.5M

slide-10
SLIDE 10

TURBO Library

  • Convenient programmatical application-level DVFS control
  • Testbed to explore challenges of future heterogeneous cores

10

https://bitbucket.org/donjonsn/turbo

Linux kernel and hardware interfaces Hardware abstraction Topology PCI-Configuration MSR-Interface

  • P-states

PerfEvent

  • HW counters

Performance configuration Thread

  • Migrate to core

P-States

  • Setting & configuration

PerformanceMonitor

  • Low-level profiling

Execution control ThreadRegistry

  • Create/Register

ThreadControl

  • Decorate lock, barriers, …: boosting/profiling
slide-11
SLIDE 11

Boosting Applications

  • Expose application knowledge
  • Asymmetric software transactional memory:


up to 50% speedup with only 2% more energy

  • Tradeoffs when IPC depends on core frequency
  • Hash table resize in memcached:


9% speedup but 22% higher frequency

  • Outweigh P-state latency by delegating CS
  • High cross-module round-trip delay (2k cycles)
  • Intra-module delay scales with P-state (Pboost: 280 cycles)

11

slide-12
SLIDE 12

Next Steps

  • Intel Haswell-EP supports per core P-states
  • Allows to give hints
  • Application domains
  • Real-time scheduling
  • Fork-join benchmarks
  • …?

12