Application-controlled Frequency Scaling Jons-Tobias Wamhoff - - PowerPoint PPT Presentation

▶

Nov 21, 2022 457 likes •589 views

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universitt Dresden, Germany Patrick Marlier Pascal Felber Universit de Neuchtel, Switzerland Dave Dice Oracle Labs, USA

SLIDE 1

Application-controlled Frequency Scaling

Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universität Dresden, Germany Patrick Marlier Pascal Felber Université de Neuchâtel, Switzerland Dave Dice Oracle Labs, USA

SLIDE 2

Overview

Dynamic voltage and frequency scaling (DVFS)
traditionally: used to save energy or boost sequential

bottlenecks/serial peak loads

today: improve performance by exposing asymmetric

properties of applications

Outline
Recap DVFS features on current x86 multicores
DVFS properties: latency and power
Applying DVFS on application-level

SLIDE 3

P- and C-states

P-states: performance states
predefined frequency/voltage pairs
controlled through machine-specific registers

(MSRs, privileged rdmsr/wrmsr)

C-states: power states
trade entry/wakeup latency for higher power

savings

entered by hlt or monitor/mwait

Pturbo Pbase Pslow

… …

C0 C1-Cn

halted frequency/voltage

SLIDE 4

AMD Turbo CORE

Voltage and frequency domain: module vs. package

Intel Turbo Boost &

Boosting: deterministic vs. thermal

Pbase Pbase Pbase Pbase Pturbo ≥C1 ≥C1 ≥C1 Pturbo Pslow Pslow Pslow HT HT x86 FPU x86

AMD only: asymmetric frequencies with manual boost

SLIDE 5

Evaluation Setup

Critical sections (CS) protected by MCS queue lock
Decorations on acquire/release → trigger DVFS
Variable size of CS → amortize DVFS cost
Effective CS frequency:
Energy for 1 hour at Pbase:

fPbase tCS twait

Acquireentry Acquireexit Release

time

fCS = fbase · tCS tA+CS+R

ENORM = Esample · tA+CS+R tCS

SLIDE 6

Automatic Frequency Scaling

Decoration: spinning vs. blocking
P-state transitions triggered by hardware

fPturbo fPbase twait tCS tPbase→Chalt tChalt→Pbase tPturbo→Pbase tramp

OS halt: entry, wakeup CPU deeper C-state boosted P-state

SLIDE 7

0.0 1.4 3.1 4.0 fCS (GHz)

Frequency AMD

0.0 0.8 3.4 3.9

Frequency Intel

103 104 105 106 107 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ENORM (kWh)

Energy AMD

spin futex

102 103 104 105 106 107 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Energy Intel

Blocking vs. Spinning Locks

↑ 1.5M ↑ 4M 1M, twait = 7M ↓ 10k twait = 70k ↓

SLIDE 8

Manual Frequency Scaling

Decoration: spin and application-level DVFS control

fPturbo fPbase fPslow twait tCS tPbase→Pslow tPslow→Pturbo tPturbo→Pbase tramp

ioctl 1k 1k 1k wrmsr 28k 2k 23k transition 2k 225k 1k

SLIDE 9

103 104 105 106 107 108 SizeCS (cycles, log) 0.0 1.4 3.1 4.0 fCS (GHz)

Frequency AMD

103 104 105 106 107 108 SizeCS (cycles, log) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ENORM (kWh)

Energy AMD

spin

dlgt mgrt

Manual Lock Boosting

spin: static Pbase
owner: dynamically boost
delegate: dedicated wrmsr core
migrate: statically boosted core

↖ 600k ↗ 200k ↑

400k

futex: 1.5M

SLIDE 10

TURBO Library

Convenient programmatical application-level DVFS control
Testbed to explore challenges of future heterogeneous cores

https://bitbucket.org/donjonsn/turbo

Linux kernel and hardware interfaces Hardware abstraction Topology PCI-Configuration MSR-Interface

P-states

PerfEvent

HW counters

Performance configuration Thread

Migrate to core

P-States

Setting & configuration

PerformanceMonitor

Low-level profiling

Execution control ThreadRegistry

Create/Register

ThreadControl

Decorate lock, barriers, …: boosting/profiling

SLIDE 11

Boosting Applications

Expose application knowledge
Asymmetric software transactional memory:

up to 50% speedup with only 2% more energy

Tradeoffs when IPC depends on core frequency
Hash table resize in memcached:

9% speedup but 22% higher frequency

Outweigh P-state latency by delegating CS
High cross-module round-trip delay (2k cycles)
Intra-module delay scales with P-state (Pboost: 280 cycles)

SLIDE 12

Next Steps

Intel Haswell-EP supports per core P-states
Allows to give hints
Application domains
Real-time scheduling
Fork-join benchmarks
…?