VMM Emulation of Intel Hardware Transactional Memory Maciej - - PowerPoint PPT Presentation

vmm emulation of intel
SMART_READER_LITE
LIVE PREVIEW

VMM Emulation of Intel Hardware Transactional Memory Maciej - - PowerPoint PPT Presentation

VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech, Kyle Hale, Peter Dinda Northwestern University V3VEE Project www.v3vee.org Hobbes Project 1 What will we talk about? We added the capability to run Intel HTM code on a


slide-1
SLIDE 1

VMM Emulation of Intel Hardware Transactional Memory

Maciej Swiech, Kyle Hale, Peter Dinda Northwestern University V3VEE Project www.v3vee.org

1

Hobbes Project

slide-2
SLIDE 2

What will we talk about?

  • We added the capability to run Intel HTM code on a

virtual machine with minimal emulation

  • We developed a new page-flipping technique that

allows capturing of reads and writes at single memory reference granularity

  • Software implementation of HTM emulation allows

for arbitrary transaction size and code testing

2

slide-3
SLIDE 3

Outline

  • Motivation / Background
  • Intel HTM
  • Architecture
  • Palacios
  • Evaluation
  • Conclusions

3

slide-4
SLIDE 4

Outline

  • Motivation / Background
  • Intel HTM
  • Architecture
  • Palacios
  • Evaluation
  • Conclusions

4

slide-5
SLIDE 5

Motivation | transactional memory

  • Processors and applications become more parallel

and distributed to cope with growing scale of data and research problems

  • Need for easier and more reliable methods for

concurrent programming

5

slide-6
SLIDE 6

Background | transactional memory

do_the_things();

6

do_the_things() { write_shared_mem(); read_shared_mem(); }

slide-7
SLIDE 7

Background | transactional memory

Instead of:

acquire_lock(); do_the_things(); release_lock();

7

do_the_things() { write_shared_mem(); read_shared_mem(); }

slide-8
SLIDE 8

Background | transactional memory

Instead of:

acquire_lock(); do_the_things(); release_lock();

8

Have to track locks Deadlock

slide-9
SLIDE 9

Background | transactional memory

Can do:

transaction { do_the_things(); }

9

acquire_lock(); do_the_things(); release_lock();

slide-10
SLIDE 10

Background | transactional memory

Can do:

transaction { do_the_things(); }

10

acquire_lock(); do_the_things(); release_lock();

Unsafe concurrent memory accesses are detected by TM Easier to write safe code UNSAFE: Write after Read Read after Write Write after Write

slide-11
SLIDE 11

Background | transactional memory

  • Transactions are
  • Composable
  • Easier to reason about
  • More optimistic than locking
  • Assumption: no other code will touch memory in TX
  • HTM is faster than STM

11

slide-12
SLIDE 12

Motivation | virtualizing

  • Currently only Intel Haswell and IBM chipsets have

implementations of Hardware Transactional Memory

  • Adding HTM capabilities to a virtual machine

monitor would allow anyone to run transactional code

  • Allows for testing effects of new hardware

implementations on code

12

slide-13
SLIDE 13

Outline

  • Motivation / Background
  • Intel HTM
  • Architecture
  • Palacios
  • Evaluation
  • Conclusions

13

slide-14
SLIDE 14

Intel HTM | background

  • In the Haswell generation of processors Intel

introduced 2 Hardware Transactional Memory implementations

  • RTM – Restricted Transactional Memory
  • HLE – Hardware Lock Elision
  • 4 new instructions added to the ISA
  • XBEGIN
  • XABORT
  • XEND
  • XTEST

14

slide-15
SLIDE 15

Intel HTM | ISA

  • XBEGIN imm32
  • Marks beginning of a transaction and abort label
  • XABORT imm32
  • Forces transaction abort
  • XEND
  • Marks end of transaction
  • XTEST
  • Tests if processor is currently in a transaction state

15

slide-16
SLIDE 16

Intel HTM | example

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

16

slide-17
SLIDE 17

Intel HTM | specification

  • Intel list many reasons a transaction “may” abort
  • Operations that modify RIP, GPRs, status flags
  • Operations on XMM, YMM, MXCSR registers
  • Various other instructions
  • Synchronous exception events
  • Asynchronous events such as interrupts
  • Self-modifying code
  • Many others…
  • RaW, WaR, WaW conflicts trigger an abort

17

slide-18
SLIDE 18

Outline

  • Motivation / Background
  • Intel HTM
  • Palacios
  • Architecture
  • Evaluation
  • Conclusions

18

slide-19
SLIDE 19

Architecture | design

19

  • Hypervisor extension
  • TM events captured and handled in VMM
  • Redo-log based design with garbage collection
  • Minimal instruction decoding
slide-20
SLIDE 20

Architecture | design

20

  • MIME
  • Generate stream of memory read/writes
  • RTME
  • Maintains the redo log
  • Tracks system state
  • Conflict Detection
  • Garbage Collection
slide-21
SLIDE 21

Architecture | RTME

  • Finite State Machine model
  • SYSTEM state
  • CORE state
  • TSX instructions generate #UD exceptions, driving

state

  • Maintains read/write logs for each transaction

21 Restricted Transactional Memory Engine

slide-22
SLIDE 22

Architecture | RTME

  • Keeps track of per-core and system transactional

state

  • Places cores in single-stepping mode
  • If one core single-stepping, all cores
  • Launches garbage collection of log entries

22 Restricted Transactional Memory Engine

slide-23
SLIDE 23

Architecture | example

23 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

slide-24
SLIDE 24

Architecture | example

24 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

System in TM mode Core in TM mode

slide-25
SLIDE 25

Architecture | example

25 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

Monitor abort conditions (incl. XABORT) Maintain redo-log

slide-26
SLIDE 26

Architecture | example

26 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

CHECK WaW conflicts CHECK RaW conflicts CHECK WaR conflicts

slide-27
SLIDE 27

Architecture | example

27 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

COMMIT write log

slide-28
SLIDE 28

Architecture | example

28 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

Core out of TM mode Launch GC

slide-29
SLIDE 29

Architecture | example

29 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

Core out of TM mode Launch GC if no cores in TM, System out of TM mode

slide-30
SLIDE 30

Architecture | example

30 Restricted Transactional Memory Engine

start_label: XBEGIN abort_label <body of transaction, may use XABORT> XEND success_label: <handle transaction commited> abort_label: <handle transaction aborted>

If any abort condition is triggered Runs at given code point All intermediate state is discarded

slide-31
SLIDE 31

Architecture | MIME

  • Leverages
  • Shadow Page Table page fault hooking
  • Instruction length decoding
  • Hypercall insertion

Memory access single-stepping

  • Staging page to keep writes hidden until commit

31 Memory and Instruction Meta Engine

slide-32
SLIDE 32

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: movq %rdx, %rbx ... target: ...

32 Memory and Instruction Meta Engine

slide-33
SLIDE 33

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: movq %rdx, %rbx ... target: ...

33 Memory and Instruction Meta Engine

Decode instruction length…

slide-34
SLIDE 34

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: VMCALL ... target: ... saved instr: movq %rdx, %rbx

34 Memory and Instruction Meta Engine

…replace next instr with hypercall

slide-35
SLIDE 35

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: VMCALL ... target: ... saved instr: movq %rdx, %rbx

35 Memory and Instruction Meta Engine

Flush the shadow page tables All guest mem access  page fault

slide-36
SLIDE 36

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: VMCALL ... target: ... saved instr: movq %rdx, %rbx

36 Memory and Instruction Meta Engine

IFETCH  sPT fault Map the instruction page in

slide-37
SLIDE 37

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: VMCALL ... target: ... saved instr: movq %rdx, %rbx

37 Memory and Instruction Meta Engine

Read: map page in as read-only Write: map staging page in Read: record address Write: record address and value

slide-38
SLIDE 38

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: VMCALL ... target: ... saved instr: movq %rdx, %rbx

38 Memory and Instruction Meta Engine

Signals end of instruction If staging page was used, copy data into redo log

slide-39
SLIDE 39

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: movq %rdx, rbx ... target: ... saved instr: NULL

39 Memory and Instruction Meta Engine

Restore overwritten instruction

slide-40
SLIDE 40

Architecture | example

addq %rbx, %rax prev: INSTRUCTION cur: movq %rdx, rbx ... target: ... saved instr: NULL

40 Memory and Instruction Meta Engine

MIME begins again

slide-41
SLIDE 41

Architecture | example

prev: addq %rbx, %rax cur: INSTRUCTION next: movq %rdx, rbx ... target: VMCALL ... saved instr: ...

41 Memory and Instruction Meta Engine

If cur is a control flow inst,

  • verwrite target instead of next
slide-42
SLIDE 42

Architecture | conflict checking

42

  • All transactions are given a number, which serves

as a context and gives the transactions an ordering

  • 2 additional 2D hash tables are maintained
  • Record during which system state (TX number) memory

accesses were made

  • Record if accesses were reads or writes
slide-43
SLIDE 43

Architecture | conflict checking

43

  • On a transaction end, every access in the RTME log

is checked against the conflict tables

  • If conflict is detected, transaction is aborted
slide-44
SLIDE 44

Architecture | Garbage Collection

  • Log entries and collision hashes will keep growing
  • Garbage collection is needed
  • Garbage collection is launched at transaction end
  • Transaction number context is monotonically

increasing on each core

  • Easy to determine accesses made during contexts no

longer referenced

44

slide-45
SLIDE 45

Outline

  • Motivation / Background
  • Intel HTM
  • Architecture
  • Palacios
  • Evaluation
  • Conclusions

45

slide-46
SLIDE 46

Palacios | background

  • OS-independent, open source, BSD-licensed,

publicly available embeddable VMM

  • Collaborative community resource development

project involving Northwestern University, the University of New Mexico, University of Pittsburgh, Sandia National Labs, and Oak Ridge National Lab

  • Currently leveraged for Hobbes Node Virtualization

Layer

46

slide-47
SLIDE 47

Palacios |

  • HTM implementation could be added to any

hypervisor with shadow page table fault hooking

  • No instruction emulation necessary
  • ~1300 lines of code
  • RTME/MIME available as patchset

47

slide-48
SLIDE 48

Outline

  • Motivation / Background
  • Intel HTM
  • Architecture
  • Palacios
  • Evaluation
  • Conclusions

48

slide-49
SLIDE 49

Evaluation |

  • RTME/MIME vs Intel Haswell
  • RTME/MIME and Intel SDE vs ‘native’

49

slide-50
SLIDE 50

Evaluation | performance

  • HP Proliant DL320e
  • 1x quad-core Intel Xeon E3-1720v3
  • 8GB RAM.
  • Fedora 20 with a 3.13.5 kernel

50

slide-51
SLIDE 51

Evaluation | performance

  • Microbenchmark
  • One thread pinned to a single core
  • Enters a transaction, writes to a memory location, and

then exits the transaction.

  • Benchmark measures the time spent running 10 such

transactions,

  • Runtime averaged over 100 runs.

51

slide-52
SLIDE 52

Evaluation| performance

HTM implementation Average runtime RTME/MIME 853.88 usec Intel Haswell 2.57 usec

52

slide-53
SLIDE 53

Evaluation| performance

HTM implementation Average runtime RTME/MIME 853.88 usec Intel Haswell 2.57 usec

53

Only during TX ~3% overhead

  • therwise
slide-54
SLIDE 54

Evaluation | correctness

  • Dell PowerEdge R415
  • 2x quadcore AMD Opteron 4122 installed
  • 16 GB of memory.
  • Fedora 15 with a 2.6.38 kernel
  • 2 virtual cores
  • BusyBox environment based on Linux kernel 2.6.38
  • This machine does not have an HTM implementation.

54

slide-55
SLIDE 55

Evaluation | correctness

  • Suite of micro-benchmarks
  • Transaction calls XABORT not having written to memory
  • Transaction calls XABORT after having “written” to

memory

  • Transaction writes memory with an immediate value
  • Transaction reads memory into a register
  • Transaction writes a register to memory
  • Transaction reads and writes the same memory location
  • Transaction thread writes to distinct addresses
  • Transaction and non-transactional thread write to
  • verlapping addresses.
  • Threads written using pthreads

55

slide-56
SLIDE 56

Evaluation | correctness

  • All test-cases run on RTME/MIME and Intel SDE

5.31.0

  • All test-cases run (without TSX instructions) on the

host

56

Emulation Method Slowdown vs. Native RTME/MIME ~1,500x Intel SDE 5.31.0 ~90,000x

~60x faster

slide-57
SLIDE 57

Outline

  • Motivation / Background
  • Intel HTM
  • Palacios
  • Architecture
  • Evaluation
  • Conclusions

57

slide-58
SLIDE 58

Conclusion |

  • Developed RTME/MIME system
  • Software implemented HTM emulation system
  • Developed MIME
  • Novel page-flipping ‘single stepping’ technique
  • Performance
  • Run significantly faster than emulation

58

slide-59
SLIDE 59

Conclusion | limitations

  • Page boundaries
  • No support for instructions or memory accesses that

cross page boundaries

  • Read-after-Write accesses
  • sPT hooking doesn’t allow detection of RaW
  • Fine for correctness of implementation
  • REP prefix
  • No support for instructions with multiple accesses

59

slide-60
SLIDE 60

Conclusion | future work

  • Extend MIME
  • Leverage instruction recording to capture detailed

memory traces of application runs

  • Include support for breakpoints / stack traces to aid with

concurrent debugging tools

60

slide-61
SLIDE 61

Conclusion | future work

  • Leverage software cache to test limitations on

transaction size

61 NatSys Labs

slide-62
SLIDE 62

Acknowledgements |

  • NU EECS 441 HTM Team
  • Marcel Flores
  • Zachary Bischof
  • Quix86 x86 decoder team
  • Alexander Kudryavtsev
  • Michael Solovyov

62

slide-63
SLIDE 63

Maciej Swiech <mswiech@u.northwestern.edu> http://eecs.northwestern.edu/~msw978 Prescience Lab: www.presciencelab.org V3VEE Project: www.v3vee.org

63

  • HTM emulation with minimal instruction emulation
  • Page-flipping technique for capturing memory accesses at memory

access granularity

  • Software controlled HTM emulation implementation for testing