[PPT] - A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT PowerPoint Presentation

SLIDE 1

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

www.kit.edu

A Light-Weight Virtual Machine Monitor for Blue Gene/P

SLIDE 2

System Architecture Group Department of Computer Science 2 May 31, 2011

BG/P Programming Model

Traditional BG/P supercomputer programming model:

Parallel programming run-time (MPI) Compute-Node Kernel

CNK: OS for massive parallel applications

Light-weight kernel, “POSIX-ish” Function-shipping to IO-nodes Perfect choice for current HPC apps

MPI programming model Low OS noise Performance, scalability, customizability

Jan Stoess

Linux IO node CNK CNK

compute node 0 compute node 63

… … MPI MPI CIOD

SLIDE 3

System Architecture Group Department of Computer Science 3 May 31, 2011

Application Scale-Out

Standard Server / Commercial workloads are scaling out

Big data (hadoop, stream processing, caching) Clouds (ec2, vcloud) Commodity OSes, runtimes, (HW)

Linux OS, Java, Ethernet

CNK not truly compatible:

No full Linux/POSIX compatibility No compatibility to standardized networking protocols

Jan Stoess

Linux IO node CNK CNK

compute node 0 compute node 63

… … MPI MPI CIOD ?

SLIDE 4

System Architecture Group Department of Computer Science 4 May 31, 2011

HPC readiness vs. Compatibility

Commodity OSes not designed for Supercomputers

OS footprint and complexity? Network protocol overhead Problem for standard scale-out software as well

BG could run such workloads – in principle

“cores, memory, interconnect” Reference HW for future data centers

Can we have

… the HPC strength of CNK and … the compatibility of commodity OS / NW?

Jan Stoess

… Linux IO node CNK CNK

compute node 0 compute node 63

… … MPI MPI CIOD

?

SLIDE 5

System Architecture Group Department of Computer Science 5 May 31, 2011

A Light-Weight VMM for Supercomputers

Idea: use a light-weight kernel and a VMM

VMM gives HW-compatibility

Can run Linux in a VM Can run Linux applications Can communicate via (virtualized) Ethernet

Light-weight kernel preserves short path to HW

Run HPC apps “natively” Direct access to HPC interconnects Kernel small and customizable Low kernel footprint

Development path for converging platforms and workloads

Jan Stoess

VMM LWK MPI MPI MPI Guest VM

SLIDE 6

System Architecture Group Department of Computer Science 6 May 31, 2011

Prototype

L4 based prototype

Small, privileged micro-kernel User-level VMM component

Current focus: VMM layer (this talk)

Virtual BG cores, memory, interconnects Support for Standard OSes

Future work: Native HPC app support

L4 has native API Leverage ex. research on

L4 OS frameworks Native HPC app layers [kitten/palacios]

Jan Stoess

VMM L4 L4 APP Linux 2.6

SLIDE 7

System Architecture Group Department of Computer Science 7 May 31, 2011

BG Overview and VMM agenda

Compute Nodes

4 PowerPC 450 cores 2 GB physical memory MMU/TLB Interrupt controller Torus and Collective

ther HW

(mailboxes, JTAG) not considered

IO nodes:

Not virtualized Run special Linux for booting

Jan Stoess

rwx A A

torus

IO node

compute node 63

…

collective

compute node 0

PPC 450 TLB BIC

7 Guest VM VMM L4

vPPC vTLB vBIC vTORUS vCollective

SLIDE 8

System Architecture Group Department of Computer Science 8 May 31, 2011

L4 and VMM architecture

L4 offers generic OS abstractions

Threads Address spaces Synchronous IPC IPC-based exception / IRQ handling

VMM is just a user-level program

Receives “VM exit” message from VM Emulates it and replies with an update message

L4 virtualization enhancements

Empty address spaces Extended VM/thread state handling Internal VM TLB handling

Jan Stoess

8 Guest VM L4 VMM

IPC IPC

SLIDE 9

System Architecture Group Department of Computer Science 9 May 31, 2011

Virtual PowerPC processor

VM runs at user mode, privileged PPC instructions trap L4 propagates traps to user-level VMM

kernel-synthesized IPC VM/thread state included

VMM receives trap IPC

Decodes message (faulting PC) Emulates instruction (e.g. device IO) Sends back a reply IPC

Upon reception

Kernel installs new state Resumes guest

Jan Stoess

9 Guest VM L4 VMM

IPC

PC R1 GPRs State

VM Exit Exit-IPC

PC+4 R1* GPRs* State*

VM Exit

PC R1 GPRs State = mtdcr

VM Entry Resume-IPC

SLIDE 10

System Architecture Group Department of Computer Science 10 May 31, 2011

Guest VM L4 VMM

Virtual MMU/TLB

PowerPC 450

64-entry TLB No HW-walked Page Tables

Need to virtualize MMU translation

Two levels Guest virtual to guest physical (guest managed) Guest physical to host physical (L4/VMM managed) Compressed into HW TLB

Jan Stoess

vaddr paddr rwx attr sz A A AS

GVHP GPHP

vaddr paddr rwx attr sz A A AS

GVGP

SLIDE 11

System Architecture Group Department of Computer Science 11 May 31, 2011

L4 keeps a per-VM “shadow TLB”

Intercepts guest TLB access

(tlbwe, tlbre, …)

Fills shadow TLB accordingly Stores GVGP mappings

L4 keeps per-AS memory mappings

Standard L4 memory management Stores GPHP mappings User-directed, VMM carries out

TLB miss handling:

If guest virtual TLB miss, deliver to guest If guest physical TLB miss, deliver to VMM

Guest VM L4 VMM

Virtual MMU/TLB

Jan Stoess

vaddr paddr rwx attr sz A A AS vaddr paddr rwx attr sz A A AS

Entry in shadow TLB? Entry in L4 MM Structures?

tlb miss no yes no #pf IPC

SLIDE 12

System Architecture Group Department of Computer Science 12 May 31, 2011

PowerPC TLB protection features

User/Kernel bits in TLBs Address Spaces IDs (256) Standard Linux behavior:

U/K bits for kernel separation ASIDs for process separation

Requirements:

Must virtualize protection Guest code runs at user-level Must support shared mappings Compressed, as for translation Minimize # TLB flushes

Guest VM L4 VMM

Virtual MMU/TLB Protection

Jan Stoess VADDR PADDR rwx U/K sz ASID TS

PID MSR EA

SLIDE 13

System Architecture Group Department of Computer Science 13 May 31, 2011

Virtualized

L4/VMM

Use U/K bits and ASIDs

VM

All user-level (no U/K) ASID= 1: Guest Kernel ASID= 2: Guest User ASID= 0: Shared Mappings

Analysis

No TLB flush on guest syscall No TLB flush on VM exit

nly on guest process or world

switches

Virtual MMU/TLB Protection

Jan Stoess VADDR PADDR rwx

U/ K

sz

ASI D TS

PID MSR EA

Guest VM

L4 VMM

User Krnl TS=1 TS=0

(TS=0) (TS=1)

SLIDE 14

System Architecture Group Department of Computer Science 14 May 31, 2011

Virtual Collective Interconnect

Collective:

Tree Network, 7.8 Gbit/s, < 6µs latency Packet-based, two virtual channels

Packet header 16 * 128-bit FPU words payload RX/TX FIFOs

Jan Stoess

OS

RX FIFO COLL.

OS

COLL. TX FIFO RX FIFO TX FIFO

IO node

compute node 63

…

compute node 0

collective

SLIDE 15

System Architecture Group Department of Computer Science 15 May 31, 2011

Virtual Collective Interconnect

Collective:

Tree Network, 7.8 Gbit/s, < 6 µs latency Packet-based, two virtual channels

Packet header 16 * 128-bit FPU words payload RX/TX FIFOs

Virtualized collective:

TX: Trap guest channel accesses Issue on physical collective link RX: Copy GPR/FPU into private buffer Notify guest, then trap vCOLL access

Jan Stoess

VMM VMM

pCOLL pCOLL

IO node

compute node 63

…

compute node 0

collective OS

RX FIFO vCOLL

OS

vCOLL TX FIFO RX FIFO TX FIFO BUF

SLIDE 16

System Architecture Group Department of Computer Science 16 May 31, 2011

Virtual Torus Interconnect

Jan Stoess

OS

RCV Buffer TORUS

OS

TORUS

rget(0,0)

… SND Buffer … SND Buffer … … RCV Buffer

dput(2,3)

Torus:

3D network, 40.8 Gbit/s, 5 µs latency Packet-based, 4 RX/TX groups (Buffer-based) and rDMA rDMA:

direct access by (user) software Memory descriptors put/get interface (direct-put, remote-get)

torus

IO node

compute node 63

…

compute node 0

SLIDE 17

System Architecture Group Department of Computer Science 17 May 31, 2011

Virtual Torus Interconnect

Jan Stoess

OS

RCV Buffer vTORUS

OS

vTORUS

rget(0,0)

… SND Buffer … SND Buffer … … RCV Buffer

dput(2,3)

VMM VMM

pTORUS pTORUS

GPHP

torus

IO node

compute node 63

…

compute node 0

Torus:

3D network, 40.8 Gbit/s, 5 µs latency Packet-based, 4 RX/TX groups (Buffer-based) and rDMA rDMA:

direct access by (user) software Memory descriptors put/get interface (direct-put, remote-get)

Virtualized torus model:

Trap guest descriptor accesses Translate guest to host physical Then issue on HW torus

SLIDE 18

System Architecture Group Department of Computer Science 18 May 31, 2011

Status & Initial Evaluation

Functionally complete:

Virtual PPC core, MMU Virtual torus, tree UP Linux 2.6 guests Virtualized Ethernet (within guest)

Initial benchmarks

Ethernet performance (mapped onto torus) Collective much worse still testing/setup problems

Jan Stoess

VMM L4 Linux 2.6 VMM L4 Linux 2.6 VMM L4 Linux 2.6 Linux 2.6

SLIDE 19

System Architecture Group Department of Computer Science 19 May 31, 2011

Conclusion

Standard server / commercial workloads are scaling out

Current BG/P programming model makes transition hard Perfect choice for traditional HPC apps Lacks compatibility to standard OSes (Linux) or network protocols (Ethernet)

Idea: Use a light-weight kernel and a VMM

VMM for HW-compatibility, LWK for low footprint Development path for converging platforms and workloads L4-based VMM prototype functionally complete performance from acceptable (torus) to under-optimized (collective)

Things to explore:

Native application support (MPICH2/L4 and Memcache/L4 for BG/P in preparation) Performance, Isolation

Jan Stoess