KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT - - PowerPoint PPT Presentation
A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT - - PowerPoint PPT Presentation
A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT University of the State of Baden-Wuerttemberg and www.kit.edu National Research Center of the Helmholtz Association BG/P Programming Model Traditional BG/P supercomputer CIOD MPI
System Architecture Group Department of Computer Science 2 May 31, 2011
BG/P Programming Model
Traditional BG/P supercomputer programming model:
Parallel programming run-time (MPI) Compute-Node Kernel
CNK: OS for massive parallel applications
Light-weight kernel, “POSIX-ish” Function-shipping to IO-nodes Perfect choice for current HPC apps
MPI programming model Low OS noise Performance, scalability, customizability
Jan Stoess
Linux IO node CNK CNK
compute node 0 compute node 63
… … MPI MPI CIOD
System Architecture Group Department of Computer Science 3 May 31, 2011
Application Scale-Out
Standard Server / Commercial workloads are scaling out
Big data (hadoop, stream processing, caching) Clouds (ec2, vcloud) Commodity OSes, runtimes, (HW)
Linux OS, Java, Ethernet
CNK not truly compatible:
No full Linux/POSIX compatibility No compatibility to standardized networking protocols
Jan Stoess
Linux IO node CNK CNK
compute node 0 compute node 63
… … MPI MPI CIOD ?
System Architecture Group Department of Computer Science 4 May 31, 2011
HPC readiness vs. Compatibility
Commodity OSes not designed for Supercomputers
OS footprint and complexity? Network protocol overhead Problem for standard scale-out software as well
BG could run such workloads – in principle
“cores, memory, interconnect” Reference HW for future data centers
Can we have
… the HPC strength of CNK and … the compatibility of commodity OS / NW?
Jan Stoess
… Linux IO node CNK CNK
compute node 0 compute node 63
… … MPI MPI CIOD
?
System Architecture Group Department of Computer Science 5 May 31, 2011
A Light-Weight VMM for Supercomputers
Idea: use a light-weight kernel and a VMM
VMM gives HW-compatibility
Can run Linux in a VM Can run Linux applications Can communicate via (virtualized) Ethernet
Light-weight kernel preserves short path to HW
Run HPC apps “natively” Direct access to HPC interconnects Kernel small and customizable Low kernel footprint
Development path for converging platforms and workloads
Jan Stoess
VMM LWK MPI MPI MPI Guest VM
System Architecture Group Department of Computer Science 6 May 31, 2011
Prototype
L4 based prototype
Small, privileged micro-kernel User-level VMM component
Current focus: VMM layer (this talk)
Virtual BG cores, memory, interconnects Support for Standard OSes
Future work: Native HPC app support
L4 has native API Leverage ex. research on
L4 OS frameworks Native HPC app layers [kitten/palacios]
Jan Stoess
VMM L4 L4 APP Linux 2.6
System Architecture Group Department of Computer Science 7 May 31, 2011
BG Overview and VMM agenda
Compute Nodes
4 PowerPC 450 cores 2 GB physical memory MMU/TLB Interrupt controller Torus and Collective
- ther HW
(mailboxes, JTAG) not considered
IO nodes:
Not virtualized Run special Linux for booting
Jan Stoess
rwx A A
torus
IO node
compute node 63
…
collective
compute node 0
PPC 450 TLB BIC
7 Guest VM VMM L4
vPPC vTLB vBIC vTORUS vCollective
System Architecture Group Department of Computer Science 8 May 31, 2011
L4 and VMM architecture
L4 offers generic OS abstractions
Threads Address spaces Synchronous IPC IPC-based exception / IRQ handling
VMM is just a user-level program
Receives “VM exit” message from VM Emulates it and replies with an update message
L4 virtualization enhancements
Empty address spaces Extended VM/thread state handling Internal VM TLB handling
Jan Stoess
8 Guest VM L4 VMM
IPC IPC
System Architecture Group Department of Computer Science 9 May 31, 2011
Virtual PowerPC processor
VM runs at user mode, privileged PPC instructions trap L4 propagates traps to user-level VMM
kernel-synthesized IPC VM/thread state included
VMM receives trap IPC
Decodes message (faulting PC) Emulates instruction (e.g. device IO) Sends back a reply IPC
Upon reception
Kernel installs new state Resumes guest
Jan Stoess
9 Guest VM L4 VMM
IPC
PC R1 GPRs State
VM Exit Exit-IPC
PC+4 R1* GPRs* State*
VM Exit
PC R1 GPRs State = mtdcr
VM Entry Resume-IPC
System Architecture Group Department of Computer Science 10 May 31, 2011
Guest VM L4 VMM
Virtual MMU/TLB
PowerPC 450
64-entry TLB No HW-walked Page Tables
Need to virtualize MMU translation
Two levels Guest virtual to guest physical (guest managed) Guest physical to host physical (L4/VMM managed) Compressed into HW TLB
Jan Stoess
vaddr paddr rwx attr sz A A AS
GVHP GPHP
vaddr paddr rwx attr sz A A AS
GVGP
System Architecture Group Department of Computer Science 11 May 31, 2011
L4 keeps a per-VM “shadow TLB”
Intercepts guest TLB access
(tlbwe, tlbre, …)
Fills shadow TLB accordingly Stores GVGP mappings
L4 keeps per-AS memory mappings
Standard L4 memory management Stores GPHP mappings User-directed, VMM carries out
TLB miss handling:
If guest virtual TLB miss, deliver to guest If guest physical TLB miss, deliver to VMM
Guest VM L4 VMM
Virtual MMU/TLB
Jan Stoess
vaddr paddr rwx attr sz A A AS vaddr paddr rwx attr sz A A AS
Entry in shadow TLB? Entry in L4 MM Structures?
tlb miss no yes no #pf IPC
System Architecture Group Department of Computer Science 12 May 31, 2011
PowerPC TLB protection features
User/Kernel bits in TLBs Address Spaces IDs (256) Standard Linux behavior:
U/K bits for kernel separation ASIDs for process separation
Requirements:
Must virtualize protection Guest code runs at user-level Must support shared mappings Compressed, as for translation Minimize # TLB flushes
Guest VM L4 VMM
Virtual MMU/TLB Protection
Jan Stoess VADDR PADDR rwx U/K sz ASID TS
PID MSR EA
System Architecture Group Department of Computer Science 13 May 31, 2011
Virtualized
L4/VMM
Use U/K bits and ASIDs
VM
All user-level (no U/K) ASID= 1: Guest Kernel ASID= 2: Guest User ASID= 0: Shared Mappings
Analysis
No TLB flush on guest syscall No TLB flush on VM exit
- nly on guest process or world
switches
Virtual MMU/TLB Protection
Jan Stoess VADDR PADDR rwx
U/ K
sz
ASI D TS
PID MSR EA
Guest VM
L4 VMM
User Krnl TS=1 TS=0
(TS=0) (TS=1)
System Architecture Group Department of Computer Science 14 May 31, 2011
Virtual Collective Interconnect
Collective:
Tree Network, 7.8 Gbit/s, < 6µs latency Packet-based, two virtual channels
Packet header 16 * 128-bit FPU words payload RX/TX FIFOs
Jan Stoess
OS
RX FIFO COLL.
OS
COLL. TX FIFO RX FIFO TX FIFO
IO node
compute node 63
…
compute node 0
collective
System Architecture Group Department of Computer Science 15 May 31, 2011
Virtual Collective Interconnect
Collective:
Tree Network, 7.8 Gbit/s, < 6 µs latency Packet-based, two virtual channels
Packet header 16 * 128-bit FPU words payload RX/TX FIFOs
Virtualized collective:
TX: Trap guest channel accesses Issue on physical collective link RX: Copy GPR/FPU into private buffer Notify guest, then trap vCOLL access
Jan Stoess
VMM VMM
pCOLL pCOLL
IO node
compute node 63
…
compute node 0
collective OS
RX FIFO vCOLL
OS
vCOLL TX FIFO RX FIFO TX FIFO BUF
System Architecture Group Department of Computer Science 16 May 31, 2011
Virtual Torus Interconnect
Jan Stoess
OS
RCV Buffer TORUS
OS
TORUS
rget(0,0)
… SND Buffer … SND Buffer … … RCV Buffer
dput(2,3)
Torus:
3D network, 40.8 Gbit/s, 5 µs latency Packet-based, 4 RX/TX groups (Buffer-based) and rDMA rDMA:
direct access by (user) software Memory descriptors put/get interface (direct-put, remote-get)
torus
IO node
compute node 63
…
compute node 0
System Architecture Group Department of Computer Science 17 May 31, 2011
Virtual Torus Interconnect
Jan Stoess
OS
RCV Buffer vTORUS
OS
vTORUS
rget(0,0)
… SND Buffer … SND Buffer … … RCV Buffer
dput(2,3)
VMM VMM
pTORUS pTORUS
GPHP
torus
IO node
compute node 63
…
compute node 0
Torus:
3D network, 40.8 Gbit/s, 5 µs latency Packet-based, 4 RX/TX groups (Buffer-based) and rDMA rDMA:
direct access by (user) software Memory descriptors put/get interface (direct-put, remote-get)
Virtualized torus model:
Trap guest descriptor accesses Translate guest to host physical Then issue on HW torus
System Architecture Group Department of Computer Science 18 May 31, 2011
Status & Initial Evaluation
Functionally complete:
Virtual PPC core, MMU Virtual torus, tree UP Linux 2.6 guests Virtualized Ethernet (within guest)
Initial benchmarks
Ethernet performance (mapped onto torus) Collective much worse still testing/setup problems
Jan Stoess
VMM L4 Linux 2.6 VMM L4 Linux 2.6 VMM L4 Linux 2.6 Linux 2.6
System Architecture Group Department of Computer Science 19 May 31, 2011
Conclusion
Standard server / commercial workloads are scaling out
Current BG/P programming model makes transition hard Perfect choice for traditional HPC apps Lacks compatibility to standard OSes (Linux) or network protocols (Ethernet)
Idea: Use a light-weight kernel and a VMM
VMM for HW-compatibility, LWK for low footprint Development path for converging platforms and workloads L4-based VMM prototype functionally complete performance from acceptable (torus) to under-optimized (collective)
Things to explore:
Native application support (MPICH2/L4 and Memcache/L4 for BG/P in preparation) Performance, Isolation
Jan Stoess