Managing NVM in The Machine
Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016
Managing NVM in The Machine Rocky Craig, Master Linux Technologist - - PowerPoint PPT Presentation
Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016 The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool
Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016
The Machine Project from Hewlett Packard Enterprise
2
Massive SoC pool Massive memory pool Photonic fabric
http://www.labs.hpe.com/research/themachine/
“The Machine: A New Kind of Computer
Memory-Centric Computing: “No IO” from NVM persistence
3
// Give me some space in a way I can find it again tomorrow int *vaddr = TheMachineVoodoo(...., identifier, ….., size, ….); // Use it *vaddr = 42; // Don't lose it exit(0);
The NVM Fabric of The Machine
4
DRAM SoC Fabric Bridge SoC DRAM Fabric Bridge NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM
Hardware Point of View for Fabric-Attached Memory (FAM) –Basic unit of SoC HW memory access is still the page
– Looks like DRAM, smells like DRAM... – But it's not identified as DRAM
–Basic unit of NVM access granularity is the 8 GB “book”
– A collection of pages – 4T per node == 512 books, goal of 80 nodes
–Memory-mapping operations provide direct load/store access
– FAM on same node as SoC doing load/store is cache-coherent – FAM on a different node is not cache-coherent
5
Node 1 Node 2 Node N Linux
Fabric Bridge
. . . . . .
NVM
Linux
Fabric Bridge
NVM
Linux
Fabric Bridge
NVM Hardware Platform Basics
Fabric Switches
Single Load/Store Domain
7
SoC Fabric Bridge
Fabric-Attached Memory
1-4 TB
256 GB DRAM 256 GB DRAM
SoC Fabric Bridge
Fabric-Attached Memory
1-4 TB
SoC Fabric Bridge
Fabric-Attached Memory 256 GB DRAM
TheMachineVoodoo(): rough consensus and running code
– Runs on each node – File system API under a mount point, typically “/lfs” – Communicates to metadata server over SoC Ethernet – Provides access to FAM books for applications on SoC
– Running on Top of Rack Management Server (ToRMS) – FS metadata (“shelves” and attributes) managed in SQL database – Never sees actual book contents in FAM
8
Memory-Centric Computing under LFS
9
fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666); ftruncate(fd, 10 * TB); int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); *vaddr = 42;
Possible usage pattern
Expected use patterns
Implications:
NUMA and cache coherency
12 DRAM SoC Fabric Bridge SoC DRAM Fabric Bridge NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM
LFS POSIX Extended File Attributes
$ touch /lfs/myshelf $ getfattr -d /lfs/myshelf getfattr: Removing leading '/' from absolute path names # file: lfs/myshelf user.LFS.AllocationPolicy="RandomBooks" user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...." user.LFS.<other stuff but you get the idea> $ truncate -s40G /lfs/myshelf
One SoC
User Kernel
librarian.py SQL
lfs_fuse.py
Ethernet
fuse.ko ToRMS
lfs_fuse.py
VFS myprocess
fuse.py libfuse.so
/dev/fuse FS API system calls
Files under /lfs Books and Shelves
Database is initialized with book layout and topology of all nodes / enclosures / racks During runtime it tracks shelves, usage, and attributes
Librarian and Librarian File System
Where's the beef?
Oh this one again
Node 1 Node 2 Node N Linux
Fabric Bridge
. . . . . .
NVM
Linux
Fabric Bridge
NVM
Linux
Fabric Bridge
NVM
Fabric Switches
Encapsulation 1 Encapsulation 2 Encapsulation N
lfs_fuse.py Physical memory lfs_fuse.py Physical memory lfs_fuse.py Physical memory
Developing without hardware
Emulated sharing
librarian.py LAN
User Kernel
librarian.py SQL
lfs_fuse.py
localhost
fuse.ko
lfs_fuse.py
VFS myprocess
fuse.py libfuse.so
/dev/fuse FS API system calls
Early LFS development: self-hosted
Shadow File $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1 $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2 : : $ vi smalltm.ini # node count, book size, book total $ create_db.py smalltm.ini smalltm.db $ librarian.py …. --db_file=smalltm.db $ truncate -s 16G /tmp/GlobalNVM
Address Translations
18
ARM Core ARM Core VA -> PA VA -> PA Cache Cache
Book fjrewall Book fjrewall
Fabric requester Fabric requester DRAM max 1T DRAM max 1T Fabric Bridge: 14.9T of Apertures (worst case) Coherent interconnect VA: 48b (256 TB) 53b (8 PB) “Book space” SOC Fabric space: 75b (32 ZB) ~1900 PA → LA Book Descriptors ~1900 PA → LA Book Descriptors PA: 44 - 48b (16 - 256 TB) PCI, etc PCI, etc
Page and book faults
fd = open("/lfs/bigone"... ftruncate(fd, 20 * TB); int *vaddr = mmap(... fd, ...); *vaddr = 42; Passthrough to lfs_fuse::open()
Passthrough to lfs_fuse::ftruncate()
Stay in kernel (FuSE hook)
Start in kernel LFS page fault handler
One Node
User Kernel
librarian.py SQL
lfs_fuse.py
Ethernet
tm-fuse.ko ToRMS
lfs_fuse.py
VFS myprocess
tm-fuse.py tm-libfuse.so
/dev/fuse FS API system calls
Fabric bridge FPGA
Librarian File System – Data in FAM
Hardware
Descriptors are in short supply
*(vaddr + 1G) = 43; (touch enough space to use all descriptors) *onetoomany = 43; Start in kernel LFS page fault handler
Lather rinse repeat Need to reclaim a descriptor
IVSHMEM as Global FAM
QEMU guest as node lfs_fuse.py QEMU guest as node lfs_fuse.py QEMU guest as node lfs_fuse.py
Modified Nahanni server manages file used as backing store *
Apertures Apertures Apertures
* * librarian.py
LFS & Driver Development on QEMU and IVSHMEM
* Guest-private IVSHMEM regions emulate bridge resource space
23
Confjdential
Platforms and environments
Fabric-Attached Memory Emulation (Develop) The Machine Architectural Simulator (Validate)
The Machine
Application Application Application New APIs POSIX APIs New APIs POSIX APIs New APIs POSIX APIs
LFS LFS LFS
Librarian Drivers
Drivers Librarian Librarian Drivers
Hardware Firmware
Firmware Hardware
libpmem
–Part of http://pmem.io/nvml/ –API for controlling data persistence –Flushing SoC caches. –Clearing memory controller buffers –Accelerated APIs for persistent data movement –Non-temporal copies –Bypass SoC caches –Additions for The Machine –APIs for invalidating SoC caches
24
Fabric-Attached Memory Atomics
–Native SoC atomic instructions are cache-dependent
– Do not work between nodes
–Bridge and switch hardware includes fabric-native atomic operations –Proprietary fam-atomic library provides API
– Atomic read/write, compare/exchange, add, bitwise and/or – Cross-node Spin Locks – Depends on LFS for VA → PA → FA translations
25
LFS native block devices
–Legacy applications or frameworks that need a block device
– File-system dependent (ext4) – Ceph
–Triggered via mknod –Simplifications for proof-of-concept
– Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only – Lock the descriptors
26
The Future
– Full integration into management infrastructure of The Machine – Frameworks / Middleware / Demos / Applications / Stress testing – Optimizations (i.e., huge pages) – Learn, learn, learn
– More capable or specialized SoCs – Deeper integration of fabric – Enablement of NVM technologies at production scale – Harden proven software (i.e., replace FuSE with a “real” file system) – True concurrent file system – Eliminate separate ToRMS server – ???????
Open Source
–Yes we're going to release all system software
– Librarian, LFS, kernel modules
–Started with FAM Emulation in December 2015
– http://github.com/FabricAttachedMemory – x86 and Debian Jessie – “Platform” only
28
How Fast Is The Field of Dreams?
29
Rocky Craig <first.last>@hpe.com
30