Managing NVM in The Machine Rocky Craig, Master Linux Technologist - - PowerPoint PPT Presentation

managing nvm in the machine
SMART_READER_LITE
LIVE PREVIEW

Managing NVM in The Machine Rocky Craig, Master Linux Technologist - - PowerPoint PPT Presentation

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016 The Machine Project from Hewlett Packard Enterprise Massive SoC pool Photonic fabric Massive memory pool


slide-1
SLIDE 1

Managing NVM in The Machine

Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016

slide-2
SLIDE 2

The Machine Project from Hewlett Packard Enterprise

2

Massive SoC pool Massive memory pool Photonic fabric

http://www.labs.hpe.com/research/themachine/

“The Machine: A New Kind of Computer

slide-3
SLIDE 3

Memory-Centric Computing: “No IO” from NVM persistence

3

// Give me some space in a way I can find it again tomorrow int *vaddr = TheMachineVoodoo(...., identifier, ….., size, ….); // Use it *vaddr = 42; // Don't lose it exit(0);

slide-4
SLIDE 4

The NVM Fabric of The Machine

4

DRAM SoC Fabric Bridge SoC DRAM Fabric Bridge NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM

slide-5
SLIDE 5

Hardware Point of View for Fabric-Attached Memory (FAM) –Basic unit of SoC HW memory access is still the page

– Looks like DRAM, smells like DRAM... – But it's not identified as DRAM

–Basic unit of NVM access granularity is the 8 GB “book”

– A collection of pages – 4T per node == 512 books, goal of 80 nodes

–Memory-mapping operations provide direct load/store access

– FAM on same node as SoC doing load/store is cache-coherent – FAM on a different node is not cache-coherent

5

slide-6
SLIDE 6

Node 1 Node 2 Node N Linux

  • n SoC

Fabric Bridge

. . . . . .

NVM

Linux

  • n SoC

Fabric Bridge

NVM

Linux

  • n SoC

Fabric Bridge

NVM Hardware Platform Basics

Fabric Switches

slide-7
SLIDE 7

Single Load/Store Domain

7

SoC Fabric Bridge

Fabric-Attached Memory

1-4 TB

256 GB DRAM 256 GB DRAM

SoC Fabric Bridge

Fabric-Attached Memory

1-4 TB

SoC Fabric Bridge

Fabric-Attached Memory 256 GB DRAM

slide-8
SLIDE 8

TheMachineVoodoo(): rough consensus and running code

  • Provide a new file system for FAM allocation
  • File system daemon

– Runs on each node – File system API under a mount point, typically “/lfs” – Communicates to metadata server over SoC Ethernet – Provides access to FAM books for applications on SoC

  • Librarian

– Running on Top of Rack Management Server (ToRMS) – FS metadata (“shelves” and attributes) managed in SQL database – Never sees actual book contents in FAM

8

slide-9
SLIDE 9

Memory-Centric Computing under LFS

9

fd = open("/lfs/bigone", O_CREAT | O_RDWR, 0666); ftruncate(fd, 10 * TB); int *vaddr = mmap(NULL, 10 * TB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); *vaddr = 42;

slide-10
SLIDE 10

Possible usage pattern

  • open(.....)
  • truncate(1 or 2 books)
  • mmap() and use “briefly”
  • read() or write() mixed in
  • truncate(up or down) a lot
  • close()
  • copy it, unlink it, save it for later...
  • open(....)
  • truncate(1 or 2 books)
  • lather rinse repeat especially across SoCs
slide-11
SLIDE 11

Expected use patterns

  • pen(.....)
  • truncate( 1 or 2 books)
  • mmap() and use “briefly”
  • read() or write() mixed in
  • close()
  • unlink()
  • pen(....)
  • truncate(1 or 2 books)
  • lather rinse repeat
  • open()
  • truncate(thousands of books)
  • mmap() sections across many cores/SoCs
  • Run until solution convergence
  • Sporadically, truncate(increase size)

Implications:

  • Solution architectures need re-thinking
  • It's not only about persistence
  • File-system performance is not critical
slide-12
SLIDE 12

NUMA and cache coherency

12 DRAM SoC Fabric Bridge SoC DRAM Fabric Bridge NVM NVM NVM NVM Fabric Switch NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM NVM Fabric Bridge Fabric Bridge DRAM SoC SoC DRAM

slide-13
SLIDE 13

LFS POSIX Extended File Attributes

$ touch /lfs/myshelf $ getfattr -d /lfs/myshelf getfattr: Removing leading '/' from absolute path names # file: lfs/myshelf user.LFS.AllocationPolicy="RandomBooks" user.LFS.AllocationPolicyList="RandomBooks,LocalNode,Nearest,...." user.LFS.<other stuff but you get the idea> $ truncate -s40G /lfs/myshelf

slide-14
SLIDE 14

One SoC

User Kernel

librarian.py SQL

lfs_fuse.py

Ethernet

fuse.ko ToRMS

lfs_fuse.py

VFS myprocess

fuse.py libfuse.so

/dev/fuse FS API system calls

Files under /lfs Books and Shelves

Database is initialized with book layout and topology of all nodes / enclosures / racks During runtime it tracks shelves, usage, and attributes

Librarian and Librarian File System

Where's the beef?

slide-15
SLIDE 15

Oh this one again

Node 1 Node 2 Node N Linux

  • n SoC

Fabric Bridge

. . . . . .

NVM

Linux

  • n SoC

Fabric Bridge

NVM

Linux

  • n SoC

Fabric Bridge

NVM

Fabric Switches

slide-16
SLIDE 16

Encapsulation 1 Encapsulation 2 Encapsulation N

lfs_fuse.py Physical memory lfs_fuse.py Physical memory lfs_fuse.py Physical memory

Developing without hardware

Emulated sharing

librarian.py LAN

slide-17
SLIDE 17

User Kernel

librarian.py SQL

lfs_fuse.py

localhost

fuse.ko

lfs_fuse.py

VFS myprocess

fuse.py libfuse.so

/dev/fuse FS API system calls

Early LFS development: self-hosted

Shadow File $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs1 1 $ lfs_fuse.py --shadow_file /tmp/GlobalNVM localhost /mnt/lfs2 2 : : $ vi smalltm.ini # node count, book size, book total $ create_db.py smalltm.ini smalltm.db $ librarian.py …. --db_file=smalltm.db $ truncate -s 16G /tmp/GlobalNVM

slide-18
SLIDE 18

Address Translations

18

ARM Core ARM Core VA -> PA VA -> PA Cache Cache

Book fjrewall Book fjrewall

Fabric requester Fabric requester DRAM max 1T DRAM max 1T Fabric Bridge: 14.9T of Apertures (worst case) Coherent interconnect VA: 48b (256 TB) 53b (8 PB) “Book space” SOC Fabric space: 75b (32 ZB) ~1900 PA → LA Book Descriptors ~1900 PA → LA Book Descriptors PA: 44 - 48b (16 - 256 TB) PCI, etc PCI, etc

slide-19
SLIDE 19

Page and book faults

fd = open("/lfs/bigone"... ftruncate(fd, 20 * TB); int *vaddr = mmap(... fd, ...); *vaddr = 42; Passthrough to lfs_fuse::open()

  • lfs_fuse converse with Librarian – create a new shelf
  • lfs_fuse return a file descriptor for VFS

Passthrough to lfs_fuse::ftruncate()

  • Requests keyed on fd
  • lfs_fuse converse with Librarian – allocate books (LA)

Stay in kernel (FuSE hook)

  • Allocate VMA
  • LFS changes: set up caching structures to assist faulting

Start in kernel LFS page fault handler

  • If first fault in a book
  • Overload getxattr() into lfs_fuse
  • lfs_fuse converse with Librarian – get book LA info
  • Kernel caches book LA
  • Get book LA info from cache
  • Select and program unused descriptor
  • map with vma_insert_pfn()
slide-20
SLIDE 20

One Node

User Kernel

librarian.py SQL

lfs_fuse.py

Ethernet

tm-fuse.ko ToRMS

lfs_fuse.py

VFS myprocess

tm-fuse.py tm-libfuse.so

/dev/fuse FS API system calls

Fabric bridge FPGA

Librarian File System – Data in FAM

Hardware

slide-21
SLIDE 21

Descriptors are in short supply

*(vaddr + 1G) = 43; (touch enough space to use all descriptors) *onetoomany = 43; Start in kernel LFS page fault handler

  • If first fault in a book
  • Overload getxattr hook to lfs_fuse
  • lfs_fuse converse with Librarian – get book LA info
  • Kernel caches book LA
  • Get book LA info from cache
  • Reuse previous descriptor/aperture as address base
  • map with vma_insert_pfn()

Lather rinse repeat Need to reclaim a descriptor

  • Select an LRU candidate
  • For all VMAs mapped into that descriptor (book LA):
  • flush caches
  • zap_vma_pte()
  • Reprogram selected descriptor with LA, vma_insert_pfn()
slide-22
SLIDE 22

IVSHMEM as Global FAM

QEMU guest as node lfs_fuse.py QEMU guest as node lfs_fuse.py QEMU guest as node lfs_fuse.py

Modified Nahanni server manages file used as backing store *

Apertures Apertures Apertures

* * librarian.py

LFS & Driver Development on QEMU and IVSHMEM

* Guest-private IVSHMEM regions emulate bridge resource space

slide-23
SLIDE 23

23

Confjdential

Platforms and environments

Fabric-Attached Memory Emulation (Develop) The Machine Architectural Simulator (Validate)

The Machine

Application Application Application New APIs POSIX APIs New APIs POSIX APIs New APIs POSIX APIs

LFS LFS LFS

Librarian Drivers

Drivers Librarian Librarian Drivers

Hardware Firmware

Firmware Hardware

slide-24
SLIDE 24

libpmem

–Part of http://pmem.io/nvml/ –API for controlling data persistence –Flushing SoC caches. –Clearing memory controller buffers –Accelerated APIs for persistent data movement –Non-temporal copies –Bypass SoC caches –Additions for The Machine –APIs for invalidating SoC caches

24

slide-25
SLIDE 25

Fabric-Attached Memory Atomics

–Native SoC atomic instructions are cache-dependent

– Do not work between nodes

–Bridge and switch hardware includes fabric-native atomic operations –Proprietary fam-atomic library provides API

– Atomic read/write, compare/exchange, add, bitwise and/or – Cross-node Spin Locks – Depends on LFS for VA → PA → FA translations

25

slide-26
SLIDE 26

LFS native block devices

–Legacy applications or frameworks that need a block device

– File-system dependent (ext4) – Ceph

–Triggered via mknod –Simplifications for proof-of-concept

– Plagiarize drivers/nvdimm/pmem.c – Avoid cache complications: node-local only – Lock the descriptors

26

slide-27
SLIDE 27

The Future

  • Short-term

– Full integration into management infrastructure of The Machine – Frameworks / Middleware / Demos / Applications / Stress testing – Optimizations (i.e., huge pages) – Learn, learn, learn

  • And beyond

– More capable or specialized SoCs – Deeper integration of fabric – Enablement of NVM technologies at production scale – Harden proven software (i.e., replace FuSE with a “real” file system) – True concurrent file system – Eliminate separate ToRMS server – ???????

slide-28
SLIDE 28

Open Source

–Yes we're going to release all system software

– Librarian, LFS, kernel modules

–Started with FAM Emulation in December 2015

– http://github.com/FabricAttachedMemory – x86 and Debian Jessie – “Platform” only

28

slide-29
SLIDE 29

How Fast Is The Field of Dreams?

29

slide-30
SLIDE 30

Thank you

Rocky Craig <first.last>@hpe.com

30