Evaluating storage APIs for QEMU Anthony Liguori - - PowerPoint PPT Presentation

evaluating storage apis for qemu
SMART_READER_LITE
LIVE PREVIEW

Evaluating storage APIs for QEMU Anthony Liguori - - PowerPoint PPT Presentation

Evaluating storage APIs for QEMU Anthony Liguori aliguori@us.ibm.com Open Virtualization IBM Linux Technology Center Linux Plumbers Conference 2009 Linux is a registered trademark of Linus Torvalds. The V-Word QEMU is used by Xen and


slide-1
SLIDE 1

Linux is a registered trademark of Linus Torvalds.

Evaluating storage APIs for QEMU

Anthony Liguori – aliguori@us.ibm.com Open Virtualization IBM Linux Technology Center Linux Plumbers Conference 2009

slide-2
SLIDE 2

The V-Word

  • QEMU is used by Xen and KVM for I/O but...

– this is not a virtualization talk

  • Let's just think of QEMU as a userspace process

that can run a variety of “workloads”

  • Think of it like dbench
  • These workloads tend to be very intelligent

about how they access storage

  • Workloads have incredible performance

demands

  • Our goal is to give our users the best possible

performance by default

– Should Just Work

slide-3
SLIDE 3

We want

  • Asynchronous completion
  • Scatter/gather lists
  • Batch submission
  • Ability to tell kernel about request ordering

requirements

  • Ability to maintain CPU affinity for request

processing

slide-4
SLIDE 4

Hello World

slide-5
SLIDE 5

Posix read()/write()

  • Our very first implementation
  • We handled requests synchronously, using

read()/write()

  • Scatter/gather lists were bounced
  • Main problem with this approach:

– Workload cannot run while processing I/O

request

– I/O performance is terrible – Because workload doesn't run while waiting

for I/O, CPU performance is terrible too

slide-6
SLIDE 6

Worker thread

slide-7
SLIDE 7

First improvement

  • Have a single worker thread
  • I/O requests are now asynchronous

– No more horrendous CPU overhead

  • We still bounce
  • We can only handle one request at a time
  • Never merged upstream (Xen only)
slide-8
SLIDE 8

posix-aio

slide-9
SLIDE 9

Upstream solution

  • Use posix-aio to support portable AIO
  • Yay!
  • Reasonable API

– Can batch requests – Supports async notification via signals

  • Except it's terrible
slide-10
SLIDE 10

Posix-aio shortcomings

  • Under the covers, it uses a thread pool
  • Requires bouncing
  • API is not extendable by mere mortals

– New APIs must be accepted by POSIX before

implementing in glibc (or so I was told)

  • Biggest problem was this comment in glibc:
  • “ The current file descriptor is worked on. It makes no sense

to start another thread since this new thread would fight with the running thread for the resources.”

  • Cannot support multiple AIO requests in flight
  • n a single file descriptor; no response from

Ulrich about removing this restriction

  • Signal based completion is painful to use
slide-11
SLIDE 11

Other posix-aio's

  • It's not just glibc that screws it up
  • FreeBSD has a nice posix-aio implementation

that's supported by a kernel module

  • If you use posix-aio without this module loaded,

you get a SEGV

  • You need non-portable code to detect if this

kernel module is not loaded, and then a fallback mechanism that isn't posix-aio since a non-privileged user cannot load kernel modules

  • Posix-aio always requires a fallback
slide-12
SLIDE 12

linux-aio: tux saves the day!

slide-13
SLIDE 13

linux-aio

  • Forget portability, let's use a native linux

interface

  • Fall back to something lame for everything else
  • Very nice interface

– Supports scatter/gather requests – Can submit multiple requests at once

  • Except it's terrible
slide-14
SLIDE 14

linux-aio shortcomings

  • Originally, no async notification

– Must use special blocking function – Signal support added – Eventfd support added – Neither mechanism is probe-able in software

so you have to guess at compile time

– Libaio spent a good period of time in an

unmaintained state making eventfd support unavailable in even modern distros (SLES11)

  • Only works on some types of file descriptors

– Usually, O_DIRECT

  • If used on an unsupported file descriptor, you

get no error, io_submit() just blocks

slide-15
SLIDE 15

!@#!@@$#!!#@#!#@

slide-16
SLIDE 16

linux-maybe-sometimes-aio

  • There is no right way to use this API if you

actually care about asynchronous IO requests

  • You either have to

– Require a user to enable linux-aio – Be extremely conversation and limit

yourselves to things you know work today like O_DIRECT on a physical device

  • No guarantee these cases will keep working
  • No way of detecting when new cases are added
  • The API desperately needs feature detection
  • It's only useful for databases and

benchmarking tools

slide-17
SLIDE 17

Let's fix posix-aio

slide-18
SLIDE 18

Our own thread pool

  • Implement our own posix-aio but don't enforce

arbitrary limits

  • Still cannot submit multiple requests on a file

descriptor because of seek/read race

– Thread1: lseek -> readv – Thread2: lseek -> (race) -> writev

  • Tried various work-arounds with dup() (FAIL)
  • Bounce buffers and use pread/pwrite
  • Introduce preadv/pwritev

– We now have zero copy and simultaneous

request processing

slide-19
SLIDE 19

Shortcomings

  • Thread switch cost is non-negligible
  • We don't have a true batch submission API to

the kernel

– Tagging semantics don't map very well

  • Not very CFQ friendly

– Each thread is considered a different IO

context, CFQ waits for each thread to submit more requests resulting in long delays

– Fixable with CLONE_IO – not exposed

through pthreads

– Some attempts at improving upstream

slide-20
SLIDE 20

Compromise

slide-21
SLIDE 21

What we do today

  • We use linux-aio when we think it's safe

– Gives us better performance – Only use with block devices – Lose features such as host page cache

sharing

– For certain configurations, like c _ _ _ d,

making use of the host page cache is absolutely critical

– Most users use file backed images

  • We fall back to our thread pool otherwise

– Good compromise of performance and

features

– But we know we can do better

slide-22
SLIDE 22

What's coming

slide-23
SLIDE 23

acall/syslets

  • Both are kernel thread pool

– Avoid thread creation when request can

complete immediately (nice)

– Lighter weight threads – Potentially better thread pool management

  • acall has a narrower scope

– No clear benefit today over userspace thread

pool other than introducing interfaces

– Seems easier to merge upstream

  • syslets have a broader scope

– Complex ability to chain system calls without

returning to userspace

– Seems to have lost merge momentum

slide-24
SLIDE 24

acall/syslet shortcomings

  • Still does not solve some of the fundamental

semantic mapping issues

– Neither are very useful for our workloads

without preadv/pwritev

– Neither help request tagging as request

  • rdering is fundamentally lost in a thread

pool

– Still not obvious how to extend

preadv/pwritev paradigm to support tagging

– Both have clear benefits though

slide-25
SLIDE 25

Overall uncertainty

  • We're willing to fix linux-aio
  • We're willing to help solve the problems around

acall/syslets

  • The lack of clarity around the future makes it

difficult though to begin

  • Other v-word solutions use custom userspace

block IO interfaces to avoid these problems

– Using confusing terms like “in-kernel

paravirtual block device backend” to avoid real review

– It would be much better to fix the generic

interfaces so everyone benefits

– It's a battle we're losing so far

slide-26
SLIDE 26

Questions

  • Questions?
slide-27
SLIDE 27

Linux is a registered trademark of Linus Torvalds.

Evaluating storage APIs for QEMU

Anthony Liguori – aliguori@us.ibm.com Open Virtualization IBM Linux Technology Center Linux Plumbers Conference 2009

slide-28
SLIDE 28

The V-Word

  • QEMU is used by Xen and KVM for I/O but...

– this is not a virtualization talk

  • Let's just think of QEMU as a userspace process

that can run a variety of “workloads”

  • Think of it like dbench
  • These workloads tend to be very intelligent

about how they access storage

  • Workloads have incredible performance

demands

  • Our goal is to give our users the best possible

performance by default

– Should Just Work

slide-29
SLIDE 29

We want

  • Asynchronous completion
  • Scatter/gather lists
  • Batch submission
  • Ability to tell kernel about request ordering

requirements

  • Ability to maintain CPU affinity for request

processing

slide-30
SLIDE 30

Hello World

slide-31
SLIDE 31

Posix read()/write()

  • Our very first implementation
  • We handled requests synchronously, using

read()/write()

  • Scatter/gather lists were bounced
  • Main problem with this approach:

– Workload cannot run while processing I/O

request

– I/O performance is terrible – Because workload doesn't run while waiting

for I/O, CPU performance is terrible too

slide-32
SLIDE 32

Worker thread

slide-33
SLIDE 33

First improvement

  • Have a single worker thread
  • I/O requests are now asynchronous

– No more horrendous CPU overhead

  • We still bounce
  • We can only handle one request at a time
  • Never merged upstream (Xen only)
slide-34
SLIDE 34

posix-aio

slide-35
SLIDE 35

Upstream solution

  • Use posix-aio to support portable AIO
  • Yay!
  • Reasonable API

– Can batch requests – Supports async notification via signals

  • Except it's terrible
slide-36
SLIDE 36

Posix-aio shortcomings

  • Under the covers, it uses a thread pool
  • Requires bouncing
  • API is not extendable by mere mortals

– New APIs must be accepted by POSIX before

implementing in glibc (or so I was told)

  • Biggest problem was this comment in glibc:
  • “ The current file descriptor is worked on. It makes no sense

to start another thread since this new thread would fight with the running thread for the resources.”

  • Cannot support multiple AIO requests in flight
  • n a single file descriptor; no response from

Ulrich about removing this restriction

  • Signal based completion is painful to use
slide-37
SLIDE 37

Other posix-aio's

  • It's not just glibc that screws it up
  • FreeBSD has a nice posix-aio implementation

that's supported by a kernel module

  • If you use posix-aio without this module loaded,

you get a SEGV

  • You need non-portable code to detect if this

kernel module is not loaded, and then a fallback mechanism that isn't posix-aio since a non-privileged user cannot load kernel modules

  • Posix-aio always requires a fallback
slide-38
SLIDE 38

linux-aio: tux saves the day!

slide-39
SLIDE 39

linux-aio

  • Forget portability, let's use a native linux

interface

  • Fall back to something lame for everything else
  • Very nice interface

– Supports scatter/gather requests – Can submit multiple requests at once

  • Except it's terrible
slide-40
SLIDE 40

linux-aio shortcomings

  • Originally, no async notification

– Must use special blocking function – Signal support added – Eventfd support added – Neither mechanism is probe-able in software

so you have to guess at compile time

– Libaio spent a good period of time in an

unmaintained state making eventfd support unavailable in even modern distros (SLES11)

  • Only works on some types of file descriptors

– Usually, O_DIRECT

  • If used on an unsupported file descriptor, you

get no error, io_submit() just blocks

slide-41
SLIDE 41

!@#!@@$#!!#@#!#@

slide-42
SLIDE 42

linux-maybe-sometimes-aio

  • There is no right way to use this API if you

actually care about asynchronous IO requests

  • You either have to

– Require a user to enable linux-aio – Be extremely conversation and limit

yourselves to things you know work today like O_DIRECT on a physical device

  • No guarantee these cases will keep working
  • No way of detecting when new cases are added
  • The API desperately needs feature detection
  • It's only useful for databases and

benchmarking tools

slide-43
SLIDE 43

Let's fix posix-aio

slide-44
SLIDE 44

Our own thread pool

  • Implement our own posix-aio but don't enforce

arbitrary limits

  • Still cannot submit multiple requests on a file

descriptor because of seek/read race

– Thread1: lseek -> readv – Thread2: lseek -> (race) -> writev

  • Tried various work-arounds with dup() (FAIL)
  • Bounce buffers and use pread/pwrite
  • Introduce preadv/pwritev

– We now have zero copy and simultaneous

request processing

slide-45
SLIDE 45

Shortcomings

  • Thread switch cost is non-negligible
  • We don't have a true batch submission API to

the kernel

– Tagging semantics don't map very well

  • Not very CFQ friendly

– Each thread is considered a different IO

context, CFQ waits for each thread to submit more requests resulting in long delays

– Fixable with CLONE_IO – not exposed

through pthreads

– Some attempts at improving upstream

slide-46
SLIDE 46

Compromise

slide-47
SLIDE 47

What we do today

  • We use linux-aio when we think it's safe

– Gives us better performance – Only use with block devices – Lose features such as host page cache

sharing

– For certain configurations, like c _ _ _ d,

making use of the host page cache is absolutely critical

– Most users use file backed images

  • We fall back to our thread pool otherwise

– Good compromise of performance and

features

– But we know we can do better

slide-48
SLIDE 48

What's coming

slide-49
SLIDE 49

acall/syslets

  • Both are kernel thread pool

– Avoid thread creation when request can

complete immediately (nice)

– Lighter weight threads – Potentially better thread pool management

  • acall has a narrower scope

– No clear benefit today over userspace thread

pool other than introducing interfaces

– Seems easier to merge upstream

  • syslets have a broader scope

– Complex ability to chain system calls without

returning to userspace

– Seems to have lost merge momentum

slide-50
SLIDE 50

acall/syslet shortcomings

  • Still does not solve some of the fundamental

semantic mapping issues

– Neither are very useful for our workloads

without preadv/pwritev

– Neither help request tagging as request

  • rdering is fundamentally lost in a thread

pool

– Still not obvious how to extend

preadv/pwritev paradigm to support tagging

– Both have clear benefits though

slide-51
SLIDE 51

Overall uncertainty

  • We're willing to fix linux-aio
  • We're willing to help solve the problems around

acall/syslets

  • The lack of clarity around the future makes it

difficult though to begin

  • Other v-word solutions use custom userspace

block IO interfaces to avoid these problems

– Using confusing terms like “in-kernel

paravirtual block device backend” to avoid real review

– It would be much better to fix the generic

interfaces so everyone benefits

– It's a battle we're losing so far

slide-52
SLIDE 52

Questions

  • Questions?