Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core - - PowerPoint PPT Presentation

distributed real time fault tolerance on a virtualized
SMART_READER_LITE
LIVE PREVIEW

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core - - PowerPoint PPT Presentation

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: { missimer,richwest,liye } @cs.bu.edu *VMware, Inc. Eric


slide-1
SLIDE 1

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System

Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: {missimer,richwest,liye}@cs.bu.edu *VMware, Inc.

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 1

slide-2
SLIDE 2

Quest-V: Virtualized Multi-Core System

Quest-V Background: Boston University’s in house operating system + hypervisor Developed for real-time and high-confidence systems Key Features: Virtualized Separation Kernel Simplified Hypervisor:

Sandboxes are pinned to cores at boot, no need for scheduling I/O devices are partitioned amongst sandboxes, not shared or emulated Virtualization used for encapsulation

Assume hypervisor is a trusted code base Communication through explicit shared memory channels

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 2

slide-3
SLIDE 3

Quest-V Design

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 3

slide-4
SLIDE 4

Motivation

Safety critical systems requires component isolation and redundancy

Integrated Modular Avionics (IMA), Automobiles

Multi-/many-core processors are increasingly popular in embedded systems Multi-core processors can be used to consolidate redundant services onto a single platform

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 4

slide-5
SLIDE 5

Motivation

Many processors now feature hardware virtualization

ARM Cortex A15, Intel VT-x, AMD-V

Hardware virtualization provides opportunity to efficiently partition resources amongst guest VMs Not trying to remove all hardware redundancy – just lessen it

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 5

slide-6
SLIDE 6

Motivation

Many processors now feature hardware virtualization

ARM Cortex A15, Intel VT-x, AMD-V

Hardware virtualization provides opportunity to efficiently partition resources amongst guest VMs Not trying to remove all hardware redundancy – just lessen it H/W Virtualization + Resource Partitioning/Isolation = Platform for Embedded Safety Critical Systems

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 5

slide-7
SLIDE 7

Motivation

Focusing on hardware transient faults and software timing faults

Random bit flips from caused by radiation Asynchronous bugs in faulty device drivers

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 6

slide-8
SLIDE 8

Quest-V N-Modular Redundancy

N redundant copies of a program, one per sandbox (at least three) At least one voter Hash based fault detection and recovery Virtualized separation kernel platform provides new n-modular redundancy configurations Software based dual core lock step (DCLS)

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 7

slide-9
SLIDE 9

N-Modular Redundancy

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 8

slide-10
SLIDE 10

N-Modular Redundancy for Real-Time Applications

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 9

slide-11
SLIDE 11

Fault Detection

Typical n-modular redundancy compares the output of the computation

Pro: Fast Con: Don’t know what went wrong

Proposed detection method: compare application memory on a per page basis via hashes

Pro: Faster and generic recovery for complicated applications (discussed later) Con: Must hash memory state of process (slow) Can speed on comparison using a “summary” hash

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 10

slide-12
SLIDE 12

Fault Detection

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 11

slide-13
SLIDE 13

N-Modular Redundancy Configurations

Voting mechanism and device driver in the hypervisor Voting mechanism and device driver in one sandbox Voting mechanism distributed across sandboxes and device driver is shared

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 12

slide-14
SLIDE 14

Voting Mechanism and Device Driver in the Hypervisor

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 13

slide-15
SLIDE 15

Voting Mechanism and Device Driver in the Hypervisor

Pros: No need to modify

  • perating system - could

apply to Linux as well as Quest Need only n sandboxes Cons: Conflicts with Quest-V hypervisor design Faulty device driver could jeopardize the entire system Need to duplicate the entire guest

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 14

slide-16
SLIDE 16

Voting Mechanism and Device Driver in One Sandbox

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 15

slide-17
SLIDE 17

Voting Mechanism and Device Driver in One Sandbox

Pros: Simpler hypervisor Application level redundancy, don’t need to copy the entire sandbox Cons: Need (n+1) sandboxes Need to modify guest

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 16

slide-18
SLIDE 18

Voting is Distributed and Device Driver is Shared

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 17

slide-19
SLIDE 19

Voting is Distributed and Device Driver is Shared

Pros: Need only n sandboxes Application level redundancy, don’t need to copy the entire sandbox Cons: Need to modify guest Complicated shared device driver

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 18

slide-20
SLIDE 20

Recovery

Want recovery to be as generic as possible Simple applications – rebooting might be sufficient Complicated applications – rebooting could cause important state to be lost Perform live migrations of either application or guest machine

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 19

slide-21
SLIDE 21

Recovery

All performed within the context of the thread’s sporadic server

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 20

slide-22
SLIDE 22

Quick Summary - Key Points to Take Away

Per-page hash based fault detection and recovery Three n-modular redundancy configurations in a virtualized separation kernel Hypervisor Voting Sandbox Voting Distributed Voting

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 21

slide-23
SLIDE 23

Conclusion

So what’s left?

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

slide-24
SLIDE 24

Conclusion

So what’s left? Further implementation and comparison

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

slide-25
SLIDE 25

Conclusion

So what’s left? Further implementation and comparison Figure out solution for voter single point of failure: Possibilities include arithmetic encoding and memory scrubbing

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

slide-26
SLIDE 26

Conclusion

More Info: www.questos.org

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 23

slide-27
SLIDE 27

Conclusion

More Info: www.questos.org Questions?

Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 23