[PPT] - Adaptability and Fault Tolerance Adaptability and Fault Tolerance PowerPoint Presentation

SLIDE 1

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 1

Adaptability and Fault Tolerance Adaptability and Fault Tolerance

Rog Rogé ério rio de Lemos de Lemos University of Kent, UK University of Kent, UK

Context: self-* and dependability; Focus: adaptability and fault tolerance; State of the art; Conclusions;

SLIDE 2

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 2

Self Self-

* and Dependability

* and Dependability

Dependability:

the ability to deliver service that can justifiably be trusted;

Self-* properties of systems:

the support for autonomy; self-adaptable, self-managing, self-optimising, self-healing,

self-repairing, self-configuring, etc.

Adaptability:

the ability of a system of accommodating changes while

providing its specified services;

run-time changes;

SLIDE 3

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 3

Dependability Dependability

Dependability Dependability - the ability to avoid service failures that are more frequent and more severe than is acceptable;

threats

threats - undesired, but in principle expected circumstances:

faults, errors and failures;

attributes

attributes – properties of the system:

reliability, availability, integrity, confidentiality, and safety;

technologies

technologies – methods and techniques for providing and reach confidence on ability to attain dependability:

rigorous design, validation & verification, fault tolerance, and

system evaluation;

SLIDE 4

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 4

Dependability Dependability -

Threats

Threats

( (Yves Yves Deswarte Deswarte & David Powell & David Powell) )

Error Error

Failure Failure

Fault

that part of system state which may lead to a failure adjudged or hypothesized cause of an error

ccurs when delivered

service deviates from implementing the system function

activation propagation

Fault Fault

causation

Error Error

activation

SLIDE 5

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 5

Adaptability Adaptability -

Initiators

Initiators

Changes:

the act, process, or result of altering or modifying; internal changes:

component failures, overload of resources, etc.

external changes:

environmental, requirements, etc.

There is no fundamental chain of adaptability

initiators;

SLIDE 6

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 6

Threats and Initiators Threats and Initiators

Changes correspond to events (faults):

changes can be dormant if not activated;

What is the consequence of change (errors)?

what would be the equivalent to error free and erroneous

states?

these states are created when changes are activated and can

remain latent until detected;

What is the equivalent of failure?

unsuccessful adaptation? the system might continue to provide its services, but

ignoring the change;

SLIDE 7

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 7

Dependability Dependability -

Technologies

Technologies

Fault avoidance Fault avoidance: build a system with no faults:

rigorous design – fault prevention;

formal and rigorous notations, processes, adapters, etc.

verification & validation – fault removal;

model checking, fault injection, testing, simulation, etc.

Fault acceptance Fault acceptance: impossible to rid the system of faults:

fault tolerance; system evaluation – fault forecasting;

empirical approaches, Markov models, etc.

SLIDE 8

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 8

Fault Tolerance Fault Tolerance

Fault tolerance Fault tolerance aims at avoiding the failure of the system:

error detection:

detects the presence of errors;

recovery:

transforms a system state that contains errors or faults into

a error free state, or faults that can be re-activated;

error handling:
eliminates errors from the system state;
fault handling:
prevents faults from being activated again;
diagnosis, isolation and reconfiguration;

SLIDE 9

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 9

Fault Tolerance Fault Tolerance

( (Yves Yves Deswarte Deswarte & David Powell & David Powell) )

Error Error

Failure Failure

Fault

Error Detection Error Detection

that part of system state which may lead to a failure adjudged or hypothesized cause of an error

ccurs when delivered

service deviates from implementing the system function

Error Handling

Rollback, Rollforward, Compensation

Error Handling Error Handling

Rollback, Rollback, Rollforward Rollforward, , Compensation Compensation

Fault Handling

Diagnosis, Isolation, Reconfiguration, Reinitialization

Fault Handling Fault Handling

Diagnosis, Isolation, Diagnosis, Isolation, Reconfiguration, Reconfiguration, Reinitialization Reinitialization

SLIDE 10

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 10

System Structure System Structure

Fault tolerance is about system structuring;

structure is what enables the system to generate the

behaviour;

determines how effectively this structuring can be

used to provide means of error confinement error confinement;

avoid the propagation of errors; what interactions can exist and at what rate;

it is not restricted to system architecture;

Structural flexibility the basis for adaptation;

SLIDE 11

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 11

Fault Assumptions Fault Assumptions

Faults are undesirable, though expected circumstances:

systems can fail in many different ways;

In the design of fault-tolerant systems, it is essential to define assumptions:

nature

nature of faults - dictates the type of redundancy that must be implemented:

space or time; replication or diversification;

rate

rate of faults - influences the amount of redundancy needed to attain a given dependability;

SLIDE 12

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 12

Fault Assumptions Fault Assumptions

How a component behaves when it fails:

crash fault being the simplest and most restrictive (or

well-defined) type;

Byzantine being the least restrictive;

mission
mission

timing timing Byzantine Byzantine crash crash crash

The different types of changes needs to be classified;

behavioural assumptions;

SLIDE 13

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 13

State of the Art State of the Art

Adaptive fault tolerance Adaptive fault tolerance

property that enables a system to maintain and improve

fault tolerance by adapting to changes in environment and policy;

monitor the system; reconfigure the application when its configuration of it is not

appropriate for the dependability requirements;

distributed systems:

different layers:

middleware / fault tolerance /adaptation;

consensus problem;

SLIDE 14

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 14

State of the Art State of the Art

AQuA – CORBA based operating system;

dynamic replication of objects; Proteus:

dynamic fault tolerance through adaptive reconfiguration;
allows to specify the degree of dependability at the application

level;

SLIDE 15

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 15

State of the Art State of the Art

Chameleon - adaptive infrastructure;

allows different levels of availability requirements; explicit representation of adaptive policies; provides dependability through the use of ARMORs

(Adaptive, Reconfigurable, and Mobile Objects for Reliability):

managers for monitoring and recovering resources;
daemons for providing communication;
common ARMORs for providing application-required

dependability;

enables multiple fault tolerance strategies to co-exist;

SLIDE 16

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 16

State of the Art State of the Art

Architectural fault tolerance Architectural fault tolerance

Error detection and recovery;

techniques based on exception-handling;

application dependent;
iC2C and iFTE;

Fault handling

system reconfiguration;

replacement of components, connectors and configurations;

dynamic reconfiguration;

SLIDE 17

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 17

State of the Art State of the Art

Bio Bio-

inspired computing

inspired computing and statistical methods statistical methods:

data-oriented approaches

data mining large quantities of observations for identifying

patterns;

anomaly (fault and intrusion) detection;

neural networks, genetic algorithms, etc.; adaptive error detection using artificial immune systems:

problem: how to learn from rare events!

statistical learning techniques (SLT) applied to system

recovery;

SLIDE 18

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 18

Conclusions Conclusions

Changes are like faults, though:

they might be desired/undesired and expected/unexpected;

Classification of the types of changes:

therwise becomes application dependent;
e.g., exception handling for the support of fault tolerance;

How system structuring affects adaptability?

is software that flexible for supporting run-time change?

impact of design-time change;

to scope the impact of change;

confinement of the consequence of change;

SLIDE 19

Rogério de Lemos ICSE 2006 SEAMS – May 2006 – 19

Conclusions Conclusions

Dependability and adaptability:

the ability to deliver service:

D: rigorous design and fault tolerance

fault tolerance;

A: rigour in the specification/reasoning about adaptability; A: most work has focused on system reconfiguration;

confidence on that ability:

D: V&V and system evaluation; A: very little has been done here;

adaptability vs. predictability;