[PPT] - A Framework to Control Emergent Survivability of Multi Agent Systems PowerPoint Presentation

SLIDE 1

A Framework to Control Emergent Survivability of Multi Agent Systems

Aaron Helsinger, Karl Kleinmann & Marshall Brinn BBN Technologies [ahelsing, kkleinmann, mbrinn]@bbn.com

SLIDE 2

AAMAS'04

2

The Problem

DMAS are complex

By definition, many independent entities autonomously pursuing goals, spread out over an unreliable network Application Function is itself emergent

As with any complex system, chaos is a fact of life Predictability is impossible at the micro level

Multithreading, timing, etc. The autonomy of agents exacerbates this, as does the network

ver which you distribute them.

A DMAS can fail in many unpredictable ways.

No complex system can anticipate all problems, nor be impervious to all attacks.

For widespread adoption, the agent community must provide confidence in DMAS systems to reliably perform under stress.

SLIDE 3

AAMAS'04

3

Emergent Survivability

Our only hope is to

Limit the impact to the micro level, and Keep the macro stable. Make tradeoffs, or suffer catastrophic functionality loss.

We engineer the system to tolerate degradation in some dimensions, while trying to maximize overall system performance.

Measure resources, application function, stresses, and survivability at runtime. Build a hierarchy of control loops to measure performance at macro level and control behavior at micro level.

The system can reason about its survivability in real time and adjust resources in the face of attacks at multiple levels, producing Emergent Survivability.

Designated Users DMAS Primary Application Function Application Goals & Desired Behavior SW HW Failures & Attacks Designated Users DMAS Primary Application Function Application Goals & Desired Behavior SW HW Failures & Attacks

Degrade without Failing Herding Cats

SLIDE 4

AAMAS'04

4

1) Measure Performance

Identify the dimensions of application function

E.g. Timeliness, correctness, completeness Include survivability, e.g. integrity, accountability, robustness Measure system resources, stresses, and performance

Must define these correctly

If they are too micro, they will vary wildly. If they measure the wrong quantities, they will not vary with the application performance

Build sensors for collecting these data

In-band, lightweight, and real-time See my AAMAS03 paper details

Functions for weighting measures and producing a scalar overall system score

M O P 3-1-1 Tim e to com pute a logistics plan M O P 3-1-3 Tim e to present inform ation to a user

20 40 60 80 100 0.5 1 1.5 2 2.5 3 3.5 4

M ultiple of B aseline Tim e

U tility

SLIDE 5

AAMAS'04

5

2) Hierarchy of Control

The key idea of our framework is to build a hierarchy

Reasoning at the macro level Acting at the micro level Decisions are made close to the resources in contention or actions capable of addressing the issue,

Without being susceptible to minor chaotic variations.

Succession of layers; One layer’s micro is another layer’s macro These levels are managed by a nested set of control loops.

Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions

SLIDE 6

AAMAS'04

6

UltraLog Program

DARPA effort Integrated contributions of 15-20 companies and universities Show assessable wartime survivability Prototype application is military logistics

Real algorithms and organizations Plan, transport, and execute 180 day deployment FCS scenario Resulting log plan has 250K+ individual elements representing demand and transport for 34K+ entities of 200+ types.

SLIDE 7

AAMAS'04

7

UltraLog Survivability Requirements

Program Goal (per original program description) : System will incur no greater than a 20% capabilities loss and a 30% performance loss under conditions of 45% information infrastructure loss, wartime loads, and directed information warfare

Stress, System Function and Degradation are Quantitative in Nature Three categories of stress

Loss (total or partial) of hardware capabilities (CPU, BW, Memory, Disk) Significant increases in legitimate work to perform Attempts to circumvent system integrity (confidentiality, authentication, authorization) Survivability: Extent to which system function is maintained under stress

SLIDE 8

AAMAS'04

8

The Cougaar Architecture

Cougaar architecture is designed to support

data intensive, inherently distributed applications, emphasizing scalability & configurability.

Cougaar is

100% Java agent architecture Expressly for building large distributed MAS Around 400K lines of code.

Prototype application

Uses over 1092 agents

ver a 9-LAN network of
ver 85 machines. It is

Data- and compute- intensive, Inherently distributed, and must Plan and execute a logistics deployment.

Node

Agent Agent

Plugin Plugin

Binder Binder Binder Plugin

Message Transport Service

Agent

Blackboard

Binder Servlet Interface

YP/WP Directory Services Community Services

Node

Agent Agent

Plugin Plugin

Binder Binder Binder Plugin

Message Transport Service

Agent

Blackboard

Binder Servlet Interface

YP/WP Directory Services Community Services

Developed under DARPA funding Cougaar is Open-Source (BSD-style license) http://www.cougaar.org

SLIDE 9

AAMAS'04

9

Prototype Application MOPs

UltraLog Survivability

MOE 1 Planning and Replanning

0.71

MOP 3-1-1 Time to compute plan or replan

0.80

MOP 3-1-3 Time to present

0.20

MOE 2 Confidentiality & Accountability

0.29

MOP 2-1 Memory data available

0.16

MOP 2-5 User actions recorded

0.04

Capability

0.58

MOE 3 Performance

0.42

MOP 1-2 Correctness of Plan

0.39

MOP 3-1-2 reserved MOP 1-1 Completeness of Plan

0.41

MOP 1-3 Completeness for presentation

0.10

MOP 2-2 Disk data available

0.16

MOP 2-3 Transmission data available

0.31

MOP 1-4 Correctness for presentation

0.10

MOP 1-1-1 Transport

0.64

MOP 1-1-2 Supply

0.36

MOP 1-1-1-1 Near Term

0.85

MOP 1-1-1-2 Far Term

0.15

MOP 1-2-1 Transport

0.55

MOP 1-2-1-1 Near Term

0.85

MOP 1-2-1-2 Far Term

0.15

MOP 1-2-2 Supply

0.45

MOP 1-3-1 Transport

0.64

MOP 1-3-1-1 Near Term

0.85

MOP 1-3-1-2 Far Term

0.15

MOP 1-3-2 Supply

0.36

MOP 1-4-1 Transport

0.55

MOP 1-4-1-1 Near Term

0.85

MOP 1-4-1-2 Far Term

0.15

MOP 1-4-2 Supply

0.45

Swing Weights November 03

MOP 2-4 User actions counter to policy

0.21

MOP 2-6 User violations recorded

0.12

Measure Performance
Weight Measures
Compute Overall Survivability

Score

SLIDE 10

AAMAS'04

10

Library of Adaptive Services

Adaptive Robustness

No single points of failure (SPOFs) Automated recovery from resource loss

Planned or unplanned agent and machine loss
Proactive response to perceived threat
Lost network component (temporary or permanent)

Resource management

Load balancing
Load shedding

Adaptive Security

Application software integrity:

Signed jars, Java security mgr

Data integrity:

Signed and encrypted messages
Signed and encrypted data files

Access control:

Maintain an identity and certificates for “Principles”
Policy-based access control of servlets, messages, and blackboard
bjects

SLIDE 11

AAMAS'04

11

UltraLog Control Hierarchy

Society

Top level, with user input Policy manager Cross-community coordinator

Community

Security, robustness, LAN communities & resources Policy controlled, Defense Coordinator balances priorities

Host or JVM

Host level resources managed by policy, Adaptivity Engine, coordinator

Agent

Tailor local operations and goals Adaptivity Engine reasons using a local book of plays, configuring local components

Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions

SLIDE 12

AAMAS'04

12

Adaptivity Engine

The Adaptivity Engine is the heart of the Agent or Node-level control loop. An Adaptivity Engine in an agent or node will be run off a playbook that determines what operating modes and policies should be invoked on sub-components to achieve a desirable aggregate performance

Based on measurements of current and expected performance and situation.

A playbook represents rules for adaptivity actions based on performance regions. Examples:

“Enter Operating mode X when CPU > X and RT-Performance=‘Falling Behind’” “Establish Policy ABC when THREATCON>=3”

The Adaptivity Engine at any given level needs to make periodic measurements, determine the current operating region and take appropriate action (control loop).

SLIDE 13

AAMAS'04

13

Adaptivity Engine Architecture

Processing Components (Plugins)

Sensors

Publish real-time sensor conditions: Load Resource Availability THREATCON Possibly Current Settings Current Performance Sets Operating Modes for Components based on plays in Playbook and current Sensor Conditions Read Playbook Constrain Playbook based on ‘leaf’ OperatingMode policy direction Operating mode policy manager reads Operating mode policies (leaves of expansion or relayed from other agents) from blackboard Operating Modes (Knob Settings) Get Condition by name Get OperatingMode by name Publish changes to operating modes Publish InterAgentOperatingModePolicies

Blackboard

RelayLP

Other agents

InterAgent OperatingMode Policies

Operating Mode Policy Manager Adaptivity Engine

OperatingMode Service Condition Service Playbook Playbook Constrain Service Playbook Read Service Playbook Manager

SLIDE 14

AAMAS'04

14

Defense Coordinator

The Coordinator is the brains of the Host or Community level control loop. Goals

Deconflict competing Defense actions Choose “best” actions based on:

Collection of Defense diagnoses
Belief about threat environment
High-level policy (as specified by MAU curves)

Local decisions when possible for efficiency

Key design points

POMDP for belief calculations XML TechSpecs for threats, assets, and defenses Cost/Benefit analysis to select actions

Metrics Calculation Community Behavior MAU Score Community #1 Controller Selected Defenses Society Controller Community Behavior Community #2 Controller Selected Defenses Metrics Calculation Community Behavior MAU Score Community #1 Controller Selected Defenses Society Controller Community Behavior Community #2 Controller Selected Defenses

SLIDE 15

AAMAS'04

15

Coordinator Architecture

Node Node Agent

Local Defense Coordinator D i a g n

s

e s A c t i

n

O f f e r s C u r r e n t A c t i

n

Local Defense A l l

w

e d A c t i

n

s W h e n C

n

n e c t e d A l l

w

e d A c t i

n

s W h e n D i s c

n

n e c t e d D i a g n

s

e s A c t i

n

O f f e r s C u r r e n t A c t i

n

A l l

w

e d A c t i

n

s W h e n C

n

n e c t e d A l l

w

e d A c t i

n

s W h e n D i s c

n

n e c t e d

Management Agent

Community Defense Coordinator D i a g n

s

e s A c t i

n

O f f e r s C u r r e n t A c t i

n

Community Defense A l l

w

e d A c t i

n

s W h e n C

n

n e c t e d A l l

w

e d A c t i

n

s W h e n D i s c

n

n e c t e d

Agent

Local Defense

SLIDE 16

AAMAS'04

16

Assessment of Prototype

Extensive integration lab Annual testing cycle

Engineering testing Security red team Functional assessment Survivability assessment

Web accessible run results

Tools for running & stressing society Automatic report generation Display debug information & survivability scores

SLIDE 17

AAMAS'04

17

2003 Assessment Results

Improved Stress Results

2000 2001 2002 2003 Stress FAIL OK OK PASS Wartime loads1 FAIL FAIL OK OK Wartime loads2 FAIL FAIL OK PASS Wartime loads3 FAIL FAIL OK OK Thrashing FAIL OK OK OK Scaling of nodes and agents FAIL FAIL OK OK Scaling logistics problem OK PASS PASS PASS Fraudulent, untrusted code FAIL PASS PASS PASS Untrusted communications FAIL PASS PASS PASS Insecure / dangerous code FAIL FAIL OK PASS Corruption of persisted state FAIL OK OK OK Unauthorized processing OK OK OK PASS Unexpected plugin behavior OK PASS PASS PASS Component masquerade FAIL OK OK PASS Compromised agents1 FAIL OK OK PASS Compromised agents2 FAIL OK OK OK Intrusion FAIL FAIL OK OK Compromised communications FAIL FAIL OK OK Snooping OK PASS PASS PASS Message intercept FAIL OK PASS PASS Processing failure1 FAIL OK OK PASS Processing failure2 FAIL OK PASS PASS Network failure1 FAIL OK PASS PASS Network failure2 FAIL OK OK PASS Processing contention FAIL OK OK PASS DOS attack Security Robustness Scalability

Capability 10 20 30 40 50 60 70 80 90 100 0% 20% 40% 60% 80% 100% Infrastructure Loss Capability Score Time to Plan 20 40 60 80 100 120 140 160 180 200 0% 20% 40% 60% 80% 100% Infrastructure Loss Time to Plan (min)

Passed 45% Loss Test

45% Infrastructure Loss

20% Capability Loss More Stresses PASS in 03

SLIDE 18

AAMAS'04

18

Future Work

There is much we’d like to do in our UltraLog implementation

Coordinators that learn Show how performance degrades with increased stress – where is the breaking point?

There is work to do in proving our approach

Apply to different application domains

What if anything is specific to logistics?

Experiment with alternate control frameworks

Performance
Survivability

Related work available at cougaar.org: Many posted papers

SLIDE 19

AAMAS'04

19

Summary

DMAS must be proven survivable to be adopted Survivability can emerge from DMAS operations To do so:

Measure survivability at runtime as part of system function Create nested control loops that

measure at macro levels and
act at micro levels to
produce overall application function survivability