A Framework to Control Emergent Survivability of Multi Agent Systems - - PowerPoint PPT Presentation
A Framework to Control Emergent Survivability of Multi Agent Systems - - PowerPoint PPT Presentation
A Framework to Control Emergent Survivability of Multi Agent Systems Aaron Helsinger, Karl Kleinmann & Marshall Brinn BBN Technologies [ahelsing, kkleinmann, mbrinn]@bbn.com The Problem DMAS are complex By definition, many
AAMAS'04
2
The Problem
DMAS are complex
By definition, many independent entities autonomously pursuing goals, spread out over an unreliable network Application Function is itself emergent
As with any complex system, chaos is a fact of life Predictability is impossible at the micro level
Multithreading, timing, etc. The autonomy of agents exacerbates this, as does the network
- ver which you distribute them.
A DMAS can fail in many unpredictable ways.
No complex system can anticipate all problems, nor be impervious to all attacks.
For widespread adoption, the agent community must provide confidence in DMAS systems to reliably perform under stress.
AAMAS'04
3
Emergent Survivability
Our only hope is to
Limit the impact to the micro level, and Keep the macro stable. Make tradeoffs, or suffer catastrophic functionality loss.
We engineer the system to tolerate degradation in some dimensions, while trying to maximize overall system performance.
Measure resources, application function, stresses, and survivability at runtime. Build a hierarchy of control loops to measure performance at macro level and control behavior at micro level.
The system can reason about its survivability in real time and adjust resources in the face of attacks at multiple levels, producing Emergent Survivability.
Designated Users DMAS Primary Application Function Application Goals & Desired Behavior SW HW Failures & Attacks Designated Users DMAS Primary Application Function Application Goals & Desired Behavior SW HW Failures & Attacks
Degrade without Failing Herding Cats
AAMAS'04
4
1) Measure Performance
Identify the dimensions of application function
E.g. Timeliness, correctness, completeness Include survivability, e.g. integrity, accountability, robustness Measure system resources, stresses, and performance
Must define these correctly
If they are too micro, they will vary wildly. If they measure the wrong quantities, they will not vary with the application performance
Build sensors for collecting these data
In-band, lightweight, and real-time See my AAMAS03 paper details
Functions for weighting measures and producing a scalar overall system score
M O P 3-1-1 Tim e to com pute a logistics plan M O P 3-1-3 Tim e to present inform ation to a user
20 40 60 80 100 0.5 1 1.5 2 2.5 3 3.5 4
M ultiple of B aseline Tim e
U tility
AAMAS'04
5
2) Hierarchy of Control
The key idea of our framework is to build a hierarchy
Reasoning at the macro level Acting at the micro level Decisions are made close to the resources in contention or actions capable of addressing the issue,
- Without being susceptible to minor chaotic variations.
Succession of layers; One layer’s micro is another layer’s macro These levels are managed by a nested set of control loops.
Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions
AAMAS'04
6
UltraLog Program
DARPA effort Integrated contributions of 15-20 companies and universities Show assessable wartime survivability Prototype application is military logistics
Real algorithms and organizations Plan, transport, and execute 180 day deployment FCS scenario Resulting log plan has 250K+ individual elements representing demand and transport for 34K+ entities of 200+ types.
AAMAS'04
7
UltraLog Survivability Requirements
Program Goal (per original program description) : System will incur no greater than a 20% capabilities loss and a 30% performance loss under conditions of 45% information infrastructure loss, wartime loads, and directed information warfare
Stress, System Function and Degradation are Quantitative in Nature Three categories of stress
Loss (total or partial) of hardware capabilities (CPU, BW, Memory, Disk) Significant increases in legitimate work to perform Attempts to circumvent system integrity (confidentiality, authentication, authorization) Survivability: Extent to which system function is maintained under stress
AAMAS'04
8
The Cougaar Architecture
Cougaar architecture is designed to support
data intensive, inherently distributed applications, emphasizing scalability & configurability.
Cougaar is
100% Java agent architecture Expressly for building large distributed MAS Around 400K lines of code.
Prototype application
Uses over 1092 agents
- ver a 9-LAN network of
- ver 85 machines. It is
Data- and compute- intensive, Inherently distributed, and must Plan and execute a logistics deployment.
Node
Agent Agent
Plugin PluginBinder Binder Binder Plugin
Message Transport Service
Agent
Blackboard
Binder Servlet Interface
YP/WP Directory Services Community Services
Node
Agent Agent
Plugin PluginBinder Binder Binder Plugin
Message Transport Service
Agent
Blackboard
Binder Servlet Interface
YP/WP Directory Services Community Services
Developed under DARPA funding Cougaar is Open-Source (BSD-style license) http://www.cougaar.org
AAMAS'04
9
Prototype Application MOPs
UltraLog Survivability
MOE 1 Planning and Replanning
0.71
MOP 3-1-1 Time to compute plan or replan
0.80
MOP 3-1-3 Time to present
0.20
MOE 2 Confidentiality & Accountability
0.29
MOP 2-1 Memory data available
0.16
MOP 2-5 User actions recorded
0.04
Capability
0.58
MOE 3 Performance
0.42
MOP 1-2 Correctness of Plan
0.39
MOP 3-1-2 reserved MOP 1-1 Completeness of Plan
0.41
MOP 1-3 Completeness for presentation
0.10
MOP 2-2 Disk data available
0.16
MOP 2-3 Transmission data available
0.31
MOP 1-4 Correctness for presentation
0.10
MOP 1-1-1 Transport
0.64
MOP 1-1-2 Supply
0.36
MOP 1-1-1-1 Near Term
0.85
MOP 1-1-1-2 Far Term
0.15
MOP 1-2-1 Transport
0.55
MOP 1-2-1-1 Near Term
0.85
MOP 1-2-1-2 Far Term
0.15
MOP 1-2-2 Supply
0.45
MOP 1-3-1 Transport
0.64
MOP 1-3-1-1 Near Term
0.85
MOP 1-3-1-2 Far Term
0.15
MOP 1-3-2 Supply
0.36
MOP 1-4-1 Transport
0.55
MOP 1-4-1-1 Near Term
0.85
MOP 1-4-1-2 Far Term
0.15
MOP 1-4-2 Supply
0.45
Swing Weights November 03
MOP 2-4 User actions counter to policy
0.21
MOP 2-6 User violations recorded
0.12
- Measure Performance
- Weight Measures
- Compute Overall Survivability
Score
AAMAS'04
10
Library of Adaptive Services
Adaptive Robustness
No single points of failure (SPOFs) Automated recovery from resource loss
- Planned or unplanned agent and machine loss
- Proactive response to perceived threat
- Lost network component (temporary or permanent)
Resource management
- Load balancing
- Load shedding
Adaptive Security
Application software integrity:
- Signed jars, Java security mgr
Data integrity:
- Signed and encrypted messages
- Signed and encrypted data files
Access control:
- Maintain an identity and certificates for “Principles”
- Policy-based access control of servlets, messages, and blackboard
- bjects
AAMAS'04
11
UltraLog Control Hierarchy
Society
Top level, with user input Policy manager Cross-community coordinator
Community
Security, robustness, LAN communities & resources Policy controlled, Defense Coordinator balances priorities
Host or JVM
Host level resources managed by policy, Adaptivity Engine, coordinator
Agent
Tailor local operations and goals Adaptivity Engine reasons using a local book of plays, configuring local components
Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions Agent Host / Node Community Society Raw and Derived Sensor Data Selected Control Actions
AAMAS'04
12
Adaptivity Engine
The Adaptivity Engine is the heart of the Agent or Node-level control loop. An Adaptivity Engine in an agent or node will be run off a playbook that determines what operating modes and policies should be invoked on sub-components to achieve a desirable aggregate performance
Based on measurements of current and expected performance and situation.
A playbook represents rules for adaptivity actions based on performance regions. Examples:
“Enter Operating mode X when CPU > X and RT-Performance=‘Falling Behind’” “Establish Policy ABC when THREATCON>=3”
The Adaptivity Engine at any given level needs to make periodic measurements, determine the current operating region and take appropriate action (control loop).
AAMAS'04
13
Adaptivity Engine Architecture
Processing Components (Plugins)
Sensors
Publish real-time sensor conditions: Load Resource Availability THREATCON Possibly Current Settings Current Performance Sets Operating Modes for Components based on plays in Playbook and current Sensor Conditions Read Playbook Constrain Playbook based on ‘leaf’ OperatingMode policy direction Operating mode policy manager reads Operating mode policies (leaves of expansion or relayed from other agents) from blackboard Operating Modes (Knob Settings) Get Condition by name Get OperatingMode by name Publish changes to operating modes Publish InterAgentOperatingModePolicies
Blackboard
RelayLP
Other agents
InterAgent OperatingMode Policies
Operating Mode Policy Manager Adaptivity Engine
OperatingMode Service Condition Service Playbook Playbook Constrain Service Playbook Read Service Playbook Manager
AAMAS'04
14
Defense Coordinator
The Coordinator is the brains of the Host or Community level control loop. Goals
Deconflict competing Defense actions Choose “best” actions based on:
- Collection of Defense diagnoses
- Belief about threat environment
- High-level policy (as specified by MAU curves)
Local decisions when possible for efficiency
Key design points
POMDP for belief calculations XML TechSpecs for threats, assets, and defenses Cost/Benefit analysis to select actions
Metrics Calculation Community Behavior MAU Score Community #1 Controller Selected Defenses Society Controller Community Behavior Community #2 Controller Selected Defenses Metrics Calculation Community Behavior MAU Score Community #1 Controller Selected Defenses Society Controller Community Behavior Community #2 Controller Selected Defenses
AAMAS'04
15
Coordinator Architecture
Node Node Agent
Local Defense Coordinator D i a g n
- s
e s A c t i
- n
O f f e r s C u r r e n t A c t i
- n
Local Defense A l l
- w
e d A c t i
- n
s W h e n C
- n
n e c t e d A l l
- w
e d A c t i
- n
s W h e n D i s c
- n
n e c t e d D i a g n
- s
e s A c t i
- n
O f f e r s C u r r e n t A c t i
- n
A l l
- w
e d A c t i
- n
s W h e n C
- n
n e c t e d A l l
- w
e d A c t i
- n
s W h e n D i s c
- n
n e c t e d
Management Agent
Community Defense Coordinator D i a g n
- s
e s A c t i
- n
O f f e r s C u r r e n t A c t i
- n
Community Defense A l l
- w
e d A c t i
- n
s W h e n C
- n
n e c t e d A l l
- w
e d A c t i
- n
s W h e n D i s c
- n
n e c t e d
Agent
Local Defense
AAMAS'04
16
Assessment of Prototype
Extensive integration lab Annual testing cycle
Engineering testing Security red team Functional assessment Survivability assessment
Web accessible run results
Tools for running & stressing society Automatic report generation Display debug information & survivability scores
AAMAS'04
17
2003 Assessment Results
Improved Stress Results
2000 2001 2002 2003 Stress FAIL OK OK PASS Wartime loads1 FAIL FAIL OK OK Wartime loads2 FAIL FAIL OK PASS Wartime loads3 FAIL FAIL OK OK Thrashing FAIL OK OK OK Scaling of nodes and agents FAIL FAIL OK OK Scaling logistics problem OK PASS PASS PASS Fraudulent, untrusted code FAIL PASS PASS PASS Untrusted communications FAIL PASS PASS PASS Insecure / dangerous code FAIL FAIL OK PASS Corruption of persisted state FAIL OK OK OK Unauthorized processing OK OK OK PASS Unexpected plugin behavior OK PASS PASS PASS Component masquerade FAIL OK OK PASS Compromised agents1 FAIL OK OK PASS Compromised agents2 FAIL OK OK OK Intrusion FAIL FAIL OK OK Compromised communications FAIL FAIL OK OK Snooping OK PASS PASS PASS Message intercept FAIL OK PASS PASS Processing failure1 FAIL OK OK PASS Processing failure2 FAIL OK PASS PASS Network failure1 FAIL OK PASS PASS Network failure2 FAIL OK OK PASS Processing contention FAIL OK OK PASS DOS attack Security Robustness Scalability
Capability 10 20 30 40 50 60 70 80 90 100 0% 20% 40% 60% 80% 100% Infrastructure Loss Capability Score Time to Plan 20 40 60 80 100 120 140 160 180 200 0% 20% 40% 60% 80% 100% Infrastructure Loss Time to Plan (min)
Passed 45% Loss Test
45% Infrastructure Loss
20% Capability Loss More Stresses PASS in 03
AAMAS'04
18
Future Work
There is much we’d like to do in our UltraLog implementation
Coordinators that learn Show how performance degrades with increased stress – where is the breaking point?
There is work to do in proving our approach
Apply to different application domains
- What if anything is specific to logistics?
Experiment with alternate control frameworks
- Performance
- Survivability
Related work available at cougaar.org: Many posted papers
AAMAS'04
19
Summary
DMAS must be proven survivable to be adopted Survivability can emerge from DMAS operations To do so:
Measure survivability at runtime as part of system function Create nested control loops that
- measure at macro levels and
- act at micro levels to
- produce overall application function survivability