[PPT] - Provenance-based Intrusion Detection Thomas Pasquier University of PowerPoint Presentation

SLIDE 1

Provenance-based Intrusion Detection

Thomas Pasquier University of Bristol https://tfjmp.org 12/11/2020

1

SLIDE 2

Talk loosely based on following publications

Han et al. “SIGL: Securing Software Installations Through Deep Graph

Learning”, USENIX Security 2021

Han et al. “UNICORN: Revisiting Host-Based Intrusion Detection in the Age of

Data Provenance”, NDSS 2020

Pasquier et al. “Runtime Analysis of Whole-System Provenance”, ACM CCS

2018

Pasquier et al. “Practical Whole-System Provenance Capture”, ACM SoCC 2017

2

SLIDE 3

Motivation: System call based intrusion detection

System Calls

3

SLIDE 4

Motivation: System call based intrusion detection

Identify abnormal patterns System Calls

4

SLIDE 5

Motivation: System call based intrusion detection

Identify abnormal patterns Hidden among benign actions System Calls

5

SLIDE 6

Motivation: System call based intrusion detection

Identify abnormal patterns Hidden among benign actions Masquerading as benign action System Calls

6

SLIDE 7

Motivation: System call based intrusion detection

Identify abnormal patterns Hidden among benign actions Masquerading as benign action Over a long period of time [...] [...] System Calls

7

SLIDE 8

What is provenance?

8

SLIDE 9

What is provenance?

From the French “provenir” meaning “coming from”
Formal set of documents describing the origin of an art piece
Sequence of
Formal ownership
Custody
Places of storage
Used for authentication

9

SLIDE 10

What is data-provenance?

Represent interactions between objects of different types
Data-items (entities)
Processing (activities)
Individuals and Organisations (agents)
Represented as a directed acyclic graph (think information flows)
Edges represent interactions between objects’ states as dependencies
It is a representation of history of a system execution
Immutable (unless it’s 1984)
No dependency to the future

10

SLIDE 11

How is this useful?

11

SLIDE 12

Provenance-based intrusion detection

▪ Intuition: provenance graph exposes causality relationships

between events

12

SLIDE 13

Provenance-based intrusion detection

▪ Intuition: provenance graph exposes causality relationships

between events

13

SLIDE 14

Provenance-based intrusion detection

▪

Related events are connected even across long period of time

14

SLIDE 15

How to perform detection?

15

SLIDE 16

Assumptions (and limitations)

Runtime detection
We target environment with minimal human intervention
relatively consistent behaviour
e.g. web servers, CI pipelines etc...
Build a model of system behaviour (unsupervised training)
in a controlled environment
from a representative workload (this is hard!)
Detect deviation from the model
Several approaches being explored…

16

SLIDE 17

Example: UNICORN

▪ Han et al. “UNICORN: Runtime Provenance-Based Detector for Advanced Persistent Threats”, NDSS 2020

17

SLIDE 18

Example: UNICORN

1) Graph streamed in, converted to histogram, labelled using (modified) struct2vec

18

SLIDE 19

Example: UNICORN

2) At regular interval, histogram converted to a fixed size vector using similarity preserving graph sketching

19

SLIDE 20

Example: UNICORN

3) Feature vectors are clustered

20

SLIDE 21

Example: UNICORN

4) Cluster forms “meta-state”, transitions are modelled In deployment, anomaly detected via clustering and “meta-state” model

21

SLIDE 22

Relatively simple

▪

Labelled directed acyclic graph

– node/edge types – security context (when available)

▪

Modification and combination of existing algorithms

– struct2vec – similarity preserving hashing – clustering

▪

Right combination + domain knowledge

22

SLIDE 23

Some insights from this work

23

SLIDE 24

We can build practical provenance-based IDSs

▪

We can detect intrusion out of graph structure with little metadata

– Vertex type (thread, file, socket etc…) – Edge type (read, write, connect etc…)

▪

Processing speed

– Current prototype – Data generation speed < processing speed!

24

SLIDE 25

Proper evaluation is hard!

Dataset are hard to generate
What is a good quality dataset?
Hard to compare across papers, a lot is not available
Experiments (i.e. attacks)
Capture Mechanisms
Analysis pipelines
Leads to unsatisfactory evaluation
I may be able to compare to similar techniques (may reuse dataset)
… very hard for unrelated one (i.e. ingest different data type)
Adversarial ML?

25

SLIDE 26

Identifying threats: explainability is a problem

▪

There is a problem within the last batch of X graph elements

– 2,000 in previous figures

▪

Good luck finding out what went wrong

▪

Provenance forensic is an active field of research

– Promising work out of the DARPA programme

▪

… but could we do better during detection?

26

SLIDE 27

Ongoing projects

27

SLIDE 28

Towards more interpretable provenance-based IDSs

PhD student project (Xueyuan “Michael” Han)
Collaborators

○ Harvard University ○ UBC ○ NEC Labs America

Deep graph learning techniques
Precisely identifying attacks within a provenance-graph
Generating actionable reports

28

SLIDE 29

A framework for Provenance-based forensics

PhD student project (Priyanka Badva)
Collaborators

○ SRI International

Provenance graphs are large and complex (several millions nodes)
Designing tools and techniques to identify/explain attacks
Working with my colleague Ryan

29

SLIDE 30

Distributed IDS

Edge network
Collaboration with Toshiba (£4M)
Exploring distributed learning
Poisoning
Mechanism
Etc.
Large testbed planned (work starting January)
Hiring 2 postdocs at Bristol
Money available for an intern short term (+-covid)

30

SLIDE 31

Kernel partitioning

PhD student project (Soo Yee Lim)
Collaborators

○ HP Labs Bristol ○ Royal Holloway, University of London ○ University of Otago

Leveraging CHERI/ARM Morello hardware

○ Hardware capabilities

Implement kernel partitioning in the Linux OS

31

SLIDE 32

Thank you! Questions?

https://tfjmp.org thomas.pasquier@bristol.ac.uk

32

SLIDE 33

How to evaluate?

33

SLIDE 34

Comparison state of the art

Manzoor et al. "Fast memory-efficient anomaly detection in streaming heterogeneous graphs" ACM KDD, 2016. R -> neighborhood size for struct2vec algorithm

34

SLIDE 35

Evaluation with DARPA datasets

35

SLIDE 36

Evaluation with DARPA datasets

SUCH GOOD RESULTS ARE NOT NORMAL

36

SLIDE 37

Building our own dataset

▪ Attack designed to look similar to background activity

37

SLIDE 38

Building our own dataset

▪ Attack designed to look similar to background activity ▪ Is that enough?

38

SLIDE 39

Runtime performance

39

SLIDE 40

Runtime performance

40

SLIDE 41

Runtime performance

Memory usage: ~500MB CPU usage 15% on 1 core

41