Improving Malware Classification: Bridging the Static/Dynamic Gap - - PowerPoint PPT Presentation

▶

Jan 05, 2023 32 likes •192 views

Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics CISC850 Cyber Analytics INTRODUCTION Why is there a need for

SLIDE 1

Improving Malware Classification: Bridging the Static/Dynamic Gap

Vinit Singh 18th April 2017

Authors: Blake Anderson, Curtis Storlie, Terran Lane

CISC850 Cyber Analytics

SLIDE 2

INTRODUCTION

Why is there a need for machine learning in

malware detection ?

The need for different type of data sources

and how to combine them.

Unified framework by using a support vector

machine using multiple kernel learning.

CISC850 Cyber Analytics

SLIDE 3

DATA SOURCES

STATIC SOURCES:

Binary, Disassembled Binary, Control Flow Graph

DYNAMIC SOURCES:

Dynamic Instruction Traces (DIT) , Dynamic System Call Traces (DST)

MISCELLANEOUS FILE INFORMATION:

Entropy, Packers, Instructions in file, vertices and edges in CFG

CISC850 Cyber Analytics

SLIDE 4

METHOD

STEP 1: DATA REPRESENTATION

Markov chain representation for raw binary,

disassembled binary, DIT and DST

Standard representation for Control Flow Graph
The miscellaneous file information is represented as

a simple feature vector of length seven

CISC850 Cyber Analytics

SLIDE 5

STEP 2: KERNELS

The Kernel Trick
Exponential Kernel:

xi : Features of the file information / transition probability of Markov chain

Graphlet Kernel:

G: Graph , k : number of nodes of subgraph equal to k DG : Normalized probability vector = fg / # of all graphlets of size k fg = feature vector consisting number of times unique subgraph of size k occurs

SLIDE 6

Heatmaps for Individual Kernels

SLIDE 7

STEP 3: MULTIPLE KERNEL LEARNING

Optimization problem for classical kernel learning:

Subject to constraint: Thus the Decision function is :

But for multiple kernel learning we need to estimate βk

SLIDE 8

Heatmap of Combined Kernel

SLIDE 9

RESULTS

Criteria 1 : Accuracy:

Accuracy is calculated using 10-fold cross-validation.

SLIDE 10

Criteria 2: ROC Curves / AUC Values

SLIDE 11

Criteria 3: Speed to classify new instances

SLIDE 12

Criteria 4: Testing on a Large Malware Sample

Accuracy on validation set consisting of 20k samples

SLIDE 13

OBSERVATIONS

There were a total of 19 false positives and

false negatives that were found out of 1556 instances of the original dataset.

Use of only static analysis doesn’t work well

when the training instances have been packed.

SLIDE 14

LIMITATIONS AND DRAWBACKS

Selecting an appropriate value of n for n-gram

analysis

Time to collect dynamic system traces will be too

resource intensive on a normal system

Choosing optimal instruction call categories
Intel Pin isn’t transparent while tracing the

program to collect instructions

SLIDE 15

RELATED WORK

Use of single data sources
Use of static data sources combined with

ensemble learning

Result Fusion Model
Identifying packed and hidden code

SLIDE 16

CONCLUSION

Not restricting malware classification to a single

data source improves classification accuracy.

In a resource constrained environment combined

static analysis can result in high accuracy and low number of false positives.

Static analysis is not an optimal solution when

Improving Malware Classification: Bridging the Static/Dynamic Gap

Vinit Singh 18th April 2017

Authors: Blake Anderson, Curtis Storlie, Terran Lane

INTRODUCTION

malware detection ?

and how to combine them.

machine using multiple kernel learning.

DATA SOURCES

METHOD

STEP 1: DATA REPRESENTATION

disassembled binary, DIT and DST

a simple feature vector of length seven

STEP 2: KERNELS

Heatmaps for Individual Kernels

STEP 3: MULTIPLE KERNEL LEARNING

Heatmap of Combined Kernel

RESULTS

Accuracy is calculated using 10-fold cross-validation.

Accuracy on validation set consisting of 20k samples

OBSERVATIONS

false negatives that were found out of 1556 instances of the original dataset.

when the training instances have been packed.

LIMITATIONS AND DRAWBACKS

analysis

resource intensive on a normal system

program to collect instructions

RELATED WORK

ensemble learning

CONCLUSION

data source improves classification accuracy.

static analysis can result in high accuracy and low number of false positives.

instances have been packed or have an high entropy.