Improving Malware Classification: Bridging the Static/Dynamic Gap - PowerPoint PPT Presentation
Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics CISC850 Cyber Analytics INTRODUCTION Why is there a need for
Improving Malware Classification: Bridging the Static/Dynamic Gap Authors: Blake Anderson, Curtis Storlie, Terran Lane Vinit Singh 18 th April 2017 CISC850 Cyber Analytics
CISC850 Cyber Analytics INTRODUCTION • Why is there a need for machine learning in malware detection ? • The need for different type of data sources and how to combine them. • Unified framework by using a support vector machine using multiple kernel learning.
CISC850 Cyber Analytics DATA SOURCES • STATIC SOURCES: Binary, Disassembled Binary, Control Flow Graph • DYNAMIC SOURCES: Dynamic Instruction Traces (DIT) , Dynamic System Call Traces (DST) • MISCELLANEOUS FILE INFORMATION: Entropy, Packers, Instructions in file, vertices and edges in CFG
CISC850 Cyber Analytics METHOD STEP 1: DATA REPRESENTATION • Markov chain representation for raw binary, disassembled binary, DIT and DST • Standard representation for Control Flow Graph • The miscellaneous file information is represented as a simple feature vector of length seven
STEP 2: KERNELS • The Kernel Trick • Exponential Kernel: x i : Features of the file information / transition probability of Markov chain • Graphlet Kernel: G: Graph , k : number of nodes of subgraph equal to k D G : Normalized probability vector = fg / # of all graphlets of size k fg = feature vector consisting number of times unique subgraph of size k occurs
Heatmaps for Individual Kernels
STEP 3: MULTIPLE KERNEL LEARNING • Optimization problem for classical kernel learning: Subject to constraint: Thus the Decision function is : • But for multiple kernel learning we need to estimate β k
Heatmap of Combined Kernel
RESULTS • Criteria 1 : Accuracy: Accuracy is calculated using 10-fold cross-validation.
• Criteria 2: ROC Curves / AUC Values
• Criteria 3: Speed to classify new instances
• Criteria 4: Testing on a Large Malware Sample Accuracy on validation set consisting of 20k samples
OBSERVATIONS • There were a total of 19 false positives and false negatives that were found out of 1556 instances of the original dataset. • Use of only static analysis doesn’t work well when the training instances have been packed.
LIMITATIONS AND DRAWBACKS • Selecting an appropriate value of n for n-gram analysis • Time to collect dynamic system traces will be too resource intensive on a normal system • Choosing optimal instruction call categories • Intel Pin isn’t transparent while tracing the program to collect instructions
RELATED WORK • Use of single data sources • Use of static data sources combined with ensemble learning • Result Fusion Model • Identifying packed and hidden code
CONCLUSION • Not restricting malware classification to a single data source improves classification accuracy. • In a resource constrained environment combined static analysis can result in high accuracy and low number of false positives. • Static analysis is not an optimal solution when instances have been packed or have an high entropy.
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.