Detection in IoT Networks Imtiaz Ullah and Qusay H. Mahmoud 33 rd - - PowerPoint PPT Presentation

detection in iot networks
SMART_READER_LITE
LIVE PREVIEW

Detection in IoT Networks Imtiaz Ullah and Qusay H. Mahmoud 33 rd - - PowerPoint PPT Presentation

A Scheme for Generating a Dataset for Anomalous Activity Detection in IoT Networks Imtiaz Ullah and Qusay H. Mahmoud 33 rd Canadian Conference on Artificial Intelligence 12-15 May 2020 Agenda Introduction Motivation Problem Statement


slide-1
SLIDE 1

A Scheme for Generating a Dataset for Anomalous Activity Detection in IoT Networks

Imtiaz Ullah and Qusay H. Mahmoud

33rd Canadian Conference on Artificial Intelligence 12-15 May 2020

slide-2
SLIDE 2

Agenda

▪ Introduction ▪ Motivation ▪ Problem Statement ▪ Related work ▪ Testbed Architecture ▪ Results and Analysis

  • Correlated Features
  • Feature Ranking
  • Learning Curve
  • Validation Curve
  • Classification

▪ Conclusion ▪ Future work

[2]

slide-3
SLIDE 3

Introduction

▪ Smart digital devices

  • Become part of our daily lives
  • Improve the quality of life
  • Make communication easier
  • Increase the data transfer and information sharing

▪ “Things” in the IoT could be anything

  • Physical
  • Virtual

▪ Technological challenges

  • Security
  • Power usage
  • Scalability
  • Communication mechanisms

[3]

slide-4
SLIDE 4

▪ Exponential growth of IoT make it a smart object for the attackers. ▪ The effects of cyber-attacks become more destructive.

[4]

  • Fig. 1. Source: https://www.forbes.com/sites/gilpress/2016/09/02/internet-of-things-by-the-numbers-what-new-surveys-found/#a60d28116a0e

Introduction Cont.

slide-5
SLIDE 5

[5]

Motivation

▪ The exponential growth of the Internet of Things (IoT) devices provides a large attack surface for intruders to launch more

destructive cyber-attacks.

▪ New techniques and detection algorithms required a well-designed dataset for IoT networks. ▪ Available IoT intrusion dataset had limited number of features. ▪ A very limited number of flow based feature in available IoT datasets.

slide-6
SLIDE 6

[6]

Problem Statement

▪ Firstly, we reviewed the weaknesses of various intrusion detection datasets. ▪ Secondly, we proposed a new dataset, adopted from https://ieee-dataport.org/open-access/iot-network-

intrusion-dataset

▪ Thirdly we provide a significant set of features with their corresponding weights. ▪ Finally, we propose a new detection classification methodology using the generated dataset. ▪ The IoT Botnet dataset can be accessed from https://sites.google.com/view/iot-network-intrusion-dataset.

slide-7
SLIDE 7

Related Work

[7]

▪ DARPA 98 / 99

  • Developed at MIT Lincoln Lab via an emulated network environment.
  • The DARPA 98 dataset contains seven days.
  • The DARPA 99 contains five weeks of network traffic.

▪ Lee and Stolfo developed the KDD99 dataset from DARPA 98/99. ▪ NSL-KDD removed redundant records from the KDD99 dataset.

  • Training data of KDD99 contains 78% redundant instances.
  • Testing data contains 75% redundant instances.

▪ ISCX Dataset at CIC university of new Brunswick.

  • Systematic approach to generate normal and malicious traffics.
  • Multistage attack.
  • Publicly available.
slide-8
SLIDE 8

[8]

▪ UNSW-NB15.

  • Comprehensive modern normal network traffic.
  • Diverse intrusions scenario.
  • In-depth structured network traffic information.
  • Publicly available.
  • 49 Features
  • Flow, Basic, Content, Time, Additional Generated features, Connection, Labeled features.

▪ CICIDS2017

  • Modern normal and malicious network traffic.
  • 80 network features.
  • Reliable normal, and malicious network flows.
  • Publicly available.

▪ CICDDOS2019

  • Up-to-date normal and malicious DDOS network traffic.
  • 12 DDoS attacks.
  • Publicly available.
  • Comprehensive metadata about IP addresses.

Related Work Cont.

slide-9
SLIDE 9

[9]

▪ BoT-IoT Dataset

  • Developed via legitimate and emulated IoT networks.
  • A typical smart home configuration designed.
  • Dataset is publicly available.
  • 49 Features.

▪ Botnet IoT Dataset

  • Dataset generated using.
  • Nine commercial IoT devices.
  • Two IoT-based botnets BASHLITE and Mirai.
  • 115 Network Features.

Related Work Cont.

slide-10
SLIDE 10

Testbed Architecture

[10]

▪ A typical smart home environment. ▪ Smart home device SKT NGU and EZVIZ

Wi-Fi camera to generate the IoTID20 dataset.

▪ Other devices Laptops, Tablets, Smartphones. ▪ The SKT NGU and EZVIZ Wi-Fi camera are

IoT victim devices, and all other devices in the testbed are the attacking devices.

▪ CIC Flowmeter to extract features.

  • Fig. 2. Source: https://ieee-dataport.org/open-access/iot-network-intrusion-dataset
slide-11
SLIDE 11

Testbed Architecture Cont.

[11]

  • Fig. 3. IoTID20 Dataset Attack Taxonomy

▪ New IoTID20 dataset for anomalous

activity detection in IoT networks.

▪ IoTID20 available in CSV format. ▪ Various types of IoT attacks and

families.

▪ Large number of general features. ▪ Large number of flow based features. ▪ High rank features.

slide-12
SLIDE 12

Label Feature of IoTID20

[12] Binary Category Subcategory Normal, Anomaly Normal DoS, Mirai, MITM, Scan Normal, Syn Flooding, Brute Force, HTTP Flooding, UDP Flooding ARP Spoofing Host Port, OS Table 1. Binary, Category, and Sub-Category of IoTID20 Dataset

slide-13
SLIDE 13

Result and Analysis

[13]

Accuracy = TP+TN TP+TN+FP+FN Precision = TP TP + FP Recall = TP TP + FN F − measure = 2 Precision. Recall Precision + Recall

slide-14
SLIDE 14

IoTID20 Dataset Correlated Features

[14] Total Featur es Feature Name 12 Active_Max, Bwd_IAT_Max, Bwd_Seg_Size_Avg, Fwd_IAT_Max, Fwd_Seg_Size_Avg, Idl e_Max, PSH_Flag_Cnt, Pkt_Size_Avg, Subflow_Bwd_Byts, SubflowBwd_Pkts, Subflow_F wd_Byts, Subflow_Fwd_Pkts Table 2. IoTID20 Dataset Correlated Features

▪ The correlated features degrade the detection capability of a machine learning algorithm. ▪ A correlation coefficient of 0.70 to remove a list of correlated features.

slide-15
SLIDE 15

Feature Ranking

[15]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Flow_ID Src_IP Src_Port Dst_IP Dst_Port Protocol Timestamp Flow_Duration Tot_Fwd_Pkts Tot_Bwd_Pkts TotLen_Fwd_Pkts TotLen_Bwd_Pkts Fwd_Pkt_Len_Max Fwd_Pkt_Len_Min Fwd_Pkt_Len_Mean Fwd_Pkt_Len_Std Bwd_Pkt_Len_Max Bwd_Pkt_Len_Min Bwd_Pkt_Len_Mean Bwd_Pkt_Len_Std Flow_Byts/s Flow_Pkts/s Flow_IAT_Mean Flow_IAT_Std Flow_IAT_Max Flow_IAT_Min Fwd_IAT_Tot Fwd_IAT_Mean Bwd_IAT_Mean Fwd_IAT_Max Fwd_IAT_Min Bwd_IAT_Tot Bwd_IAT_Mean Bwd_IAT_Std Bwd_IAT_Max Bwd_IAT_Min Fwd_PSH_Flags Bwd_PSH_Flags Fwd_URG_Flags Bwd_URG_Flags Fwd_Header_Len Bwd_Header_Len Fwd_Pkts/s Bwd_Pkts/s Pkt_Len_Min Pkt_Len_Max Pkt_Len_Mean Pkt_Len_Std Pkt_Len_Var FIN_Flag_Cnt SYN_Flag_Cnt RST_Flag_Cnt PSH_Flag_Cnt ACK_Flag_Cnt URG_Flag_Cnt CWE_Flag_Count ECE_Flag_Cnt Down/Up_Ratio Pkt_Size_Avg Fwd_Seg_Size_Avg Bwd_Seg_Size_Avg Fwd_Byts/b_Avg Fwd_Pkts/b_Avg Fwd_Blk_Rate_Avg Bwd_Byts/b_Avg Bwd_Pkts/b_Avg Bwd_Blk_Rate_Avg Subflow_Fwd_Pkts Subflow_Fwd_Byts Subflow_Bwd_Pkts Subflow_Bwd_Byts Init_Fwd_Win_Byts Init_Bwd_Win_Byts Fwd_Act_Data_Pkts Fwd_Seg_Size_Min Active_Mean Active_Std Active_Max Active_Min Idle_Mean Idle_Std Idle_Max Idle_Min

Ranking Score

Features

▪ High ranked features improve feature selection. ▪ Shapira-Wilk algorithm to rank IoTID20 features. ▪ More than 70 % of the feature ranked with a value greater than 0.50.

  • Fig. 4. Feature Ranking Shapiro-Wilk algorithm
slide-16
SLIDE 16

Learning Curve

[16]

  • Fig. 5. Learning Curve for Binary Label

▪ A Learning Curve shows

  • Relationship between the training and validation of an algorithm using various training samples.
  • How the algorithm can benefit by providing more data or the data provided enough for better performance of

the algorithm.

50 60 70 80 90 100 32000 40000 48000 56000 64000 72000 80000 88000 96000 102000 Training-Binary Testing-Binary

slide-17
SLIDE 17

[17]

50 55 60 65 70 75 80 85 90 95 100 32000 40000 48000 56000 64000 72000 80000 88000 96000 102000 Training-Category Testing-Category

  • Fig. 6. Learning Curve for Category

Learning Curve

slide-18
SLIDE 18

[18]

50 55 60 65 70 75 80 85 90 95 100 32000 40000 48000 56000 64000 72000 80000 88000 96000 102000 Training-Sub-Category Testing-Sub-Category

  • Fig. 7. Learning Curve for Subcategory

Learning Curve

slide-19
SLIDE 19

Validation Curve

[19]

50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 Training-Binary Testing-Binary

  • Fig. 8. Validation Curve for Binary Label

▪ A validation curve shows

  • Effectiveness of a classifier on the data it is trained.
  • Efficiency of the classifier to the new test data.
slide-20
SLIDE 20

[20]

50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 Training-Category Testing-Category

  • Fig. 9. Validation Curve for Category

Validation Curve

slide-21
SLIDE 21

[21]

50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 Training-Sub-Category Testing-Sub-Category

  • Fig. 10. Validation Curve for Subcategory

Validation Curve

slide-22
SLIDE 22

[22]

10 20 30 40 50 60 70 80 90 100

SVM Gaussian NB LDA Logistic Regression Decision Tree Random Forest Ensemble

F-Score Normal Anomaly

  • Fig. 11. F-Score for Binary Label

Binary Classification

▪ Classifies the dataset as normal network traffic or

malicious network traffic.

▪ SVM, Gaussian NB, LDA, and Logic regression

poorly performed for binary label classification.

▪ The decision tree, random forest, and ensemble

performed very well for binary label classification.

▪ 3, 5, and 10-fold cross-validation test to check the

  • verfitting of classifiers.

▪ Cross-fold validation test remains unchanged.

slide-23
SLIDE 23

[23]

10 20 30 40 50 60 70 80 90 100

SVM Gaussian NB LDA Logistic Regression Decision Tree Random Forest Ensemble

F-Score Normal DoS Mirai MITM Scan

  • Fig. 12. F-Score for Category Label

Category Classification

▪ Classifies the dataset as normal network

traffic or any of the following attack category DoS, Mirai, MITM, or Scan.

▪ Decision tree estimator performs very well

for all attack categories.

▪ Poor performance by logic regression, LDA,

Gaussian NB, and SVM.

slide-24
SLIDE 24

[24]

10 20 30 40 50 60 70 80 90 100 SVM Gaussian NB LDA Logistic Regression Decision Tree Random Forest Ensemble

F-Score

Normal DoS-Synflooding Mirai-Ackflooding Mirai-Hostbruteforce Mirai-HTTP Flooding Mirai-UDP Flooding MITM ARP Spoofing Scan Hostport Scan Port

▪ Classifies the dataset into normal

network traffic or any one of the categories, as shown in Figure 3.

▪ Better performance achieved by

the decision tree classifier for the subcategories.

Subcategory Classification

  • Fig. 13. F-Score for Sub-Category Label
slide-25
SLIDE 25

[25]

Algorithm Accuracy Precision Recall F Score SVM 40 55 37 16 Gaussian NB 73 70 66 62 LDA 70 71 71 70 Logistic Regression 40 25 39 30 Decision Tree 88 88 88 88 Random Forest 84 85 84 84 Ensemble 87 87 87 87

Table 3. IoTID20 Dataset Performance Results

Performance Comparison

slide-26
SLIDE 26

[26]

Conclusion

▪ New IoTID20 dataset for anomalous activity detection in IoT networks. ▪ New normal and malicious IoT network traffic. ▪ Various types of IoT attacks and families. ▪ Large number of general features. ▪ Large number of flow-based features. ▪ High rank features.

slide-27
SLIDE 27

[27]

In the future, we plan to develop and evaluate a framework for anomalous activity detection models for IoT networks to improve accuracy.

Future work

slide-28
SLIDE 28

Contact us…

{imtiaz.ullah, qusay.mahmoud}@ontariotechu.net

Thank you!