Similarity Analysis in Verona & IMDEA Roberto Giacobazzi - - PowerPoint PPT Presentation

similarity analysis in verona imdea
SMART_READER_LITE
LIVE PREVIEW

Similarity Analysis in Verona & IMDEA Roberto Giacobazzi - - PowerPoint PPT Presentation

Similarity Analysis in Verona & IMDEA Roberto Giacobazzi Niccol Marastoni Mila Dalla Preda Giacobazzi Big Data Structured & Batch Unstructured y t V e e The 3Vs i r l a o V c Big i t Streaming y Data


slide-1
SLIDE 1 ⓒ Giacobazzi

Similarity Analysis in Verona & IMDEA

Roberto Giacobazzi Mila Dalla Preda Niccolò Marastoni

slide-2
SLIDE 2

ⓒ Giacobazzi

V

  • l

u m e V e l

  • c

i t y V a r i e t y

Batch Structured & Unstructured Structured Streaming Data Terabytes Zettabytes

Big Data

The 3Vs

We need Automation

Big Data

slide-3
SLIDE 3

ⓒ Giacobazzi

Abstraction

Surveillance in Big Data

Pattern Recognition

slide-4
SLIDE 4

ⓒ Giacobazzi

Devices = Cameras Abstraction = Pattern recognition Analysis = Similarity

Automated Surveillance

slide-5
SLIDE 5

ⓒ Giacobazzi

Big Data vs Big Code

Big Code

Diversity Dependecy Dimension

Static Dynamic Source

Transformed & Documented

Mobile Code Executable

Major Threat

My 3Ds

We need even more Automation

slide-6
SLIDE 6

ⓒ Giacobazzi

Devices = Compromised networks Abstraction = Abstract Interpretation Analysis = Code Similarity

Automated Surveillance

slide-7
SLIDE 7

ⓒ Giacobazzi

Similarity Analysis

slide-8
SLIDE 8

ⓒ Giacobazzi

On THE (im)possibility result!

CLASSES OF RECURSIVELY ENUMERABLE SETS AND THEIR DECISION PROBLEMS^)

BY

  • H. G. RICE
  • 1. Introduction.

In this paper we consider classes whose elements are re- cursively enumerable sets of non-negative integers. No discussion

  • f recur-

sively enumerable sets can avoid the use of such classes, so that it seems de- sirable to know some of their properties. We give our attention here to the properties

  • f complete

recursive enumerability and complete recursiveness (which may be intuitively interpreted as decidability). Perhaps

  • ur most

interesting result (and the one which gives this paper its name) is the fact that no nontrivial class is completely recursive. We assume familiarity with a paper

  • f Kleene

[5](2), and with ideas which are well summarized in the first sections

  • f a paper of Post

Í7].

  • I. Fundamental

definitions

  • 2. Partial recursive

functions. We shall characterize recursively enumer- able (r.e.) sets of non-negative integers by the partial recursive functions

  • f
  • Kleene. The set characterized

(or, as we shall say more frequently, enumer- ated) by a partial recursive function

  • f one variable

will be taken as the range of values of the function. A function undefined for all arguments (and thus producing no values) will be considered to produce an enumeration

  • f

the empty set o. Kleene has shown [5, pp. 50-58] that a Gödel enumeration

  • f the partial

recursive functions is possible, so that we may designate any partial recursive function

  • f one variable

as <j>n(x), where n is a Gödel number

  • f the function.

Actually, it requires

  • nly a minor adjustment
  • f Kleene's

constructions to insure that, not only does every function have at least one number, but that every non-negative integer n is the number of some function. We shall assume this to be the situation, and shall make one other minor adjustment: <t>o(x) is the identity function. Kleene further showed the existence of a recursive predicate 7"(x, y, z) and a primitive recursive function U(x) such that

Presented to the Society, December 28, 1951; received by the editors of the Journal for Symbolic Logic, November 16, 1951, subsequently transferred to the Transactions, and re-

ceived in revised form May 26, 1952.

(') Most of the results in this paper were contained in a thesis written under Professor Paul Rosenbloom, to whom the author wishes to express his gratitude, and presented toward the degree of Doctor of Philosophy at Syracuse University. (l) Numbers in brackets refer to the bibliography at the end of the paper.

358

1952

We can only approximate!!!

W ∈

?

?

{ P | P ≈ Q }

W Code

slide-9
SLIDE 9

ⓒ Giacobazzi

Example of static analysis (input)

{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}

  • d;

{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}

  • d

{n0=n,i=0,n0>=0}

Code

slide-10
SLIDE 10

ⓒ Giacobazzi

Example of static analysis (output)

{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}

  • d;

{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}

  • d

{n0=n,i=0,n0>=0}

Code Understanding

slide-11
SLIDE 11

ⓒ Giacobazzi

Code Obfuscation

Example of static analysis (output)

{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}

  • d;

{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}

  • d

{n0=n,i=0,n0>=0}

slide-12
SLIDE 12

ⓒ Giacobazzi

Code Obfuscation

slide-13
SLIDE 13

ⓒ Giacobazzi

Code Obfuscation

slide-14
SLIDE 14

ⓒ Giacobazzi

Another (im)possibility result!

2001

We can only partially obfuscate!!! W Code

W ∈

?

{ P | P ≈ Q }

?

VBB

Q

slide-15
SLIDE 15

ⓒ Giacobazzi

Can we build a theory in PL? (outside crypto)

slide-16
SLIDE 16

ⓒ Giacobazzi

x(t)

t

[ [P] ]

The Concrete Model

slide-17
SLIDE 17

ⓒ Giacobazzi

The Concrete Model

x(t)

t

Bad State 1 bug! We need computers to reason about computers

[ [P] ]

slide-18
SLIDE 18

ⓒ Giacobazzi

Partial Execution

x(t)

t

Cheap, efficient, but unsound!!! Bad State Still buggy!

stop

[ [P] ]

slide-19
SLIDE 19

ⓒ Giacobazzi

Testing & Dynamic analysis

x(t)

t

Efficient but unsound! Bad State Still buggy!

[ [P] ]

slide-20
SLIDE 20

ⓒ Giacobazzi

x(t)

t

Still too complicated, complex, undecidable

Abstracting the Model

α([ [P] ])

slide-21
SLIDE 21

ⓒ Giacobazzi

x(t)

t

Still too complicated, complex, undecidable

Abstracting the Model

α([ [P] ])

slide-22
SLIDE 22

ⓒ Giacobazzi

x(t)

t

This is NOT Abstract Interpretation!!! Bad State No bug!

Abstracting the Model

α([ [P] ])

slide-23
SLIDE 23

ⓒ Giacobazzi

x(t)

t

Abstract Interpretation

Affordable (sound) loss of precision

Abstract Interpretation by Cousot & Cousot ACM POPL 1977

[ [P] ]α

slide-24
SLIDE 24

ⓒ Giacobazzi

x(t)

t

I

Abstract Interpretation by Cousot & Cousot ACM POPL 1977

Abstract Interpretation

Affordable (sound) loss of precision

[ [P] ]α

slide-25
SLIDE 25

ⓒ Giacobazzi

x(t)

t

I II

Abstract Interpretation by Cousot & Cousot ACM POPL 1977

Abstract Interpretation

Affordable (sound) loss of precision

[ [P] ]α

slide-26
SLIDE 26

ⓒ Giacobazzi

x(t)

t

I II III

Abstract Interpretation by Cousot & Cousot ACM POPL 1977

Abstract Interpretation

Affordable (sound) loss of precision

[ [P] ]α

slide-27
SLIDE 27

ⓒ Giacobazzi

x(t)

t

Abstract Interpretation by Cousot & Cousot ACM POPL 1977

IV Fix-point

Abstract Interpretation

Affordable (sound) loss of precision

[ [P] ]α

slide-28
SLIDE 28

ⓒ Giacobazzi

Affordable (sound) loss of precision

x(t)

t

Bad State

Guaranteed Security

Soundness

[ [P] ]α α([ [P] ]) ⊆

[ [P] ]α

slide-29
SLIDE 29

ⓒ Giacobazzi

x(t)

t

Bad State

True Alarm

Soundness

Affordable (sound) loss of precision

[ [P] ]α α([ [P] ]) ⊆

[ [P] ]α

slide-30
SLIDE 30

ⓒ Giacobazzi

x(t)

t

Bad State

False Alarms

[ [P] ]α α([ [P] ]) ⊆

(In)completeness

Affordable (sound) loss of precision

[ [P] ]α

slide-31
SLIDE 31

ⓒ Giacobazzi

x(t)

t

Bad State X Completeness Domain Refinement

Giacobazzi et al. JACM 2000

Just true bugs!

You can always refine!!!

[ [P] ]α

slide-32
SLIDE 32

ⓒ Giacobazzi

x(t)

t

Bad State X Completeness Domain Refinement Just true bugs!

You can always refine!!!

α([ [P] ]) = [ [P] ]α

[ [P] ]α

slide-33
SLIDE 33

ⓒ Giacobazzi

Refine Simplify

Domain

Exploiting the (im)possibility results!

{ P | P ≈ Q }

W ∈ ? W Code

! / ?

α

slide-34
SLIDE 34

ⓒ Giacobazzi

Refine Simplify

Domain

Exploiting the (im)possibility results!

{ P | P ≈ Q }

W ∈ ? W Code

! / ?

α

slide-35
SLIDE 35

ⓒ Giacobazzi

Refine Simplify

Domain

Exploiting the (im)possibility results!

{ P | P ≈ Q }

W ∈ ? W Code

! / ?

α

slide-36
SLIDE 36

ⓒ Giacobazzi

Refine Simplify

Domain

Exploiting the (im)possibility results!

{ P | P ≈ Q }

W ∈ ? W Code

! / ?

α

slide-37
SLIDE 37

ⓒ Giacobazzi

De-obfuscate Obfuscate

Code

Exploiting the (im)possibility results!

{ P | P ≈ }

W ∈ ? W Code

! / ?

Q

α Q

slide-38
SLIDE 38

ⓒ Giacobazzi

De-obfuscate Obfuscate

Code

Exploiting the (im)possibility results!

{ P | P ≈ }

W ∈ ? W Code

! / ?

Q

α

slide-39
SLIDE 39

ⓒ Giacobazzi

De-obfuscate Obfuscate

Code

Exploiting the (im)possibility results!

{ P | P ≈ }

W ∈ ? W Code

! / ?

Q

α

slide-40
SLIDE 40

ⓒ Giacobazzi

On the Completeness Class

Giacobazzi et al. ACM POPL 2015

then Cα then Cα

Obfuscate De-obfuscate

Obfuscation/De-obfuscation is compilation between completeness classes

Complete Incomplete

C(α)

def

= {P program | α(JPK) = JPKα}

slide-41
SLIDE 41

ⓒ Giacobazzi

Infinite

A JskipK

;

On the Completeness Class

C(α)

def

= {P program | α(JPK) = JPKα}

slide-42
SLIDE 42

ⓒ Giacobazzi

Infinite

A JskipK

;

A JskipK

;

A JskipK

;

On the Completeness Class

C(α)

def

= {P program | α(JPK) = JPKα}

slide-43
SLIDE 43

ⓒ Giacobazzi

Infinite

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

On the Completeness Class

C(α)

def

= {P program | α(JPK) = JPKα}

slide-44
SLIDE 44

ⓒ Giacobazzi

A JskipK

; Infinite

A JskipK A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

;

A JskipK

; …… ;

On the Completeness Class

C(α)

def

= {P program | α(JPK) = JPKα}

slide-45
SLIDE 45

ⓒ Giacobazzi

Non Extensional

C(α)

def

= {P program | α(JPK) = JPKα}

P complete, JPK = JQK 6) Q complete P : x := y Q : x := y + 1; x := x − 1

JPKSign{y/+} = {x/+, y/+} JQKSign{y/+} = {x/Z, y/+}

+ −

[ {1 2 Zn

α

γ

℘(Z)

On the Completeness Class

As well as complexity!

slide-46
SLIDE 46

ⓒ Giacobazzi

Non Trivial

On the Completeness Class

For any nontrivial abstraction α there always exists an incomplete program!

Similar to Rice’s Theorem [1952] C(α)

def

= {P program | α(JPK) = JPKα}

C(α) = All Programs , α 2 {λx.x, λx.>}

slide-47
SLIDE 47

ⓒ Giacobazzi

Hard

4.5. If α

and non trivial (i.e

  • ductive sets.

is recursive and non trivial then Cα and Cα are productive sets.

(α 6= id & α 6= >)

… … …

P ∈ Imp

On the Completeness Class

Completeness is harder to prove than termination

C(α)

def

= {P program | α(JPK) = JPKα}

slide-48
SLIDE 48

ⓒ Giacobazzi

Hard

C(α)

def

= {P program | α(JPK) = JPKα}

Automating the proof that α is complete for P is impossible

4.5. If α

and non trivial (i.e

  • ductive sets.

is recursive and non trivial then Cα and Cα are productive sets.

(α 6= id & α 6= >)

On the Completeness Class

Completeness is harder to prove than termination

slide-49
SLIDE 49

ⓒ Giacobazzi

Cα 6m Cα Cα 6m Cα and

On Completeness and impossibility

then Cα then Cα

{Q | [ [P] ] = [ [Q] ]}

  • bfuscated

de-obfuscated

slide-50
SLIDE 50

ⓒ Giacobazzi

The Challenges

Intensional and extensional aspects of computation: From computability and complexity to program analysis and security

Shonan Village Center, Japan January 22-25, 2018

slide-51
SLIDE 51 ⓒ Giacobazzi

Similarity Analysis How do we build signatures?

slide-52
SLIDE 52 ⓒ Giacobazzi

Given a set of apk group them wrt their similarity

REHA Reverse Engineering Helper for Android

slide-53
SLIDE 53 ⓒ Giacobazzi

Reverse Engineering Helper for Android REHA

Given an apk P and the classified DB returns the methods of P that are in common with the apk of the DB

slide-54
SLIDE 54 ⓒ Giacobazzi

HOT API vector (risky API)

API 1 API 2 API n

.method public static lIllilIliIlIiIiIllilllilllIiIilIiilIilllIlIiliiiiI(Ljava/lang/String;)I const-string v0, "amfh.zrtcdiani.kvmzahyjyvq.bw" invoke-static {v0}, Ljava/lang/Class;->forName(Ljava/ lang/String;)Ljava/lang/Class; move-result-object v0 invoke-virtual {v0, p0}, Ljava/lang/Class;- >getField(Ljava/lang/String;)Ljava/lang/reflect/Field; move-result-object v0 invoke-virtual {v0, v0}, Ljava/lang/reflect/Field;- >getInt(Ljava/lang/Object;)I :try_end_0 .catch Ljava/lang/Exception; {:try_start_0 .. :try_end_0} :catch_0 move-result v0 return v0 :catch_0 move-exception v0 new-instance v1, Ljava/lang/RuntimeException; const-string v2, "" invoke-direct {v1, v2, v0}, Ljava/lang/ RuntimeException;-><init>(Ljava/lang/String;Ljava/ lang/Throwable;)V throw v1 .end method

How? Static

+

sequence # # outgoing edges # loop depth

Centroid

  • f the CFG
slide-55
SLIDE 55 ⓒ Giacobazzi

Method & APK similarity

Given two methods M1 and M2 Sim(M1,M2) = CDD(M1.centroid, M2.centroid) < C-threshold AND VDD(M1.API-Vector, M2.API-Vector) < V-threshold Centroid Distance Degree [0,1] Vector Distance Degree [0,1], returns 0 when they are identical

REHA

Sim(APK1, APK2) = | {M | M in APK1 APK2} | | APK1 |

What of APK1 is similar to APK2

slide-56
SLIDE 56 ⓒ Giacobazzi

REHA Scalability

REHA works in O(n2) for a dataset containing n methods Method similarity measures the centroid distance in a 3- dimensional vector. We consider only the regions

  • f the 3-dimensional space

that contain centroids close to the one in input

Search space reduction

sequence # # outgoing edges # loop depth

slide-57
SLIDE 57 ⓒ Giacobazzi

Genome Project

apks 1.226 Smali files 270.834 methods 1.600.000 Time 22m 09s

I t s c a l e s s

  • m

e h

  • w
slide-58
SLIDE 58 ⓒ Giacobazzi

Ransomware Classification REHA

The first 12 families contain 564 samples (84% of the dataset) Number of Samples

apks 673 Smali files 180.119 methods 1.300.000 Time 3m 44s

slide-59
SLIDE 59 ⓒ Giacobazzi

AVClass

A tool for malware labelling RAID 2016

slide-60
SLIDE 60 ⓒ Giacobazzi

AVClass vs REHA

slide-61
SLIDE 61 ⓒ Giacobazzi

AVClass vs REHA

multiple labels for the same malware

slide-62
SLIDE 62 ⓒ Giacobazzi

Still mostly syntactical!!!!

slide-63
SLIDE 63 ⓒ Giacobazzi

t q1 q2

Similarity by composition NIPS 2006

Work in Progress

slide-64
SLIDE 64 ⓒ Giacobazzi
  • less

similar more similar

Similarity by composition NIPS 2006

Work in Progress

slide-65
SLIDE 65 ⓒ Giacobazzi
  • shr eax, 8

lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

  • 𝑢
  • 𝑟1
  • 𝑟2
  • 𝑢

mov rsi, 14h mov rdi, rcx shr eax, 8 mov ecx, r13 add esi, 1h xor ebx, ebx test eax, eax jl short loc_22F4

  • 𝑟1
  • 𝑟2
  • 𝑢
  • 𝑟1
  • mov r9, 13h

mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8

𝑟2

less similar more similar

Statistical Similarity of Binaries PLDI 2016

Work in Progress

slide-66
SLIDE 66 ⓒ Giacobazzi
  • shr eax, 8

lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx

  • 𝑢

mov rsi, 14h mov rdi, rcx shr eax, 8 mov ecx, r13 add esi, 1h xor ebx, ebx test eax, eax jl short loc_22F4

𝑟1

  • mov r9, 13h

mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8

𝑟2

Cluster the API wrt their behaviour Methods/apk are similar of they are made out of similar pieces

Work in Progress

slide-67
SLIDE 67 ⓒ Giacobazzi

Instrumentation of smali code to monitor HOT API, the API feature vector could be filled by dynamic information Data flow dependences can be extracted dynamically to face certain obfuscations Reflection, dynamically loaded code

Dynamic analysis Work in Progress

slide-68
SLIDE 68 ⓒ Giacobazzi

and Abstract Interpretation Work in Progress

[ [P] ]α

P

Code strands Abstract code strands

slide-69
SLIDE 69 ⓒ Giacobazzi

and Abstract Interpretation Work in Progress

Code strands Abstract code strands

O(P)

[ [O(P)] ]α

loss of precision

slide-70
SLIDE 70 ⓒ Giacobazzi

and Abstract Interpretation Work in Progress

  • Insert Opaque Predicates (IOP)
  • Opaque Brench Insertion (OBI)
  • Trasparent Brench Insertion (TBI)
  • Irreducibility
  • Simple Opaque Predicates (SOP)

P

O(P)

?

slide-71
SLIDE 71 ⓒ Giacobazzi

and Abstract Interpretation Work in Progress

Warnings before Obfuscation Change %

lusearch 4 IOP 5 25 xalan-benchmark 2 IOP 4 100 dacapo-tomcat 2 IOP 3 50 commons-codec 11 Irreducibility + 36 227.27 IOP commons-io-1.3.1 84 SOP 88 4.76 commons-logging 12 IOP + SOP + 20 66.67 Irreducibility commons-logging-1.0.4 5 IOP + SOP + 12 140 Irreducibility asm3.1 51 BuggyCode 52 1.96

Benchmark Warnings after

slide-72
SLIDE 72 ⓒ Giacobazzi

Open problem

Code strands Abstract code strands

O(P)

[ [O(P)] ]α

O

slide-73
SLIDE 73 ⓒ Giacobazzi

Open problem

Code strands Abstract code strands

O(P)

[ [O(P)] ]α

O

slide-74
SLIDE 74 ⓒ Giacobazzi

Open problem

Code strands Abstract code strands

O(P)

[ [O(P)] ]α

O

slide-75
SLIDE 75 ⓒ Giacobazzi

Open problem

Code strands Abstract code strands

O(P)

[ [O(P)] ]α

O

Obfuscation signature?

slide-76
SLIDE 76 ⓒ Giacobazzi

THANKS