Similarity Analysis in Verona & IMDEA
Roberto Giacobazzi Mila Dalla Preda Niccolò Marastoni
Similarity Analysis in Verona & IMDEA Roberto Giacobazzi - - PowerPoint PPT Presentation
Similarity Analysis in Verona & IMDEA Roberto Giacobazzi Niccol Marastoni Mila Dalla Preda Giacobazzi Big Data Structured & Batch Unstructured y t V e e The 3Vs i r l a o V c Big i t Streaming y Data
Roberto Giacobazzi Mila Dalla Preda Niccolò Marastoni
ⓒ Giacobazzi
V
u m e V e l
i t y V a r i e t y
Batch Structured & Unstructured Structured Streaming Data Terabytes Zettabytes
Big Data
The 3Vs
We need Automation
Big Data
ⓒ Giacobazzi
Abstraction
Surveillance in Big Data
Pattern Recognition
ⓒ Giacobazzi
Devices = Cameras Abstraction = Pattern recognition Analysis = Similarity
Automated Surveillance
ⓒ Giacobazzi
Big Data vs Big Code
Big Code
Diversity Dependecy Dimension
Static Dynamic Source
Transformed & Documented
Mobile Code Executable
Major Threat
My 3Ds
We need even more Automation
ⓒ Giacobazzi
Devices = Compromised networks Abstraction = Abstract Interpretation Analysis = Code Similarity
Automated Surveillance
ⓒ Giacobazzi
Similarity Analysis
ⓒ Giacobazzi
On THE (im)possibility result!
CLASSES OF RECURSIVELY ENUMERABLE SETS AND THEIR DECISION PROBLEMS^)
BY
In this paper we consider classes whose elements are re- cursively enumerable sets of non-negative integers. No discussion
sively enumerable sets can avoid the use of such classes, so that it seems de- sirable to know some of their properties. We give our attention here to the properties
recursive enumerability and complete recursiveness (which may be intuitively interpreted as decidability). Perhaps
interesting result (and the one which gives this paper its name) is the fact that no nontrivial class is completely recursive. We assume familiarity with a paper
[5](2), and with ideas which are well summarized in the first sections
Í7].
definitions
functions. We shall characterize recursively enumer- able (r.e.) sets of non-negative integers by the partial recursive functions
(or, as we shall say more frequently, enumer- ated) by a partial recursive function
will be taken as the range of values of the function. A function undefined for all arguments (and thus producing no values) will be considered to produce an enumeration
the empty set o. Kleene has shown [5, pp. 50-58] that a Gödel enumeration
recursive functions is possible, so that we may designate any partial recursive function
as <j>n(x), where n is a Gödel number
Actually, it requires
constructions to insure that, not only does every function have at least one number, but that every non-negative integer n is the number of some function. We shall assume this to be the situation, and shall make one other minor adjustment: <t>o(x) is the identity function. Kleene further showed the existence of a recursive predicate 7"(x, y, z) and a primitive recursive function U(x) such that
Presented to the Society, December 28, 1951; received by the editors of the Journal for Symbolic Logic, November 16, 1951, subsequently transferred to the Transactions, and re-
ceived in revised form May 26, 1952.
(') Most of the results in this paper were contained in a thesis written under Professor Paul Rosenbloom, to whom the author wishes to express his gratitude, and presented toward the degree of Doctor of Philosophy at Syracuse University. (l) Numbers in brackets refer to the bibliography at the end of the paper.
358
We can only approximate!!!
W ∈
?
?
{ P | P ≈ Q }
W Code
ⓒ Giacobazzi
Example of static analysis (input)
{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}
{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}
{n0=n,i=0,n0>=0}
Code
ⓒ Giacobazzi
Example of static analysis (output)
{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}
{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}
{n0=n,i=0,n0>=0}
Code Understanding
ⓒ Giacobazzi
Code Obfuscation
Example of static analysis (output)
{n0>=0} n := n0; {n0=n,n0>=0} i := n; {n0=i,n0=n,n0>=0} while (i <> 0 ) do {n0=n,i>=1,n0>=i} j := 0; {n0=n,j=0,i>=1,n0>=i} while (j <> i) do {n0=n,j>=0,i>=j+1,n0>=i} j := j + 1 {n0=n,j>=1,i>=j,n0>=i}
{n0=n,i=j,i>=1,n0>=i} i := i - 1 {i+1=j,n0=n,i>=0,n0>=i+1}
{n0=n,i=0,n0>=0}
ⓒ Giacobazzi
Code Obfuscation
ⓒ Giacobazzi
Code Obfuscation
ⓒ Giacobazzi
Another (im)possibility result!
2001
We can only partially obfuscate!!! W Code
W ∈
?
{ P | P ≈ Q }
?
VBB
Q
ⓒ Giacobazzi
Can we build a theory in PL? (outside crypto)
ⓒ Giacobazzi
x(t)
t
The Concrete Model
ⓒ Giacobazzi
The Concrete Model
x(t)
t
Bad State 1 bug! We need computers to reason about computers
ⓒ Giacobazzi
Partial Execution
x(t)
t
Cheap, efficient, but unsound!!! Bad State Still buggy!
stop
ⓒ Giacobazzi
Testing & Dynamic analysis
x(t)
t
Efficient but unsound! Bad State Still buggy!
ⓒ Giacobazzi
x(t)
t
Still too complicated, complex, undecidable
Abstracting the Model
α([ [P] ])
ⓒ Giacobazzi
x(t)
t
Still too complicated, complex, undecidable
Abstracting the Model
α([ [P] ])
ⓒ Giacobazzi
x(t)
t
This is NOT Abstract Interpretation!!! Bad State No bug!
Abstracting the Model
α([ [P] ])
ⓒ Giacobazzi
x(t)
t
Abstract Interpretation
Affordable (sound) loss of precision
Abstract Interpretation by Cousot & Cousot ACM POPL 1977
ⓒ Giacobazzi
x(t)
t
I
Abstract Interpretation by Cousot & Cousot ACM POPL 1977
Abstract Interpretation
Affordable (sound) loss of precision
ⓒ Giacobazzi
x(t)
t
I II
Abstract Interpretation by Cousot & Cousot ACM POPL 1977
Abstract Interpretation
Affordable (sound) loss of precision
ⓒ Giacobazzi
x(t)
t
I II III
Abstract Interpretation by Cousot & Cousot ACM POPL 1977
Abstract Interpretation
Affordable (sound) loss of precision
ⓒ Giacobazzi
x(t)
t
Abstract Interpretation by Cousot & Cousot ACM POPL 1977
IV Fix-point
Abstract Interpretation
Affordable (sound) loss of precision
ⓒ Giacobazzi
Affordable (sound) loss of precision
x(t)
t
Bad State
Guaranteed Security
Soundness
[ [P] ]α α([ [P] ]) ⊆
ⓒ Giacobazzi
x(t)
t
Bad State
True Alarm
Soundness
Affordable (sound) loss of precision
[ [P] ]α α([ [P] ]) ⊆
ⓒ Giacobazzi
x(t)
t
Bad State
False Alarms
[ [P] ]α α([ [P] ]) ⊆
(In)completeness
Affordable (sound) loss of precision
ⓒ Giacobazzi
x(t)
t
Bad State X Completeness Domain Refinement
Giacobazzi et al. JACM 2000
Just true bugs!
You can always refine!!!
ⓒ Giacobazzi
x(t)
t
Bad State X Completeness Domain Refinement Just true bugs!
You can always refine!!!
α([ [P] ]) = [ [P] ]α
ⓒ Giacobazzi
Refine Simplify
Exploiting the (im)possibility results!
{ P | P ≈ Q }
W ∈ ? W Code
! / ?
α
ⓒ Giacobazzi
Refine Simplify
Exploiting the (im)possibility results!
{ P | P ≈ Q }
W ∈ ? W Code
! / ?
α
ⓒ Giacobazzi
Refine Simplify
Exploiting the (im)possibility results!
{ P | P ≈ Q }
W ∈ ? W Code
! / ?
α
ⓒ Giacobazzi
Refine Simplify
Exploiting the (im)possibility results!
{ P | P ≈ Q }
W ∈ ? W Code
! / ?
α
ⓒ Giacobazzi
De-obfuscate Obfuscate
Exploiting the (im)possibility results!
{ P | P ≈ }
W ∈ ? W Code
! / ?
Q
α Q
ⓒ Giacobazzi
De-obfuscate Obfuscate
Exploiting the (im)possibility results!
{ P | P ≈ }
W ∈ ? W Code
! / ?
Q
α
ⓒ Giacobazzi
De-obfuscate Obfuscate
Exploiting the (im)possibility results!
{ P | P ≈ }
W ∈ ? W Code
! / ?
Q
α
ⓒ Giacobazzi
On the Completeness Class
Giacobazzi et al. ACM POPL 2015
Obfuscate De-obfuscate
Obfuscation/De-obfuscation is compilation between completeness classes
Complete Incomplete
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
Infinite
A JskipK
;
On the Completeness Class
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
Infinite
A JskipK
;
A JskipK
;
A JskipK
;
On the Completeness Class
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
Infinite
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
On the Completeness Class
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
A JskipK
; Infinite
A JskipK A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
;
A JskipK
; …… ;
On the Completeness Class
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
Non Extensional
C(α)
def
= {P program | α(JPK) = JPKα}
P complete, JPK = JQK 6) Q complete P : x := y Q : x := y + 1; x := x − 1
JPKSign{y/+} = {x/+, y/+} JQKSign{y/+} = {x/Z, y/+}
∅
+ −
[ {1 2 Zn
α
γ
℘(Z)
On the Completeness Class
As well as complexity!
ⓒ Giacobazzi
Non Trivial
On the Completeness Class
For any nontrivial abstraction α there always exists an incomplete program!
Similar to Rice’s Theorem [1952] C(α)
def
= {P program | α(JPK) = JPKα}
C(α) = All Programs , α 2 {λx.x, λx.>}
ⓒ Giacobazzi
Hard
4.5. If α
and non trivial (i.e
is recursive and non trivial then Cα and Cα are productive sets.
(α 6= id & α 6= >)
Cα
… … …
P ∈ Imp
On the Completeness Class
Completeness is harder to prove than termination
C(α)
def
= {P program | α(JPK) = JPKα}
ⓒ Giacobazzi
Hard
C(α)
def
= {P program | α(JPK) = JPKα}
Automating the proof that α is complete for P is impossible
4.5. If α
and non trivial (i.e
is recursive and non trivial then Cα and Cα are productive sets.
(α 6= id & α 6= >)
On the Completeness Class
Completeness is harder to prove than termination
ⓒ Giacobazzi
Cα 6m Cα Cα 6m Cα and
On Completeness and impossibility
then Cα then Cα
{Q | [ [P] ] = [ [Q] ]}
de-obfuscated
ⓒ Giacobazzi
The Challenges
Intensional and extensional aspects of computation: From computability and complexity to program analysis and security
Shonan Village Center, Japan January 22-25, 2018
Given a set of apk group them wrt their similarity
REHA Reverse Engineering Helper for Android
Reverse Engineering Helper for Android REHA
Given an apk P and the classified DB returns the methods of P that are in common with the apk of the DB
HOT API vector (risky API)
API 1 API 2 API n
.method public static lIllilIliIlIiIiIllilllilllIiIilIiilIilllIlIiliiiiI(Ljava/lang/String;)I const-string v0, "amfh.zrtcdiani.kvmzahyjyvq.bw" invoke-static {v0}, Ljava/lang/Class;->forName(Ljava/ lang/String;)Ljava/lang/Class; move-result-object v0 invoke-virtual {v0, p0}, Ljava/lang/Class;- >getField(Ljava/lang/String;)Ljava/lang/reflect/Field; move-result-object v0 invoke-virtual {v0, v0}, Ljava/lang/reflect/Field;- >getInt(Ljava/lang/Object;)I :try_end_0 .catch Ljava/lang/Exception; {:try_start_0 .. :try_end_0} :catch_0 move-result v0 return v0 :catch_0 move-exception v0 new-instance v1, Ljava/lang/RuntimeException; const-string v2, "" invoke-direct {v1, v2, v0}, Ljava/lang/ RuntimeException;-><init>(Ljava/lang/String;Ljava/ lang/Throwable;)V throw v1 .end method
How? Static
sequence # # outgoing edges # loop depth
Centroid
Method & APK similarity
Given two methods M1 and M2 Sim(M1,M2) = CDD(M1.centroid, M2.centroid) < C-threshold AND VDD(M1.API-Vector, M2.API-Vector) < V-threshold Centroid Distance Degree [0,1] Vector Distance Degree [0,1], returns 0 when they are identical
REHA
Sim(APK1, APK2) = | {M | M in APK1 APK2} | | APK1 |
∩
What of APK1 is similar to APK2
REHA Scalability
REHA works in O(n2) for a dataset containing n methods Method similarity measures the centroid distance in a 3- dimensional vector. We consider only the regions
that contain centroids close to the one in input
Search space reduction
sequence # # outgoing edges # loop depth
Genome Project
apks 1.226 Smali files 270.834 methods 1.600.000 Time 22m 09s
I t s c a l e s s
e h
Ransomware Classification REHA
The first 12 families contain 564 samples (84% of the dataset) Number of Samples
apks 673 Smali files 180.119 methods 1.300.000 Time 3m 44s
AVClass
A tool for malware labelling RAID 2016
AVClass vs REHA
AVClass vs REHA
multiple labels for the same malware
Still mostly syntactical!!!!
t q1 q2
Similarity by composition NIPS 2006
Work in Progress
similar more similar
Similarity by composition NIPS 2006
Work in Progress
lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx
mov rsi, 14h mov rdi, rcx shr eax, 8 mov ecx, r13 add esi, 1h xor ebx, ebx test eax, eax jl short loc_22F4
mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8
𝑟2
less similar more similar
Statistical Similarity of Binaries PLDI 2016
Work in Progress
lea r14d, [r12+13h] mov r13, rbx lea rcx, [r13+3] mov [r13+1], al mov [r13+2], r12b mov rdi, rcx
mov rsi, 14h mov rdi, rcx shr eax, 8 mov ecx, r13 add esi, 1h xor ebx, ebx test eax, eax jl short loc_22F4
𝑟1
mov r12, rbx add rbp, 3 mov rsi, rbp lea rdi, [r12+3] mov [r12+2], bl lea r13d, [rcx+r9] shr eax, 8
𝑟2
Cluster the API wrt their behaviour Methods/apk are similar of they are made out of similar pieces
Work in Progress
Instrumentation of smali code to monitor HOT API, the API feature vector could be filled by dynamic information Data flow dependences can be extracted dynamically to face certain obfuscations Reflection, dynamically loaded code
Dynamic analysis Work in Progress
and Abstract Interpretation Work in Progress
[ [P] ]α
P
Code strands Abstract code strands
and Abstract Interpretation Work in Progress
Code strands Abstract code strands
O(P)
[ [O(P)] ]α
loss of precision
and Abstract Interpretation Work in Progress
P
O(P)
and Abstract Interpretation Work in Progress
Warnings before Obfuscation Change %
lusearch 4 IOP 5 25 xalan-benchmark 2 IOP 4 100 dacapo-tomcat 2 IOP 3 50 commons-codec 11 Irreducibility + 36 227.27 IOP commons-io-1.3.1 84 SOP 88 4.76 commons-logging 12 IOP + SOP + 20 66.67 Irreducibility commons-logging-1.0.4 5 IOP + SOP + 12 140 Irreducibility asm3.1 51 BuggyCode 52 1.96
Benchmark Warnings after
Open problem
Code strands Abstract code strands
O(P)
[ [O(P)] ]α
O
Open problem
Code strands Abstract code strands
O(P)
[ [O(P)] ]α
O
Open problem
Code strands Abstract code strands
O(P)
[ [O(P)] ]α
O
Open problem
Code strands Abstract code strands
O(P)
[ [O(P)] ]α
O
Obfuscation signature?