Program Analysis Attackers: need to analyze our program to modify - - PowerPoint PPT Presentation

program analysis
SMART_READER_LITE
LIVE PREVIEW

Program Analysis Attackers: need to analyze our program to modify - - PowerPoint PPT Presentation

Program Analysis Attackers: need to analyze our program to modify it! Defenders: need to analyze our program to protect it! Two kinds of analyses: 1 static analysis tools collect information about a program by studying its code; 2 dynamic


slide-1
SLIDE 1

Program Analysis

Attackers: need to analyze our program to modify it! Defenders: need to analyze our program to protect it! Two kinds of analyses:

1 static analysis tools collect information about a program by

studying its code;

2 dynamic analysis tools collect information from executing the

program.

1/22

slide-2
SLIDE 2

Static and Dynamic Analyses

control-flow graphs: representation of functions.

2/22

slide-3
SLIDE 3

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls.

2/22

slide-4
SLIDE 4

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls. debugging: what path does the program take?

2/22

slide-5
SLIDE 5

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls. debugging: what path does the program take? tracing: which functions/system calls get executed?

2/22

slide-6
SLIDE 6

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls. debugging: what path does the program take? tracing: which functions/system calls get executed? profiling: what gets executed the most?

2/22

slide-7
SLIDE 7

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls. debugging: what path does the program take? tracing: which functions/system calls get executed? profiling: what gets executed the most? disassembly: turn raw executables into assembly code.

2/22

slide-8
SLIDE 8

Static and Dynamic Analyses

control-flow graphs: representation of functions. call graphs: representation of (possible) function calls. debugging: what path does the program take? tracing: which functions/system calls get executed? profiling: what gets executed the most? disassembly: turn raw executables into assembly code. decompilation: turn raw assembly code into source code.

2/22

slide-9
SLIDE 9

Outline

1

Static Analysis Control-flow analysis

2

Reconstituting source Disassembly

Static Analysis 3/22

slide-10
SLIDE 10

Control-flow Graphs (CFGs)

A way to represent functions. Nodes are called basic blocks. Each block consists of straight-line code ending (possibly) in a branch. An edge A → B: control could flow from A to B.

Static Analysis 4/22

slide-11
SLIDE 11

✞ ☎

int modexp ( int y , int x [ ] , int w, int n ) { int R , L ; int k = 0; int s = 1; while ( k < w) { i f ( x [ k] == 1) R = ( s ∗y ) % n ; else R = s ; s = R∗R % n ; L = R; k++; } return L ; }

✞ ☎

( 1) k=0 ( 2) s=1 ( 3) i f ( k>= w) goto (12) ( 4) i f ( x [ k ]!=1) goto ( 7) ( 5) R=(s ∗y)%n ( 6) goto ( 8) ( 7) R=s ( 8) s=R∗R%n ( 9) L=R (10) k++ (11) goto ( 3) (12) return L

✝ ✆

Static Analysis 5/22

slide-12
SLIDE 12

The resulting graph

(1) k=0 (2) s=1 (7) R=s (5) R=(s*y) mod n (12) return L (8) s=R*R mod n (9) L = R (10) k++ (11) goto B1 B0 : B4 : B3 : (6) goto B5 B6 : B1 : (3) if (k>=w)goto B6 (4) if (x[k]!=1) goto B4 B2 : B5 :

Static Analysis 6/22

slide-13
SLIDE 13

BuildCFG(F):

1 Mark every instruction which can start a basic block as a

leader: the first instruction is a leader; any target of a branch is a leader; the instruction following a conditional branch is a leader.

2 A basic block consists of the instructions from a leader up

to, but not including, the next leader.

3 Add an edge A → B if A ends with a branch to B or can fall

through to B.

Static Analysis 7/22

slide-14
SLIDE 14

Interprocedural control flow

Interprocedural analysis also considers flow of information between functions. Call graphs are a way to represent possible function calls. Each node represents a function. An edge A → B: A might call B.

Static Analysis 8/22

slide-15
SLIDE 15

Building call-graphs

✞ ☎

void h ( ) ; void f (){ h ( ) ; } void g (){ f ( ) ; } void h ( ) { f ( ) ; g ( ) ; } void k () {}

h f g main k

Static Analysis 9/22

slide-16
SLIDE 16

Outline

1

Static Analysis Control-flow analysis

2

Reconstituting source Disassembly

Reconstituting source 10/22

slide-17
SLIDE 17

Reconstituting source

trans p.c

header .data .text symbols relocation

p p.s as ld strip

header .data .text

p’ cc

header .data .text symbols relocation

p.o

Reconstituting source 11/22

slide-18
SLIDE 18

Attacking stripped binary code

dis

header .data .text

p’’ p’’ hex editor

header .data .text

p’ dcc edit cc p’.s p’.c p’’.c

Reconstituting source 12/22

slide-19
SLIDE 19

Why is disassembly hard?

Variable length instruction sets — overlapping instructions.

Reconstituting source 13/22

slide-20
SLIDE 20

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions.

Reconstituting source 13/22

slide-21
SLIDE 21

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction!

Reconstituting source 13/22

slide-22
SLIDE 22

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction! Find the beginning of functions if all calls are indirect.

Reconstituting source 13/22

slide-23
SLIDE 23

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction! Find the beginning of functions if all calls are indirect. Finding the end of fuctions — if no dedicated return instruction.

Reconstituting source 13/22

slide-24
SLIDE 24

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction! Find the beginning of functions if all calls are indirect. Finding the end of fuctions — if no dedicated return instruction. Handwritten assembly code — won’t conform to the standard calling conventions.

Reconstituting source 13/22

slide-25
SLIDE 25

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction! Find the beginning of functions if all calls are indirect. Finding the end of fuctions — if no dedicated return instruction. Handwritten assembly code — won’t conform to the standard calling conventions. code compression — the code of two functions may overlap.

Reconstituting source 13/22

slide-26
SLIDE 26

Why is disassembly hard?

Variable length instruction sets — overlapping instructions. Mixing data and code — misclassify data as instructions. Indirect jumps — must assume that any location could be the start of an instruction! Find the beginning of functions if all calls are indirect. Finding the end of fuctions — if no dedicated return instruction. Handwritten assembly code — won’t conform to the standard calling conventions. code compression — the code of two functions may overlap. Self-modifying code.

Reconstituting source 13/22

slide-27
SLIDE 27

Instruction set 1

  • pcode

mnemonic

  • perands

semantics call addr function call to addr 1 calli reg function call to address in reg 2 brg

  • ffset

branch to pc+offset if flags for > are set 3 inc reg reg ← reg + 1 4 bra

  • ffset

branch to pc + offset 5 jmpi reg jump to address in reg 6 prologue beginning of function 7 ret return from function Instruction set for a small architecture. All operators and operands are one byte long. Instructions can be 1-3 bytes long.

Reconstituting source 14/22

slide-28
SLIDE 28

Instruction set 2

  • pcode

mnemonic

  • perands

semantics 8 load reg 1, (reg 2) reg 1 ← [reg 2] 9 loadi reg, imm reg ← imm 10 cmpi reg, imm compare reg and imm and set flags 11 add reg 1, reg 2 reg 1 ← reg 1 + reg2 12 brge

  • ffset

branch to pc+offset if flags for ≥ are set 13 breq

  • ffset

branch to pc+offset if flags for = are set 14 store (reg 1), reg 2 [reg 1] ← reg 2

Reconstituting source 15/22

slide-29
SLIDE 29

Disassembly — example

✞ ☎

6 0 1 0 9 0 4 3 1 0 7 0 6 9 0 1 1 0 0 1 2 2 6 9 1 3 0 1 1 1 0 8 2 1 5 2 3 2 3 7 9 1 3 4 7 9 1 4 4 2 7 6 9 0 3 7 6 9 0 1 7 4 2 2 4 3 1 7 4 3 4 1

✝ ✆

Next few slides show the results of different disassembly algorithms. Correctly disassembled regions are in pink.

Reconstituting source 16/22

slide-30
SLIDE 30

✞ ☎

main : # ORIGINAL PROGRAM 0 : [ 6 ] prologue 1 : [ 0 , 1 0 ] c a l l foo 3 : [ 9 , 0 , 4 3 ] l o a d i r0 ,43 6 : [ 1 , 0 ] c a l l i r0 8 : [ 7 ] r e t 9 : [ 0 ] . a l i g n 2 foo : 1 0 : [ 6 ] prologue 1 1 : [ 9 , 0 , 1 ] l o a d i r0 ,1 14: [ 10 , 0 , 1 cmpi r0 ,1 1 7 : [ 2 , 2 6 ] brg 26 1 9 : [ 9 , 1 , 3 0 ] l o a d i r1 ,30 2 2 : [ 1 1 , 1 , 0 ] add r1 , r0 2 5 : [ 8 , 2 , 1 ] load r2 , ( r1 ) 2 8 : [ 5 , 2 ] jmpi r2 3 0 : [ 3 2 ] . byte 32 3 1 : [ 3 7 ] . byte 37 3 2 : [ 9 , 1 , 3 ] l o a d i r1 ,3 3 5 : [ 4 , 7 ] bra 7

✞ ☎

bar : 4 3 : [ 6 ] prologue 4 4 : [ 9 , 0 , 3 ] l o a d i r0 ,3 4 7 : [ 7 ] r e t baz : 4 8 : [ 6 ] prologue 4 9 : [ 9 , 0 , 1 ] l o a d i r0 ,1 5 2 : [ 7 ] r e t l i f e : 5 3 : [ 4 2 ] . byte 42 f r ed : 5 4 : [ 2 , 4 ] brg 4 5 6 : [ 3 , 1 ] i n c r1 5 8 : [ 7 ] r e t 5 9 : [ 4 , 3 ] bra 3 6 1 : [ 4 , 1 ] bra 1

✝ ✆

slide-31
SLIDE 31

✞ ☎

# LINEAR SWEEP DISASSEMBLY 0 : [ 6 ] prologue 1 : [ 0 , 1 0 ] c a l l 10 3 : [ 9 , 0 , 4 3 ] l o a d i r0 ,43 6 : [ 1 , 0 ] c a l l i r0 8 : [ 7 ] r e t 9 : [ 0 , 6 ] c a l l 6 1 1 : [ 9 , 0 , 1 ] l o a d i r0 ,1 1 4 : [ 1 0 , 0 , 1 ] cmpi r0 ,1 1 7 : [ 2 , 2 6 ] brg 26 1 9 : [ 9 , 1 , 3 0 ] l o a d i r1 ,30 2 2 : [ 1 1 , 1 , 0 ] add r1 , r0 2 5 : [ 8 , 2 , 1 ] load r2 , ( r1 ) 2 8 : [ 5 , 2 ] jmpi r2 3 0 : [ 3 2 ] ILLEGAL 32 3 1 : [ 3 7 ] ILLEGAL 37 3 2 : [ 9 , 1 , 3 ] l o a d i r1 ,3 3 5 : [ 4 , 7 ] bra 7 3 7 : [ 9 , 1 , 4 ] l o a d i r1 ,4 4 0 : [ 4 , 2 ] bra 2

✞ ☎

4 3 : [ 6 ] prologue 4 4 : [ 9 , 0 , 3 ] l o a d i r0 ,3 4 7 : [ 7 ] r e t 4 8 : [ 6 ] prologue 4 9 : [ 9 , 0 , 1 ] l o a d i r0 ,1 5 2 : [ 7 ] r e t 5 3 : [ 4 2 ] ILLEGAL 42 5 4 : [ 2 , 4 ] brg 4 5 6 : [ 3 , 1 ] i n c r1 5 8 : [ 7 ] r e t 5 9 : [ 4 , 3 ] bra 3 6 1 : [ 4 , 1 ] bra 1

✝ ✆

slide-32
SLIDE 32

✞ ☎

f0 : # RECURSIVE TRAVERSAL 0 : [ 6 ] prologue 1 : [ 0 , 1 0 ] c a l l 10 3 : [ 9 , 0 , 4 3 ] l o a d i r0 ,43 6 : [ 1 , 0 ] c a l l i r0 8 : [ 7 ] r e t 9 : [ 0 ] . byte f10 : 1 0 : [ 6 ] prologue 1 1 : [ 9 , 0 , 1 ] l o a d i r0 ,1 1 4 : [ 1 0 , 0 , 1 ] cmpi r0 ,1 1 7 : [ 2 , 2 6 ] brg 26 1 9 : [ 9 , 1 , 3 0 ] l o a d i r1 ,30 2 2 : [ 1 1 , 1 , 0 ] add r1 , r0 2 5 : [ 8 , 2 , 1 ] load r2 , ( r1 ) 2 8 : [ 5 , 2 ] jmpi r2 3 0 : [ 3 2 ] . byte 32 3 1 : [ 3 7 ] . byte 37

✞ ☎

3 2 : [ 9 , 1 , 3 ] l o a d i r1 ,3 3 5 : [ 4 , 7 ] bra 7 3 7 : [ 9 , 1 , 4 ] l o a d i r1 ,4 4 0 : [ 4 , 2 ] bra 2 4 2 : [ 7 ] r e t 4 3 : [ 6 ] prologue 4 4 : [ 9 , 0 , 3 ] l o a d i r0 ,3 4 7 : [ 7 ] r e t 4 8 : [ 6 ] . byte 6 4 9 : [ 9 ] . byte 9 5 0 : [ 0 ] . byte 5 1 : [ 1 ] . byte 1 5 2 : [ 7 ] . byte 7 5 3 : [ 4 2 ] . byte 42 5 4 : [ 2 ] . byte 2 . . . . . . . . . . 5 9 : [ 4 ] . byte 4 6 0 : [ 3 ] . byte 3 6 1 : [ 4 ] . byte 4

slide-33
SLIDE 33

Algorithm reHM

Extends the standard recursive traversal algorithm with a collection of heuristics to inrease precision.

Reconstituting source 20/22

slide-34
SLIDE 34

Algorithm reHM

Extends the standard recursive traversal algorithm with a collection of heuristics to inrease precision. First, follow all branches and returns a set of function start addresses and a set of decoded addresses.

Reconstituting source 20/22

slide-35
SLIDE 35

Algorithm reHM

Extends the standard recursive traversal algorithm with a collection of heuristics to inrease precision. First, follow all branches and returns a set of function start addresses and a set of decoded addresses. Next, try to decode any remaining undecoded bytes by looking for prologue instructions that could start a function.

Reconstituting source 20/22

slide-36
SLIDE 36

Algorithm reHM

Extends the standard recursive traversal algorithm with a collection of heuristics to inrease precision. First, follow all branches and returns a set of function start addresses and a set of decoded addresses. Next, try to decode any remaining undecoded bytes by looking for prologue instructions that could start a function. Next, try to build a reasonable control flow graph from the remaining undecoded bytes.

Reconstituting source 20/22

slide-37
SLIDE 37

Algorithm reHM

Extends the standard recursive traversal algorithm with a collection of heuristics to inrease precision. First, follow all branches and returns a set of function start addresses and a set of decoded addresses. Next, try to decode any remaining undecoded bytes by looking for prologue instructions that could start a function. Next, try to build a reasonable control flow graph from the remaining undecoded bytes. Reasonable CFG: “there are no jumps into the middle of another instruction and the resulting function contains at least two control transfer instruction.”

Reconstituting source 20/22

slide-38
SLIDE 38

✞ ☎

f0 : # HARRIS/MILLER 0 : [ 6 ] prologue 1 : [ 0 , 1 0 ] c a l l 10 3 : [ 9 , 0 , 4 3 ] l o a d i r0 ,43 6 : [ 1 , 0 ] c a l l i r0 8 : [ 7 ] r e t 9 : [ 0 ] . byte f10 : 1 0 : [ 6 ] prologue 1 1 : [ 9 , 0 , 1 ] l o a d i r0 ,1 1 4 : [ 1 0 , 0 , 1 ] cmpi r0 ,1 1 7 : [ 2 , 2 6 ] brg 26 1 9 : [ 9 , 1 , 3 0 ] l o a d i r1 ,30 2 2 : [ 1 1 , 1 , 0 ] add r1 , r0 2 5 : [ 8 , 2 , 1 ] load r2 , ( r1 ) 2 8 : [ 5 , 2 ] jmpi r2 3 0 : [ 3 2 ] . byte 32 3 1 : [ 3 7 ] . byte 37 3 2 : [ 9 , 1 , 3 ] l o a d i r1 ,3

✞ ☎

f43 : 4 3 : [ 6 ] prologue 4 4 : [ 9 , 0 , 3 ] l o a d i r0 ,3 4 7 : [ 7 ] r e t f48 : 4 8 : [ 6 ] prologue 4 9 : [ 9 , 0 , 1 ] l o a d i r0 ,1 5 2 : [ 7 ] r e t 5 3 : [ 4 2 ] . byte 42 f54 : 5 4 : [ 2 , 4 ] brg 4 5 6 : [ 3 , 1 ] i n c r1 5 8 : [ 7 ] r e t 5 9 : [ 4 ] . byte 4 6 0 : [ 3 ] . byte 3

slide-39
SLIDE 39

Algorithm reHM

Function f43 is only called indirectly, function f48 isn’t called at all — the disassembler still finds them by searching for their prologue instructions.

Reconstituting source 22/22

slide-40
SLIDE 40

Algorithm reHM

Function f43 is only called indirectly, function f48 isn’t called at all — the disassembler still finds them by searching for their prologue instructions. The disassembler next starts at location 53, realizes that 42 isn’t a valid opcode, moves to location 54, builds a valid CFG.

Reconstituting source 22/22

slide-41
SLIDE 41

Algorithm reHM

Function f43 is only called indirectly, function f48 isn’t called at all — the disassembler still finds them by searching for their prologue instructions. The disassembler next starts at location 53, realizes that 42 isn’t a valid opcode, moves to location 54, builds a valid CFG. The algorithm recovered 95.6% of all functions over a set of Windows and Linux programs.

Reconstituting source 22/22