[PPT] - EDA045F: Program Analysis LECTURE 2: DATAFLOW ANALYSIS 1 Christoph PowerPoint Presentation

SLIDE 1

EDA045F: Program Analysis

LECTURE 2: DATAFLOW ANALYSIS 1

Christoph Reichenbach

SLIDE 2

In the last lecture. . .

◮ Uses of Program Analysis ◮ Static vs. Dynamic Program Analysis ◮ Soundness, Precision, Termination ◮ Abstraction and Simplification for Analysis ◮ Program Execution Pipeline ◮ Intermediate Representation 2 / 75

SLIDE 3

Announcements

◮ Moodle available ◮ Homework #1 on home page after class ◮ Groups formation in break! ◮ Needed: Student representative 3 / 75

SLIDE 4

Intermediate Representations

. . . 0: iload_0 1: ifle 9 4: iconst_1 5: istore_1 6: goto 11 9: iconst_0 10: istore_1 11: iload_1 12: ireturn . . .

◮ Simplify analysis ◮ Fewer cases to consider ◮ Reduce risk of bugs in analyses ◮ (Simplify code generation) ◮ (Simplify code transformation)

⇒ We will need code transformation for dynamic analysis

4 / 75

SLIDE 5

A Buggy Example

Java

int[] array = new int[]{23}; Set<Integer> set = null; print(array.length, set.size()); // create nonempty set Set<Integer> set = new HashSet<Integer>(...);

Analysis: Connect dereference to null pointer

5 / 75

SLIDE 6

Example: Our program in Java bytecode

⇒ iconst_1 ⇒ 1 newarray int ⇒ 3 dup ⇒ 4 iconst_0 ⇒ 5 bipush 23 ⇒ 7 iastore ⇒ 8 astore_1 ⇒ 9 aconst_null ⇒ 10 astore_2 ⇒ 11 aload_1 ⇒ 12 arraylength ⇒ 13 aload_2 ⇒ 14 invokeinterface java.util.Set.size() ⇒ 19 invokestatic print(int, int)

23 array array, set array, set, set.size() 1 1, array 1, array, null 1, array, null, array.length

Stack 1: array 2: set/null Local variables: The stack is not convenient for program analysis

6 / 75

SLIDE 7

Summary

◮ Stack: Cumbersome for connecting ◮ Meaning of stack slot depends on position in the program ◮ Local Variables: Helpful for connecting ◮ Meaning is associated with variable in original program ◮ Dealing with intermediate results? ◮ No clear solution yet for dealing with e.g.:

((a > 0) ? null : array).length

7 / 75

SLIDE 8

Simplifying Analysis with Simpler IRs

◮ Goal: ◮ Make analyses easier to build ◮ Make analyses less error-prone ◮ Start with ASTs ◮ Refine: ◮ Simpler statements

‘Dummy names’ for intermediate results

◮ Representing control flow ◮ Breaking up multiple uses of the same name 8 / 75

SLIDE 9

A Tiny Language

name ::= id | name . id expr ::= num | expr+expr | null | print expr | new() | name stmt ::= name = expr | { stmt ⋆ } | if expr stmt else stmt | while expr stmt | skip | return expr

9 / 75

SLIDE 10

Evaluation Order

ATL

v = print((print 1) + (print 2))

ATL with explicit order

tmp1 = print 1 tmp2 = print 2 tmp3 = tmp1 + tmp2 v = print(tmp3)

Java or C or C++

// Many challenging constructions: a[i++] = b[i > 10 ? i-- : i++] + c[f(i++, --i)];

Every analysis must remember the evaluation order rules!

10 / 75

SLIDE 11

A Tiny Language: Simplified

name ::= id | id . id val ::= name | num expr ::= val | val+val | null | print val | new() stmt ::= name = expr | { stmt ⋆ } | if val stmt else stmt | while val stmt | skip | return val

11 / 75

SLIDE 12

Eliminating Nesting

◮ No nested expressions

⇒ Evaluation order is explicit ⇒ Fewer patterns to analyse

◮ All intermediate results have a name

⇒ Easier to ‘blame’ subexpressions for errors

◮ Names might be just pointers in the implementation ◮ We still have nested statements ◮ Not all IRs de-nest as aggressively as this 12 / 75

SLIDE 13

Multiple Paths

ATL

v = new() if condition { v = null } else { print v } v.f = 1

ATL

v = new() while condition { v = null } v.f = 1

Need to reason about the order of execution of statements, too

13 / 75

SLIDE 14

Control-Flow Graphs

v = new() if condition

b0

v = null

b1

print v

b2

v.f = 1

b3

true false Construct graph to show flow of control through program

14 / 75

SLIDE 15

Making Flow Explicit

name ::= id | id . id val ::= name | num expr ::= val | val+val | null | print val | new() stmt ::= name = expr | skip | return val → ::= stmt ⋆ → | end | stmt ⋆ if val → else →

For intuition only: → is not a ‘real’ nonterminal

15 / 75

SLIDE 16

Control-Flow-Graphs

◮ Replace statement nesting by nodes code

b0

and edges

◮ Multiple outgoing edges: Label condition:

if condition

b0

true false

◮ Can group statements into Basic Blocks or keep them

separate: v = new() if condition

b0

Basic Block v = new()

b0a

if condition

b0b

◮ Uniform representation for different control statements 16 / 75

SLIDE 17

Use-Def Chains

v = new() if condition

b0

v = null

b1

print v

b2

v.f = 1

b3

true false Use-Def chain: Map one use to all definitions Def-Use chain: Map one definition to all uses (not shown here)

17 / 75

SLIDE 18

Alternative: Static Single Assignments

Idea: unique names for every assignment vv0 = null print vv0 vv1 = new() if condition

b0

vv2 = null

b1

print vv1

b2

v3 = Φ(v1, v2) v3.f = 1

b3

true false

18 / 75

SLIDE 19

Static Single Assignments Simplifies Def-Use/Use-Def Chains

v=0

b0

v=1

b1

v=2

b2

if ...if

b3

print v

b4

w=v

b5

x=v+v

b6

without SSA v0=0

b0

v1=1

b1

v2=2

b2

v3 = Φ(v0, v1, v2) if ...if

b3

print v3

b4

w=v3

b5

x=v3+v3

b6

with SSA

19 / 75

SLIDE 20

Static Single Assignment Form

◮ From a static perspective: ◮ Each variable is set exactly once in the program ◮ Each name stands for exactly one computation ◮ Can connect definitions and uses without complex graphs ◮ Φ (Phi) functions merge points ◮ Minimal SSA eliminates unnecessary Φ functions ◮ Similar representations: ◮ Continuation-Passing Style IR (CPS) ◮ A-Normal Form (ANF) ◮ Simpler Def-Use / Use-Def chains 20 / 75

SLIDE 21

Summary

◮ Different Intermediate Representations (IRs) to pick ◮ Usually eliminate nested expressions ◮ Make evaluation order explicit ◮ Control-Flow Graph (CFG): ◮ Represent control flow as Blocks and Control-Flow Edges ◮ Edges represent control flow, labelled to identify conditionals ◮ Blocks can be single statements or Basic Blocks ◮ Basic blocks are sequences of statements without branches ◮ IRs try to expose and link: ◮ Definitions of (= writes to) a variable ◮ Uses of (= reads from) a variable ◮ Use-Def Chain: Links uses to all reaching definitions ◮ Def-Use Chain: Links definitions to all reachable uses ◮ Static Single Assignment (SSA) form: ◮ Each variable has exactly one definition ◮ Use Φ (Phi) expressions to merge variables across control-flow

edges

21 / 75

SLIDE 22

Basic Formal Notation

◮ Tuples: ◮ Notation:

a a, b (pair) a, c, d (triple)

◮ Fixed-length (unlike list) ◮ Group items, analogous to (read-only) record/object ◮ Sets:

∅ = {} (the empty set) {1} (singleton set containing precisely the number 1) {2, 3} (Set with two elements) Z (The (infinite) set of integers) R (The (infinite) set of real numbers)

22 / 75

SLIDE 23

Basic operations on sets

x ∈ S Is x containd in S? True: 1 ∈ {1} and 1 ∈ Z False: 2 ∈ {1} or π ∈ R x / ∈ S Is x NOT containd in S? A ∪ B Set union {1} ∪ {2} = {1, 2} {1, 3} ∪ {2, 3} = {1, 2, 3} A ∩ B Set intersection {1} ∩ {2} = ∅ {1, 3} ∩ {2, 3} = {3} A ⊆ B Subset relationship True: ∅ ⊆ {1} and Z ⊆ R False: {2} ⊆ {1} A × B Product set {1, 2} × {3, 4} = {1, 3, 1, 4, 2, 3, 2, 4}

23 / 75

SLIDE 24

Graphs

A (directed) graph G is a tuple G = N, E, where:

◮ N is the set of nodes of G ◮ E ⊆ N × N is the set of edges of G ◮ Often: Add function f : E → X to label edges

N = {n0, n1, n2, n3, n4} E = {n0, n1, n0, n2, n1, n3, n2, n0} n0 n1 n2 n3 n4

24 / 75

SLIDE 25

Summary

◮ Tuples group a fixed number of items ◮ Sets represent a (possibly infinite) number of unique

elements

◮ Widely used in program analysis ◮ (Directed) Graphs represent nodes and edges between

them

◮ Optional labels on edges possible ◮ Used e.g. for control-flow graphs 25 / 75

SLIDE 26

Dataflow Analysis: Example

ATL

x = new() print x // A if z { x.f = 2 // B x = null } else skip x.f = 1 // C

◮ Analyse: Will there be an error at B or C? ◮ Must distinguish between x at A vs. x at B and C ◮ Need to model flow of information Suitable IRs: ◮ Control-Flow Graph (CFG) ◮ Static Single-Assignment Form (SSA)

Need analysis that can represent data flow through program

26 / 75

SLIDE 27

Control Flow

Understanding data flow requires understanding control flow: x = new() print x if z x.f = 2 x = null x.f = null Control flow Data flow (here as Def-Use chains)

27 / 75

SLIDE 28

Basic Ideas of Data Flow Analysis

x = new() x ← object print x (no change) if z (no change) x.f = 2 (no change) x = null x ← null x.f = 1 (no change) x unknown x nonnull x nonnull x nonnull x nonnull x nonnull x null x either

28 / 75

SLIDE 29

Another Analysis

ATL

z = ... x = 1 y = 2 if z > ... { y = z if z < ... { z = 7 } } print y

◮ Which assignments are unnecessary?

⇒ Possible oversights / bugs

(Live Variables Analysis)

29 / 75

SLIDE 30

Control Flow

z = ... x = 1 x = 1 y = 2 if z > ... y := z; if z < ... z = 7 z = 7 print y ∅ {y} {y} {y} {y} {y, z} {z}

Overwrite y ⇒ don’t need old y

{y, z} {y, z} {z} {z} {z} ∅ Analysis effective: found useless assignments to z and x

30 / 75

SLIDE 31

Observations

1 Data Flow analysis can be run forward or backward 2 May have to join results from multiple sources

31 / 75

SLIDE 32

What about Loops? (1/2)

x = null if (...) x = new() print x x unknown x null x either x either x nonnull . .

◮ Analysis: Null Pointer Dereference ◮ Stop when we’re not learning anything new any more ◮ Works fine 32 / 75

SLIDE 33

What about Loops? (2/2)

x = 1 if (...) x = x + 1 print x x =? x = 1 x ∈ {1, 2, 3, . . .}

x ∈ {1, 2, 3, . . .}

x ∈ {2, 3, . . .} . .

◮ Analysis: Reaching Definitions

We need to bound repetitions!

33 / 75

SLIDE 34

Summary: Data-Flow Analysis (Introduction)

◮ Some important program analyses are flow sensitive:

must consider how execution order affects variables

◮ Data flow depends on control flow ◮ Data flow analysis examines how variables change across

control-flow edges

◮ May have to join multiple results ◮ Can run forward or backward wrt program control flow ◮ Handling loops is nontrivial 34 / 75

SLIDE 35

Engineering Data Flow Algorithms

1 Termination

◮ Assumption: Operate on Control Flow Graph ◮ Theory: Ensure termination

2 (Correctness)

35 / 75

SLIDE 36

Data Flow Analysis on CFGs

◮ inb: knowledge at entrance of

basic block b

◮ outb: knowledge at exit of basic

block b

◮ mergeb: merges all outbi for all

basic blocks bi that flow into b

◮ transb: updates outb from inb

transb b inb

utb

mergeb

36 / 75

SLIDE 37

Characterising Data Flow Analyses Characteristics:

◮ Forward or backward analysis ◮ L: Abstract Domain (the ‘analysis domain’) ◮ transb : L → L ◮ mergeb : L × L → L

Require properties of L, transb, mergeb to ensure termination

37 / 75

SLIDE 38

Limiting Iteration

init while (...) x := x + 1; P0 P1 P2

b0 b1

◮ Does the following ever stop changing:

inb0 = mergeb0(P0, P2)

◮ Intuition: we keep generalising information ◮ Growth limit: bound amount of generalisation ◮ Make sure mergeb, transb never throw information away

Eventually, either nothing changes or we hit growth limit

38 / 75

SLIDE 39

Ordering Knowledge

A ⊑ B

A B

◮ A describes at least as much knowledge as B ◮ Either: ◮ A = B (i.e., A ⊑ B ⊑ A), or ◮ A has strictly more knowledge than B 39 / 75

SLIDE 40

Intuition: Knowing Less, Knowing More Structure of L:

⊥ A B A&B

mergeb

transb transb · · · Y Z More Knowledge

◮ mergeb must not lose knowledge ◮ mergeb(A, B) ⊑ A ◮ mergeb(A, B) ⊑ B ◮ transb must be monotonic over amount of knowledge:

x ⊑ y = ⇒ transb(x) ⊑ transb(y)

◮ Introduce bound: ⊥ means ‘too much information’ 40 / 75

SLIDE 41

Aggregating Knowledge

b0 b1

P1 = mergeb0(A, B) P2 = transb0(mergeb0(A, B))

◮ Interplay between transb and mergeb helps preserve

knowledge

◮ mergeb(A, B) ⊑ A:

As we add knowledge, P1 either

◮ Stays equal ◮ ‘Descends’ ◮ Monotonicity of transb: If P1 descends, then P2 either ◮ Stays equal ◮ ‘Descends’

⇒ At each node, we either stay equal or descend

Now we must only set a growth limit. . .

41 / 75

SLIDE 42

Descending Chains

ak = ak+1 = . . . a3 a2 a1 a0

◮ A (possibly infinite) sequence a0, a1, a2, . . . is a

descending chain iff: ai+1 ⊑ ai (for all i ≥ 0)

◮ Descending Chain Condition: ◮ For every descending chain a0, a1, a2, . . . in abstract

domain L:

◮ there exists k ≥ 0 such that:

ak = ak+n for any n ≥ 0 DCC is formalisation of growth limit

42 / 75

SLIDE 43

Top and Bottom

⊤ ⊥

◮ Convention: We introduce two distinguished elements: ◮ Top:

⊤: A ⊑ ⊤ for all A

◮ Bottom: ⊥: ⊥ ⊑ A for all A ◮ Since mergeb(A, B) ⊑ A and mergeb(A, B) ⊑ B: ◮ mergeb(⊥, A) = ⊥ = mergeb(A, ⊥) ◮ mergeb(⊤, A) ⊑ A ⊒ mergeb(A, ⊤) ◮ In practice, it’s safe and simple to set:

mergeb(⊤, A) = A = mergeb(A, ⊤)

◮ Intuition: ◮ ⊤: means ‘no information known yet’ ◮ ⊥: means ‘contradictory / too much information’ 43 / 75

SLIDE 44

Summary

◮ Designing a Forward or backward analysis: ◮ Pick Abstract Domain L ◮ Must be partially ordered with (⊑) ⊆ L × L:

A ⊑ B iff A ‘knows’ at least as much as B

◮ Unique top element ⊤ ◮ Unique bottom element ⊥ ◮ transb : L → L ◮ Must be monotonic:

x ⊑ y = ⇒ transb(x) ⊑ transb(y)

◮ mergeb : L × L → L must produce a lower bound for its

parameters:

◮ mergeb(A, B) ⊑ A ◮ mergeb(A, B) ⊑ B ◮ Satisfy Descending Chain Condition to ensure termination ◮ Easiest solution: make L finite 44 / 75

SLIDE 45

Abstract Domains Revisited

A− A0 A+ A? ⊒ ⊑ = ⊥ ⊤ ⊑ ⊒ · · · · · · −3 −2 −1 1 2 3 α is compatible here means: for all i ∈ Z: ⊖(α(i)) ⊑ α(neg(i)) ⊖ is compatible with neg ⊖ ⊤ = ⊤ ⊖ A0 = A0 ⊖ A+ = A− ⊖ A− = A+ ⊖ A? = A? ⊖ is monotonic (and ⊕ extended with ⊤ is, too)

45 / 75

SLIDE 46

Summary

◮ We could extend {A+, A−, A0, A?} to an Abstract Domain by

adding ⊤ LA = {A+, A−, A0, A?, ⊤}

◮ LA is finite, so the DCC holds trivially ◮ All our abstract operations are monotonic ◮ Making the abstraction function α : Z → LA explicit allows

us to check that our abstract operations are compatible: ⊖(α(i)) ⊑ α(neg(i)) (cf. ‘induced operation’ in Abstract Interpretation)

46 / 75

SLIDE 47

Soot IRs

◮ Exercise #1 uses Soot, which offers four IRs: ◮ Jimple: Soot’s main CFG-based IR ◮ Shimple: Jimple converted to SSA form ◮ Grimp: Jimple with nested expressions

Intended for decompiling/pretty-printing

◮ Baf: Enhanced Java bytecode

Intended for bytecode generation

47 / 75

SLIDE 48

Example Program with Bug

Java

int[] array = new int[]{23}; Set<Integer> set = null; print(array.length, set.size()); // create nonempty set Set<Integer> set = new HashSet<Integer>(...);

Soot’s Jimple IR

l0 := @this $r0 = newarray (int)[1] $r0[0] = 23 l2 = null $i0 = lengthof $r0 $i1 = interfaceinvoke l2.<java.util.Set: int size()>() staticinvoke <T2: void print(int,int)>($i0, $i1)

48 / 75

SLIDE 49

Order of Side Effects

Java

int[] one = new int[1]; int[] two = new int[2]; int counter = 0;

ne[counter++] = two[counter++]++;

return one;

Jimple

ne

= newarray (int)[1] two = newarray (int)[2] counter = 0 + 1 $i0 = counter $i1 = two[$i0] $i2 = $i1 + 1 two[$i0] = $i2

ne[0]

= $i1 return one

49 / 75

SLIDE 50

Jimple IR

Block ::= Stmt ⋆ Trap⋆ Stmt ::= nop | vr := v | vr = vr | Invoke | goto i | if v goto i | return v | return-void | entermonitor v | exitmonitor v | tableswitch . . . | lookupswitch . . . | breakpoint | ret | throw v Trap | catch ty from i to i1 with ih v ::= var | vc | vr | ve vc ::= int | long | float | double | string | null | method | ty ve ::= Invoke | new ty | newarray ty[int] | nemultiwarray ty([int])⋆ | v+v | v-v . . . vr ::= vr [vr ] | @this | @parameter i | @caughtexception | vr .id | ty.id Invoke ::= . . .

50 / 75

SLIDE 51

Homework

1 Find all main methods 2 Find all calls to deprecated methods 3 Simplified Array Out-Of-Bounds checking: find uses of

negative array indices

4 Live Variables Analysis: Find useless assignments 5 Make your analysis reusable

51 / 75

SLIDE 52

To be continued. . .

Next week:

◮ Lattice theory ◮ Understanding our precision ◮ Procedure calls 52 / 75