Carol V. Alexandru , Sebastiano Panichella, Sebastian Proksch, Harald - - PowerPoint PPT Presentation

carol v alexandru sebastiano panichella sebastian proksch
SMART_READER_LITE
LIVE PREVIEW

Carol V. Alexandru , Sebastiano Panichella, Sebastian Proksch, Harald - - PowerPoint PPT Presentation

59th CREST Open Workshop Centre for Research on Evolution, Search and Testing University College London, London, United Kingdom Carol V. Alexandru , Sebastiano Panichella, Sebastian Proksch, Harald C. Gall Software Evolution and Architecture Lab


slide-1
SLIDE 1

Carol V. Alexandru, Sebastiano Panichella, Sebastian Proksch, Harald C. Gall

Software Evolution and Architecture Lab University of Zurich, Switzerland {alexandru,panichella,proksch,gall}@ifi.uzh.ch 26.03.2018 59th CREST Open Workshop Centre for Research on Evolution, Search and Testing University College London, London, United Kingdom

slide-2
SLIDE 2

The Problem Domain

  • Static analysis (e.g. #Attr., McCabe, coupling...)

1

slide-3
SLIDE 3

The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0 2

  • Static analysis (e.g. #Attr., McCabe, coupling...)
slide-4
SLIDE 4

The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0 2

  • Static analysis (e.g. #Attr., McCabe, coupling...)
  • Many revisions, fine-grained historical data
slide-5
SLIDE 5

A Typical Analysis Process

www

clone

select project

3

slide-6
SLIDE 6

A Typical Analysis Process

www

clone checkout

select project select revision

3

slide-7
SLIDE 7

A Typical Analysis Process

www

clone checkout Purpose-built, language specific tool

Res

select project select revision

apply tool store analysis results

3

slide-8
SLIDE 8

A Typical Analysis Process

www

clone checkout Purpose-built, language specific tool

Res

select project select revision

apply tool store analysis results more revisions?

3

slide-9
SLIDE 9

A Typical Analysis Process

www

clone checkout Purpose-built, language specific tool

Res

select project select revision

apply tool store analysis results more revisions? more projects?

3

slide-10
SLIDE 10

Redundancies all over...

4

Redundancies in historical code analysis Impact on Code Study Tools

slide-11
SLIDE 11

Redundancies all over...

Redundancies in historical code analysis Few files change Only small parts of a file change Across Revisions Impact on Code Study Tools

4

slide-12
SLIDE 12

Redundancies all over...

Redundancies in historical code analysis Few files change Only small parts of a file change Across Revisions Impact on Code Study Tools Repeated analysis

  • f "known" code

4

slide-13
SLIDE 13

Redundancies all over...

Redundancies in historical code analysis Few files change Only small parts of a file change Changes may not even affect results Across Revisions Impact on Code Study Tools Repeated analysis

  • f "known" code

Storing redundant results

4

slide-14
SLIDE 14

Redundancies all over...

4

Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis

  • f "known" code

Storing redundant results

slide-15
SLIDE 15

Redundancies all over...

Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis

  • f "known" code

Storing redundant results Re-implementing identical analyses Generalizability is expensive

4

slide-16
SLIDE 16

Redundancies all over...

Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis

  • f "known" code

Storing redundant results Re-implementing identical analyses Generalizability is expensive

5

slide-17
SLIDE 17

#1: Avoid Checkouts

slide-18
SLIDE 18

Avoid checkouts

7

clone

slide-19
SLIDE 19

Avoid checkouts

7

clone checkout

read write

slide-20
SLIDE 20

Avoid checkouts

7

clone checkout

read read write

analyze

slide-21
SLIDE 21

Avoid checkouts

7

clone checkout

read

For every file: 2 read ops + 1 write op Checkout includes irrelevant files Need 1 CWD for every revision to be analyzed in parallel

read write

analyze

slide-22
SLIDE 22

Avoid checkouts

8

clone analyze

read

slide-23
SLIDE 23

Avoid checkouts

8

clone analyze Only read relevant files in a single read op No write ops No overhead for parallization

read

slide-24
SLIDE 24

Avoid checkouts

8

clone analyze Only read relevant files in a single read op No write ops No overhead for parallization

Git Analysis Tool File Abstraction Layer

read

slide-25
SLIDE 25

Avoid checkouts

8

clone analyze Only read relevant files in a single read op No write ops No overhead for parallization

Git Analysis Tool File Abstraction Layer

E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer) extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {

  • verride def getCharContent(): CharSequence = code

} read

slide-26
SLIDE 26

Avoid checkouts

clone analyze Only read relevant files in a single read op No write ops No overhead for parallization

Git Analysis Tool File Abstraction Layer

E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer) extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {

  • verride def getCharContent(): CharSequence = code

} read 9

slide-27
SLIDE 27

#2: Use a multi-revision representation

  • f your sources
slide-28
SLIDE 28

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

10

slide-29
SLIDE 29
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

11

  • rev. 1

Merge ASTs

slide-30
SLIDE 30
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

12

  • rev. 2

Merge ASTs

slide-31
SLIDE 31

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

13

  • rev. 3
slide-32
SLIDE 32

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

14

  • rev. 4
slide-33
SLIDE 33

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

15

slide-34
SLIDE 34

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

16

  • rev. range [1-4]
  • rev. range [1-2]
slide-35
SLIDE 35

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes

17

slide-36
SLIDE 36

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes

18

slide-37
SLIDE 37

Merge ASTs

  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes

19

slide-38
SLIDE 38

#3: Store AST nodes only if they're needed for analysis

slide-39
SLIDE 39

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }} } }

What's the complexity (1+#forks) and name for each method and class?

20

slide-40
SLIDE 40

140 AST nodes (using ANTLR)

parse

What's the complexity (1+#forks) and name for each method and class?

20 public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }} } }

slide-41
SLIDE 41

140 AST nodes (using ANTLR)

parse

CompilationUnit TypeDeclaration Modifiers public Members Method Modifiers public Name run Name Demo Parameters ReturnType PrimitiveType VOID Body Statements

... ... What's the complexity (1+#forks) and name for each method and class?

20 public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }} } }

slide-42
SLIDE 42

What's the complexity (1+#forks) and name for each method and class? 140 AST nodes (using ANTLR)

parse filtered parse

TypeDeclaration Method Name run Name Demo ForStatement IfStatement ConditionalExpression

7 AST nodes (using ANTLR)

21 public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }} } }

slide-43
SLIDE 43

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } }

What's the complexity (1+#forks) and name for eachmethod and class? 140 AST nodes (using ANTLR)

parse filtered parse

TypeDeclaration Method Name run Name Demo ForStatement IfStatement ConditionalExpression

7 AST nodes (using ANTLR)

22

slide-44
SLIDE 44

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } }

What's the complexity (1+#forks) and name for eachmethod and class? 140 AST nodes (using ANTLR)

parse filtered parse

TypeDeclaration Method Name run Name Demo ForStatement IfStatement ConditionalExpression

7 AST nodes (using ANTLR)

23

slide-45
SLIDE 45

#4: Use non-duplicative data structures to store your results

slide-46
SLIDE 46
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

24

slide-47
SLIDE 47
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

24

slide-48
SLIDE 48
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

24

[1-1] label #attr mcc [4-4] label #attr mcc InnerClass 4 1 2 4 [2-3] label #attr mcc

slide-49
SLIDE 49
  • rev. 1
  • rev. 2
  • rev. 3
  • rev. 4

[1-1] label #attr mcc [4-4] label #attr mcc InnerClass 4 1 2 4 [2-3] label #attr mcc

25

slide-50
SLIDE 50

LISA also does: #5: Parallel Parsing #6: Asynchronous graph computation #7: Generic graph computations applying to ASTs from compatible languages

26

slide-51
SLIDE 51

A light-weight view on multi-language analysis

slide-52
SLIDE 52

Typical solutions

  • Toolchains / Frameworks
  • Integrate language-specific tooling
  • Lots of engineering required
  • Meta-models
  • Translate language code to some common

representation

  • Significant overhead / rigid models

52

slide-53
SLIDE 53

Structure matters most

  • Complexity?
  • # of Functions / Attributes etc.
  • Coupling between Classes
  • Call graphs

53

if (true) { if (true) { } } # CYCLO: 3 if (true) { } if (true) { } # CYCLO: 4

slide-54
SLIDE 54

Relative structure is similar

54

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

slide-55
SLIDE 55

Relative structure is similar

55

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

slide-56
SLIDE 56

Relative structure is similar

56

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

class function for if print

slide-57
SLIDE 57

Entities are different

57

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

class function for if print CLASS_DECL METHOD_DECL FOR_STAT IF_STAT METHOD _INVOK classdef funcdef forstat ifelsestat atomexpr

slide-58
SLIDE 58

Can filter irrelevant nodes

58

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

class function for if print

slide-59
SLIDE 59

Can filter irrelevant nodes

59

public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) }}}} class Demo: def run(): for i in range(1, 100): if i % 3 == 0 or i % 5 == 0: print(i)

class function for if print

slide-60
SLIDE 60

How LISA handles Multiple languages

slide-61
SLIDE 61

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

61

slide-62
SLIDE 62

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

62

class function

slide-63
SLIDE 63

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

63

class function func-count signal 1 1 1 signal

slide-64
SLIDE 64

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

64

class function collect func-count signal fc = fc+1+1 = 2

slide-65
SLIDE 65

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

65

class function func-count signal 1 signal

slide-66
SLIDE 66

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

66

class function collect func-count signal fc = fc+1 = 3

slide-67
SLIDE 67

Analysis formulation

  • Signal/Collect (like “Google Pregel” for Scala)
  • Graph vertices send information packets

(signals) and do something when receiving (collecting) signals.

  • Analyses use generic lables for entities

67

slide-68
SLIDE 68

68

slide-69
SLIDE 69

69

slide-70
SLIDE 70

Light-weight entity mappings

70

slide-71
SLIDE 71

Light-weight entity mappings

71

Lables used by Analyses

slide-72
SLIDE 72

Light-weight entity mappings

72

Lables used by the parser or grammar

slide-73
SLIDE 73

Light-weight entity mappings

73

slide-74
SLIDE 74

Parsing

  • AST is kept as the parser supplies it
  • ANTLRv4 integration
  • Filtering can be enabled
  • Only AST nodes that correspond to a label

used in an analysis are kept

  • Reduces graph size by a factor of 10 or more

74

slide-75
SLIDE 75

Adding new languages

  • 1. Integrate a parser (or generate one)
  • Graph interface allows adding nodes/edges
  • 2. Write a node mapping
  • 3. Re-use existing analyses on new ASTs

75

slide-76
SLIDE 76

76

slide-77
SLIDE 77

77

slide-78
SLIDE 78

To Summarize...

slide-79
SLIDE 79

The LISA Analysis Process

27

www

clone

select project

slide-80
SLIDE 80

The LISA Analysis Process

27

www

clone parallel parse into merged graph

select project Language Mappings (Grammar to Analysis)

determines which AST nodes are loaded

ANTLRv4 Grammar

used by Generates Parser

slide-81
SLIDE 81

The LISA Analysis Process

27

www

clone parallel parse into merged graph

Res

select project

store analysis results

Analysis formulated as Graph Computation Language Mappings (Grammar to Analysis)

used by

  • Async. compute

determines which AST nodes are loaded determines which data is persisted

ANTLRv4 Grammar

used by Generates Parser runs

  • n graph
slide-82
SLIDE 82

The LISA Analysis Process

27

www

clone parallel parse into merged graph

Res

select project

store analysis results more projects?

Analysis formulated as Graph Computation Language Mappings (Grammar to Analysis)

used by

  • Async. compute

determines which AST nodes are loaded determines which data is persisted

ANTLRv4 Grammar

used by Generates Parser runs

  • n graph
slide-83
SLIDE 83

How well does it work, then?

slide-84
SLIDE 84

Marginal cost for +1 revision

41.670 4.633 0.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033 5 10 15 20 25 30 35 40 45 50 1 10 100 1000 2000 3000 4000 5000 6000 7000

Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics)

Seconds # of revisions

28

slide-85
SLIDE 85

Overall Performance Stats

29

Language Java C# JavScript Python #Projects 1000 1000 1000 1000 #Revisions 2 Million 1.4 Million 1.5 Million 2.3 Million #Files (parsed!) 10 Billion 3 Billion 380 Thousand 1.2 Billion #Lines (parsed!) 1.6 Trillion 0.6 Trillion 43 Billion 0.3 Trillion Total Runtime (RT)¹ 2d 5h 3d 23h 2d 4h Median RT¹ 8.35s 40.5s 14.4s 24.5s

  • Tot. Avg. RT per

Rev.² 97ms 183ms 57ms 83ms

  • Med. Avg. RT per

Rev.² 41ms 88ms 25ms 48ms

¹ Including cloning and persisting results ² Excluding cloning and persisting results

slide-86
SLIDE 86

What's the catch?

(There are a few...)

slide-87
SLIDE 87

The (not so) minor stuff

  • Must implement analyses from scratch
  • No help from a compiler
  • Non-file-local analyses need some effort

30

slide-88
SLIDE 88

The (not so) minor stuff

  • Must implement analyses from scratch
  • No help from a compiler
  • Non-file-local analyses need some effort
  • Moved files/methods etc. add overhead
  • Uniquely identifying files/entities is hard
  • (No impact on results, though)

30

slide-89
SLIDE 89

Parser performance matters

31

Javascript C# Java Python

slide-90
SLIDE 90

LISA is EXTREME

32

complex feature-rich heavyweight simple generic lightweight

slide-91
SLIDE 91

Thank you for your attention

59th CREST Open Workshop, London, 26.03.2018

Read the paper: http://t.uzh.ch/Fj

(upcoming EMSE publication is much more detailed)

Try the tool: http://t.uzh.ch/Fk Get the slides: http://t.uzh.ch/NR Contact me: alexandru@ifi.uzh.ch