Coverage-Based Reduction of Test Execution Time: Lessons from a - - PowerPoint PPT Presentation

coverage based reduction of test execution time lessons
SMART_READER_LITE
LIVE PREVIEW

Coverage-Based Reduction of Test Execution Time: Lessons from a - - PowerPoint PPT Presentation

Thomas Bach Coverage-Based Reduction of Test Execution Time: Lessons from a Very Large Industrial Project Thomas Bach, Artur Andrzejak, Ralf Pannemans SAP SE Heidelberg University http://pvs.ifi.uni-heidelberg.de http://www.sap.de Content


slide-1
SLIDE 1

Thomas Bach

Coverage-Based Reduction

  • f Test Execution Time:

Lessons from a Very Large Industrial Project

Thomas Bach, Artur Andrzejak, Ralf Pannemans

Heidelberg University http://pvs.ifi.uni-heidelberg.de SAP SE http://www.sap.de

slide-2
SLIDE 2

Content

  • Academic-industry collaboration details
  • Test environment
  • Challenges and gaps between research and practice
  • Our results from coverage analysis

2

slide-3
SLIDE 3

Collaboration Details

  • Started in 2012
  • Recurring student activities (> 10 theses, internships)
  • PhD project: Testing in Very Large Software Projects

– PhD student at Heidelberg University and SAP

  • Success factors:

– Good combination: Practical relevant & nontrivial research – Real, large scale software product as a use case

  • Challenges:

– Transfer research to production – Find interested persons in charge

3

slide-4
SLIDE 4

Test Environment

  • SAP HANA

– In-memory database management system – Core product platform of SAP – Several million LOC C/C++, scales up to >600 cores

  • Testing

– More than 1000 test suites with more than 100 000 tests – Coverage is line based per test suite – Test framework in python

  • Test sends SQL to HANA and checks results

4

slide-5
SLIDE 5

GAPS BETWEEN RESEARCH AND PRACTICE

5

slide-6
SLIDE 6

Project goals and discovered gaps

  • We want to

– Reduce test runtime – Increase specificity of coverage based test characterization

  • We encountered several issues with existing work

6

slide-7
SLIDE 7

Evaluation with Small Projects

  • Practitioners do not trust small evaluations

7

Work1 Size Alspaugh et al. 2007 5 classes to 22 classes Zhang et al. 2009 53 testcases to 209 testcases Li et al. 2009 374 LOC to 11 kLOC You et al. 2011 500 LOC to 10 kLOC Zhang et al. 2013 2 kLOC to 80 kLOC Do et al. 2008 7 kLOC to 80 kLOC Elbaum et al. 2002 8 kLOC to 300 kLOC Our work > 3.50 MLOC

1 See paper for details

Related work comparing overlap-aware vs. non-overlap-aware solvers for TCS or TCP

slide-8
SLIDE 8

Flaky Tests

  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: Failed
  • Execute test 1: OK

8

Hardware Problems? Test dependencies? Test infrastructure? Real bug? (e.g. concurrency) Performance? Memory leak? and more …

Investigate? Ignore?

slide-9
SLIDE 9

Flaky Tests

  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: Failed
  • Execute test 1: OK

9

Hardware Problems? Test dependencies? Test infrastructure? Real bug? (e.g. concurrency) Performance? Memory leak? and more …

Investigate? Ignore?

Real world is not perfect and return of investment avoids perfection Flaky test detection and handling is time consuming

slide-10
SLIDE 10

Shared coverage

10

Test 1 Test 2 Test 4 Database Code Test 3 Covered by nearly all tests

Large part of coverage is not specific

slide-11
SLIDE 11

Random Coverage

  • Coverage A: 651 074 lines hit
  • Coverage B: 651 845 lines hit
  • Coverage C: 651 862 lines hit
  • Coverage D: 652 015 lines hit

11

A B D C

Venn diagram

slide-12
SLIDE 12

Random Coverage

  • Coverage A: 651 074 lines hit
  • Coverage B: 651 845 lines hit
  • Coverage C: 651 862 lines hit
  • Coverage D: 652 015 lines hit

12

In Fact: A and B from same Test1 C and D from same Test2 Test2 contains Test1 + more

A B D C

Venn diagram

Impossible to find exactly identical or included tests

slide-13
SLIDE 13

Size of Coverage Data

13

Size is nontrivial and increasing

slide-14
SLIDE 14

OUR RESULTS ON COVERAGE ANALYSIS

14

slide-15
SLIDE 15

Overlap-Aware Coverage Algorithms

  • Test Case Selection

– Time budget 1h: Which tests to run?

  • Objective: coverage – Maximum budgeted cov. problem

– Which tests to run for full coverage?

  • Objective: cardinality – Set cover problem
  • Objective: runtime – Weighted set cover problem
  • Test Case Prioritization

– Which tests to run first? Objective: coverage (per time)

15

Unsafe algorithms, we could miss functionality

slide-16
SLIDE 16

Overlap-Aware Coverage Algorithms

  • Test Case Selection

– Time budget 1h: Which tests to run?

  • Objective: coverage – Maximum budgeted cov. problem

– Which tests to run for full coverage?

  • Objective: cardinality – Set cover problem
  • Objective: runtime – Weighted set cover problem
  • Test Case Prioritization

– Which tests to run first? Objective: coverage (per time)

16

Unsafe algorithms, we could miss functionality

slide-17
SLIDE 17

Overlap-Aware vs. Simple Greedy

17

Test 1

Coverage

Test 2 Test 3 Test 1 Test 2 Test 3 Test 1 Test 2 Test 3

Simple greedy Overlap-aware greedy

slide-18
SLIDE 18

Overlap-Aware vs. Simple Greedy

18

Test 1

Coverage

Test 2 Test 3 Test 1 Test 2 Test 3 Test 1 Test 2 Test 3

Simple greedy Overlap-aware greedy

slide-19
SLIDE 19

Overlap-Aware vs. Simple Greedy

19

Test 1

Coverage

Test 2 Test 3 Test 1 Test 2 Test 3 Test 1 Test 2 Test 3

Simple greedy Overlap-aware greedy

slide-20
SLIDE 20

Comparison Overlap-Aware

20

Runtime for single run: <10s Also works for test clusters with buckets Overlap-aware greedy reaches more coverage faster

slide-21
SLIDE 21

Parallel Variant for Test Clusters

21

Test 1 Test 2 Test 3 Test Server 1

Budget: 1 hour

Test 4 Test 5 Test 6 Test 1 Test 2 Test 3 5 6 Test Server A

Budget: 1 x 3 hours

Test 7 7 Test 1 Test 2 Test 3 5 6 7 Test 4 Test 4 Test Server 2

Budget: 1 hour

Test Server 3

Budget: 1 hour

slide-22
SLIDE 22

Parallel Variant for Test Clusters

22

Test 1 Test 2 Test 3 Test Server 1

Budget: 1 hour

Test 4 Test 5 Test 6 Test 1 Test 2 Test 3 5 6 Test Server A

Budget: 1 x 3 hours

Test 7 7 Test 1 Test 2 Test 3 5 6 7 Test 4 Test 4 Test Server 2

Budget: 1 hour

Test Server 3

Budget: 1 hour

slide-23
SLIDE 23

Overlap-Aware for Test Clusters

23

Coverage Time budget

Overlap-Aware Greedy for Test Clusters with 1, 4, 8, 16 or 32 Servers

1 4 8 16 32

Coverage decrease < 0,01% -> works for test clusters

slide-24
SLIDE 24

Coverage Redundancy

24

1 int example_function(int a, int b) { 2 int c = a + b; 3 int d = a - b; 4 return c*d; 5 }

slide-25
SLIDE 25

Coverage Redundancy

25

1 int example_function(int a, int b) { 2 int c = a + b; 3 int d = a - b; 4 return c*d; 5 }

Test1 Test2 Test3 S1 x x S2 x x S3 x x S4 x x S5 x x

slide-26
SLIDE 26

Coverage Redundancy

26

1 int example_function(int a, int b) { 2 int c = a + b; 3 int d = a - b; 4 return c*d; 5 }

Test1 Test2 Test3 S1 x x S2 x x S3 x x S4 x x S5 x x

slide-27
SLIDE 27

Coverage Redundancy

27

Test1 Test2 Test3 S1 x x S2 x x S3 x x S4 x x S5 x x Coverage run Lines hit Line groups Redundancy % 2015-11-15 2901575 79741 97.25 2016-05-19 3172337 93162 97.06 2016-08-04 3371109 97368 97.11 2016-10-25 3510727 104764 97.02 2016-11-01 3421780 104837 96.94 2016-11-15 3436853 106030 96.91

1 int example_function(int a, int b) { 2 int c = a + b; 3 int d = a - b; 4 return c*d; 5 }

Large part of coverage data is redundant

slide-28
SLIDE 28

Shared Coverage Problem

  • Ask SAP engineers

where they expect coverage for Test1

28

A B C D E F

Lines hit Directories

Coverage Expectation for Test1

slide-29
SLIDE 29

Shared Coverage Problem

  • Ask SAP engineers

where they expect coverage for Test1

  • Measure Test1

29

A B C D E F

Lines hit Directories

Coverage Expectation for Test1

A B C D E F

Lines hit Directories

Coverage for Test1

Coverage does not characterize Test1

slide-30
SLIDE 30

Filtering Shared Coverage Data

Considered two approaches: a) Baseline approach Define baseline test and remove baseline coverage from all other tests b) Testcount approach Remove all lines covered by more than e.g. 238 tests (of e.g. 1200 in total)

30

slide-31
SLIDE 31

Testcount Approach

31

Distribution plot. E.g. 80% of all lines hit are covered by only 238 or less test suites and 31% of all lines are covered by only 1 test

slide-32
SLIDE 32

32

A B C D E F

Lines hit Directories

Coverage for Test1

Filtering Shared Coverage Evaluation

Measurement After Approach

A B C D E F

Lines hit Directories

Filtered Coverage for Test1

slide-33
SLIDE 33
  • List of top 5 directories ordered by lines hit:
  • Ask SAP engineers if this fits their expectations:

33

A B C D E F

Lines hit Directories

Coverage for Test1

Filtering Shared Coverage Evaluation

A B C D E F

Lines hit Directories

Filtered Coverage for Test1 Measurement After Approach

F, C, B, D, A D, F, A, B, C

slide-34
SLIDE 34
  • List of top 5 directories ordered by lines hit:
  • Ask SAP engineers if this fits their expectations:

34

A B C D E F

Lines hit Directories

Coverage for Test1

Filtering Shared Coverage Evaluation

A B C D E F

Lines hit Directories

Filtered Coverage for Test1 Measurement After Approach

F, C, B, D, A D, F, A, B, C No Yes

slide-35
SLIDE 35

Filtering Shared Coverage Evaluation

35

slide-36
SLIDE 36

Filtering Shared Coverage Evaluation

36

Specificity improved significantly

slide-37
SLIDE 37

Filtering Shared Coverage Evaluation

Size of Coverage Data

6

Random Coverage

  • Coverage A: 651 074 lines hit
  • Coverage B: 651 845 lines hit
  • Coverage C: 651 862 lines hit
  • Coverage D: 652 015 lines hit
9

In Fact: A and B from same Test1 C and D from same Test2 Test2 contains Test1 + more

A B D C

Venn diagram

Evaluation with Small Projects

  • Practitioners do not trust small evaluations
Work1 Size Alspaugh et al. 2007 5 classes to 22 classes Zhang et al. 2009 53 testcases to 209 testcases Li et al. 2009 374 LOC to 11 kLOC You et al. 2011 500 LOC to 10 kLOC Zhang et al. 2013 2 kLOC to 80 kLOC Do et al. 2008 7 kLOC to 80 kLOC Elbaum et al. 2002 8 kLOC to 300 kLOC Our work > 3.50 MLOC Related work comparing overlap-aware vs. non-overlap-aware solvers for TCS or TCP

Flaky Tests

  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: OK
  • Execute test 1: Failed
  • Execute test 1: OK
Hardware Problems? Test dependencies? Test infrastructure? Real bug? (e.g. concurrency) Performance? Memory leak? and more …

Investigate? Ignore?

Comparison Overlap-Aware

Summary

37

Gaps

Shared coverage

Test 1 Test 2 Test 4 Database Code Test 3 Covered by nearly all tests

Coverage Redundancy

int example_function(int a, int b) { int c = a + b; int d = a - b; return c*d; }

t1 t2 t3 S1 x x S2 x x S3 x x S4 x x S5 x x Coverage run Lines hit Lines groups Redundancy 2015-11-15 2901575 79741 97.25 2016-05-19 3172337 93162 97.06 2016-08-04 3371109 97368 97.11 2016-10-25 3510727 104764 97.02 2016-11-01 3421780 104837 96.94 2016-11-15 3436853 106030 96.91

slide-38
SLIDE 38

Backup Slides

38

slide-39
SLIDE 39

Filtering Shared Coverage Evaluation

39

File # lines hit DirA\File1 2 DirB\File2 3 DirB\File3 2 DirB\File4 5 DirB\DirM\File5 7

Coverage result for Test1

Directory # lines hit DirA 2 DirB 17

Coverage result for Test1 per directory

List of directories ordered by #lines hit: DirB, DirA

Ask SAP engineers if DirA or DirB is expected for Test1 Top directory is wrong, coverage is not specific

slide-40
SLIDE 40

Overlap-Aware for Test Clusters

40

Coverage Time budget

Overlap-aware greedy for test clusters with parallelization factor from 1 to 50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50