Test amplification tools are being developed to generate and improve test cases. These include DSPOT which generates missing assertions, DESCARTES which makes bigger changes to speed up mutation testing, and CAMP which amplifies tests through Docker mutations. Search-based techniques are also being used to automatically reproduce crashes from stack traces. EvoCrash and Botsing employ genetic algorithms to evolve an initial random test suite towards reproducing the crash. Over 200 real-world crashes have been used to evaluate the techniques. Recent work explores using behavioral models or existing test cases to improve the initial test population.
1 of 29
Download to read offline
More Related Content
20181106 arie van_deursen_testday2018
1. Test Amplifcation
Arie van Deursen, TU Delft
Dutch Testing Day, November 6, 2018
Pouria Derakhshanfar, Xavier Devroey, Mozhan Soltani, Annibale Panichella, Andy Zaidman
1
3. Test Amplification Tools Under Construction
DSPOT:
Detect and generate missing assertions for Junit test cases
DESCARTES:
Speed up mutation testing by making bigger changes (drop method body)
CAMP:
Environment test amplification through Docker mutations
EvoCrash / Botsing:
Test suite amplification via crash reproduction
https://www.stamp-project.eu/
https://github.com/STAMP-project
3
5. Crash Reproduction
? File an issue
? Discover steps to
reproduce the crash
? Example: XWIKI-13031
Challenges:
? Labor intensive
? Hard to automate
5
6. Java Stack Trace (Issue XWIKI-13031)
java.lang.ClassCastException: [¡]
at org¡..SolrEntityReferenceResolver.getWikiReference(....java:93)
at org¡..SolrEntityReferenceResolver.getEntityReference(¡.java:70)
at org¡..SolrEntityReferenceResolver.resolve(¡.java:63)
at org¡..SolrDocumentReferenceResolver.resolve(¡.java:48)
at ¡
Exception
Frames
{Target ?
6
7. Search-Based Crash Reproduction
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
Random initial
test suite
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a()
c() e()d()
e()
c() b() e()
a()
c() e()d()
e()
c() b() e()
Evolutionary
search
a() e()d()
Exception:
at x(¡)
at y(¡)
at e(¡)
Exception:
at x(¡)
at y(¡)
at e(¡)
Crash reproducing
test caseStack trace
7
Soltani, Panichella, van Deursen. ¡°Search-Based Crash Reproduction and Its Impact on Debugging.¡±
IEEE Transactions on Software Engineering, 2018. pure.tudelft.nl
8. Crash-reproducing Test Case
public void test0() throws Throwable {
¡
SolrEntityReferenceResolver solrEntityReferenceResolver0 = new ¡();
EntityReferenceResolver entityReferenceResolver0 = ¡ mock(¡);
solrDocument0.put("wiki", (Object) entityType0);
Injector.inject(solrEntityReferenceResolver0, ¡);
Injector.validateBean(solrEntityReferenceResolver0, ¡);
¡
// Undeclared exception!
solrEntityReferenceResolver0.resolve(solrDocument0, entityType0, objectArray0);
}
java.lang.ClassCastException: [¡]
at org¡..SolrEntityReferenceResolver.getWikiReference(....java:93)
at org¡..SolrEntityReferenceResolver.getEntityReference(¡.java:70)
at org¡..SolrEntityReferenceResolver.resolve(¡.java:63)
8
9. EvoSuite
? Search based test generation
? Many random JUnit tests
? Optimized to maximize (e.g.)
branch coverage
? Combines and improves tests
to optimize overall fitness
? http://www.evosuite.org/
9
Initialize population
Evaluate fitness
Next generation
Selection
Crossover
Mutation
Reinsertion
[fitness == 0 or
budget exhausted]
10. EvoCrash
? Implemented on top of
EvoSuite
? Requires
? Stack trace
? Binaries: .jar files
? Time budget: Set by user
? Produce single test case
? That reproduces the crash
10
Initialize population
Evaluate fitness
Next generation
Selection
Crossover
Mutation
Reinsertion
[fitness == 0 or
budget exhausted]
11. Initialize
Population
? Guided initialization
? Random method calls
? Guarantees that a target
method call is inserted in
each test at least once
? Direct for public and
protected methods
? Indirect for private methods
11
Initialize population
Evaluate fitness
Next generation
Selection
Crossover
Mutation
Reinsertion
[fitness == 0 or
budget exhausted]
12. Evaluate Fitness
? Global fitness function to guide
generation process
? Line coverage
? How far are we from the line
where the exception is thrown?
? Exception coverage
? Is the exception thrown?
? Stack trace similarity
? How similar is the stack trace
compared to the original (given)
stack trace?
12
Initialize population
Evaluate fitness
Next generation
Selection
Crossover
Mutation
Reinsertion
[fitness == 0 or
budget exhausted]
13. Next Generation
? Selection
? Fittest tests according to the
fitness function
? Guided crossover
? Single-point crossover
? Check that call to target method
is preserved
? Guided mutation
? Add/change/drop statements
? Check that call to target method
is preserved
13
Initialize population
Evaluate fitness
Next generation
Selection
Crossover
Mutation
Reinsertion
[fitness == 0 or
budget exhausted]
14. Does it Work?
The JCrashPack Crash Replication Benchmark
? 200 crashes from various open source projects
? XWiki
? STAMP partner
? From XWiki issue tracking system: 51 crashes
? Defects4J applications
? State of the art fault localization benchmark
? 73 crashes (with fixes)
? Elasticsearch
? Based on popularity
? From Elasticsearch issue tracking system: 76 crashes
? Filtered, verified, cleaned up, right jar versions, ¡
14 https://github.com/STAMP-project/JCrashPack
15. ExRunner:
Running Crash Replication
? Python tool to benchmark crash reproduction tools
? Multithreaded execution
? Monitors tool executions
Stack traces JobGenerator
Observer
Thread 1
Thread n
.
.
Job 1
Logs
results
Tool Configs. Job n
Test case
Jar files
.
.
Logs
results
Test case
15
17. Identified 12 Key Challenges
? Input data generation
? For complex inputs, generic types, etc.
? Environmental dependencies
? Environment state hard to manage at unit level
? Complex code
? Long methods, with lot of nested predicates
? Abstract classes and methods
? Cannot be instantiated and one concrete implementation is picked randomly
? [¡]
17
Mozhan Soltani ¡¤ Pouria Derakhshanfar ¡¤ Xavier Devroey ¡¤ Arie van Deursen
An Empirical Evaluation of Search-Based Crash Reproduction. TU Delft, 2018. In preparation.
18. Work in Progress:
Improved Seeding
Hypothesis
An initial test suite close to
actual usage of the class is more
likely to lead to a crash
reproducing test case than a
random one
18
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
Random initial
test suite
Exception:
at x(¡)
at y(¡)
at e(¡)
Stack trace
Pouria Derakhshanfar, Xavier Devroey, Gilles Perrouin, Andy Zaidman and Arie van Deursen.
Search-based Crash Reproduction using Behavioral Model Seeding. TU Delft, 2018. Submitted
19. Test seeding
Use the existing tests to
generate the initial test
suite
19
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() c()
c() e()c()
e()
c() a()
Exception:
at x(¡)
at y(¡)
at e(¡)
Stack trace
a() c()
c() c()
b() a() e()
Existing tests
e()
Random initial
test suite
Existing tests
subset
20. Model seeding
Use a model of
method usage to
generate the initial
test suite
20
Random initial
test suite
Exception:
at x(¡)
at y(¡)
at e(¡)
Stack trace
a()b()
c() e()
d()
Model
Model-driven
initial test suite
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
a() b()
c() e()d()
e()
c() e()
b() a()
c() e()d()
e()
c() e()
a()
21. Call Sequence Models
21
a()b()
c() e()
d()
Model
? Model generated from
sequences of method calls
? Coming from
? Source code
? Static analysis
? Test cases
? Dynamic analysis
? Operations logs
? Online analysis
N-gram inference
[b(), a(), e()]
[c(), d(), a(), e()]
[b(), a(), d(), a(), d(), a(), e()]
¡
24. Model Seeding Approach
Guided Genetic Algorithm
app.jar
tests.jar
Pr[clone] Guided
initialization
Fitness
evaluation
[?tness == 0 or
budget exhausted]
Selection
Guided
crossover
Guided
mutation
Reinsertion
Stacktrace
Test
cases
Objects
pool
Test seeding
Bheavioral model seeding
populatesInstrumented
execution
Static analysis
Instrumented
execution
Call
sequ.
Models
inference
populates
Abstract
test cases
selectionModels
Pr[pick?init]
Pr[pick mut]
1
2
3 4
5
A
24
25. 0
25
50
75
not started failed line reached ex. thrown reproduced
Numberofframes
Configurations
no s.
test s. 0.2
test s. 0.5
test s. 0.8
test s. 1.0
model s. 0.2
model s. 0.5
model s. 0.8
model s. 1.0
25
26. Model Seeding: Results
? Behavioral seeding outperforms test seeding / no seeding
? 13 more crashes could be reproduced
? No performance overhead
? Extra crashes are more complex
? Higher frame levels
? Better at ¡±industrial¡± cases
? Model seeding outperforms test seeding.
26
27. Implementation: Botsing
? Open source implementation of EvoCrash approach
? Uses EvoSuite as library for instrumentation
? Extensible, modular, tested
? Used as test bed for new crash reproduction ideas (model seeding)
? https://github.com/STAMP-project/botsing
27