We present a technique for syntax analysis of a regular set of input strings. This problem is relevant for analysis of string-embedded languages when a host program generates clauses of embedded language at run time. Our technique is based on a generalization of RNGLR algorithm, which, inherently, allows us to construct a finite representation of parse forest for regularly approximated set of input strings. This representation can be further utilized for semantic analysis and transformations in the context of reengineering, code maintenance, program understanding etc. The approach in question implements relaxed parsing: non-recognized strings in approximation set are ignored with no error detection.
1 of 29
Download to read offline
More Related Content
Relaxed Parsing of Regular Approximations of String-Embedded Languages
1. Relaxed Parsing of Regular Approximations of
String-Embedded Languages
Author: Ekaterina Verbitskaia
Saint Petersburg State University
JetBrains Programming Languages and Tools Lab
26/08/2015
Ekaterina Verbitskaia (SPbSU) 26/08/2015 1 / 27
2. String-embedded code
Embedded SQL
SqlCommand myCommand = new SqlCommand(
"SELECT * FROM table WHERE Column = @Param2",
myConnection);
myCommand.Parameters.Add(myParam2);
Dynamic SQL
IF @X = @Y
SET @TBL = ¡¯ #table1 ¡¯
ELSE
SET @TBL = ¡¯ table2 ¡¯
SET @S = ¡¯SELECT x FROM¡¯ + @TBL + ¡¯WHERE ISNULL(n,0) > 1¡¯
EXECUTE (@S)
Ekaterina Verbitskaia (SPbSU) 26/08/2015 2 / 27
3. Problems
String-embedded code are expressions in some programming language
It may be necessary to support them in IDE: code highlighting,
autocomplete, refactorings
It may be necessary to transform them: migration of legacy software to
new platforms
It may be necessary to detect vulnerabilities in such code
Any other problems of programming languages can occur
Ekaterina Verbitskaia (SPbSU) 26/08/2015 3 / 27
4. Static analysis of string-embedded code
Performed without programm execution
Checks that the set of properties holds for each possible expression
value
Undecidable for string-embedded code in the general case
The set of possible expression values is over approximated and then
the approximation is analysed
Ekaterina Verbitskaia (SPbSU) 26/08/2015 4 / 27
5. Existing tools
PHP String Analyzer, Java String Analyzer, Alvor
Static analyzers for PHP, Java, and SQL embedded into Java
respectively
Kyung-Goo Doh et al.
Checks syntactical correctness of embedded code
PHPStorm
IDE for PHP with support of HTML, CSS, JavaScript
IntelliLang
PHPStorm and IDEA plugin, supports various languages
STRANGER
Vulnerability detection of PHP
Flaws
Limited functionality
Hard to extend them with new features or support new languages
Do not create structural representation of code
Ekaterina Verbitskaia (SPbSU) 26/08/2015 5 / 27
6. Static analysis of string-embedded code: the scheme
Identification of hotspots: points of interest, where the analysis is
desirable
Approximation construction
Lexical analysis
Syntactic analysis
Semantic analysis
Ekaterina Verbitskaia (SPbSU) 26/08/2015 6 / 27
7. Static analysis of string-embedded code: the scheme
Code: hotspot is marked string res = "";
for(i = 0; i < l; i++)
res = "()" + res;
use(res);
Possible values {"", "()", "()()", ..., "()"^l}
Regular approximation ("()")*
Approximation
Ekaterina Verbitskaia (SPbSU) 26/08/2015 7 / 27
8. Static analysis of string-embedded code: the scheme
Approximation
After lexing
Grammar
start ::= s
s ::= LBR s RBR s
s ::= ?
Parse forest
start_rule
prod 2
s !
prod 0 prod 0
LBR s RBR
eps
LBR s RBR
eps
Ekaterina Verbitskaia (SPbSU) 26/08/2015 8 / 27
9. Problem statement
The aim is to develop the algorithm suitable for syntactic analysis of
string-embedded code
Tasks:
Develop an algorithm for parsing of regular approximation of
embedded code which produce a finite parse forest
Parse forest should contain a parse tree for every correct (w.r.t.
reference grammar) string accepted by the input automaton
Incorrect strings should be omitted: no error detection
An algorithm should not depend on the language of the host program
and the language of embedded code
Ekaterina Verbitskaia (SPbSU) 26/08/2015 9 / 27
10. Algorithm
Input: reference DCF grammar G and DFA graph with no
?-transitions over the alphabeth of terminals of G
Output: finite representation of the trees corresponding to all correct
string accepted by input automaton
Ekaterina Verbitskaia (SPbSU) 26/08/2015 10 / 27
11. Right-Nulled Generalized LR algorithm
RNGLR processes context free grammars
In case when LR conflicts occur, parses in each possible way
Shift/Reduce conflict
Reduce/Reduce conflict
Uses data structures which reduce memory consumption and
guarantee appropriate time of analysis
Ekaterina Verbitskaia (SPbSU) 26/08/2015 11 / 27
12. RNGLR data structures: Graph-Structured Stack
0
2
a
3
b
9
B
4
A
6
a
8a
10
A
5
a
8
a
a
11a
1
S
Ekaterina Verbitskaia (SPbSU) 26/08/2015 12 / 27
13. RNGLR operations: shift
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
Ekaterina Verbitskaia (SPbSU) 26/08/2015 13 / 27
14. RNGLR operations: shift
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
Ekaterina Verbitskaia (SPbSU) 26/08/2015 13 / 27
15. RNGLR operations: reduce
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
Ekaterina Verbitskaia (SPbSU) 26/08/2015 14 / 27
16. RNGLR operations: reduce
0 2
a
3b
9
B
4
A
6
a
8
a
10
A
5
a
8
a
a
11
a
0
2
a
3
b
9
B
4
A
6
a
8a
10
A
5
a
8
a
a
11a
1
S
Ekaterina Verbitskaia (SPbSU) 26/08/2015 14 / 27
17. RNGLR
Read the input sequentially
Process all reductions
Shift the next token
Each time new vertex is added to the GSS, shift is calculated
Each time new edge is created in the GSS, reductions are calculated
Ekaterina Verbitskaia (SPbSU) 26/08/2015 15 / 27
18. Algorithm
Traverse the automaton graph and sequentially construct GSS,
similarly as in RNGLR
New type of ¡°conflict¡±: Shift/Shift
The set of LR-states is associated with each vertex of input graph
The order in which the vertices of input graph are traversed is
controled with a queue. Whenever new edge is added to GSS, its tail
vertex is enqueued
The algorithm implements relaxed parsing: errors are not detected,
erroneous strings are ignored
Ekaterina Verbitskaia (SPbSU) 26/08/2015 16 / 27
24. Parse forest construction
Shared Packed Parse Forest ¡ª graph, in which all derivation trees are
merged
Constructed in the same manner as in the RNGLR-algorithm
Fragments of derivation trees are associated with GSS edges
Each time shift is processed, new tree of one terminal is created
Each time reduction is processed, new tree with children accumulated
along the paths is created
Children are not copied, they are reused
The root of resulting tree is associated with GSS edge, corresponding
to reduction to the starting nonterminal
Unreachable vertices are deleted from resulting graph
Ekaterina Verbitskaia (SPbSU) 26/08/2015 22 / 27
25. Algorithm: correctness
Correct tree ¡ª derivation tree of
some string accumulated along
the path in the input graph
n start
n s
t LBR n s t RBR n s
eps t LBR n s t RBR
eps
Ekaterina Verbitskaia (SPbSU) 26/08/2015 23 / 27
26. Algorithm: correctness
Theorem (Termination)
Algorithm terminates for any input
Theorem (Correctness)
Every tree, generated from SPPF, is correct
Theorem (Correctness)
For every path p in the inner graph, recognized w.r.t. reference grammar, a
correct tree corresponding to p can be generated from SPPF
Ekaterina Verbitskaia (SPbSU) 26/08/2015 24 / 27
27. Implementation
The algorithm is implemented as a part of YaccConstructor project
using F# programming language
The generator of RNGLR parse tables and data structures for GSS and
SPPF are reused
Ekaterina Verbitskaia (SPbSU) 26/08/2015 25 / 27
28. Evaluation
The data is taken from the project of migration from MS-SQL to
Oracle Server
2,7 lines of code, 2430 queries, 2188 successfully processed
The number of queries which previously could not be processed
because of timeout is decreased from 45 to 1
Ekaterina Verbitskaia (SPbSU) 26/08/2015 26 / 27
29. Conclusion
The algorithm for parsing of regular approximation of dynamically
generated string which constructs the finite representation of parse
forest is developped
Its termination and correctness are proved
The algorithm is implemented as a part of YaccConstructor project
The evaluation demonstrated it could be used for complex tasks
Ekaterina Verbitskaia (SPbSU) 26/08/2015 27 / 27