A brief introduction to some Program Query Languages and tools, part of a larger course on Programming Paradigms, taught at UCLouvain university in Belgium by Prof. Kim Mens.
1 of 44
Download to read offline
More Related Content
A Brief Overview of (Static) Program Query Languages
2. De
fi
nition
A Program Query Language (PQL)
is a language or tool for querying information
from some program representation:
program source code text
binary code
execution traces
abstract syntax trees
control-
fl
ow graphs
data-
fl
ow graphs
These queries can reason about:
program structure
coding idioms, control
fl
ow,
class hierarchies,
program behaviour
function calls,
variable accesses,
patterns that indicate
potential issues such as
code smells, vulnerabilities
4. Maintenance : detect issues in the code that may require refactoring such as bad
smells, duplicated code, deprecated API usage
Testing : identify speci
fi
Veri
fi
Debugging : identify problematic code patterns (anti-patterns, known bugs) that
might lead to bugs or vulnerabilities
Teaching : provide personalised feedback to students by identifying issues,
suggesting improvements, or highlight excellence in their code
Relevance for Software Quality Assurance
5. Static vs. Dynamic PQLs
Static Program Query Languages
(SPQL)
reason over a static representation of
software programs
analyse the program without executing it
mostly focus on structural issues (but to
a limited extent behavioural ones too)
to improve code quality,
fi
nd possible
bugs, detect security vulnerabilities, or
violations of coding standards
Dynamic Program Query Languages
reason over a dynamic program
representation, obtained through code
execution
might not cover all possible execution
traces
mostly focus on behavioural issues that
only manifest at runtime
to monitor system behaviour, memory
usage, and performance issues
6. (Some) Static PQLs
PQL Program Representation Program Queries
CodeQL
Code Property Graph : a graph database that combines
elements of Abstract Syntax Trees (ASTs), Control Flow
Graphs (CFGs), and data dependency graphs.
SQL-like language to query CodeQL's
Code Property Graph
XPath
XQuery
XML documents
(e.g. srcML for source code)
Queries that navigate the hierarchical
structure of an XML document
Regular
Expressions
Text
(e.g. program source code)
Matching strings based on speci
fi
c
patterns
AST traversal
Python source code parsed into an AST
(e.g., using Pythons built-in AST module;
or using astroid library for AST parsing,
static analysis and type inference)
Parse tree visitors / walkers
PyLint
(or other lint tools)
Prede
fi
ned rules + possibility to create
your own customized rules (e.g., by
writing your own parse tree visitor)
Pyttern
Pyttern patterns, a Python-like
language with special wildcards
Logic
Metaprogramming
Code as logic facts extracted from the (parse tree)
structure of the source code
Logic queries
8. Coding Idioms
Conventional way of accomplishing particular programming tasks
Using language-speci
fi
c features and best practices
to achieve clear, e
ffi
cient and simple code
Not random patterns but generally recognised and widely adopted
solutions among experienced programmers within the community
Di
ff
er from design patterns in their focus on language-speci
fi
c solutions
to small-scale tasks or problems, rather than architectural issues
9. Coding Flaws
Broad spectrum of errors, inadequacies, or instances of poor practice
within program code
Not only syntax errors but also more subtle problems such as poor code
smells, problematic coding idioms, anti-patterns, and practices stemming
from misconceptions held by novices
Highlight areas in code that, while not necessarily leading to immediate
execution errors, represent a departure from best coding practices
Signal potential underlying problems in program logic, structure, readability
or maintainability
May indicate a deeper lack of understanding or pro
fi
ciency in coding
principles and programming language concepts
10. Accumulator Pattern
Common coding idiom to accumulate
sequence of values into single result
Using for-loop to accumulate results in
a variable as it iterates through the
items
Typical use case is summing numbers
Can be adapted for other operations:
product accumulation,
concatenating a string,
def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10
11. Accumulator Pattern
Write a program that
approximates the value of using
the Gregory-Leibniz series
Expected solution makes use of a
variant of the accumulator pattern
Code on the right is
fl
awed (why?)
Writing a pattern that detects the
accumulator pattern can help
identifying solutions that dont
def approx_pi(n):
var = 0
if n > 0 :
for i in range(n):
var = (-1)**i / (2*i + 1)
var *=4
return var
12. CodeQL
Code Query Language
Formerly known as SemmleQL
Code analysis tool currently maintained by Microsoft's GitHub
Treats code as data by converting the code into database of facts and
relationships
Allows developers to write declarative SQL-like queries over this database
Mostly used for security checks to
fi
nd potential bugs and vulnerabilities in code
13. CodeQL
Find all if-statements in a C++ program that directly contain a return statement
CodeQL allows to express sophisticated analyses of program behaviour in a way
that is both powerful and relatively accessible to those familiar with SQL.
import cpp
from IfStmt ifstmt, ReturnStmt ret
where ifstmt.getThen() = ret.getParent()
select ifstmt, "This 'if' statement directly contains a 'return'."
import CodeQL library to analyse C++ code
declare variables for if- and return-statements
condition to match: return statements
whose parent is body of an if-statement.
de
fi
nes what the query will return
+ message indicating what the if-statement does
14. CodeQL accumulator pattern
import python
from Function f,
Name n1,
Name n2,
Name n3,
Assign a,
File file,
Module m,
Return r,
For for,
AugAssign aa,
Variable v
where f.contains(a)
and a.getTarget(0) = n1
and a.getValue() instanceof Num
and n1.getVariable() = v
and m.contains(f)
and m.getFile() = file
and f.contains(for)
and for.contains(aa)
and aa.getTarget() = n2
and n2.getVariable() = v
and f.contains(r)
and r.getValue() = n3
and n3.getVariable() = v
select file, f, v, a, aa, r, "Assignment of a number to a variable"
arguably not the most readable
15. Regular Expressions
A regex is a sequence of characters that de
fi
nes a search pattern
Can be used for text processing and parsing & string searching
Can be used as basic PQL by pattern matching over source code text
Only works on a textual level; no explicit representation of underlying code
structure => more limited analysis capabilities
The re module in Python is a powerful library for string manipulation using
regular expressions
16. regex example
def append_to(element, to=[]) :
to.append(element)
return to
append_to(5)
append_to(6)
append_to(7)
list = append_to(8)
>>> print(list)
[5, 6, 7, 8] When append_to is called multiple times without specifying
the second argument, it keeps accumulating elements from
all the calls, which is often not the intended behavior.
Practical example of using regex to
analyse Python source code:
Detecting pattern of default arguments in
function de
fi
nitions.
Common stylistic breach/bug in Python
code, as default arguments retain their
state between function calls, which may
lead to unexpected behavior.
17. regex example
To detect this pattern, we can write a regex that matches function de
fi
nitions with default
arguments:
defs+w+(.*=s*([]|{}|set())
Let's break down this regular expression:
defs+w+( matches start of function de
fi
nition, including def followed by one or more
whitespace characters, followed by one or more word characters (function name), and an opening
parenthesis.
.*=s* matches zero or more of any character (non-greedy) followed by an equals sign =,
followed by zero or more whitespace characters. This part looks for assignment of the default
value
([]|{}|set()) matches empty list [], empty dictionary {}, or empty set set() as default argument.
def append_to(element, to=[]
18. regex example
import re
# Sample Python code
python_code = """
def append_to_list(item, lst=[]):
lst.append(item)
return lst
def add_to_set(element, s=set()):
s.add(element)
return s
def add_to_set(): # no default =[] here
s.add(element)
return s
"""
# Regular expression to
fi
nd mutable default arguments
pattern = r"defs+w+(.*=s*([]|{}|set())"
# Find matches in the sample code
matches = re.
fi
nditer(pattern, python_code)
# Print matches
for match in matches:
print("Found a function with a mutable default argument:", match.group())
Found a function with a mutable default argument: def append_to_list(item, lst=[]
Found a function with a mutable default argument: def add_to_set(element, s=set()
Found a function with a mutable default argument: def add_to_set(): # no default =[]
Beware of false positives !
19. Not very readable and still doesnt match the intended pattern exactly
(Why not? Can you improve it?)
regex accumulator pattern
defs+(?P<fun_name>.*)(.*):n
s*(?P<var>.*)s*=s*d+(?:n|.)*
for.*:(?:n|.)*
(?P<var>)s*+=s*d(?:n|.)*
return.*(?P<var>).*
n
def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10
20. AST
x = x + 1
print("Hello World")
Module
List
body
Assign
Name
BinOp
List
Constant
Add
Name
0
targets value
left op right
0
Expr
Call
Constant
Name List
1
value
func args
0
types_ignore
type_comment
x Store x Load
keywords
List
1
id ctx id ctx id kind
print Store
id ctx
"Hello
World"
id kind
21. AST Traversal with Visitor DP
Node
FunctionDef Assign AugAssign
For Return
NodeVisitor
generic_visit(Node)
Accumulator
visit_FunctionDef(FunctionDef)
visit_Assign(Assign)
visit_AugAssign(AugAssign)
visit_For(For)
visit_Return(Return)
...
Hierarchy of AST nodes to be visited
Speci
fi
c visitor per pattern to be detected
Well-known design pattern
but requires quite some boilerplate code to be written
25. AST accumulator pattern
def visit_Return(self, node: ast.Return) -> None:
print(self._saved_assign[-1])
if not isinstance(node.value, ast.Name):
self._has_acc = False
elif node.value.id not in self._saved_assign[-1]:
self._has_acc = False
self.generic_visit(node)
with open(/buggy-code/source/factor_list.py) as file:
tree = ast.parse(file.read(), file.name)
visitor = Accumulator()
visitor.visit(tree)
if not visitor._has_acc:
print("miss Accumulator")
else:
print("Accumulator")
26. PyLint
PyLint is a widely adopted static code analysis tool for Python projects
Focus on code style and quality issues: to enforce coding standards, look
for code smells, and can make suggestions about where/how to refactor
Easy to install and integrated in many Python IDEs
Comes with numerous pre-existing checks, is highly con
fi
gurable and
o
ff
ers customizable linting rules for code quality checks.
Permits to de
fi
ne your own custom checks by writing an AST visitor.
27. Pylint accumulator pattern
import astroid
from astroid import nodes
from typing import TYPE_CHECKING, Optional
from pylint.checkers import BaseChecker
if TYPE_CHECKING:
from pylint.lint import PyLinter
class AccumulatorChecker(BaseChecker):
name = "no-acculumator"
msgs = {
"W0002": (
"Doesn't use an accumulator variable.",
"no-accumulator",
"All functions should use an accumulator variable.",
),
}
...
Using the astroid AST module
29. ...
def visit_augassign(self, node: nodes.AugAssign) -> None:
in_loop = False
for ancestor in node.node_ancestors():
if isinstance(ancestor, (nodes.For, nodes.While)):
in_loop = True
break
if isinstance(ancestor, nodes.FunctionDef):
break
if in_loop:
if node.target.name in self._function_stack[-1]:
self._saved_assign[-1].append(node.target.name)
def visit_return(self, node: nodes.Return) -> None:
if not isinstance(node.value, nodes.Name):
self.add_message("no-accumulator", node=node)
else:
if node.value.name not in self._saved_assign[-1]:
self.add_message("no-accumulator", node=node)
Pylint accumulator pattern
30. Logic Metaprogramming (LMP)
Use of a logic programming language (e.g., Prolog) as SPQL
Operating on a representation of the program's abstract syntax tree (AST)
rei
fi
ed as facts and rules in a logic knowledge base
To declare logic predicates that reason over the program structure
Making use of the full power of logic programming (resolution, uni
fi
cation,
recursion, ) to express patterns in a high-level, declarative manner
31. Logic Metaprogramming (LMP)
Represent all nodes in the program AST as logical facts in the following format :
node_def(Type, Children, ValueTypes)
node_val(Type, Node_Id, Parent_Id, Child_Ids, Values)
Example :
node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno])
node_val(call,12,11,[[13],15,[]],[2,0,21,2])
type of AST node :
functiondef, assign, constant, for, call,
unique ID for this
particular node
ID of this nodes
parent node
34. Logic Metaprogramming (LMP)
Additional predicates are prede
fi
ned to avoid using this representation directly.
For example, id_type(Id,functiondef)
fi
nds all Ids of function de
fi
nition nodes
and info(Id, 'name', Name) yields the value of the name attribute of the node Id.
id_type(?Id, ?Type) Type is the AST node type of the AST node with identi
fi
er Id
info(+Id, +Name, -Info) gets Info for the attribute Name of the node with given Id
info(+Type, ?Id, +Name, -Info) same as above but guarantees that the node is of given Type
children(-Children, +Id) Children is list of identi
fi
ers of the children of the node with given Id
parent(-Parent_Id, +Id) Parent_Id is the identi
fi
er of the parent of the node with identi
fi
er Id
descendant(?Child_Id, ?Parent_Id) node Child_ID is (direct of indirect) descendant of node Parent_ID
35. Logic Metaprogramming (LMP)
In terms of which yet more readable predicates can de de
fi
ned.
ast_functiondef(Id) :- id_type(Id, functiondef).
36. LMP example
Predicate to
fi
nd Name of each function that contains a list of at least three hard-coded
constants inside.
contains_hardcoded list(-Name) :-
id_type(List_Id, 'list'),
info(List_Id, 'elts', Children_List),
findall( Child,
(member(Child, Children_List), id_type(Child,'constant')),
Constant_Children),
length(Constant_Children, N_Constant), N_Constant >= 3,
descendant(List_Id, _, Function_Id, 'functiondef'),
info(Function_Id, 'name', Name).
This is a (bad) pattern that some students try to use to
trick an autograder by hardcoding rather than
calculating expected solutions to an exercise.
37. Pyttern
A program query language that is easy to learn, read and write for non-expert
developers or software engineers
A simple query language for Python programs with a syntax that remains close
to Python
essentially Python syntax extended with wildcards (representing holes in
the program)
inspired by regex and SCRUPLE
yet allows for the easy declaration and detection of su
ffi
ciently complex coding
idioms
38. Pyttern Wildcards
Wildcard Meaning
? Matches a single node in the AST
?* Matches zero or more nodes in an AST list node
?{name} Matches a single node and names it for later reference
?: Matches the body of this wildcard at one nesting level
?*: Matches the body of this wildcard at any nesting level
39. Pyttern Python
def foo(?*):
?*
?*:
?*
for ? in range(?*):
?*
if ? == 0:
?*
?*
?*
?*
def foo(lst):
counter = 0
for i in range(len(lst)):
if lst[i] == 0:
counter += 1
return counter
43. Limitations of (some) PQLs
False positives:
reporting a problem that isn't one
False negatives:
failing to detect actual problems
Complexity/usability:
limited adoption due to required expertise
and steep learning curve
Depth of analysis:
limited expresiveness to declare
more than just super
fi
cial patterns
Performance:
too computationally intensive; no immediate
feedback when analysing large codebases
44. Description of the Experiment
Numerous program querying tools and languages exist that can be used
for detecting structural patterns in program source code
Some of them are more complex to use than others or have a steeper
learning curve
Some of them are more expressive (in the kind of patterns they can
detect) than others
We need your help to compare the relative expressiveness and usability
from di
ff
erent points of view of various existing program query languages