際際滷

際際滷Share a Scribd company logo
Program Query Languages
Kim Mens
April 2024
De
fi
nition
A Program Query Language (PQL)
is a language or tool for querying information
from some program representation:
program source code text
binary code
execution traces
abstract syntax trees
control-
fl
ow graphs
data-
fl
ow graphs
These queries can reason about:
program structure
coding idioms, control
fl
ow,
class hierarchies, 
program behaviour
function calls,
variable accesses, 
patterns that indicate
potential issues such as
code smells, vulnerabilities
Purpose
impact
program
checking
adherence
identifying
improving
Maintenance : detect issues in the code that may require refactoring such as bad
smells, duplicated code, deprecated API usage
Testing : identify speci
fi
Veri
fi
Debugging : identify problematic code patterns (anti-patterns, known bugs) that
might lead to bugs or vulnerabilities
Teaching : provide personalised feedback to students by identifying issues,
suggesting improvements, or highlight excellence in their code
Relevance for Software Quality Assurance
Static vs. Dynamic PQLs
Static Program Query Languages
(SPQL)
reason over a static representation of
software programs
analyse the program without executing it
mostly focus on structural issues (but to
a limited extent behavioural ones too)
to improve code quality,
fi
nd possible
bugs, detect security vulnerabilities, or
violations of coding standards
Dynamic Program Query Languages
reason over a dynamic program
representation, obtained through code
execution
might not cover all possible execution
traces
mostly focus on behavioural issues that
only manifest at runtime
to monitor system behaviour, memory
usage, and performance issues
(Some) Static PQLs
PQL Program Representation Program Queries
CodeQL
Code Property Graph : a graph database that combines
elements of Abstract Syntax Trees (ASTs), Control Flow
Graphs (CFGs), and data dependency graphs.
SQL-like language to query CodeQL's
Code Property Graph
XPath
XQuery
XML documents
(e.g. srcML for source code)
Queries that navigate the hierarchical
structure of an XML document
Regular
Expressions
Text
(e.g. program source code)
Matching strings based on speci
fi
c
patterns
AST traversal
Python source code parsed into an AST
(e.g., using Pythons built-in AST module;
or using astroid library for AST parsing,
static analysis and type inference)
Parse tree visitors / walkers
PyLint
(or other lint tools)
Prede
fi
ned rules + possibility to create
your own customized rules (e.g., by
writing your own parse tree visitor)
Pyttern
Pyttern patterns, a Python-like
language with special wildcards
Logic
Metaprogramming
Code as logic facts extracted from the (parse tree)
structure of the source code
Logic queries
Coding Patterns
Coding Idioms
Coding Flaws
Coding Idioms
Conventional way of accomplishing particular programming tasks
Using language-speci
fi
c features and best practices
to achieve clear, e
ffi
cient and simple code
Not random patterns but generally recognised and widely adopted
solutions among experienced programmers within the community
Di
ff
er from design patterns in their focus on language-speci
fi
c solutions
to small-scale tasks or problems, rather than architectural issues
Coding Flaws
Broad spectrum of errors, inadequacies, or instances of poor practice
within program code
Not only syntax errors but also more subtle problems such as poor code
smells, problematic coding idioms, anti-patterns, and practices stemming
from misconceptions held by novices
Highlight areas in code that, while not necessarily leading to immediate
execution errors, represent a departure from best coding practices
Signal potential underlying problems in program logic, structure, readability
or maintainability
May indicate a deeper lack of understanding or pro
fi
ciency in coding
principles and programming language concepts
Accumulator Pattern
Common coding idiom to accumulate
sequence of values into single result
Using for-loop to accumulate results in
a variable as it iterates through the
items
Typical use case is summing numbers
Can be adapted for other operations:
product accumulation,
concatenating a string, 
def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10
Accumulator Pattern
Write a program that
approximates the value of  using
the Gregory-Leibniz series
Expected solution makes use of a
variant of the accumulator pattern
Code on the right is
fl
awed (why?)
Writing a pattern that detects the
accumulator pattern can help
identifying solutions that dont
def approx_pi(n):
var = 0
if n > 0 :
for i in range(n):
var = (-1)**i / (2*i + 1)
var *=4
return var
CodeQL
Code Query Language
Formerly known as SemmleQL
Code analysis tool currently maintained by Microsoft's GitHub
Treats code as data by converting the code into database of facts and
relationships
Allows developers to write declarative SQL-like queries over this database
Mostly used for security checks to
fi
nd potential bugs and vulnerabilities in code
CodeQL
Find all if-statements in a C++ program that directly contain a return statement
CodeQL allows to express sophisticated analyses of program behaviour in a way
that is both powerful and relatively accessible to those familiar with SQL.
import cpp
from IfStmt ifstmt, ReturnStmt ret
where ifstmt.getThen() = ret.getParent()
select ifstmt, "This 'if' statement directly contains a 'return'."
import CodeQL library to analyse C++ code
declare variables for if- and return-statements
condition to match: return statements
whose parent is body of an if-statement.
de
fi
nes what the query will return
+ message indicating what the if-statement does
CodeQL accumulator pattern
import python
from Function f,
Name n1,
Name n2,
Name n3,
Assign a,
File file,
Module m,
Return r,
For for,
AugAssign aa,
Variable v
where f.contains(a)
and a.getTarget(0) = n1
and a.getValue() instanceof Num
and n1.getVariable() = v
and m.contains(f)
and m.getFile() = file
and f.contains(for)
and for.contains(aa)
and aa.getTarget() = n2
and n2.getVariable() = v
and f.contains(r)
and r.getValue() = n3
and n3.getVariable() = v
select file, f, v, a, aa, r, "Assignment of a number to a variable"
arguably not the most readable
Regular Expressions
A regex is a sequence of characters that de
fi
nes a search pattern
Can be used for text processing and parsing & string searching
Can be used as basic PQL by pattern matching over source code text
Only works on a textual level; no explicit representation of underlying code
structure => more limited analysis capabilities
The re module in Python is a powerful library for string manipulation using
regular expressions
regex example
def append_to(element, to=[]) :
to.append(element)
return to
append_to(5)
append_to(6)
append_to(7)
list = append_to(8)
>>> print(list)
[5, 6, 7, 8] When append_to is called multiple times without specifying
the second argument, it keeps accumulating elements from
all the calls, which is often not the intended behavior.
Practical example of using regex to
analyse Python source code:
Detecting pattern of default arguments in
function de
fi
nitions.
Common stylistic breach/bug in Python
code, as default arguments retain their
state between function calls, which may
lead to unexpected behavior.
regex example
To detect this pattern, we can write a regex that matches function de
fi
nitions with default
arguments:
defs+w+(.*=s*([]|{}|set())
Let's break down this regular expression:
defs+w+( matches start of function de
fi
nition, including def followed by one or more
whitespace characters, followed by one or more word characters (function name), and an opening
parenthesis.
.*=s* matches zero or more of any character (non-greedy) followed by an equals sign =,
followed by zero or more whitespace characters. This part looks for assignment of the default
value
([]|{}|set()) matches empty list [], empty dictionary {}, or empty set set() as default argument.
def append_to(element, to=[]
regex example
import re
# Sample Python code
python_code = """
def append_to_list(item, lst=[]):
lst.append(item)
return lst
def add_to_set(element, s=set()):
s.add(element)
return s
def add_to_set(): # no default =[] here
s.add(element)
return s
"""
# Regular expression to
fi
nd mutable default arguments
pattern = r"defs+w+(.*=s*([]|{}|set())"
# Find matches in the sample code
matches = re.
fi
nditer(pattern, python_code)
# Print matches
for match in matches:
print("Found a function with a mutable default argument:", match.group())
Found a function with a mutable default argument: def append_to_list(item, lst=[]
Found a function with a mutable default argument: def add_to_set(element, s=set()
Found a function with a mutable default argument: def add_to_set(): # no default =[]
Beware of false positives !
Not very readable and still doesnt match the intended pattern exactly
(Why not? Can you improve it?)
regex accumulator pattern
defs+(?P<fun_name>.*)(.*):n
s*(?P<var>.*)s*=s*d+(?:n|.)*
for.*:(?:n|.)*
(?P<var>)s*+=s*d(?:n|.)*
return.*(?P<var>).*
n
def accumulate(lst) :
accum = 0
for item in lst :
accum += item
return accum
>>> accumulate([1,2,3,4])
10
AST
x = x + 1
print("Hello World")
Module
List
body
Assign
Name
BinOp
List
Constant
Add
Name
0
targets value
left op right
0
Expr
Call
Constant
Name List
1
value
func args
0
types_ignore
type_comment
x Store x Load
keywords
List
1
id ctx id ctx id kind
print Store
id ctx
"Hello
World"
id kind
AST Traversal with Visitor DP
Node
FunctionDef Assign AugAssign
For Return 
NodeVisitor
generic_visit(Node)
Accumulator
visit_FunctionDef(FunctionDef)
visit_Assign(Assign)
visit_AugAssign(AugAssign)
visit_For(For)
visit_Return(Return)
...
Hierarchy of AST nodes to be visited
Speci
fi
c visitor per pattern to be detected
Well-known design pattern
but requires quite some boilerplate code to be written
AST accumulator pattern
import ast
class Accumulator(ast.NodeVisitor):
def __init__(self) -> None:
super().__init__()
self._function_stack = []
self._saved_assign = []
self._in_loop = False
self._has_acc = True
def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
self._function_stack.append([])
self._saved_assign.append([])
self.generic_visit(node)
self._function_stack.pop()
self._saved_assign.pop()
Using Pythons AST module
AST accumulator pattern
def visit_Assign(self, node: ast.Assign) -> None:
for target in node.targets:
if isinstance(target, ast.Name):
if isinstance(node.value, ast.Constant):
self._function_stack[-1].append(target.id)
self.generic_visit(node)
def visit_AugAssign(self, node: ast.AugAssign) -> None:
if self._in_loop:
if node.target.id in self._function_stack[-1]:
self._saved_assign[-1].append(node.target.id)
self.generic_visit(node)
AST accumulator pattern
def visit_For(self, node: ast.For) -> None:
self._visit_loop(node)
def visit_While(self, node: ast.For) -> None:
self._visit_loop(node)
def _visit_loop(self, node) -> None:
if not self._in_loop:
self._in_loop = True
self.generic_visit(node)
self._in_loop = False
else:
self.generic_visit(node)
AST accumulator pattern
def visit_Return(self, node: ast.Return) -> None:
print(self._saved_assign[-1])
if not isinstance(node.value, ast.Name):
self._has_acc = False
elif node.value.id not in self._saved_assign[-1]:
self._has_acc = False
self.generic_visit(node)
with open(/buggy-code/source/factor_list.py) as file:
tree = ast.parse(file.read(), file.name)
visitor = Accumulator()
visitor.visit(tree)
if not visitor._has_acc:
print("miss Accumulator")
else:
print("Accumulator")
PyLint
PyLint is a widely adopted static code analysis tool for Python projects
Focus on code style and quality issues: to enforce coding standards, look
for code smells, and can make suggestions about where/how to refactor
Easy to install and integrated in many Python IDEs
Comes with numerous pre-existing checks, is highly con
fi
gurable and
o
ff
ers customizable linting rules for code quality checks.
Permits to de
fi
ne your own custom checks by writing an AST visitor.
Pylint accumulator pattern
import astroid
from astroid import nodes
from typing import TYPE_CHECKING, Optional
from pylint.checkers import BaseChecker
if TYPE_CHECKING:
from pylint.lint import PyLinter
class AccumulatorChecker(BaseChecker):
name = "no-acculumator"
msgs = {
"W0002": (
"Doesn't use an accumulator variable.",
"no-accumulator",
"All functions should use an accumulator variable.",
),
}
...
Using the astroid AST module
Pylint accumulator pattern
...
def __init__(self, linter: Optional["PyLinter"] = None) -> None:
super().__init__(linter)
self._function_stack = []
self._saved_assign = []
def visit_functiondef(self, node: nodes.FunctionDef) -> None:
print("visit_functiondef")
self._function_stack.append([])
self._saved_assign.append([])
def leave_functiondef(self, node: nodes.FunctionDef) -> None:
self._function_stack.pop()
self._saved_assign.pop()
def visit_assign(self, node: nodes.Assign) -> None:
for target in node.targets:
if isinstance(target, nodes.AssignName):
if isinstance(node.value, nodes.Const):
self._function_stack[-1].append(target.name)
...
def visit_augassign(self, node: nodes.AugAssign) -> None:
in_loop = False
for ancestor in node.node_ancestors():
if isinstance(ancestor, (nodes.For, nodes.While)):
in_loop = True
break
if isinstance(ancestor, nodes.FunctionDef):
break
if in_loop:
if node.target.name in self._function_stack[-1]:
self._saved_assign[-1].append(node.target.name)
def visit_return(self, node: nodes.Return) -> None:
if not isinstance(node.value, nodes.Name):
self.add_message("no-accumulator", node=node)
else:
if node.value.name not in self._saved_assign[-1]:
self.add_message("no-accumulator", node=node)
Pylint accumulator pattern
Logic Metaprogramming (LMP)
Use of a logic programming language (e.g., Prolog) as SPQL
Operating on a representation of the program's abstract syntax tree (AST)
rei
fi
ed as facts and rules in a logic knowledge base
To declare logic predicates that reason over the program structure
Making use of the full power of logic programming (resolution, uni
fi
cation,
recursion, ) to express patterns in a high-level, declarative manner
Logic Metaprogramming (LMP)
Represent all nodes in the program AST as logical facts in the following format :
node_def(Type, Children, ValueTypes)
node_val(Type, Node_Id, Parent_Id, Child_Ids, Values)
Example :
node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno])
node_val(call,12,11,[[13],15,[]],[2,0,21,2])
type of AST node :
functiondef, assign, constant, for, call, 
unique ID for this
particular node
ID of this nodes
parent node
Logic Metaprogramming (LMP)
node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno])
node_val(call,12,11,[[13],15,[]],[2,0,21,2])
node_def node_val
node_def(module,[body,type_ignores],[])
node_def(assign,[value,targets,type_comment],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(binop,[op,left,right],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(add,[],[])
node_def(name,[ctx],
[id,lineno,col_offset,end_col_offset,end_lineno])
node_def(load,[],[])
node_def(constant,[kind],
[value,s,n,lineno,col_offset,end_col_offset,end_lineno])
node_def(null,[],[])
node_def(store,[],[])
node_def(expr,[value],
[lineno,col_offset,end_col_offset,end_lineno])
node_def(call,[args,func,keywords],
[lineno,col_offset,end_col_offset,end_lineno])
node_val(add,3,2,[],[])
node_val(load,5,4,[],[])
node_val(name,4,2,[5],[x,1,4,5,1])
node_val(null,7,6,[],[])
node_val(constant,6,2,[7],[1,1,1,1,8,9,1])
node_val(binop,2,1,[3,4,6],[1,4,9,1])
node_val(store,9,8,[],[])
node_val(name,8,1,[9],[x,1,0,1,1])
node_val(null,10,1,[],[])
node_val(assign,1,0,[2,[8],10],[1,0,9,1])
node_val(null,14,13,[],[])
node_val(constant,13,12,[14],[Hello World!,Hello
World!,Hello World!,2,6,20,2])
node_val(load,16,15,[],[])
node_val(name,15,12,[16],[print,2,0,5,2])
node_val(call,12,11,[[13],15,[]],[2,0,21,2])
node_val(expr,11,0,[12],[2,0,21,2])
node_val(module,0,-1,[[1,11],[]],[])
x = x + 1
print("Hello World)
Logic Metaprogramming (LMP)
Additional predicates are prede
fi
ned to avoid using this representation directly.
For example, id_type(Id,functiondef)
fi
nds all Ids of function de
fi
nition nodes
and info(Id, 'name', Name) yields the value of the name attribute of the node Id.
id_type(?Id, ?Type) Type is the AST node type of the AST node with identi
fi
er Id
info(+Id, +Name, -Info) gets Info for the attribute Name of the node with given Id
info(+Type, ?Id, +Name, -Info) same as above but guarantees that the node is of given Type
children(-Children, +Id) Children is list of identi
fi
ers of the children of the node with given Id
parent(-Parent_Id, +Id) Parent_Id is the identi
fi
er of the parent of the node with identi
fi
er Id
descendant(?Child_Id, ?Parent_Id) node Child_ID is (direct of indirect) descendant of node Parent_ID
Logic Metaprogramming (LMP)
In terms of which yet more readable predicates can de de
fi
ned.
ast_functiondef(Id) :- id_type(Id, functiondef).
LMP example
Predicate to
fi
nd Name of each function that contains a list of at least three hard-coded
constants inside.
contains_hardcoded list(-Name) :-
id_type(List_Id, 'list'),
info(List_Id, 'elts', Children_List),
findall( Child,
(member(Child, Children_List), id_type(Child,'constant')),
Constant_Children),
length(Constant_Children, N_Constant), N_Constant >= 3,
descendant(List_Id, _, Function_Id, 'functiondef'),
info(Function_Id, 'name', Name).
This is a (bad) pattern that some students try to use to
trick an autograder by hardcoding rather than
calculating expected solutions to an exercise.
Pyttern
A program query language that is easy to learn, read and write for non-expert
developers or software engineers
A simple query language for Python programs with a syntax that remains close
to Python
essentially Python syntax extended with wildcards (representing holes in
the program)
inspired by regex and SCRUPLE
yet allows for the easy declaration and detection of su
ffi
ciently complex coding
idioms
Pyttern Wildcards
Wildcard Meaning
? Matches a single node in the AST
?* Matches zero or more nodes in an AST list node
?{name} Matches a single node and names it for later reference
?: Matches the body of this wildcard at one nesting level
?*: Matches the body of this wildcard at any nesting level
Pyttern Python
def foo(?*):
?*
?*:
?*
for ? in range(?*):
?*
if ? == 0:
?*
?*
?*
?*
def foo(lst):
counter = 0
for i in range(len(lst)):
if lst[i] == 0:
counter += 1
return counter
Pyttern Python
function_def
[list]
body
any_wildcard
0
any_compound_wildcard
1
any_wildcard
2
[list]
body
any_wildcard
0
for
1
any_wildcard
2
simple_wildcard
target
call
iter
[list]
body
any_wildcard
0
if
1
any_wildcard
2
body
0
1
target iter body
id 0
function_def
[list]
assign
for
name call [list]
i if
def foo(?*):
?*
?*:
?*
for ? in range(?*):
?*
if ? == 0:
?*
?*
?*
?*
def foo(lst):
counter = 0
for i in range(len(lst)):
if lst[i] == 0:
counter += 1
return counter
value
id
return
name
counter
2
... ...
foo
name
[list]
any_wildcard
args
0
foo [list]
args
name
0
lst
... ...
Pyttern Python
def ?(?*):
?accumulator = 0
for ?:
?accumulator += ?
return ?accumulator
def accumulate(lst):
accum = 0
for item in lst :
accum += item
return accum
Pyttern accumulator pattern
def ?(?*):
?accumulator = 0
for ?:
?accumulator += ?
return ?accumulator
def accumulate(lst):
accum = 0
for item in lst :
accum += item
return accum
Pyttern Python
Limitations of (some) PQLs
False positives:
reporting a problem that isn't one
False negatives:
failing to detect actual problems
Complexity/usability:
limited adoption due to required expertise
and steep learning curve
Depth of analysis:
limited expresiveness to declare
more than just super
fi
cial patterns
Performance:
too computationally intensive; no immediate
feedback when analysing large codebases
Description of the Experiment
Numerous program querying tools and languages exist that can be used
for detecting structural patterns in program source code
Some of them are more complex to use than others or have a steeper
learning curve
Some of them are more expressive (in the kind of patterns they can
detect) than others
We need your help to compare the relative expressiveness and usability
from di
ff
erent points of view of various existing program query languages

More Related Content

A Brief Overview of (Static) Program Query Languages

  • 1. Program Query Languages Kim Mens April 2024
  • 2. De fi nition A Program Query Language (PQL) is a language or tool for querying information from some program representation: program source code text binary code execution traces abstract syntax trees control- fl ow graphs data- fl ow graphs These queries can reason about: program structure coding idioms, control fl ow, class hierarchies, program behaviour function calls, variable accesses, patterns that indicate potential issues such as code smells, vulnerabilities
  • 4. Maintenance : detect issues in the code that may require refactoring such as bad smells, duplicated code, deprecated API usage Testing : identify speci fi Veri fi Debugging : identify problematic code patterns (anti-patterns, known bugs) that might lead to bugs or vulnerabilities Teaching : provide personalised feedback to students by identifying issues, suggesting improvements, or highlight excellence in their code Relevance for Software Quality Assurance
  • 5. Static vs. Dynamic PQLs Static Program Query Languages (SPQL) reason over a static representation of software programs analyse the program without executing it mostly focus on structural issues (but to a limited extent behavioural ones too) to improve code quality, fi nd possible bugs, detect security vulnerabilities, or violations of coding standards Dynamic Program Query Languages reason over a dynamic program representation, obtained through code execution might not cover all possible execution traces mostly focus on behavioural issues that only manifest at runtime to monitor system behaviour, memory usage, and performance issues
  • 6. (Some) Static PQLs PQL Program Representation Program Queries CodeQL Code Property Graph : a graph database that combines elements of Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and data dependency graphs. SQL-like language to query CodeQL's Code Property Graph XPath XQuery XML documents (e.g. srcML for source code) Queries that navigate the hierarchical structure of an XML document Regular Expressions Text (e.g. program source code) Matching strings based on speci fi c patterns AST traversal Python source code parsed into an AST (e.g., using Pythons built-in AST module; or using astroid library for AST parsing, static analysis and type inference) Parse tree visitors / walkers PyLint (or other lint tools) Prede fi ned rules + possibility to create your own customized rules (e.g., by writing your own parse tree visitor) Pyttern Pyttern patterns, a Python-like language with special wildcards Logic Metaprogramming Code as logic facts extracted from the (parse tree) structure of the source code Logic queries
  • 8. Coding Idioms Conventional way of accomplishing particular programming tasks Using language-speci fi c features and best practices to achieve clear, e ffi cient and simple code Not random patterns but generally recognised and widely adopted solutions among experienced programmers within the community Di ff er from design patterns in their focus on language-speci fi c solutions to small-scale tasks or problems, rather than architectural issues
  • 9. Coding Flaws Broad spectrum of errors, inadequacies, or instances of poor practice within program code Not only syntax errors but also more subtle problems such as poor code smells, problematic coding idioms, anti-patterns, and practices stemming from misconceptions held by novices Highlight areas in code that, while not necessarily leading to immediate execution errors, represent a departure from best coding practices Signal potential underlying problems in program logic, structure, readability or maintainability May indicate a deeper lack of understanding or pro fi ciency in coding principles and programming language concepts
  • 10. Accumulator Pattern Common coding idiom to accumulate sequence of values into single result Using for-loop to accumulate results in a variable as it iterates through the items Typical use case is summing numbers Can be adapted for other operations: product accumulation, concatenating a string, def accumulate(lst) : accum = 0 for item in lst : accum += item return accum >>> accumulate([1,2,3,4]) 10
  • 11. Accumulator Pattern Write a program that approximates the value of using the Gregory-Leibniz series Expected solution makes use of a variant of the accumulator pattern Code on the right is fl awed (why?) Writing a pattern that detects the accumulator pattern can help identifying solutions that dont def approx_pi(n): var = 0 if n > 0 : for i in range(n): var = (-1)**i / (2*i + 1) var *=4 return var
  • 12. CodeQL Code Query Language Formerly known as SemmleQL Code analysis tool currently maintained by Microsoft's GitHub Treats code as data by converting the code into database of facts and relationships Allows developers to write declarative SQL-like queries over this database Mostly used for security checks to fi nd potential bugs and vulnerabilities in code
  • 13. CodeQL Find all if-statements in a C++ program that directly contain a return statement CodeQL allows to express sophisticated analyses of program behaviour in a way that is both powerful and relatively accessible to those familiar with SQL. import cpp from IfStmt ifstmt, ReturnStmt ret where ifstmt.getThen() = ret.getParent() select ifstmt, "This 'if' statement directly contains a 'return'." import CodeQL library to analyse C++ code declare variables for if- and return-statements condition to match: return statements whose parent is body of an if-statement. de fi nes what the query will return + message indicating what the if-statement does
  • 14. CodeQL accumulator pattern import python from Function f, Name n1, Name n2, Name n3, Assign a, File file, Module m, Return r, For for, AugAssign aa, Variable v where f.contains(a) and a.getTarget(0) = n1 and a.getValue() instanceof Num and n1.getVariable() = v and m.contains(f) and m.getFile() = file and f.contains(for) and for.contains(aa) and aa.getTarget() = n2 and n2.getVariable() = v and f.contains(r) and r.getValue() = n3 and n3.getVariable() = v select file, f, v, a, aa, r, "Assignment of a number to a variable" arguably not the most readable
  • 15. Regular Expressions A regex is a sequence of characters that de fi nes a search pattern Can be used for text processing and parsing & string searching Can be used as basic PQL by pattern matching over source code text Only works on a textual level; no explicit representation of underlying code structure => more limited analysis capabilities The re module in Python is a powerful library for string manipulation using regular expressions
  • 16. regex example def append_to(element, to=[]) : to.append(element) return to append_to(5) append_to(6) append_to(7) list = append_to(8) >>> print(list) [5, 6, 7, 8] When append_to is called multiple times without specifying the second argument, it keeps accumulating elements from all the calls, which is often not the intended behavior. Practical example of using regex to analyse Python source code: Detecting pattern of default arguments in function de fi nitions. Common stylistic breach/bug in Python code, as default arguments retain their state between function calls, which may lead to unexpected behavior.
  • 17. regex example To detect this pattern, we can write a regex that matches function de fi nitions with default arguments: defs+w+(.*=s*([]|{}|set()) Let's break down this regular expression: defs+w+( matches start of function de fi nition, including def followed by one or more whitespace characters, followed by one or more word characters (function name), and an opening parenthesis. .*=s* matches zero or more of any character (non-greedy) followed by an equals sign =, followed by zero or more whitespace characters. This part looks for assignment of the default value ([]|{}|set()) matches empty list [], empty dictionary {}, or empty set set() as default argument. def append_to(element, to=[]
  • 18. regex example import re # Sample Python code python_code = """ def append_to_list(item, lst=[]): lst.append(item) return lst def add_to_set(element, s=set()): s.add(element) return s def add_to_set(): # no default =[] here s.add(element) return s """ # Regular expression to fi nd mutable default arguments pattern = r"defs+w+(.*=s*([]|{}|set())" # Find matches in the sample code matches = re. fi nditer(pattern, python_code) # Print matches for match in matches: print("Found a function with a mutable default argument:", match.group()) Found a function with a mutable default argument: def append_to_list(item, lst=[] Found a function with a mutable default argument: def add_to_set(element, s=set() Found a function with a mutable default argument: def add_to_set(): # no default =[] Beware of false positives !
  • 19. Not very readable and still doesnt match the intended pattern exactly (Why not? Can you improve it?) regex accumulator pattern defs+(?P<fun_name>.*)(.*):n s*(?P<var>.*)s*=s*d+(?:n|.)* for.*:(?:n|.)* (?P<var>)s*+=s*d(?:n|.)* return.*(?P<var>).* n def accumulate(lst) : accum = 0 for item in lst : accum += item return accum >>> accumulate([1,2,3,4]) 10
  • 20. AST x = x + 1 print("Hello World") Module List body Assign Name BinOp List Constant Add Name 0 targets value left op right 0 Expr Call Constant Name List 1 value func args 0 types_ignore type_comment x Store x Load keywords List 1 id ctx id ctx id kind print Store id ctx "Hello World" id kind
  • 21. AST Traversal with Visitor DP Node FunctionDef Assign AugAssign For Return NodeVisitor generic_visit(Node) Accumulator visit_FunctionDef(FunctionDef) visit_Assign(Assign) visit_AugAssign(AugAssign) visit_For(For) visit_Return(Return) ... Hierarchy of AST nodes to be visited Speci fi c visitor per pattern to be detected Well-known design pattern but requires quite some boilerplate code to be written
  • 22. AST accumulator pattern import ast class Accumulator(ast.NodeVisitor): def __init__(self) -> None: super().__init__() self._function_stack = [] self._saved_assign = [] self._in_loop = False self._has_acc = True def visit_FunctionDef(self, node: ast.FunctionDef) -> None: self._function_stack.append([]) self._saved_assign.append([]) self.generic_visit(node) self._function_stack.pop() self._saved_assign.pop() Using Pythons AST module
  • 23. AST accumulator pattern def visit_Assign(self, node: ast.Assign) -> None: for target in node.targets: if isinstance(target, ast.Name): if isinstance(node.value, ast.Constant): self._function_stack[-1].append(target.id) self.generic_visit(node) def visit_AugAssign(self, node: ast.AugAssign) -> None: if self._in_loop: if node.target.id in self._function_stack[-1]: self._saved_assign[-1].append(node.target.id) self.generic_visit(node)
  • 24. AST accumulator pattern def visit_For(self, node: ast.For) -> None: self._visit_loop(node) def visit_While(self, node: ast.For) -> None: self._visit_loop(node) def _visit_loop(self, node) -> None: if not self._in_loop: self._in_loop = True self.generic_visit(node) self._in_loop = False else: self.generic_visit(node)
  • 25. AST accumulator pattern def visit_Return(self, node: ast.Return) -> None: print(self._saved_assign[-1]) if not isinstance(node.value, ast.Name): self._has_acc = False elif node.value.id not in self._saved_assign[-1]: self._has_acc = False self.generic_visit(node) with open(/buggy-code/source/factor_list.py) as file: tree = ast.parse(file.read(), file.name) visitor = Accumulator() visitor.visit(tree) if not visitor._has_acc: print("miss Accumulator") else: print("Accumulator")
  • 26. PyLint PyLint is a widely adopted static code analysis tool for Python projects Focus on code style and quality issues: to enforce coding standards, look for code smells, and can make suggestions about where/how to refactor Easy to install and integrated in many Python IDEs Comes with numerous pre-existing checks, is highly con fi gurable and o ff ers customizable linting rules for code quality checks. Permits to de fi ne your own custom checks by writing an AST visitor.
  • 27. Pylint accumulator pattern import astroid from astroid import nodes from typing import TYPE_CHECKING, Optional from pylint.checkers import BaseChecker if TYPE_CHECKING: from pylint.lint import PyLinter class AccumulatorChecker(BaseChecker): name = "no-acculumator" msgs = { "W0002": ( "Doesn't use an accumulator variable.", "no-accumulator", "All functions should use an accumulator variable.", ), } ... Using the astroid AST module
  • 28. Pylint accumulator pattern ... def __init__(self, linter: Optional["PyLinter"] = None) -> None: super().__init__(linter) self._function_stack = [] self._saved_assign = [] def visit_functiondef(self, node: nodes.FunctionDef) -> None: print("visit_functiondef") self._function_stack.append([]) self._saved_assign.append([]) def leave_functiondef(self, node: nodes.FunctionDef) -> None: self._function_stack.pop() self._saved_assign.pop() def visit_assign(self, node: nodes.Assign) -> None: for target in node.targets: if isinstance(target, nodes.AssignName): if isinstance(node.value, nodes.Const): self._function_stack[-1].append(target.name)
  • 29. ... def visit_augassign(self, node: nodes.AugAssign) -> None: in_loop = False for ancestor in node.node_ancestors(): if isinstance(ancestor, (nodes.For, nodes.While)): in_loop = True break if isinstance(ancestor, nodes.FunctionDef): break if in_loop: if node.target.name in self._function_stack[-1]: self._saved_assign[-1].append(node.target.name) def visit_return(self, node: nodes.Return) -> None: if not isinstance(node.value, nodes.Name): self.add_message("no-accumulator", node=node) else: if node.value.name not in self._saved_assign[-1]: self.add_message("no-accumulator", node=node) Pylint accumulator pattern
  • 30. Logic Metaprogramming (LMP) Use of a logic programming language (e.g., Prolog) as SPQL Operating on a representation of the program's abstract syntax tree (AST) rei fi ed as facts and rules in a logic knowledge base To declare logic predicates that reason over the program structure Making use of the full power of logic programming (resolution, uni fi cation, recursion, ) to express patterns in a high-level, declarative manner
  • 31. Logic Metaprogramming (LMP) Represent all nodes in the program AST as logical facts in the following format : node_def(Type, Children, ValueTypes) node_val(Type, Node_Id, Parent_Id, Child_Ids, Values) Example : node_def(call,[args,func,keywords],[lineno,col_offset,end_col_offset,end_lineno]) node_val(call,12,11,[[13],15,[]],[2,0,21,2]) type of AST node : functiondef, assign, constant, for, call, unique ID for this particular node ID of this nodes parent node
  • 33. node_def node_val node_def(module,[body,type_ignores],[]) node_def(assign,[value,targets,type_comment], [lineno,col_offset,end_col_offset,end_lineno]) node_def(binop,[op,left,right], [lineno,col_offset,end_col_offset,end_lineno]) node_def(add,[],[]) node_def(name,[ctx], [id,lineno,col_offset,end_col_offset,end_lineno]) node_def(load,[],[]) node_def(constant,[kind], [value,s,n,lineno,col_offset,end_col_offset,end_lineno]) node_def(null,[],[]) node_def(store,[],[]) node_def(expr,[value], [lineno,col_offset,end_col_offset,end_lineno]) node_def(call,[args,func,keywords], [lineno,col_offset,end_col_offset,end_lineno]) node_val(add,3,2,[],[]) node_val(load,5,4,[],[]) node_val(name,4,2,[5],[x,1,4,5,1]) node_val(null,7,6,[],[]) node_val(constant,6,2,[7],[1,1,1,1,8,9,1]) node_val(binop,2,1,[3,4,6],[1,4,9,1]) node_val(store,9,8,[],[]) node_val(name,8,1,[9],[x,1,0,1,1]) node_val(null,10,1,[],[]) node_val(assign,1,0,[2,[8],10],[1,0,9,1]) node_val(null,14,13,[],[]) node_val(constant,13,12,[14],[Hello World!,Hello World!,Hello World!,2,6,20,2]) node_val(load,16,15,[],[]) node_val(name,15,12,[16],[print,2,0,5,2]) node_val(call,12,11,[[13],15,[]],[2,0,21,2]) node_val(expr,11,0,[12],[2,0,21,2]) node_val(module,0,-1,[[1,11],[]],[]) x = x + 1 print("Hello World)
  • 34. Logic Metaprogramming (LMP) Additional predicates are prede fi ned to avoid using this representation directly. For example, id_type(Id,functiondef) fi nds all Ids of function de fi nition nodes and info(Id, 'name', Name) yields the value of the name attribute of the node Id. id_type(?Id, ?Type) Type is the AST node type of the AST node with identi fi er Id info(+Id, +Name, -Info) gets Info for the attribute Name of the node with given Id info(+Type, ?Id, +Name, -Info) same as above but guarantees that the node is of given Type children(-Children, +Id) Children is list of identi fi ers of the children of the node with given Id parent(-Parent_Id, +Id) Parent_Id is the identi fi er of the parent of the node with identi fi er Id descendant(?Child_Id, ?Parent_Id) node Child_ID is (direct of indirect) descendant of node Parent_ID
  • 35. Logic Metaprogramming (LMP) In terms of which yet more readable predicates can de de fi ned. ast_functiondef(Id) :- id_type(Id, functiondef).
  • 36. LMP example Predicate to fi nd Name of each function that contains a list of at least three hard-coded constants inside. contains_hardcoded list(-Name) :- id_type(List_Id, 'list'), info(List_Id, 'elts', Children_List), findall( Child, (member(Child, Children_List), id_type(Child,'constant')), Constant_Children), length(Constant_Children, N_Constant), N_Constant >= 3, descendant(List_Id, _, Function_Id, 'functiondef'), info(Function_Id, 'name', Name). This is a (bad) pattern that some students try to use to trick an autograder by hardcoding rather than calculating expected solutions to an exercise.
  • 37. Pyttern A program query language that is easy to learn, read and write for non-expert developers or software engineers A simple query language for Python programs with a syntax that remains close to Python essentially Python syntax extended with wildcards (representing holes in the program) inspired by regex and SCRUPLE yet allows for the easy declaration and detection of su ffi ciently complex coding idioms
  • 38. Pyttern Wildcards Wildcard Meaning ? Matches a single node in the AST ?* Matches zero or more nodes in an AST list node ?{name} Matches a single node and names it for later reference ?: Matches the body of this wildcard at one nesting level ?*: Matches the body of this wildcard at any nesting level
  • 39. Pyttern Python def foo(?*): ?* ?*: ?* for ? in range(?*): ?* if ? == 0: ?* ?* ?* ?* def foo(lst): counter = 0 for i in range(len(lst)): if lst[i] == 0: counter += 1 return counter
  • 40. Pyttern Python function_def [list] body any_wildcard 0 any_compound_wildcard 1 any_wildcard 2 [list] body any_wildcard 0 for 1 any_wildcard 2 simple_wildcard target call iter [list] body any_wildcard 0 if 1 any_wildcard 2 body 0 1 target iter body id 0 function_def [list] assign for name call [list] i if def foo(?*): ?* ?*: ?* for ? in range(?*): ?* if ? == 0: ?* ?* ?* ?* def foo(lst): counter = 0 for i in range(len(lst)): if lst[i] == 0: counter += 1 return counter value id return name counter 2 ... ... foo name [list] any_wildcard args 0 foo [list] args name 0 lst ... ...
  • 41. Pyttern Python def ?(?*): ?accumulator = 0 for ?: ?accumulator += ? return ?accumulator def accumulate(lst): accum = 0 for item in lst : accum += item return accum
  • 42. Pyttern accumulator pattern def ?(?*): ?accumulator = 0 for ?: ?accumulator += ? return ?accumulator def accumulate(lst): accum = 0 for item in lst : accum += item return accum Pyttern Python
  • 43. Limitations of (some) PQLs False positives: reporting a problem that isn't one False negatives: failing to detect actual problems Complexity/usability: limited adoption due to required expertise and steep learning curve Depth of analysis: limited expresiveness to declare more than just super fi cial patterns Performance: too computationally intensive; no immediate feedback when analysing large codebases
  • 44. Description of the Experiment Numerous program querying tools and languages exist that can be used for detecting structural patterns in program source code Some of them are more complex to use than others or have a steeper learning curve Some of them are more expressive (in the kind of patterns they can detect) than others We need your help to compare the relative expressiveness and usability from di ff erent points of view of various existing program query languages