The Boa Programming Guide - Program Analysis

@since 2019-10

Boa supports advanced static analysis capabilities, allowing you to easily perform more complex mining tasks on thousands of projects in parallel. In this section we investigate some of these advanced analysis features.

Building Program Graphs

Boa provides a set of domain-specific functions to automatically generate standard program graphs for you. These graphs include:

  • control-flow graphs (CFGs)
  • control-dependence graphs (CDGs)
  • data-dependence graphs (DDGs)
  • program-dependence graphs (PDGs)

All of the generator functions take a single Method as an argument and return a specific kind of graph for that method. For example, to build CFGs for every method you could do this:

visit(input, visitor { before m: Method -> cfg := getcfg(m); });

Every graph type has a nodes attribute, that returns the set of all nodes in the graph. This can allow you to manually inspect the nodes. Every node in a graph has a unique identifier and indicates what kind of node it is: an ENTRY node (used to mark nodes added to the graph for entry or exit), a CONTROL node to indicate the associated AST node contained a control point (if statements, loops, etc), a METHOD node indicating the AST node contained a method call, or any OTHER kind of node (mostly for sequential nodes).

As an example, consider the following code to generate a control-flow graph for a simple Java-like method:

G: output collection of string; code := """public class C { void m(int n) { int j = 0; while (n > 0) { if (j % 2 == 0) j++; else j--; n--; } System.out.println("assert 'j is positive'"); } }"""; tree := parse(code); visit(tree, visitor { before m: Method -> G << dot(getcfg(m)); });

This code will generate output that can be fed into a tool such as Graphviz Dot to generate the following CFG:

Working directly with the graphs can be a bit cumbersome, so in the next section we will describe language features for writing your program analysis task that operates on these graph types.

Writing Graph Traversals

Consider a rather simple task, where we simply want to print the names of each node in a CFG. To accomplish this, we need to traverse the graph and visit each node in the graph, get the name of the node, and then output that name. While this particular analysis is very trivial, it will allow us to investigate some of the more advanced features offered by Boa. The code for the analysis follows:

ids: output collection of string; printNodes := traversal(n: CFGNode) { ids << n.name; }; visit(input, visitor { before m: Method -> { cfg := getcfg(m); traverse(cfg, TraversalDirection.FORWARD, TraversalKind.DFS, printNodes); } });

In this code, there are 2 important new pieces to note. First, there is now a traversal (lines 3-5) named printNodes. This traversal operates similar to a function, and takes a single argument named n with type CFGNode. It then grabs the name from that node, and outputs it. This is an example of a traversal function in Boa.

To use a traversal function, on line 10 we call the traverse() function. This function expects the following arguments: a graph, a traversal direction (either FORWARD or BACKWARD), a traversal kind (such as DFS, POSTORDER, ITERATIVE, etc), and the name of a traversal function.

The general form for a traversal function is similar to a user-defined function, but uses the keyword traversal. It also only takes a single argument, which is a node of the graph being traversed. Traversals can optionally return a value. This value is associated with the node for this traversal. Node values can be retrieved by calling getvalue(n) for any node n and can be retrieved from other traversals by also specifying the traversal as a second argument getvalue(n, t).

The general form of a traversal is given below:

id := traversal(arg: graph_node_type) [ : retType ] { body [ return retVal; ] };

Fixed Point Analysis

Many program analyses are iterative, and require a notion of reaching a fixed point. In order to know if a traversal has reached its final iteration, users must provide a custom fixed point function. This is a function that takes two arguments whose type matches the return type of the traversal, and the function returns a boolean to indicate if it has reached a fixed point or not. This function is called on each graph node with the current node value and the previous node value. It must return true for all nodes in order to indicate a fixed point and terminate the iterations. When calling the traverse() method, the fixed point function is given as the fifth and last argument.

The general form of a fixed point function is given below:

id := fixp(arg1, arg2: traversal_ret_type) : bool { body return retVal; };

Example: Liveness Analysis

Let's consider a slightly more complex analysis. This example is a liveness analysis. We define a variable as live if it holds a value that will/might be used in the future.

To compute this analysis, we first define it as a set of dataflow equations. Every node needs to define both a gen set and a kill set:


Then for each node we compute the in and out sets:


The following traversal handles computing the GEN and KILL sets for every node:

type T_gen_kill = { gen: set of string, kill: string }; # traversal that gets all variable uses in a method genkill := traversal(node: CFGNode): T_gen_kill { return { node.useVariables, node.defVariables }; };

The next traversal will initialize every node's IN and OUT sets to be empty sets (the starting point of the analysis):

type T_inout = { in: set of string, out: set of string }; init := traversal(node: CFGNode): T_inout { in_set: set of string; out_set: set of string; val: T_inout = { in_set, out_set }; return val; };

Finally we come to the third and last traversal. This is the core of the analysis:

# cfg live variable analysis live := traversal(node: CFGNode): T_inout { cur_val: T_inout; if (def(getvalue(node))) cur_val = getvalue(node); else cur_val = getvalue(node, init); gen_kill := getvalue(node, genkill); # update out set succs := node.successors; foreach (i: int; def(succs[i])) { succ := getvalue(succs[i]); if (def(succ)) { cur_val.out = union(cur_val.out, succ.in); } } # update in set cur_val.in = clone(cur_val.out); remove(cur_val.in, gen_kill.kill); cur_val.in = union(gen_kill.gen, cur_val.in); return cur_val; };

Note that the above three traversals could actually be combined into a single traversal, but we separated them here for clarity.

The next piece of code handles determining when the analysis has reached a fixed point. For this analysis, we consider it at a fixed point if the current in set for every node is identical to the previous in set for the same node:

# user-defined fix point function that is used for analysis termination. fp := fixp(curr, prev: T_inout) : bool { return len(difference(curr.in, prev.in)) == 0; };

All that remains is the find methods, build a CFG for the method, and then call the three traversals in the proper order. Note that we also clear out the values for the traversals, otherwise the 2nd method we find would start with the prior method's node values.

visit(input, visitor { before m: Method -> { clear(genkill); clear(init); clear(live); cfg := getcfg(m); traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, genkill); traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, init); traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, live, fp); } });

If we run this code on the example graph shown at the top of the page, we can see the final result: