Example Boa Programs

Boa is a flexible language, capable of answering a wide variety of software repository mining questions. Here we provide several example questions and Boa programs to answer those questions.

  1. Programming Languages
    1. What are the ten most used programming languages?
    2. How many projects use more than one programming language?
    3. How many projects use the Scheme programming language?
  2. Project Management
    1. How many projects are created each year?
    2. How many projects self-classify into each topic provided by SourceForge?
    3. How many Java projects using SVN were active in 2011?
    4. In which year was SVN added to Java projects the most?
    5. How many revisions are there in all Java projects using SVN?
    6. How many revisions fix bugs in all Java projects using SVN?
    7. How many committers are there for each project?
    8. What are the churn rates for all projects?
    9. How did the number of commits for Java projects using SVN change over years?
    10. How often are popular Java build systems used?
  3. Legal
    1. What are the five most used licenses?
    2. How many projects use more than one license?
  4. Platform/Environment
    1. What are the five most supported operating systems?
    2. Which projects support multiple operating systems?
    3. What are the five most popular databases?
    4. What are the projects that support multiple databases?
    5. How often is each database used in each programming language?
  5. Source Code
    1. What are the five largest projects, in terms of AST nodes?
    2. How many valid Java files in latest snapshot?
    3. How many fixing revisions added null checks?
    4. What files have unreachable statements?
    5. How many generic fields are declared in each project?
    6. How is varargs used over time?
    7. How is transient keyword used in Java?
  6. Software Engineering Metrics
    1. What are the number of attributes (NOA), per-project and per-type?
    2. What are the number of public methods (NPM), per-project and per-type?
  7. Program Analysis
    1. Dominator Analysis
    2. Live Variables
    3. Reaching Definitions

Programming Languages

What are the ten most used programming languages?

# Counting the 10 most used programming languages p: Project = input; counts: output top(10) of string weight int; foreach (i: int; def(p.programming_languages[i])) counts << p.programming_languages[i] weight 1;

Run Example | Published Results

How many projects use more than one programming language?

# Counting the number of projects written in more than one languages p: Project = input; counts: output sum of int; if (len(p.programming_languages) > 1) counts << 1;

Run Example | Published Results

How many projects use the Scheme programming language?

# Counting projects using Scheme p: Project = input; counts: output sum of int; foreach (i: int; match(`^scheme$`, lowercase(p.programming_languages[i]))) counts << 1;

Run Example | Published Results

Project Management

How many projects are created each year?

# How many projects created each year? p: Project = input; counts: output sum[int] of int; counts[yearof(p.created_date)] << 1;

Run Example | Published Results

How many projects self-classify into each topic provided by SourceForge?

# how many projects self-classify into each topic? p: Project = input; values: output sum[string] of int; foreach (i: int; def(p.topics[i])) values[lowercase(p.topics[i])] << 1;

Run Example | Published Results

How many Java projects using SVN were active in 2011?

# Counting the number of active Java projects with SVN p: Project = input; counts: output sum of int; visit(p, visitor { before n: Project -> ifall (i: int; !match(`^java$`, lowercase(n.programming_languages[i]))) stop; before node: CodeRepository -> if (node.kind == RepositoryKind.SVN) exists (j: int; yearof(node.revisions[j].commit_date) == 2011) counts << 1; });

Run Example | Published Results

In which year was SVN added to Java projects the most?

# which year were SVN projects added most p: Project = input; counts: output top(1) of int weight int; visit(p, visitor { before n: Project -> ifall (i: int; !match(`^java$`, lowercase(n.programming_languages[i]))) stop; before node: CodeRepository -> if (node.kind == RepositoryKind.SVN && len(node.revisions) > 0) counts << yearof(node.revisions[0].commit_date) weight 1; });

Run Example | Published Results

How many revisions are there in all Java projects using SVN?

# Counting the number of revisions for all Java projects with SVN p: Project = input; counts: output sum of int; visit(p, visitor { before n: Project -> ifall (i: int; !match(`^java$`, lowercase(n.programming_languages[i]))) stop; before node: CodeRepository -> if (node.kind == RepositoryKind.SVN) counts << len(node.revisions); });

Run Example | Published Results

How many revisions fix bugs in all Java projects using SVN?

# Counting the number of fixing revisons for all Java projects with SVN p: Project = input; counts: output sum of int; visit(p, visitor { before n: Project -> ifall (i: int; !match(`^java$`, lowercase(n.programming_languages[i]))) stop; before node: CodeRepository -> if (node.kind != RepositoryKind.SVN) stop; before node: Revision -> if (isfixingrevision(node.log)) counts << 1; });

Run Example | Published Results

How many committers are there for each project?

# How many committers are in each project? p: Project = input; counts: output sum[string] of int; committers: map[string] of bool; visit(p, visitor { before node: Revision -> if (!haskey(committers, node.committer.username)) { committers[node.committer.username] = true; counts[p.id] << 1; } });

Run Example | Published Results

What are the churn rates for all projects?

# what are the churn rates for all projects p: Project = input; counts: output mean[string] of int; visit(p, visitor { before node: Revision -> counts[p.id] << len(node.files); });

Run Example | Published Results

How did the number of commits for Java projects using SVN change over years?

# how did # of commits for Java/SVN change over time? p: Project = input; counts: output sum[int] of int; visit(p, visitor { before n: Project -> ifall (i: int; !match(`^java$`, lowercase(n.programming_languages[i]))) stop; before n: CodeRepository -> if (n.kind != RepositoryKind.SVN) stop; before n: Revision -> counts[yearof(n.commit_date)] << 1; });

Run Example | Published Results

How often are popular Java build systems used?

# How often are popular Java build systems used? TOTAL: output sum of int; ANT: output sum of int; GRADLE: output sum of int; MAVEN: output sum of int; MAKE: output sum of int; NONE: output sum of int; hasAnt := false; hasGradle := false; hasMvn := false; hasMake := false; exists (i: int; lowercase(input.programming_languages[i]) == "java") visit(input, visitor { before Project -> TOTAL << 1; after Project -> { if (hasAnt) ANT << 1; if (hasGradle) GRADLE << 1; if (hasMvn) MAVEN << 1; if (hasMake) MAKE << 1; if (!(hasAnt || hasGradle || hasMvn || hasMake)) NONE << 1; } before node: CodeRepository -> { snapshot := getsnapshot(node); for (j := 0; j < len(snapshot); j++) { if (match(`/build.xml$`, snapshot[j].name)) hasAnt = true; else if (match(`/build.gradle$`, snapshot[j].name)) hasGradle = true; else if (match(`/pom.xml$`, snapshot[j].name)) hasMvn = true; else if (match(`/makefile$`, lowercase(snapshot[j].name))) hasMake = true; } stop; } });

Run Example | Published Results

What are the five most used licenses?

# Counting the 5 most frequently used licenses p: Project = input; counts: output top(5) of string weight int; foreach (i: int; def(p.licenses[i])) counts << p.licenses[i] weight 1;

Run Example | Published Results

How many projects use more than one license?

# Counting the number of projects using more than 1 license p: Project = input; counts: output sum of int; if (len(p.licenses) > 1) counts << 1;

Run Example | Published Results

Platform/Environment

What are the five most supported operating systems?

# what are the 5 most supported OSes? p: Project = input; counts: output top(5) of string weight int; foreach (i: int; def(p.operating_systems[i])) counts << p.operating_systems[i] weight 1;

Run Example | Published Results

Which projects support multiple operating systems?

# which projects support multiple OSes? p: Project = input; counts: output collection[string] of string; if (len(p.operating_systems) > 1) counts[p.id] << p.project_url;

Run Example | Published Results

What are the five most popular databases?

# what are the 5 most popular databases? p: Project = input; counts: output top(5) of string weight int; foreach (i: int; def(p.databases[i])) counts << p.databases[i] weight 1;

Run Example | Published Results

What are the projects that support multiple databases?

# which projects support multiple databases? p: Project = input; counts: output collection[string] of string; if (len(p.databases) > 1) counts[p.id] << p.name;

Run Example | Published Results

How often is each database used in each programming language?

# pairs of programming language/database p: Project = input; counts: output sum[string][string] of int; foreach (i: int; def(p.programming_languages[i])) foreach (j: int; def(p.databases[j])) counts[p.programming_languages[i]][p.databases[j]] << 1;

Run Example | Published Results

Source Code

What are the five largest projects, in terms of AST nodes?

# What are the 5 largest projects, in terms of AST nodes? # Output is in Millions of AST nodes. p: Project = input; top5: output top(5) of string weight int; astCount := 0; visit(p, visitor { # only look at the latest snapshot before n: CodeRepository -> { snapshot := getsnapshot(n); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } # by default, count all visited nodes before _ -> astCount++; # these nodes are not part of the AST, so do nothing when visiting before Project, ChangedFile -> ; }); top5 << p.project_url weight astCount / 1000000;

Run Example | Published Results

How many valid Java files in latest snapshot?

# count how many valid Java files are in the latest snapshot p: Project = input; counts: output sum of int; visit(p, visitor { before node: CodeRepository -> counts << len(getsnapshot(node, "SOURCE_JAVA_JLS")); });

Run Example | Published Results

How many fixing revisions added null checks?

# How many fixing revisions added null checks? AddedNullCheck: output sum of int; p: Project = input; isfixing := false; count := 0; # map of file names to the last revision of that file files: map[string] of ChangedFile; visit(p, visitor { before node: Revision -> isfixing = isfixingrevision(node.log); before node: ChangedFile -> { # if this is a fixing revision and # there was a previous version of the file if (isfixing && haskey(files, node.name)) { # count how many null checks were previously in the file count = 0; visit(getast(files[node.name])); last := count; # count how many null checks are currently in the file count = 0; visit(getast(node)); # if there are more null checks, output if (count > last) AddedNullCheck << 1; } if (node.change == ChangeKind.DELETED) remove(files, node.name); else files[node.name] = node; stop; } before node: Statement -> # increase the counter if there is an IF statement # where the boolean condition is of the form: # null == expr OR expr == null OR null != expr OR expr != null if (node.kind == StatementKind.IF) visit(node.expression, visitor { before node: Expression -> if (node.kind == ExpressionKind.EQ || node.kind == ExpressionKind.NEQ) exists (i: int; isliteral(node.expressions[i], "null")) count++; }); });

Run Example | Published Results

What files have unreachable statements?

# looking for dead code DEAD: output top(1000000) of string weight int; cur_file: string; cur_method: string; s: stack of bool; alive := true; visit(input, visitor { before _ -> if (!alive) stop; before node: CodeRepository -> { snapshot := getsnapshot(node); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: ChangedFile -> cur_file = string(node); before node: Method -> { cur_method = node.name; push(s, alive); alive = true; } after node: Method -> alive = pop(s); before node: Statement -> { if (!alive) { DEAD << format("%s - %s", cur_file, cur_method) weight 1; stop; } switch (node.kind) { case StatementKind.BREAK: if (def(node.expression)) break; case StatementKind.RETURN, StatementKind.THROW, StatementKind.CONTINUE: alive = false; break; case StatementKind.IF, StatementKind.LABEL: stop; case StatementKind.FOR, StatementKind.DO, StatementKind.WHILE, StatementKind.SWITCH, StatementKind.TRY: foreach (i: int; def(node.statements[i])) { push(s, alive); visit(node.statements[i]); alive = pop(s); } stop; default: break; } } });

Run Example | Published Results

How many generic fields are declared in each project?

# How many generic fields are declared in each project? p: Project = input; GenericFields: output sum[string] of int; visit(p, visitor { before node: Type -> if (strfind("<", node.name) > -1) GenericFields[p.project_url] << 1; before node: Declaration -> { # check all fields foreach (i: int; node.fields[i]) visit(node.fields[i]); # also look at nested declarations foreach (i: int; node.methods[i]) visit(node.methods[i]); foreach (i: int; node.nested_declarations[i]) visit(node.nested_declarations[i]); stop; } before node: Method -> { foreach (i: int; node.statements[i]) visit(node.statements[i]); stop; } before node: Statement -> { foreach (i: int; node.statements[i]) visit(node.statements[i]); if (def(node.type_declaration)) visit(node.type_declaration); stop; } # fields cant be below expressions or modifiers before Expression, Modifier -> stop; });

Run Example | Published Results

How is varargs used over time?

# How is varargs used over time? p: Project = input; Varargs: output collection[string][string][time] of int; file_name: string; commit_date: time; visit(p, visitor { before node: ChangedFile -> file_name = node.name; before node: Revision -> commit_date = node.commit_date; before node: Method -> if (len(node.arguments) > 0 && strfind("...", node.arguments[len(node.arguments) - 1].variable_type.name) > -1) Varargs[p.project_url][file_name][commit_date] << 1; });

Run Example | Published Results

How is transient keyword used in Java?

# How is transient keyword used in Java? p: Project = input; TransientTotal: output sum of int; TransientMax: output maximum(1) of string weight int; TransientMin: output minimum(1) of string weight int; TransientMean: output mean of int; count := 0; s: stack of int; visit(p, visitor { before node: CodeRepository -> { # only look at the latest snapshot # and only include Java files snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Declaration -> { # only interested in fields, which only occur inside (anonymous) classes if (node.kind == TypeKind.CLASS || node.kind == TypeKind.ANONYMOUS) { # store old value push(s, count); count = 0; # find uses and increment counter foreach (i: int; def(node.fields[i])) foreach (j: int; node.fields[i].modifiers[j].kind == ModifierKind.OTHER && node.fields[i].modifiers[j].other == "transient") count++; } else stop; } after node: Declaration -> { # output result TransientTotal << count; TransientMax << p.id weight count; TransientMin << p.id weight count; TransientMean << count; # restore previous value count = pop(s); } });

Run Example | Published Results

Software Engineering Metrics

What are the number of attributes (NOA), per-project and per-type?

# Computes Number of Attributes (NOA) for each project, per-type # Output is: NOA[ProjectID][TypeName] = NOA value p: Project = input; NOA: output sum[string][string] of int; visit(p, visitor { # only look at the latest snapshot before n: CodeRepository -> { snapshot := getsnapshot(n); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Declaration -> if (node.kind == TypeKind.CLASS) NOA[p.id][node.name] << len(node.fields); });

Run Example | Published Results

What are the number of public methods (NPM), per-project and per-type?

# Computes Number of Public Methods (NPM) for each project, per-type # Output is: NPM[ProjectID][TypeName] = NPM value p: Project = input; NPM: output sum[string][string] of int; visit(p, visitor { # only look at the latest snapshot before n: CodeRepository -> { snapshot := getsnapshot(n); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Declaration -> if (node.kind == TypeKind.CLASS) foreach (i: int; has_modifier_public(node.methods[i])) NPM[p.id][node.name] << 1; });

Run Example | Published Results

Program Analysis

Dominator Analysis

op: output collection[string][string][string] of string; type T = set of string; allNodeIds: T; dominators := traversal(node: CFGNode): T { doms: T; if (def(getvalue(node))) { doms = getvalue(node); } else { if (node.id == 0) add(doms, string(node.id)); else doms = clone(allNodeIds); } foreach(i: int; def(getvalue(node.predecessors[i]))) doms = intersect(doms, getvalue(node.predecessors[i])); add(doms, string(node.id)); return doms; }; fp := fixp(curr, prev: T): bool { return curr == prev; }; doms: map[string] of T; collect := traversal(node: CFGNode) { if (def(getvalue(node, dominators))) doms[string(node.id)] = getvalue(node, dominators); }; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before method: Method -> { cfg := getcfg(method); if (def(cfg)) for (i := 0; i < len(cfg.nodes); i++) add(allNodeIds, string(i)); traverse(cfg, TraversalDirection.FORWARD, TraversalKind.REVERSEPOSTORDER, dominators, fp); traverse(cfg, TraversalDirection.FORWARD, TraversalKind.ITERATIVE, collect); op[input.project_url][current(ChangedFile).name][method.name] << string(doms); clear(dominators); clear(collect); clear(doms); clear(allNodeIds); } });

Run Example | Published Results

Live Variables

m: output collection[string][string][int] of string; # program analysis output type T = set of string; type T_gen_kill = { gen: T, kill: string }; type T_inout = { in: T, out: T }; m_name: string; # traversal that gets all variable uses in a method init := traversal(node: CFGNode) : T_gen_kill { cur_value: T_gen_kill; cur_value = { node.useVariables, node.defVariables }; return cur_value; }; # cfg live variable analysis live := traversal(node: CFGNode) : T_inout { cur_val: T_inout; if (def(getvalue(node))) { cur_val = getvalue(node); } else { in_set: T; out_set: T; cur_val = { in_set, out_set }; } succs := node.successors; foreach(i:int; def(succs[i])) { succ := getvalue(succs[i]); if (def(succ)) { cur_val.out = union(cur_val.out,succ.in); } } gen_kill := getvalue(node, init); if (def(gen_kill)) { remove(cur_val.out, gen_kill.kill); cur_val.in = union(gen_kill.gen, cur_val.out); } return cur_val; }; result := traversal(node: CFGNode) { if (def(getvalue(node, live))) { m[input.project_url][m_name][node.id] << string(getvalue(node, live).in); } }; # user-defined fix point function that is used for analysis termination. fixp1 := fixp(curr, prev: T_inout) : bool { if (len(difference(curr.in, prev.in)) == 0) return true; return false; }; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Method -> { cfg := getcfg(node); m_name = current(Declaration).name + "::" + node.name; traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, init); traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, live, fixp1); traverse(cfg, TraversalDirection.BACKWARD, TraversalKind.HYBRID, result); clear(init); clear(live); } });

Run Example | Published Results

Reaching Definitions

m: output collection[string][string][int] of string; # program analysis output type T = set of string; type T_gen_kill = { gen: string, kill: string }; type T_inout = { in: T, out: T }; m_name: string; # traversal that accumulates generated values cfg_def := traversal(node: CFGNode) : T_gen_kill { cur_val: T_gen_kill = { "", "" }; if (node.defVariables != "") { cur_val.gen = node.defVariables + "@" + string(node.id); cur_val.kill = node.defVariables; } return cur_val; }; # cfg reaching definition analysis cfg_reach_def := traversal(n: CFGNode): T_inout { cur_val: T_inout; if (def(getvalue(n))) { cur_val = getvalue(n); } else { in_set: T; out_set: T; cur_val = { in_set, out_set }; } preds := n.predecessors; foreach (i: int; def(preds[i])) { pred := getvalue(preds[i]); if (def(pred)) cur_val.in = union(cur_val.in, pred.out); } cur_val.out = clone(cur_val.in); genkill := getvalue(n, cfg_def); if (genkill.kill != "") { tmp_out := values(cur_val.out); foreach (i: int; def(tmp_out[i])) { tmp1 := clone(tmp_out[i]); str_array := splitall(tmp1, "@"); if (str_array[0] == genkill.kill) remove(cur_val.out, tmp1); } add(cur_val.out, genkill.gen); } return cur_val; }; result := traversal(node: CFGNode) { if (def(getvalue(node, cfg_reach_def))) m[input.project_url][m_name][node.id] << string(getvalue(node, cfg_reach_def).out); }; # user-defined fix point function that is used for analysis termination. fixp1 := fixp(curr, prev: T_inout) : bool { if (len(difference(curr.out, prev.out)) == 0) return true; return false; }; visit(input, visitor { before node: CodeRepository -> { snapshot := getsnapshot(node, "SOURCE_JAVA_JLS"); foreach (i: int; def(snapshot[i])) visit(snapshot[i]); stop; } before node: Method -> { cfg := getcfg(node); m_name = current(Declaration).name + "::" + node.name; traverse(cfg, TraversalDirection.FORWARD, TraversalKind.HYBRID, cfg_def); traverse(cfg, TraversalDirection.FORWARD, TraversalKind.HYBRID, cfg_reach_def, fixp1); traverse(cfg, TraversalDirection.FORWARD, TraversalKind.HYBRID, result); clear(cfg_def); clear(cfg_reach_def); } });

Run Example | Published Results