The Boa Programming Guide

Boa is a domain-specific programming language and an infrastructure built to enable and ease research that mines software and its evolution at a large-scale. Boa has the following major components:

  • A domain specific programming language,
  • compiler for the language,
  • a curated data set (almost 700k open source projects, see details),
  • a backend based on map-reduce to analyze the data set, and
  • a web-based front end to write mining software repositories (MSR) programs.

To get started, let us consider a simple Boa program below. Given a dataset, the goal of this program is to find out the years during which most number of projects were created in that dataset.

# which year were projects created most? p: Project = input; yearCount: output top(1) of int weight int; yearCount << yearof(p.created_date) weight 1;

To understand this Boa program, let us for the sake of argument assume that you ran it on an input dataset that has 700,000 open source projects. Boa's September 2013 input dataset does indeed contain that many projects. The logical model of thinking about this program is that it instantiates one Boa task for each of the 700,000 projects in the input dataset (the Boa compiler of course does some clever optimizations to make this run fast, but for now let us focus on the semantic model). The job of each Boa task instance is to analyze a single project.

Another important point to note is that the output variable, here yearCount, can be thought of as a process that is shared by all of the Boa tasks in a program. The process corresponding to the output variable can be sent a value, e.g. using the syntax on line 3, where yearCount is being sent a value. Output variables can combine values that are being sent to them in various ways. For example, the output variable yearCount says that it combines all of the integer values being sent to it and retains the top most integer as determined by an integer weight.

In the sample program we are discussing, when a Boa task gets a project it has to analyze, it sends two values to the output variable: the year that the project was created and a weight 1.

Overall, when this Boa program runs, each Boa task sends years to the output variable, which then computes and outputs the top-most year thereby fulfilling the objective of the program. If you are using the Boa web-based infrastructure to run these programs, you can view the name of the output variable and its result from the web interface.

The links to the left provide details about various components of the Boa language. We recommend that you familiarize yourself with Boa's domain specific types quickly, and then look over our examples that describe several common use cases for Boa. In case of difficulties, you could always e-mail us or post your queries in the Boa user forum.