The BoaG Programming Guide

BoaG is a language and an infrastructure for deeper analysis of the public genomics databases on the NCBI BoaG has the following major components:

  • A domain specific programming language,
  • compiler for the language,
  • a backend based on map-reduce to analyze the data set, and
  • a web-based front end to write programs to analyze genomics data.

To get started, let us consider a simple BoaG program below. Given a dataset, the goal of this program is to find out how many taxonomic assignment each protein has.

# Number of taxonomic assignment for each protein sequence s: Sequence = input; counts: output sum[string][string] of int; foreach(i:int; def(s.annotation[i])) counts [s.seqid][s.annotation[i].tax_name] << 1;

To understand this BoaG program, let us for the sake of argument assume that you ran it on an input dataset that has 174M protein sequence. BoaG's NR input dataset does indeed contain that many protein sequence. The logical model of thinking about this program is that it instantiates one BoaG task for each of the 174M protein sequences in the input dataset (the BoaG compiler of course does some clever optimizations to make this run fast, but for now let us focus on the semantic model). The job of each BoaG task instance is to analyze a single protein sequence.

Another important point to note is that the output variable, here counts, can be thought of as a process that is shared by all of the BoaG tasks in a program. The process corresponding to the output variable can be sent a value, e.g. using the syntax on line 5, where counts is being sent a value. Output variables can combine values that are being sent to them in various ways. For example, the output variable counts says that it provide the sum of all of the tax name and indexes for each protein sequence.

In the sample program we are discussing, when a BoaG task gets a protein sequence it has to analyze, it sends value 1 for each taxonomic assignment to the output variable: the count that provide total taxonomic assignment for each protein. Each protein sequence might have multiple taxonomic assignment originated from multiple databases.

Overall, when this BoaG program runs, each BoaG task sends taxonomic counts to the output variable, which then aggregates all partial outputs and thereby fulfilling the objective of the program. If you are using the BoaG web-based infrastructure to run these programs, you can view the name of the output variable and its result from the web interface.

The links to the left provide details about various components of the BoaG language. We recommend that you familiarize yourself with BoaG's domain specific types quickly, and then look over our examples that describe several common use cases for BoaG. In case of difficulties, you could always e-mail us or post your queries in the BoaG user forum.