The BoaC Programming Guide

BoaC is a language and an infrastructure for deeper analysis of the COVID-19 Open Research Dataset (CORD-19) provided by the Allen Institute for AI. BoaC has the following major components:

  • A domain specific programming language,
  • compiler for the language,
  • a backend based on map-reduce to analyze the data set, and
  • a web-based front end to write programs to analyze COVID-19 papers.

To get started, let us consider a simple BoaC program below. Given a dataset, the goal of this program is to find out the years during which most number of papers were published in that dataset.

# which year were COVID-19 papers published most? p: Paper = input; yearCount: output top(1) of int weight int; yearCount << yearof(p.metadata.publish_time) weight 1;

To understand this BoaC program, let us for the sake of argument assume that you ran it on an input dataset that has 28k+ research papers. BoaC's COVID-19 input dataset does indeed contain that many papers. The logical model of thinking about this program is that it instantiates one BoaC task for each of the 28k+ papers in the input dataset (the BoaC compiler of course does some clever optimizations to make this run fast, but for now let us focus on the semantic model). The job of each BoaC task instance is to analyze a single paper.

Another important point to note is that the output variable, here yearCount, can be thought of as a process that is shared by all of the BoaC tasks in a program. The process corresponding to the output variable can be sent a value, e.g. using the syntax on line 3, where yearCount is being sent a value. Output variables can combine values that are being sent to them in various ways. For example, the output variable yearCount says that it combines all of the integer values being sent to it and retains the top most integer as determined by an integer weight.

In the sample program we are discussing, when a BoaC task gets a paper it has to analyze, it sends two values to the output variable: the year that the paper was published and a weight 1.

Overall, when this BoaC program runs, each BoaC task sends years to the output variable, which then computes and outputs the top-most year thereby fulfilling the objective of the program. If you are using the BoaC web-based infrastructure to run these programs, you can view the name of the output variable and its result from the web interface.

The links to the left provide details about various components of the BoaC language. We recommend that you familiarize yourself with BoaC's domain specific types quickly, and then look over our examples that describe several common use cases for BoaC. In case of difficulties, you could always e-mail us or post your queries in the BoaC user forum.