Dataset Notes - October 2019

Released: October 2019

Data source: GitHub (identical data to September 2015 dataset)

This dataset adds many new Boa programming language features, including improved support for maps/sets/stacks/queues, and program analysis features such as CFG/CDG/DDG/PDG generation and (fixed-point) graph traversals.

Projects included

All projects have top-level metadata included. Only projects identified as Java projects also include repository history and source code.

Programming languages processed (stored as ASTs)

  • Java (any file with .java file extension) (up to and including Java 7 - not including Java 8 or newer)

Known Bugs/Limitations

  • The project with ID "400825" is accidentally in the dataset twice (but cloned at different times, so not 100% identical). We recommend filtering out projects with that ID.
  • All fields in Person (real_name, email, username) are set to the same value (they all contain the real name)
  • Some fields in Revision (author.real_name, committer.real_name, and are actually blank - see
  • Tags and branches are not stored
  • Commits are listed topologically and thus you can not see/infer the commit graph (or which commits belong to master vs a branch).
  • The 2-expression form of assert statements lose the first expression - only the 2nd expression (possibly non-boolean value) is stored.