Dataset Notes - September 2015

Released: September 2015

Data source: GitHub

Projects included

All projects have top-level metadata included. Only projects identified as Java projects also include repository history and source code.

Programming languages processed (stored as ASTs)

  • Java (any file with .java file extension) (up to and including Java 7 - not including Java8)

Known Bugs/Limitations

  • Project creation dates are off by 1000. Anyone wishing to use this field should correct it (p.created_date / 1000). See this query for an example: http://boa.cs.iastate.edu/boa/?q=boa/job/public/14441
  • All fields in Person (real_name, email, username) are set to the same value (they all contain the real name)
  • Tags and branches are not stored
  • Commits are listed topologically and thus you can not see/infer the commit graph (or which commits belong to master vs a branch).