Dataset Notes - August 2020

Released: August 2020

Data source: GitHub

Compiler: compiler-2020-Python

>> more information about this dataset

Projects included

The dataset contains 1,558 Github projects with following properties:

  • Original (not forked) project with Python as the primary language.
  • Contains data at least one science keywords like machine-learning, deep neural network in the description of the project. The whole list of keywords are listed in the appendix.
  • Contains at least one usage of data science library like Pytorch, Caffe, Keras, Tensorflow etc. A full list of used 33 Python data science libraries are listed in the appendix.
  • Contains at least 80 star.

The dataset contains projects owned by both organizations and individual users. Some of the top rated projects are Tensorflow Models, Keras, Scikit-learn, Pandas, Spacy, Spotify Luigi, NVIDIA FastPhotoStyle, Theano, etc.

Programming languages processed (stored as ASTs)

  • Python (any file with .py file extension)

Known Bugs/Limitations

None