Released: August 2020
Data source: GitHub
Compiler: compiler-2020-Python
>> more information about this dataset
Projects included
The dataset contains 1,558 Github projects with following properties:
- Original (not forked) project with Python as the primary language.
- Contains data at least one science keywords like machine-learning, deep neural network in the description of the project. The whole list of keywords are listed in the appendix.
- Contains at least one usage of data science library like Pytorch, Caffe, Keras, Tensorflow etc. A full list of used 33 Python data science libraries are listed in the appendix.
- Contains at least 80 star.
The dataset contains projects owned by both organizations and individual users. Some of the top rated projects are Tensorflow Models, Keras, Scikit-learn, Pandas, Spacy, Spotify Luigi, NVIDIA FastPhotoStyle, Theano, etc.
Programming languages processed (stored as ASTs)
- Python (any file with .py file extension)
Known Bugs/Limitations
None