This repository contains materials for the project of hugging face datasets and models reuse analysis.
This study empirically explores how Natural Language Processing (NLP) and Computer Vision (CV) datasets and models are reused in the Hugging Face community. We find that NLP tasks - such as Zero-shot-classification, Sentence-similarity, and Feature-extraction - require more diverse datasets compared to CV tasks on average. On the other hand, NLP datasets were reused less frequently than CV datasets. In addition, CV models were reused frequently to develop other models compared to NLP models. In conclusion, NLP models reused diverse datasets for training, while CV datasets and models were reused more and layered up together to develop other models. This study contributes to the understudied area of dataset and model reuse in computing and the broader data reuse subfield under Information Science.