Volume 11, Number 6

A Data Extraction Algorithm from Open Source Software Project Repositories for Building
Duration Estimation Models: Case Study of Github


Donatien K. Moulla1, 2, Alain Abran3 and Kolyang4, 1Faculty of Mines and Petroleum Industries, University of Maroua, Cameroon, 2LaRI Lab, University of Maroua, Cameroon, 3École de Technologie Supérieure, Canada, 4The Higher Teachers’ Training College, University of Maroua, Cameroon


Software project estimation is important for allocating resources and planning a reasonable work schedule. Estimation models are typically built using data from completed projects. While organizations have their historical data repositories, it is difficult to obtaintheir collaboration due to privacy and competitive concerns. To overcome the issue of public access to private data repositories this study proposes an algorithm to extract sufficient data from the GitHub repository for building duration estimation models. More specifically, this study extracts and analyses historical data on WordPress projects to estimate OSS project duration using commits as an independent variable as well as an improved classification of contributors based on the number of active days for each contributor within a release period. The results indicate that duration estimation models using data from OSS repositories perform well and partially solves the problem of lack of data encountered in empirical research in software engineering.


Effort and duration estimation, software project estimation, project data, data extraction algorithm, GitHub repository.