an attempt to understand the quality of the software in
terms of some quantitative change, namely the
changes in source code. Again, [2], have discussed the
effect of names of identifiers on the quality of the
software. According to them, if the quality of names
of the identifiers is of low quality, it leads to a
lowering of the quality of the software as a whole. A
tool has been used here to get identifier names from
the source code of Java projects. They have also
discussed which type of names of identifiers lead to
which type of problems. Again, here the changes in
the code are affecting the software quality. [3], as
explained the effect of code refactoring on the quality
of the software. They have developed techniques to
improve the quality of the software by refactoring
code. The authors have tried to formalize the method
of refactoring code in this paper. [4], have in a similar
manner analyzedanalyzed structural changes in
software versions and the chief motive was to measure
changes related to the structure of source code in the
software. Although there is no attempt to relate the
changes in code structure with the quality of the
software, it is quite evident that there may be a
relation between the two and this may constitute an
interesting study. However, these works are looking
into the software projects only from a static point of
view. It is necessary to look at these projects from a
dynamic perspective, from the viewpoint of their ever-
changing nature, from the viewpoint of their coming
into being and going out of being. This has been the
primary focus of the present paper.
Several studies are there that have tried to
understand features of open-source projects and
parameters such as the number of active contributors,
the use of different kinds of programming languages,
the particular structure of the project, and on many
other important parameters that the author considers
to be quantitative parameters. Many such works are
found in [5] and [6]. Again, these are looking largely
at popularity. There has to be a judgement of quality
too using suitable metrics based on events of open-
source projects.
There are other works such as [7] and [8] that
analyze the data in datasets related to the GitHub
repositories. In such studies, mainly the stars, forks,
and issues are considered. Many have also included
code and outline of how the data has been retrieved,
that is they have elaborated in detail the mining
methods involved. The study also did a random word
selection from a certain word list which was given to a
GitHub API and then the API gave back a list of repos.
Out of the list, some are selected at random and
mining is done to extract the data of those reports.
It has used the metric of popularity as popularity
= stars + forks + pulls2. The authors in this study
have tried to correlate the documentation of the
project to this defined value of popularity. The
method though is not discussed in great detail], but it
does motivate the present study to think in similar
terms.
In [9], authors have adequately described that a
large number of GitHub repositories are personal and
not active. This may have a large effect on the
conclusions that one may draw from a dataset of
GitHub repositories. For this, the authors analysed
parts of GHTorrent datasets and sent surveys to users
of GitHub. They also highlighted the fact that there
was a substantial number of projects that had very few
commits so it might not be proper to jump to
conclusions from the commit data of GitHub.
In [10], authors have shown that frequency of
commits and the evolution of versions of files in eight
large projects of GitHub have a certain degree of
correlation. The projects discussed here are very
successful. It presented a picture of the number of
commits and the number of lines of code being
changed in each file and a comparison between the
number of commits and file changes in different
versions etc. All these works are pointing towards the
attempt to design a software quality vs quantity model
for understanding the relation between them.
3 Methodology
Detailed mining has been done from a public dataset
available on Google Big Query. The total data
processed in Big Query went to about 43.2 TB. More
than 170 queries were performed on the dataset to
extract the data. Since the process is cost-consuming,
it could be performed for only a single time. This
extracted data was cross-checked from Click House
[11] and all the data was tabulated. It needs to be
mentioned here that GH Archive has made available
the data from GitHub for the last eleven years, that is,
from 2011 to 2021. This has a detailed collection of several
events. The events and their identifiers are as below:
1. CommitCommentEvent: triggered when there is
a comment in a commit
2. CreateEvent: when there is the creation of a
branch or a tag
3. DeleteEvent: When a branch or a tag is deleted
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2024.23.4
Ekbal Rashid, Nikos Mastorakis