k-means has its pros and cons. From one point of
view, it is easy to implement and to define subsets
from the original complex datasets without any other
preprocess. Most of the times it was found out that
k means algorithm is faster than hierarchical cluster-
ing and also very efficient in terms of computational
cost which is O(K*n*d) for the first iteration (assign-
ment) and O(n*d) while updating the centroids, [3].
On the other hand, although k-means and other clus-
tering methods, [4], are great at initially separating the
data into clusters, it depends heavily on the location
of the points, rather than the actual relation of the data
itself. Furthermore, it can give significantly different
results, depending on the number of clusters that are
set (k), which increases the importance of the initial
choice of the number of clusters(k). Last but not least,
it is also sensitive to the outliers.
Outliers and big data in general, could reduce the
efficiency and therefore the reliability of the k-means
algorithm, [5]. These kinds of datasets may contain
a large amount of unrelated and redundant informa-
tion, which are likely to affect the performance of the
classification algorithm. In order to avoid these prob-
lems, feature selection methods have been used, [6],
[7], [8].
Feature selection is often used as a preparatory
stage of machine learning or in conjunction with it,
[9]. The selection of the useful features is an impor-
tant step, in order to increase the efficiency of the clus-
tering and the classification algorithms. Especially,
when the analysis concerns problems of classification
and/or regression of many observable features, fea-
ture selection becomes an important stage of analy-
sis, which can reduce the size of the dataset and the
computational cost, by removing irrelevant and un-
necessary data like outliers, but also by improving
accuracy, as well as understanding deeper the struc-
ture of the model or the data, [10]. Therefore, fea-
ture selection is considered necessary when dealing
with big data cases. The purpose of feature selection
algorithms is to create a subset of features from the
initial one (the whole dataset), based on some crite-
ria. Feature selection methods are divided into three
main categories. Filter methods, wrapper methods
and embedded methods. Filter methods depending
on the general characteristics of the data, without in-
cluding any learning algorithm. These methods use
some dependency measures such as mutual informa-
tion, [11], conditional mutual information, [12], Pear-
son’s correlation, [13], in order to examine the rela-
tionships between the features and to create the best
possibly feature subset. These methods are relatively
resistant to overfitting, but they often do not provide
the optimal subset of features. However, due to the
fact that the computational cost of these methods is
much lower compared to the others, it is widely used
by researchers when dealing with big data. Wrap-
per methods, [14], [15], require a predefined learn-
ing algorithm and uses its performance to evaluate
and determine which features will be selected. Thus,
for each new subset of features, the wrapper method
needs to satisfy a predefined hypothesis or classifier.
It tends to find features that fit better to the predefined
learning algorithm, which results in very good learn-
ing performance, but at a higher computational cost.
Although this method is really efficient and achieves
high performance, its computational cost makes it
prohibitive for large scale datasets (big data), because
the classifier needs to be trained plenty of times. The
embedded methods, [16], [17], [18], combine the
advantages of both filter and wrapper methods. In
this method, the algorithm learns which features con-
tribute best to the accuracy of the model, during the
process of the creation of the model. Although they
usually have lower computational cost than wrapper
methods, they are much slower than filter methods
and in addition, the selected features highly depend
on the learning machine.
In this paper, a wrapper feature selection method
will be used. As it was mentioned earlier, wrapper
methods lead to really good learning performances,
but at high computational cost. Parallelization can re-
duce the computational cost, without compromising
the efficiency of the methods.Recently, in feature se-
lection, different parallelization techniques have been
proposed and have been used for high dimensional
cases involving large amount of data, such as can-
cer classification, [19], and wind forecasting, [20]. In
order to overcome the limitations related to the in-
crease of computation time while using feature se-
lection methods, different approaches have been pro-
posed. Such examples are, a novel hybrid parallel
implementation that uses MPI and OpenMP for fea-
ture selection on distributed-memory clusters, [21],
a parallel heterogeneous ensemble feature selection
based on multi-core architectures (apply parallelism
to multi-core in CPU along with GPU), [22], and a
framework QCFS which exploiting the capabilities
of modern CPU’s, by executing in parallel, on four
cores, the feature selection process, [23]. Finally, a
feature selection library has been proposed, [24], in
order to help with the acceleration of the feature se-
lection process. This library includes seven methods,
which follow a hybrid MPI/multithreaded approach.
In this paper, different variations of parallelization of
feature selection algorithms based on k-means will
be examined, in order to reduce computational cost,
to quantify the reduction in time resulting from each
method and finally, to select the most effective one.
The rest of the paper is organized as follows: In chap-
ter 2 basic elements of theory, as well as the methodol-
ogy followed are presented. In Chapter 3, the Datasets
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2023.20.10
Nikolaos Papaioannou, Alkiviadis Tsimpiris,
Christos Talagozis, Leonidas Fragidis,
Athanasios Angeioplastis,
Sotirios Tsakiridis, Dimitrios Varsamis