then determines their unweighted mean across all
labels. This does not take label imbalance into
account. Meanwhile, weighted averages calculate
metrics for each label, and find their average
weighted by support (the number of true instances
for each label). This alters ‘macro’ to account for
label imbalance and it can result in an F-score that is
not between precision and recall. The efficacy of
classification accuracy is optimal when the number
of samples is evenly distributed across each class.
Since there is an imbalance in the number of
instances in each class, the weighted average results
for precision, recall, and F1 score in the training
dataset applying Count (bag-of-words), TF-IDF, and
Ngrams were considered.
Higher weighted average precision, recall, and
F1 score were desired in this case. Precision is
calculated as the ratio of True Positives to the sum
of True Positives and False Positives. This means
precision is a measure of the classifier's exactness.
This shows a low precision and indicates a large
number of False Positives. On the other hand, recall
is the number of True Positives divided by the
number of True Positives and the number of False
Negatives. This means recall is a measure of the
classifier's completeness. This shows a low recall
indicates many False Negatives. Meanwhile, F1
Score is the 2*((precision * recall)/(precision +
recall)). The F1 score reflects a harmonious balance
between precision and recall.
Results of Count vectorization on sentiment
classification illustrate that among the base models,
higher weighted average precision, recall, and F1-
score on both Logistic Regression and Support
Vector Machines were obtained. However, the
ensemble outperformed Logistic Regression and
Support Vector Machines yielding 0.87 precision,
0.88 recall, and 0.87 F1 scores compared to Logistic
Regression and Support Vector Machines that
obtained the same precision, recall, and F1 scores of
0.86.
Results on TF-IDF vectorization show that
Support Vector Machines yielded the highest
precision and recall of 0.87 but the same F1 score of
0.86 in the two base models, Support Vector
Machines and Logistic Regression. Logistic
Regression obtained precision and recall of 0.86.
Naïve Bayes yielded a precision of 0.85, recall of
0.84, and F1 score of 0.83. Random Forest got
precision, recall, and F1 scores of 0.85. Meanwhile,
the ensemble obtained 0.86 in precision, recall, and
F1 scores. This indicates that the ensemble did not
outperform Support Vector Machines and Logistic
Regression in terms of F1 score. TF-IDF increased
the precision and recall of Support Vector Machines
by 0.01 but no improvement was found in the F1
score. It also did not improve the precision, recall,
and F1 scores of the other machine learning
algorithms including the ensemble. A study by [18],
found that machine learning models generally
achieved higher accuracy rates with TF-IDF, except
in the cases of Multinomial Naïve Bayes and Neural
Network I, where the Count vectorizer demonstrated
superior performance in terms of accuracy
percentages. More specifically, within these two
models, TF-IDF exhibits superior performance on
the IMDb movie reviews test set, stemming from the
dataset on which the models were trained while
showing inferior results on other datasets. The other
datasets are reviews on clothing, food, hotels,
Amazon products, and tweets. In another study by
[19], TF-IDF demonstrated greater efficiency
compared to Count vectorizer when dealing with
large-volume datasets. Both vectorizers exhibited
approximately similar performance, except for
Single Layer Perceptron, where the Count vectorizer
achieved a 10% higher accuracy. The findings in
this present study are aligned with the findings of
[18] and [19] that though TF-IDF vectorizer is often
regarded as better than Count vectorizer, it does not
generalize in all cases. It is interesting to note that in
the two prior studies, TF-IDF and Count vectorizer
were applied in various datasets. This suggests that
the differences in the performance of the two
vectorizers can be attributed to the characteristics of
the datasets. For instance, a Count vectorizer might
be more effective when the data is shorter and it has
fewer distinct words, [20].
Results on Ngram vectorization indicate that
Ngrams outperformed Count in terms of F1 score
when applied in the ensemble. These findings
support the findings of [9], that ensemble yielded
better performance in sentiment classification of
students’ comments than individual classifiers. They
used Ngram analysis for feature extraction. The
findings in this present study also complement that
of [21]. They found that the ensemble model
demonstrated a good ability to cope with errors.
The weighted average results in the training
dataset applying Ngrams are presented in Table 1.
Table 1 illustrates that Ngrams setting ngram_range
to 1, 2 (unigrams + bigrams) yielded the highest
precision, recall and F1 score when applied in
ensemble in the training dataset. A comparison of
the F1 score of the ensemble applying Count and
Ngrams is shown in Table 2. Results indicate that
Ngrams outperformed count in terms of F1 score
when applied in the ensemble.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.7