European language family due to its unique
phonological, grammatical, and lexical features and
the complex syntactic structure. In the last 10 years,
we have found several studies in mathematics
methods and computer science for the classification
of Albanian texts, for authorship attribution or author
identification. Our research work is focused on
developing and adapting mathematical models and
statistics methods for the identification of the authors
of Albanian texts with the aim of authorship
attribution and the detection of plagiarism in
Albanian texts, [10]. Previously we estimated the
probability of finding the correct author in Albanian
text classification using logistic and multinomial
logistic regression models, [11], [12]. Nowadays
clustering methods are applied in Albanian texts for
word classification, [13], and in datasets with short
comments from social networks, [14]. Several studies
on Albanian text classification and categorization are
made in classification texts by topics and notations
of texts as positive and negative in short texts, [15],
[16], [17], [18], [19]. In this paper, we
present the agglomerative hierarchical
clustering to classify Albanian documents by
authors according to the similarity of their word
frequency. We apply the agglomerative
hierarchical clustering methods in a database
created from 100 Albanian documents from 10
different authors. The similarity of texts is
realized using cosine and Euclidian distances. The
application was developed using different text
mining packages in R, [20]. Considering the
importance of stop words in text classification
models, [21], [22], we realized the application in
two cases: one with the pre-processing of the corpus
by removing Albanian stop words and the other
with Albanian stop words included. To increase
the accuracy of classification, in this paper, we
upgrade the set of Albanian stop words in R for the
application of the hierarchical method as text
classification. We evaluate the clustering of
Albanian text by utilizing Dunn's index, thus
determining the optimal clustering.
2 Materials and Methods
Hierarchical clustering is a powerful technique
for identifying natural groupings within datasets,
which can be especially useful for
unsupervised text classification. Hierarchical
clustering successively merges each text or
document on a corpus into the default cluster
based on their similarity. Similarity can be
evaluated by cosine similarity, Euclidean
distance, Manhattan distance, maximum
distance, etc. Hierarchical methods have the
advantage of the simple interpretation of the
clustering results and do
not require a prior setting of the number of clusters.
The goal of clustering is to minimize the distance
between the documents in the same cluster and to
maximize the distance between documents in
different clusters. There are two types of hierarchical
methods called agglomerative and divisive methods.
These techniques construct their hierarchy in the
opposite direction.
Agglomerative methods start when all objects are
apart then in each step two clusters are merged until
only one is left. On the other hand, divisive methods
start when all objects are together and in each
following step, a cluster is split up, until there are all
of them.
Agglomerative hierarchical clustering has been
widely used in document classification, where large
volumes of textual data are analyzed and categorized
into groups based on their similarity. The
agglomerative hierarchical clustering algorithm
starts by treating each document as a separate cluster,
and then iteratively merges the most similar clusters
until all documents are grouped into a single cluster.
A linkage criterion, such as the average linkage,
complete linkage, or Ward's method, is used to
determine the similarity between two clusters based
on the similarities between their members. Ward's
method is recognized as a highly effective technique
for text clustering. This method is an agglomerative
clustering technique that recursively splits the dataset
into smaller subsets until each subset contains only
one document. The algorithm iteratively merges the
subsets that minimize the total sum of squares
between each point and its corresponding centroid.
This method is sensitive to outliers, as it aims to
minimize the distance between data points and their
respective centroids. Another successful
agglomerative clustering method for text
classification is the average linkage method, also
referred to as the UPGMA method (unweighted pair-
group method using the average approach), which
calculates the distance between two clusters as the
mean of the distances between each pair of
documents consisting of one member from each
group. The complete linkage method tends to find
uniform clusters in which the similarity between two
clusters is the maximum distance between
documents.
In this approach, documents are initially
represented as vectors in a high-dimensional feature
space, where each feature corresponds to a specific
term in the document. The similarity between two
documents is then measured using a distance metric.
The appropriate distance for text classification is
cosine distance. This method considers the angle
between the document vectors and is less sensitive to
WSEAS TRANSACTIONS on COMPUTER RESEARCH
DOI: 10.37394/232018.2024.12.2
Denisa Kaçorri, Albina Basholli, Luela Prifti