extraction algorithm, so that the shortcoming of the
TF-IDF is partially prevented. A few techniques in
the literature are also suggested to obtain
representative terms (dispersion considered) instead
of using all vocabulary exist in the document/s. The
first approach [16] refers to the Helmholtz principle
in Gestalt theory of physics and obtains a statistical
value for each term. The terms above a threshold
value are confirmed as representative terms. The
second approach [17] is based on “inverse document
frequency” (TF-IDF) values of terms. Furthermore,
Litvak and Last’s work [18] is an example of single
document summary which uses a graph based
approach to obtain representative words.
The other important technique studied is
comparing the document structure. Marcu [19]
proposed a method which captures the rhetorical
structure of a document. It depends on a set of
constraints and assumes text coherence. The
rhetorical structure composition is also applied to
the multi-documents by Yong-dong and colleagues
[20]. Another work is done by Salton [21], who
suggested paragraph-based extraction using the
intra-document links between paragraphs. A text
relationship map is finally produced. Okazaki [22]
also proposed a similar approach applied to
sentences interrelationships.
The purpose of this study is to devise a new
technique in order to select more informative
paragraphs through similar ones, so that minimizing
information duplication and enhancing summary
quality even for higher compression rates. The
devised technique is called RTHF (comprising at
least one Representative Term at the Highest
Frequency). RTHF assures that a unit contains at
least one representative term which frequency in
this unit is the greatest for all units in the
document/s. Moreover, if a unit includes lots of
representative terms with low frequencies, it is
accepted as garbage. By the way, this unit isn’t
considered for summarization anyway. Finally,
RTHF units are ranked in order and extracted in
sequence until the summary size is obtained.
Automatic multi-document summarization is
actually a complex task requires both detection of
the related segments in documents and selection of
the more informative ones for extraction. Moreover,
in which order the extracted segments should be
presented is another issue. In this work, the
successful paragraph based work [6] is extended to
use representative term frequencies in order to select
more informative paragraphs, called RTHF. The
advantages of technique are being very simple (only
vector operations) and applicable to any EMDS.
Using only representative term frequencies is just
enough to complete other tasks in an EMDS.
Selection of informative segments has been worked.
Furthermore, it can also be adapted to order
segments in extraction by sorting the frequencies of
representative words in selected segments (further
work).
RTHF is a heuristic technique and announced
first time in the literature. It was experimented on
the paragraph-based EMDS [6]. Similar data set
(DUC 2006) was used in experiments. Eventually,
final ROUGE metrics were compared and discussed
with the values announced in [6]. This study shows
that using RTHF enhances ROUGE metrics between
1% and 3%.
2 Problem Formulation
Assume D is the whole document set and T is the
set of all terms existing in D.
D = { , , ... , }
T = { , , ... , }
First of all, all stop words are removed. Later, the
remaining terms are stemmed. Synonymous terms
are evaluated together in frequency calculation. It is
assumed that the terms in T are now independent.
In order to obtain meaningful words () to represent
the documents, Term Dispersion Ratio (TDR)
metric is proposed by Equation (1). This metric
evaluates how common a word seen through
documents. In this work, the TDR effect on
meaningful word selection is experimented by
different TDR values in Equation (2).
(1)
then = { } (2)
Some readers can be confused and mistakenly refer
TDR as TF-IDF. However, TF-IDF formula is
defined in Equation (3), where n is the document
number in the data set and document frequency dft is
defined to be the number of documents in the
collection that contain a term t.
(3)
Thus the idf of a rare term is high, whereas the idf
of a frequent term is likely to be low. However,
TDR works in reverse manner. It is interested in
WSEAS TRANSACTIONS on SYSTEMS
DOI: 10.37394/23202.2023.22.31