Selection Informative Units for Extractive Summarization

METİN TURAN

Computer Engineering Department,

İstanbul Ticaret University,

Küçükyalı, İstanbul,

TURKEY

Abstract: - An Extractive Multi-Document Summarizer must select the most informative units and prevents

duplication in extraction. In order to achieve this goal, a new technique, called “comprising at least one

Representative Term at the Highest Frequency”, called RTHF, is proposed in this work. The units which

include representative terms, but with low frequencies are not considered for extraction (selection of the most

informative units). On the other hand, these units which provide RTHF feature, precede other similar units in

ranking (prevents duplication). The heuristic behind the RTHF is explained by probability. RTHF was

experimented on a previously developed and tested paragraph- based Extractive Multi-Document Summarizer.

The results show that it enhances the original system by 0.8% ~ 3.2% (Average-F values of ROUGE metrics).

Key-Words: - Document Summarization, Informative Units, TF-IDF, Paragraph Extraction, NLP, AI

Received: June 28, 2022. Revised: February 21, 2023. Accepted: March 8, 2023. Published: March 23, 2023.

1 Introduction

As Automatic multi-document summarization is a

job of producing a summary from the bulk of

documents. Although this is a result of the rapidly

increasing amount of documents in public, objective

is to produce summaries automatically which is

more similar to the job done by human Summarizer

fundamentally. A survey, including Extractive

Multi-Document Summarizer (EMDS) approaches

can be found in the article written by Kumar and

Salim [1] or M.Sc. Thesis of Sizov [2].

A document is composed of small units such as

sentences, paragraphs or text segments. Sentence is

the most common unit in summary because it

provides easy parsing and processing. Researchers

have been suggested different techniques [3, 4] in

order to select more relevant sentences.

There are comparatively a few studies that focus on

the extraction of paragraphs in EMDS. The well-

known research was done by Mitra and colleagues

[5]. The latest system has been developed in a

doctorate thesis [6]. The result of latter work

highlights that paragraph-based summary can be

effective as much as sentence based summary.

Document units can be identified by text features.

Researches have been focused on discovering new

text features. Pioneer of researchers was

Edmundson [7] suggested three additional features

(cue, title, location) to evaluate the sentence weights

more accurately. After a long time, Kupiec proposed

a system [8] which was based on the probability of

features in text, such that Sentence Length Cutoff

Feature, Fixed-Phrase Feature, Paragraph Feature,

Thematic Word Feature, and Uppercase Word

Feature. Another important work is done by Kumar

and his colleagues [9] this decade. They calculated

sentence popularity using word features such that

cue words, stigma words and keywords. One of the

important researches was done by Suanmali [10]

who proposed a fuzzy system to score sentences

using some features had been suggested in the

literature (proper noun, thematic word, numerical

data). Finally, Gupta and Lehal work is an example

of the latest researches used feature based approach

[11]. They investigated text mining technologies

and exemplified applications in broad range.

Machine learning has been adopted to identify

weights of sentences to be selected for a summary

recently. For example, Binwahlan [12] used PSO

technique in 2009. Pairs of documents and

summaries were used for training in this technique.

The other similar work was done by Bossard and

Rodriges [13]. They used a genetic algorithm to

determine the best weights for the features. Manne

and Fatima [14] also suggested an HMM tagger to

improve the quality of the summary by feature term

identification.

The feature based technique is simple, however,

it doesn’t explain how the terms are related or they

disperse through documents. Li and colleagues [15]

used lexical chains and suggested a keyword

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

287

Volume 22, 2023

extraction algorithm, so that the shortcoming of the

TF-IDF is partially prevented. A few techniques in

the literature are also suggested to obtain

representative terms (dispersion considered) instead

of using all vocabulary exist in the document/s. The

first approach [16] refers to the Helmholtz principle

in Gestalt theory of physics and obtains a statistical

value for each term. The terms above a threshold

value are confirmed as representative terms. The

second approach [17] is based on “inverse document

frequency” (TF-IDF) values of terms. Furthermore,

Litvak and Last’s work [18] is an example of single

document summary which uses a graph based

approach to obtain representative words.

The other important technique studied is

comparing the document structure. Marcu [19]

proposed a method which captures the rhetorical

structure of a document. It depends on a set of

constraints and assumes text coherence. The

rhetorical structure composition is also applied to

the multi-documents by Yong-dong and colleagues

[20]. Another work is done by Salton [21], who

suggested paragraph-based extraction using the

intra-document links between paragraphs. A text

relationship map is finally produced. Okazaki [22]

also proposed a similar approach applied to

sentences interrelationships.

The purpose of this study is to devise a new

technique in order to select more informative

paragraphs through similar ones, so that minimizing

information duplication and enhancing summary

quality even for higher compression rates. The

devised technique is called RTHF (comprising at

least one Representative Term at the Highest

Frequency). RTHF assures that a unit contains at

least one representative term which frequency in

this unit is the greatest for all units in the

document/s. Moreover, if a unit includes lots of

representative terms with low frequencies, it is

accepted as garbage. By the way, this unit isn’t

considered for summarization anyway. Finally,

RTHF units are ranked in order and extracted in

sequence until the summary size is obtained.

Automatic multi-document summarization is

actually a complex task requires both detection of

the related segments in documents and selection of

the more informative ones for extraction. Moreover,

in which order the extracted segments should be

presented is another issue. In this work, the

successful paragraph based work [6] is extended to

use representative term frequencies in order to select

more informative paragraphs, called RTHF. The

advantages of technique are being very simple (only

vector operations) and applicable to any EMDS.

Using only representative term frequencies is just

enough to complete other tasks in an EMDS.

Selection of informative segments has been worked.

Furthermore, it can also be adapted to order

segments in extraction by sorting the frequencies of

representative words in selected segments (further

work).

RTHF is a heuristic technique and announced

first time in the literature. It was experimented on

the paragraph-based EMDS [6]. Similar data set

(DUC 2006) was used in experiments. Eventually,

final ROUGE metrics were compared and discussed

with the values announced in [6]. This study shows

that using RTHF enhances ROUGE metrics between

1% and 3%.

2 Problem Formulation

Assume D is the whole document set and T is the

set of all terms existing in D.

D = { , , ... ,  }

T = { , , ... ,  }

First of all, all stop words are removed. Later, the

remaining terms are stemmed. Synonymous terms

are evaluated together in frequency calculation. It is

assumed that the terms in T are now independent.

In order to obtain meaningful words () to represent

the documents, Term Dispersion Ratio (TDR)

metric is proposed by Equation (1). This metric

evaluates how common a word seen through

documents. In this work, the TDR effect on

meaningful word selection is experimented by

different TDR values in Equation (2).







  (1)



then  = { } (2)

Some readers can be confused and mistakenly refer

TDR as TF-IDF. However, TF-IDF formula is

defined in Equation (3), where n is the document

number in the data set and document frequency dft is

defined to be the number of documents in the

collection that contain a term t.



 (3)

Thus the idf of a rare term is high, whereas the idf

of a frequent term is likely to be low. However,

TDR works in reverse manner. It is interested in

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

288

Volume 22, 2023

representative terms which frequently seen (over a

ratio) in document collection or not (not

representative term).

As a result, TDR is the minimum proportion to

select a term to be representative. It is defined as the

minimum number of documents must include a

term. When a unit type (sentence, paragraph) is

determined (a paragraph is used in EMDS), then all

units in the D can be represented as follows:

U= {  },

If the frequency of a representative term  in unit

 is defined by relationship f(), then unit term

vector (

󰇍



) can be represented as follows:



󰇍



󰇛󰇛󰇜󰇛󰇜󰇛󰇜󰇜

For document dk,, the document center vector (

󰇍



)

is defined by the highest frequency of each

representative term which is seen in this document

(Equation (4)).



󰇍



󰇛󰇝󰇛󰇜󰇛󰇜󰇞

󰇝󰇛󰇜󰇛󰇜󰇞󰇜

where,󰥍 (4)

Furthermore, 

󰇍



is defined by the highest

frequency of each representative term which is seen

in the all units of document set D (Equation (5)). It

is called data set center vector.



󰇍



󰇛󰇝󰇛󰇜󰇛󰇜󰇞

󰇝󰇛󰇜󰇛󰇜󰇞󰇜 ,

where, 󰥍(5)

As soon as document center vectors and data set

center vector are constructed, Euclidean distances

are calculated between each document center vector

(

󰇍



) and data set center vector (

󰇍



). The units

which are over 2σ distance are assumed outlier. The

documents which are far away from the data set

center vector (

󰇍



) are discarded (outlier

documents). Others are now a candidate for

summary.

The idea behind the outlier units can be

explained by document structure. A well-written

document generally consists of one topic and its

sub-topics. A paragraph is expected to include one

of these sub-topics. The process of selecting

representative terms is actually an attempt to

associate the term/s with a sub-topic. However,

some paragraphs might include general information

about the topic not a specific sub-topic (includes lots

of infrequent representative terms and seems related

to nearly all sub-topics). Although it seems a

heuristic realization, it is a result of the entropy law

given by Equation (6).

Entropy =



 (6)

Entropy implies the stability of the system. If

entropy is zero, system is more stable. When

entropy is getting closer to zero, then the unit is

related to only a few terms. If a unit could be

assigned to only one term, then it makes entropy

zero, which would be the best result. Eventually

filtering general units (related to the lots of

representative terms) is suggested in this article.

This type of units should be detected and they

wouldn’t be considered for future processing

(extraction).

Moreover, units can contain lots of terms which

may not be even representative terms, or only one

representative term with low frequency. RTHF is a

solution proposal to detect more informative units in

the documents and can be defined as follows:

“A unit must contain at least one representative

term which frequency in this unit is the most for all

units in the document”.

If a unit provides RTHF, it includes information

about one sub-topic in detail. However, it doesn’t

still guarantee that unit is not a general one. We

only decrease the probability of being a general unit.

Moreover, RTHF helps us to decrease overlapping

(data duplication) in the summary by selecting one

or a few informative units for each topic in the

document (depends on the compression rate).

The usage of RTHF is exemplified by paragraph

vectors given in Table 1. The rows are the

paragraphs () in the document and the columns

are the representative terms ().

Table 1. Example paragraph vectors.





























If RTHF is applied to the paragraph vectors in the

Table 1 then , , ,  and  are selected for

extraction. Although it is possibly a general

paragraph (it includes nearly all  terms, but not at

most for any one) in the document, not specific, it

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

289

Volume 22, 2023

would be even selected for a simple Matching

Percent (MP) similarity measure, which is defined

as percentage of count of seen representative terms

over total count of representative terms. It can be

shown as follows:

= 4 / 8 = 0.5,

= 4 / 8 = 0.5,

= 4 / 8 = 0.5,

= 7 / 8 = 0.875, highest,

= 5 / 8 = 0.675,

= 4 / 8 = 0.5.

On the other hand, when RTHF is applied, then

and  have similar term frequencies for , so

that both of them are candidates for extraction.

Moreover  has two representative terms at most

(, ) and it is selected. However  has no

representative term at most frequency, so that it is

not going to be selected.

3 Proof of RTHF

The effect of the TDR can be defined by the

intersection property of the set theory. The condition

for a term being a member of representative terms

set () is the minimum number of documents it

must be seen, named r. r can be defined as upper

integer obtained from the multiplication of n by

TDR given by Equation (7). Then the total number

of combinational sets (C(n, r)) can be given by

Equation (8).

 (7)

󰇛󰇜󰇛󰇜! = k (8)

On the way i ϵ  and i ≤ k, define an index i on

Equation 8, where i = 󰇝󰇞. Then members

of combinational set can be expressed by 󰇛󰇜

notation. If it is openly stated, 󰇛󰇜describes the set

that includes document numbers of ith combination

in the set of r combinations of n documents.

Let’s try to explain it with an example. Assume we

have got following 4 documents and accept TDR is

0.5 for simplification. Then 󰇛󰇜 values are as

follows:

D = {,,},

where, = 2 and C(4,2) = 6

󰇛󰇜= { }, 󰇛󰇜= {},

󰇛󰇜 = { }, 󰇛󰇜= {},

󰇛󰇜 = {}, 󰇛󰇜= {}.

Representative terms () in the combinational set

󰇛󰇜 are determined by the intersection of

document center vectors of documents (

󰇍



) these

are in the combinational set (󰇛󰇜󰇜. If all the

documents in the combinational set have a nonzero

frequency for a term in their document vector, then

this term is selected as representative term.

Otherwise it is not selected. In order to achieve this

goal, first of all, the elements of 

󰇍



document

center vector are transformed into binary numbers

by using the following sign function.

󰇛󰇜󰇱





Noting that because all of frequencies are non-

negative, thus, the vector members are either one or

zero. Sign function applied document center vector

is represented by 

󰇍



By the way, the frequencies are removed, and the

existence of a representative term in a document is

only considered (1 means existence and 0 means

non-existence). Then, combinational set

󰇛󰇜vectorial computation is done as defined in the

Equation (9).

󰇛󰇜

󰇍





󰇍



󰇛󰇜 (9)

Intersection operator in the Equation (9) results in a

vector presents the terms which are member of all

the documents within combinatorial set 󰇛󰇜. At

that point, 󰇛󰇜

󰇍



vector is used to construct

representative terms set (

󰇛󰇜) by applying the

following rule.



󰇛󰇜󰇱

󰇛󰇜󰇛󰇜

󰇍







󰇛󰇜󰇛󰇜

󰇍



Finally, all representative terms set for document

collection is calculated by the Equation (10), as

result of the union of 

󰇛󰇜 sets obtained above.

󰇛󰇜

󰇛󰇜

󰇝󰇛󰇜󰇞 (10)

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

290

Volume 22, 2023

Let’s continue with an example. Assume the

following document center vectors are given in the

Table 2. Assume TDR is 0.5.

Table 2. Example document center vectors.















󰇍





󰇍





󰇍





󰇍



First of all, the document center vectors are

converted into binary numbers vectors using sign

function. It is given in the Table 3.

Table 3. Document center vectors are converted into

binary numbers















󰇍





󰇍





󰇍





󰇍



Then, 󰇛󰇜

󰇍



vector values are computed using

Equation (9). It is exemplified for 󰇛󰇜combination

below.

󰇛󰇜

󰇍





󰇍





󰇍



All results are given in the Table 4.

Table 4. 󰇛󰇜

󰇍



s of combinational set

󰇛󰇜.













󰇛󰇜

󰇍



󰇛󰇜

󰇍



󰇛󰇜

󰇍



󰇛󰇜

󰇍



󰇛󰇜

󰇍



󰇛󰇜

󰇍



Later, 

󰇛󰇜sets of representative terms are

produced from the vectors given Table 4.



󰇛󰇜 = {},



󰇛󰇜 = {},



󰇛󰇜 = {},



󰇛󰇜 = {},



󰇛󰇜 = {},



󰇛󰇜 = {},

Finally 󰇛󰇜 set is constructed by union of 

󰇛󰇜

sets above.

󰇛󰇜 =



󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛󰇜 =

{}󰇝󰇞󰇝󰇞󰇝󰇞󰇝󰇞󰇝󰇞=

󰇝󰇞

Although other combinations between r and n must

be considered, it provides the subset relationship

between representative terms sets (󰇛󰇜  … 

󰇛󰇜  󰇛󰇜), so that final representative terms

set () can be defined by Equation (12) in a simple

form instead of Equation (11).

= 󰇛󰇜

󰇝󰇞 (11)

󰇛󰇜 (12)

Let’s consider the probability of a term () seen in

the document set to be a member of  set. Assume

terms (m terms) appearing within a document set

with equal probability (1/m). If r is 1 then all terms

are member of 󰇛󰇜 set. However, consider an r

value, then the probability of selection k terms

within m terms is given by Equation (13).

P (󰇛󰇜󰇜 = 

󰇛󰇜

 = 

 (13)

The probability obtained at Equation (13) is

dependent to TDR by inverse ratio. When TDR

increases, then the probability P(󰇛󰇜󰇜

decreases. It is a result of the subset relationship

between representative terms (󰇛󰇜 … 

󰇛󰇜 󰇛󰇜). This can be realized heuristically

and modeled by limit which is given by Equation

(14).





󰇛󰇜

= 0 (14)

It can be obtained from Equation 14 that in the case

of infinite document set then  would be an empty

set. In other means, all units would not include a

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

291

Volume 22, 2023

common term all together. As a result, the count of

selected representative terms can be controlled by

arranging the r value. r is also determined by TDR

that means it plays an important role for selecting

representative terms in a controlled way.

On the other hand, the metrics called precision (P)

and recall (R) which are defined by Equation (15)

and Equation (16) respectively, must be increased

for the success of EMDS. In order to increase

precision and recall, then the summary must include

most relevant paragraphs.

P = 󰇝󰇞󰇝󰇞

󰇝󰇞

(15)

R = 󰇝󰇞󰇝󰇞

󰇝󰇞

(16)

In order to increase the relevant paragraph number

in the summary, RTHF plays an important role. It

determines general paragraphs these include many

members of  with low frequencies.

If the model is simplified to understand heuristic,

assume document set is composed of y paragraphs

and  includes x terms. If each paragraph contains

only one term of  at most frequency then the

following general paragraphs counts would be

determined.



󰇽󰇧󰇡

󰇢󰇨󰇾

The actual effect of RTHF is expected at higher

TDR. The results obtained in [6] also support this

idea (The best scores are marked for the 75% TDR).

4 Experiments

Experiments are applied to the same DUC2006

corpus. The system summaries are limited to 250

words and extraction is paragraph-based.

ROUGE [23] is used to evaluate RTHF model. We

focused on the F_Score metric which is given by

Equation (17). It is the harmonic mean of Equations

(15) and (16).

F-Score = 2 * 

 (17)

The model was run for three TDR values (25%,

50%, 75%) on each data set (50 data sets) of DUC

2006 corpus. The average of ROUGE metrics was

calculated for these TDR values separately.

Abbreviations, Average_R, Average_P and

Average_F, used on Table 5 are average recall,

average precision and average F_score respectively.

Moreover, the maximum value of each row is

marked bold to enhance readability.

Table 5 compares the best ROUGE metrics

announced in [6] and the RTHF applied similar

EMDS for different TDR’s.

Table 5. Comparison of EMDS [6] and RTHF for

different TDR’s

The

best

RTHF system

values

TDR

for [6]

(25%)

(50%)

(75%)

ROUGE-1

Average-R

0.60830

0.59882

0.61308

0.62069

Average-P

0.57537

0.57653

0.57016

0.57277

Average-F

0.58993

0.58595

0.58935

0.59472

ROUGE-2

Average-R

0.38602

0.37872

0.39090

0.39865

Average-P

0.36308

0.36390

0.36317

0.36775

Average-F

0.37175

0.37021

0.37558

0.38191

ROUGE-3

Average-R

0.32035

0.31241

0.32333

0.33144

Average-P

0.30734

0.29972

0.30011

0.30560

Average-F

0.30816

0.30514

0.31050

0.31744

ROUGE-4

Average-R

0.28037

0.27335

0.28298

0.29069

Average-P

0.26259

0.26211

0.26248

0.26791

Average-F

0.26961

0.26691

0.27165

0.27835

ROUGE-L

Average-R

0.54041

0.52834

0.54380

0.55423

Average-P

0.50852

0.50840

0.50529

0.51118

Average-F

0.52031

0.51682

0.52251

0.53090

ROUGE-

Average-R

0.14879

0.14563

0.14988

0.15237

Average-P

0.27307

0.27290

0.27108

0.27376

Average-F

0.19156

0.18950

0.19256

0.19551

ROUGE-

SU4

Average-R

0.36260

0.35148

0.36784

0.37583

Average-P

0.32567

0.32691

0.31931

0.32134

Average-F

0.33931

0.33544

0.33858

0.34415

It is clear that RTHF model enhances all ROUGE

metrics. Table 6 implies that RTHF model enhances

results over 1.0% in general.

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

292

Volume 22, 2023

Table 6. Enhancing percentage of Average-F metric

by RTHF

Enhancing

ROUGE-1

Average-F

+0.8%

ROUGE-2

Average-F

+2.7%

ROUGE-3

Average-F

+3.0%

ROUGE-4

Average-F

+3.2%

ROUGE-L

Average-F

+2.0%

ROUGE-W

Average-F

+2.0%

ROUGE-SU4

Average-F

+1.4%

5 Conclusion and Further Works

A Summarizer is a tool composed of phases and

each phase uses a different technique. In other

words, to develop a successful automatic

summarizer it requires techniques, working all

together in harmony.

This work is directed to mark general

paragraphs and preventing them to be a candidate

for summary. RTHF responsibility is to achieve

extra filtering on units after representative term

selected. RTHF forces a unit to contain at least a

term of which frequency in this unit is the most

for all units in the same document.

It is applied to the existing EMDS. The results

show that RTHF is a successful feature to select

more informative paragraphs. Moreover, it produces

the best value for higher TDR (75%) as theoretically

explained. RTHF is a unit based approach so it

could be applied successfully to other extractive unit

types (sentence, segment).

On the other hand, this technique has a

drawback. That is how to select enough

representative terms to produce summary (depends

on compression rate). In other words, the

relationship between TDR and compression rate

should be established.

It is obvious that RTHF prevents general paragraphs

to be selected for the summary. On the other hand,

the model still suffers from the MP which scores

low value for paragraphs these are only included

one member of  at most and a few members of

’s.

By the way, RTHF is a sharp feature which means

selecting the best one. Selecting paragraphs these

have at least one member of  over term average in

the document units would be better.

References:

[1] Kumar YJ, Salim N. Automatic multi

document summarization approaches. J

Computer Sci 2012; 8: 133-140.

[2] Sizov G. Extraction-based automatic

summarization - theoretical and empirical

investigation of summarization techniques.

MSc, Norwegian University, Norwegian,

Oslo, 2010.

[3] Nenkova A, McKeown K. A survey of text

summarization techniques. In: Aggarwal CC,

Zhai C-X, editors. Mining Text Data, USA:

Springer US, 2012. pp. 43-76.

[4] Das D, Martins AFT. A survey on automatic

text summarization. 2007; Language

Technologies Institute, Technical Report.

[5] Mitra M, Singhal A, Buckley C. Automatic

text summarization by paragraph extraction.

In: Workshop on Intelligent Scalable Text

Summarization; 11 July 1997, Madrid, Spain.

pp. 39-46.

[6] Turan M, Sönmez C, Ganiz, MC. The

benchmark of paragraph and sentence

extraction summaries using outlier document

filtering based multi-document summarizer.

Inf Technol Control 2014; 43: 433-439.

[7] Edmundson HP. New methods in automatic

extracting. J ACM 1969; 16: 264-285.

[8] Kupiec J, Pedersen JO, Chen F. A trainable

document summarizer. In: Proceedings of

the18th Annual International ACM SIGIR

Conference on Research and Development in

Information Retrieval; 1995; Scattle WA,

USA: ACM. pp. 68-73.

[9] Kumar PA, Kumar KP, Rao TS, Reddy PK.

An improved approach to extract document

summaries based on popularity. Lect Notes

Comput Sc 2005; 3433: 310-318.

[10] Suanmali L, Salim N, Binwahlan MS. Fuzzy

logic based method for improving text

summarization. Int J Comput Sci Inf Secur

2009; 2: 1-6.

[11] Gupta V, Lehal GS. A survey of text mining

techniques and applications. J Emerg Techol

Web Intell 2009; 1: 60-76.

[12] Binwahlan MS, Salim N, Suanmali L. Swarm

based text summarization. J Comput Sci 2009;

5: 338-346.

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

293

Volume 22, 2023

[13] Bossard A, Rodrigues C. Combining a multi-

document update summarization system with

a genetic algorithm. In: Hatzilygeroudis I,

Prentzas J, editors. Smart Innovation, Systems

and Technologies. Berlin, Germany: Springer,

2011. pp.71-87.

[14] Manne S, Fatima SS. An extensive empirical

study of feature terms selection for text

summarization and categorization. In:

CCSEIT-12; 26-28 Oct 2012; Coimbatore,

India. pp. 606-613.

[15] Li X, Wu X, Hu X, Xie F, Jiang Z. Keyword

extraction based on lexical chains and word

co-occurance for Chinese news web page.

2008 IEEE International Conference on Data

Mining Workshops; 15-19 Dec 2008; Pisa,

Italy: IEEE. pp. 744-751.

[16] Balinsky H, Balinsky A, Simske S. Document

sentences as a small world. International

Conference on Systems, Mans and

Cybernetics; 9-12 Oct 2011; Los Alamitos,

CA, USA: IEEE. pp. 2583-2588.

[17] Wang M, Xi G, Wang X, Li C, Zhang Z.

Multi-document summarization based on

word feature mining. International Conference

on Computer Science and Software

Engineering; 12-14 Dec 2008; Wuhan, China:

IEEE. pp. 743-746.

[18] Litvak M, Last M. Graph-based keyword

extraction for single-document

summarization. MMIES '08 Proceedings of

the Workshop on Multi-Source Multilingual

Information Extraction and Summarization;

23 August 2008; Manchester, UK: ACM. pp.

17-24.

[19] Marcu D. Discourse trees are good indicators

of importance in text. Advances in Automatic

Text Summarization, MIT Press, 2009. pp.

123-136.

[20] Yong-dong X, Xiao-long W, Tao L, Zhi-ming

X. Multi-document summarization based on

rhetorical structure: sentence extraction and

evaluation. IEEE International Conference on

Systems, Man and Cybernetics; 7-10 Oct

2007; Montreal, Canada: IEEE. pp. 3034-

3039.

[21] Salton G, Singhal A, Mitra M, Buckley C.

Automatic text structuring and

summarization. Inform Process Manag 1997;

32: 53-65.

[22] Okazaki N, Matsuo Y, Matsumura N,

Ishizuka M. Sentence extraction by spreading

activation through sentence similarity. IEICE

Trans Inf Syst 2003; E86D: 1686-1694.

[23] Lin C-Y. ROUGE:A package for automatic

evaluation of summaries. In Proceedings of

the Workshop on Text Summarization

Branches Out(WAS); 25-26 July 2004;

Barcelona, Spain: Association for

Computational Linguistics. pp. 74-81.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

Metin Turan, implemented algorithm, proved theory

and carried out the experiments.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/de

ed.en_US

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.31

Meti

n Turan

E-ISSN: 2224-2678

294

Volume 22, 2023

Conflict of Interest

The author has no conflict of interest to declare.