Visual and Quantitative Analyses of Virus Genomic Sequences using a

Metric-based Algorithm

ALEXANDRA BELINSKY

Mach-3dP Inc.

Burlington, ON, CANADA

GUENNADI A. KOUZAEV

Norwegian University of Science and Technology - NTNU

Trondheim, NORWAY

Abstract: - This work aims to study the virus RNAs using a novel accelerated algorithm to explore any-

length repetitive genomic fragments in sequences using Hamming distance between the binary-

expressed characters of an RNA and a query pattern. Primary attention is paid to the building and

analyzing 1-D distributions (walks) of atg-patterns - codon-starting triplets in genomes. These triplets

compose a distributed set called a word scheme of RNA. A complete genome map is built by plotting

the mentioned atg-walks, trajectories of separate (a-, c-, g-, and t-symbols) nucleotides, and the lines

designating the genomic words. The said map can be additionally equipped by gene’s designations

making this tool pertinent for multi-scale genomic analyses. The visual examination of atg-walks is

followed by calculating statistical parameters of genomic sequences, including estimating walk-

geometry deviation of RNAs and fractal properties of word-length distributions. This approach is

applied to the SARS CoV-2, MERS CoV, Dengue, and Ebola viruses, whose complete genomic

sequences are taken from GenBank and GISAID. The relative stability of these walks for SARS CoV-

2 and MERS CoV viruses was found, unlike the Dengue and Ebola distributions that showed an

increased deviation of their geometrical and fractal characteristics. The developed approach can be

useful in further studying mutations of viruses and building their phylogenic trees.

Key-Words: - Biological control systems, Modeling of Biological Systems, RNA,

Hamming-distance metric measure, quantitative RNA walks, SARS Cov-2 virus, MERS CoV virus,

Dengue virus, Ebola virus

Received: April 19, 2022. Revised: November 5, 2022. Accepted: November 24, 2022. Published: December 31, 2022.

1 Introduction

A virus is a tiny semi-live unit carrying genetic

material (RNA or DNA – double-helix RNA

structure) in a protein capsid covered by a lipid

coat. The virus penetrates the cell wall and

urges this bio-machine to ‘manufacture’ more

viruses.

Some viruses are RNA-based and transfer

genetic information by long chains of four

organic acids, namely, Adenine (a), Cytosine

(c), Guanine (g), and Uracil (u) [1]. DNA-based

viruses and double-stranded genetic polymers

carry the information by four nucleotides, but

one of them is Thymine (t) instead of Uracil. In

genomic databases, even the single-stranded

viral RNAs are registered as complementary

chains where Thymine substitutes Uracil due to

some instrumental specifics [2] that do not

hinder the mathematical aspects of the virus

theories. These complimentary RNAs will be

used further for numerical modeling in our

paper.

A complete RNA is a chain of codons

(exons) used to transfer genetic information and

introns. Unfortunately, the latter’s role is not

well known [3]. Sequencing of RNA or DNA is

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

323

Volume 21, 2022

searching and identifying nucleotides by

instrumental means. Codons in RNAs start with

an ‘aug’ combination of nucleotides and end

with one of the following three combinations:

‘uaa’, ‘uag’, or ‘uga’. Additionally, ‘aug’ may

play a coding role in genomes. Some DNA

strands consist of billions of nucleotides, so

mathematical methods are widely used in

genomics [4]. For instance, the RNA symbols

are substituted by number values, and this

process is called DNA/RNA mapping [5]-[8].

For example, in Ref. [5], eleven methods of

numerical representation of genomic sequences

are listed and analyzed to conclude that each is

preferable in a particular application, and no

universal mapping algorithm is equally

advantageous for all genomic studies.

Different retrieval algorithms can be

applied to genomic sequences, including signal-

processing means [7],[9]-[14]. The numerical

RNAs can be shown graphically for qualitative

analyses. For instance, each nucleotide is

represented by a unit vector in a 4-dimensional

(4-D) space, and an imaginary walker moves

along an RNA sequence, making a trajectory in

this space that can be described by fractional

order differential or integrodifferential

equations [15],[16].

The nucleotides are combined in a certain

way to avoid apparent difficulties with plotting

walks in multidimensional spaces [17],[18]. For

instance, each nucleotide is associated with one

of four unit vectors in 2-D space, which

projections on the plane axes can take positive

(+1) or negative (–1) values [19],[20]. A

trajectory is built moving along the consecutive

number of a nucleotide in the studied genomic

sequence.

In general, DNA walks allow the detection

of codons and introns, discover hidden RNA

periodicity [12]-[14], and calculate

phylogenetic distances between genomic

sequences [21], among others. Some additional

results and reviews on DNA imaging can be

found, for instance, in Refs. [17],[22]-[24],

where the necessity to use specified walks for

each class of genomic problems is shown.

Genomic walk analysis can be followed by

calculating fractal properties of distributions of

nucleotides [25]-[34]. Fractals are self-similar

or scale-invariant objects. It means small ‘sub-

chain’ geometry is repeated on larger geometry

scales, although it is randomly distorted.

A biopolymer chain in a solution is bent

fractally [28]. This fractality influences the

chemical reaction rate, diffusion, and surface

absorption of long-chain and globular

molecules, among others [27],[28],[35]-[38].

Because polar solvents have frequency-

dependent properties, they are adjusted by

microwaves that influence the polymer fractal

dimension. Thus, some bioreactions can be

controlled by a weak high-gradient microwave

field [39],[40].

Although many achievements are known in

the numerical mapping of RNAs, some

questions have not been resolved. For instance,

the known genomic walks are designed to track

single nucleotides or their pairs, leading to

crowdy trajectories and overloaded plots that

are challenging to analyze visually in one plot

[5]-[8].

Meanwhile, the complete RNAs of viruses

are composed chiefly of codons, and one

repetitive pattern therein is their starting atg-

triplet. We assign to these triplets a

mathematical construction - a viral RNA

scheme.

Our proposed pattern search algorithm

calculates the triplet distributions along an RNA

sequence. Additionally, the same algorithm can

make the walks of each of the four nucleotides.

These trajectories are found not twisted firmly

in comparison to curves from Refs. [5],[8], for

instance, and they are easier to be analyzed

visually. These graphs can be equipped with

marks pointing to genes and hyperlinks with the

gene names, making these figures interactive

means to analyze the genomes.

Some results of creating such a tool and

applications to genomes of several viruses are

given here. The source codes and visualizations

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

324

Volume 21, 2022

can be used for research and practical

applications. One of them is the studies of the

stability of the mentioned atg-schemes towards

mutations, variation of codon fillings, and the

fractality of atg-distributions, among others.

Section 2 considers our proposed calculation

algorithms and plotting techniques in detail

with application examples. The results of

applying these techniques to the SARS CoV-2,

MERS CoV, Dengue, and Ebola viruses are in

Section 3. They are further discussed in detail in

Section 4, and the conclusions are in Section 5.

In Appendix 1, all necessary data for the

analyzed RNAs taken from GenBank and

GISAID are tabular, including the parameters

calculated in this contribution.

2 Materials and Methods

2.1 Materials and Data Availability

In this paper, only complete arbitrary-chosen

genomic sequences without missed symbols are

studied taken from GenBank [41] and GISAID

[42]. Among them are 36 SARS CoV-2

genomic sequences from GISAID and one from

GenBank, 20 genomic sequences for the MERS

coronavirus (GenBank), 25 for the Dengue

virus from GenBank, and 15 Ebola virus

genomic sequences (GenBank). Data from the

GISAID are available after registration. All

names of genomic sequences are given in figure

legends and Tables 1–4.

2.2 Methods and Application Examples

2.2.1 Metric-based atg-triplet Algorithm

As has been stated above, for both DNA and

RNA descriptions by characters, their alphabet

consists of four nucleotides marked by a-, c-, g-

, and t-symbols. These designations are used to

study RNA or DNA if their physicochemical

properties are outside the research scope.

In many cases, RNA/DNA have repeated

patterns of nucleotide sequences, and these

regions are better conserved in mutations [9],

[43],[44]. As a rule, pattern discovery relates to

nondeterministic polynomial time problems

(NP-problems), i.e., solution time increases

exponentially with the sequence length.

A typical algorithm compares a query

character pattern with a length of nucleotides

with the following one-symbol shift of query

along a chain. In our code, we use these

techniques, as well. Usually, the search

algorithms working with characters with no

assigned numerical values are slower by 1.43-

2.37 times than those processing the binary

variables, as shown in Refs. [45],[46].

Moreover, even the computer arithmetic logic

units can be re-designed to fulfill the frequently

repeated patterns of binary operations to

accelerate them [47]. Then, in our code, all

RNA’s characters are transformed into binaries

before all operations using a Matlab operator

dec2bin(character) [48].

In computers, for instance, the UTF-8

format allows encoding all 1,112,064 valid

character code points, and it is widely used for

the World Wide Web [49]. The first 128

characters (US-ASCII) require only one byte

(eight binary numbers) in this format. If binary

units initially represent the sequences, then

calculating the DNA sequence’s numerical

properties in this form reduces the computation

time.

Because the binary sequences now write the

DNA/RNA chains, they can be characterized

quantitatively using a suitable technique. One

calculates a metric distance between the binary-

represented symbols and a query (base)

‘moving’ along a chain.

Many metric types are used in codes and big

data [50]-[55]. The advantage of using metric

estimates is that they can be applied in cluster

analysis for similar grouping nucleotide or

protein distribution patterns. For instance, it can

help classify the virus RNAs [56],[57]. Notably,

this distance can be the Hamming one [50] used

further in this paper.

Consider a flow chart of our code on

exploring patterns of arbitrary length

n

(Figure

1). It starts with importing sequence data

A

of

the length

N

from any genomic database in

FASTA format [41],[42] and defining a query

pattern

B

of any size

n

. From both files, the

empty spots should be removed [58].

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

325

Volume 21, 2022

Fig. 1: Algorithm flowchart.

In the second step, the data files are

transformed into binary strings, which are used

to calculate Hamming distance between each

binary symbol of

A

and

B

. This distance is a

metric for comparing two binary values, and it

is the number of bit positions in which the two

bits are different.

To calculate Hamming distance

H

d

between two strings

A

and

B

, the

XOR

operation

 

AB

is used, and the total number

of ‘1’s in the resultant string

C

is counted. The

distance value is zero if the compared binary-

represented symbols are the same. Only

n

characters are compared on each count; then,

the query is moved to one symbol, and

calculations are repeated.

Then, a string

C

of numerical estimates of

the length

M n N

is a product of step 3 of

this code. Not all registered genomic sequences

are divisible by

n

. In this case, a needed

number of characters

a

is added to the end of

the sequence

A

, or the Hamming metric

operation can be fulfilled using Levenstein’s

distance formula from Ref. [53], workable for

compared sub-strings of arbitrary length.

The following two parts of our algorithm are

calculating numbered

 

i

y

query positions in a

sequence

i

x

according to the Hamming-

distance data. All zeros in the string

C

are

initially obtained (step 4, Figure 1). Then, only

n

neighboring zeros corresponding to a query

are selected (step 5, Figure 1), and this query is

numbered in a sequential manner starting with

the first one found in RNA. The positions

i

x

of

these numbered queries

i

y

in a complete RNA

sequence are calculated analytically (step 6,

Figure 1).

Let us take the coordinate of the first symbol

in a numbered found query, then a set of points

can be built along a studied sequence. These

points, being connected, make a curve called a

query walk.

In this paper, the codon start-up atg-triplet is

used below as a query (pattern

B

). We define

the positions

i

x

of the first symbols of the

sequentially numbered atg-triplets in an RNA

A

, and the atg-walks are plotted.

Additionally, we calculate the word length

 

, 1 1

atg

i i i i

l x x





. In our algorithm, a ‘word’ is a

nucleotide sequence starting with an atg-triplet

and all symbols up to the next atg-one (Figure

1, right side).

The proposed algorithm was realized in the

Matlab environment [48], and it is a few-ten-

line code. The following Matlab library

functions were used:

1.

2 (character variable)dec bin

– to

transform a character variable into a

binary one

2.

 

2 , , ‘ ’ptisd a b hamming

– to calculate

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

326

Volume 21, 2022

the Hamming distance value between

two binary values

a

and

b

3.

 

stringzeros

– calculation of numbers

of zero-values in a string

4.

 

,plot y x

– plot function

 

yx

The developed algorithm was applied to

many available virus complete genomes.

Because the atg-triplets start the codons in most

cases, the main attention was paid to building

the distributions of these repeated patterns

called the RNA word schemes. The atg-

positions were compared with the found ones

from the given genomes to verify these

calculations.

2.2.2 Visualization Techniques

2.2.2.1 One-dimensional atg-walks

The viral RNAs, consisting of thousands of

nucleotides, are challenging to analyze, and

many visualizing methods are used. Among

these techniques is the plotting the DNA walks

projected on the spaces of appropriated

dimensions considered above (see

Introduction). Some contributions are full of

symbolic designations of nucleotides and

diagrams showing the positions of genes in the

complete DNA sequences. The preference for a

visualization method is dictated by the

specificity of applications, although there is a

need for a universal graphical tool.

This paper found that atg-walks could be

plotted by coordinates of the numbered atg-

triplets in a complete RNA sequence. Like the

known DNA walks, these dots can be

considered the points of a trajectory named here

as atg-walk (Figure 2A). A diagram showing

the positions of the symbol

a

in defined atg-

triplets is given in Figure 2B.

Similar to single-symbol DNA walks, the

atg-trajectories have fractal properties. Their

type was defined by analyzing distributions of

the coordinates of triplets shown by vertical

lines along an RNA sequence (Figure 2B).

These atg-distributions have repeating motifs

on different geometry scale levels, i.e., they

have fractal properties.

Fig. 2: Positions of atg-triplets along the genome

sequence of SARS-CoV-2 virus MN988668.1 (row 1,

Table 1, Appendix 1) given for the first 5 000 nucleotides

provided by points (A) and vertical lines in diagram (B).

Below, our initial assumption about the

fractality of atg-distributions will be confirmed:

we calculated the fractal dimensions of

complete genomes of several tens of virus

sequences. Presumably, the atg-triplets are

distributed along with the RNA sequences of

studied viruses according to the random Cantor

multifractal set rule [59].

2.2.2.2. Multi-scale Mapping of RNA Sequences

It is necessary to see full-scale virus RNA maps

and analyze all types of mutations. Previously,

the most attention was paid to mapping atg-

triplets, thinking they constructed a scheme of

RNA, a relatively stable structure. Besides the

structural mutations changing the atg-

distributions, the nucleotides vary their

positions inside codons. Our algorithm

considers even a single symbol as a pattern,

allowing the calculation of distribution curves

for each nucleotide similar to atg-ones. These

curves can be considered the first level of

spatial detailing of RNAs. The words in our

definitions (see Figure 1) compose the second

level. Words form a gene responsible for

synthesizing several proteins, and the genes

belong to the third level of spatial detailing of

RNAs.

A combined plotting of elements of the

hierarchical RNAs organization will be helpful

in the visual analyses of genomes. One of the

ways is shown in Figure 3.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

327

Volume 21, 2022

Fig. 3: Two-scale study results of a SARS-CoV-2 virus

MN988668.1 (row 1, Table 1, Appendix 1). In the inlet,

these symbols are pointed inside the words given for the

first 300 nucleotides.

Here, positions of a-symbols of atg-triplets

in a sequence are given by vertical lines (second

level of detailing). Words take spaces between

these vertical lines (see Figure 1). They are

filled by numbered nucleotides (points of a

different color), which are the first level of

RNA detailing. It allows distinguishing

nucleotides even at the beginning of coordinates

where visual clutter is seen (inlet in Figure 3).

The next level of hierarchical RNAs

organization is with genes. For instance, in

GenBank [41], the list of symbols of a genomic

sequence in FASTA format is followed by a

diagram where the genes are given by

horizontal bars with the gene’s literal

designations.

In our case, this diagram can be attached to

a two-scale plot considered above. Another

solution is to equip our figures with gene

hyperlinks, an interactive means highlighting

whose genes a nucleotide or codon belongs to,

shown by a pointer.

Thus, the developed pattern search algorithm

based on the Hamming distance applied to

binary representations of nucleotide symbols

allows building combined plotting of

hierarchical organization of the RNAs of

viruses. It can also be applied to the analyses of

more complex protein structures. Different from

many genomic walks, it produces spatially

simple curves that can be analyzed visually and

quantitatively.

2.2.3 Calculation Techniques for Fractal

Properties of atg-Distributions

In many previous studies, the fractality of

DNA/RNA sequences has been studied

[6],[20]-[26],[29]-[38]. The motifs of small-size

genetic patterns are repeated on large-scale

levels. Thus, the nucleotide distribution along a

genome is not entirely random due to this long-

range fractal correlation, as is mentioned in

many papers.

The large-size genomic data are often

patterned, and each pattern can have its fractal

dimension, i.e., the sequences can be

multifractals [33]. This effect is typical in

genomics, but it is also common in the theory of

nonlinear dynamical systems, signal processing,

and brain tissue morphology, among others

[60]-[68].

Discovering the fractality of genomic

sequences is preceded by their numerical

representation, for instance, by walks of

different types [6]-[8],[20],[22],[29],[33]. Then,

each step value of a chosen walk is considered a

sample of a continuous function, and the

methods of signal processing theory are applied

[12],[13].

The measure of self-similarity is its fractal

dimension

F

d

which can be calculated using

different methods. In our case, the fractal

dimension calculations can be applied directly

to the distribution of atg-consecutive numbers

i

y

(Figure 2a), but it gives this value close to 1

for the analyzed RNAs, i.e., the dependence

()

ii

yx

is close to the linear one. Then, these

calculations are not effective in analyses due to

their weak sensitivity. Instead, the word-length

distributions along RNAs sequences

 

,1

atg

ii

l

(see

sub-section 2.2.1 and Figure 1, right side) are

used.

A particular distribution of the word lengths

is shown in Figure 4 by bars whose heights

equal the word lengths. Then, the algorithms,

usually applied to the sampled signals, can be

used to compute the statistical properties of

these word-length distributions.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

328

Volume 21, 2022

Fig. 4: Word-length distribution in a SARS Cov-2 virus

MN988668.1 sequence (row 1, Table 1, Appendix 1).

In this paper, the fractal dimension of word

length distributions was calculated using the

software package FracLab 2.2 [67]. This code

provides results with reasonable accuracy if the

default parameters of FracLab are used.

Although many researchers tested this code, it

is again verified to calculate the Weierstrass

function, which is synthesized according to a

given fractal dimension value [68].

In a strong sense, the fractal dimension was

defined for the infinite sequences. In our case,

the studied RNAs have only 268-730 atg-

triplets depending on the virus. Then, the fractal

dimension values were estimated

approximately.

3 Results

3.1. Study of atg-walks of SARS CoV-2

Genome Sequences

In this research, essential attention was paid to

studying SARS CoV-2 complete RNA genome

sequences. A recent comprehensive review of

the genomics of this virus can be found in Refs.

[69],[70], for instance.

The data used here and throughout this

whole paper are from two genetic databases:

GenBank [41] and GISAID [42]. A part of the

studied genome sequences for this and other

viruses is provided in Appendix 1.

Here, the main unit, called a ‘word’, is a

nucleotide sequence starting with ‘atg’ and the

symbols up to the next triplet (Figure 1). The

number of atg-s was calculated by our code and

verified by a Matlab function

 

, ' 'count A atg

.

These results are shown in the third column of

Tables 1–4 (See Appendix 1). The Matlab

functions

 

word

median L

and

 

word

rms L

calculated the median and root-mean-square

(R.M.S) values of each sequence’s word-length

 

,1

atg

ii

l

distribution correspondingly. The results

are in columns 4 and 5 of the mentioned tables.

Consider applying the developed approach

to the complete genome of a Wuhan RNA

sample MN988668.1 (GenBank) as an example.

It consists of 29881 nucleotides and 725 atg-

triplets (See row 1, Table 1, Appendix 1).

Figure 2 shows the distribution by points of atg-

triplets for the first 5000 nucleotides of this

complete genome. In Figure 3, the entire study

of this virus is shown. Word-length distribution

of this sample is given in Figure 4.

Figure 5 illustrates the distribution (in lines)

of atg-triplets along with complete genome

sequences for thirty-seven SARS CoV-2

arbitrary-chosen virus samples, including Delta,

Omicron, and a bat-corona sample (see Table 1,

Appendix 1).

There were relatively compact localizations

of the triplet curves despite the viruses being of

different clades and lines. For instance, the

relative difference

 

1,21 1 21

( 29455) 100% 2

i

y x y y y



    

of

these curves 1-37 (Figure 5) is estimated at

29455

i

x

around only 1.6%. This confirms

the conclusions of many specialists [71] that no

new recombined strains have appeared up to

this moment, despite many mutations found to

date (2023), including the Omicron lineage.

Two inlets show the beginning and tails of

these curves to illustrate details. Although, in

general, these trajectories are woven firmly, the

tails are between the bat’s SARS-CoV-2 light-

blue curve (hCoV-

19/bat/Cambodia/RShSTT182/2010, row 6,

Table 1, Appendix 1) and the black trajectory

obtained for a sequence from Brazil (hCoV-

19/Brazil/RS-00674HM_LMM52649/2020, row

14, Table 1, Appendix 1).

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

329

Volume 21, 2022

Fig. 5: Distributions of atg-triplets of 31 SARS Cov-2

complete RNA sequences (Table 1, Appendix 1). The

inlets show these atg-distributions at the beginning and

end of genome sequences.

A detailed study of viruses from Table 1,

Appendix 1 shows that each considered here

sequence has individual atg-distribution. It

means that most mutations are combined with

the joint variations of word content, word

length, and the number of these words. Other

mutations with only word content variation may

exist. However, the atg-walks cannot see them,

and the single-symbol distributions considered

below will help us to detect these modifications

of viruses (See Section 2.2.2.2).

Figure 6 shows a detailed comparison of

samples of five viruses causing increased

trouble for specialists with the one from

Wuhan, China. The inlets offer the details of

these curves at their beginning and end. The

tails of the three curves are closed between the

Wuhan and Brazil trajectories. Although the

difference between these curves is not

significant, the mutations may have

complicated consequences in the rate of

contagiousness of viruses.

Different techniques for numerical

comparison of sequences are known from data

analytics, including, for instance, calculation of

correlation coefficients of unstructured data

sequences, data distance values, and clustering

of data, among others [10],[56],[57].

Researching RNA sequences, we suppose that

the error of nucleotide detection is essentially

less than one percent; otherwise, the results of

comparisons would be instrumentally noisy

Fig. 6: Detailed distributions of atg-triplets for five

trouble-making SARS Cov-2 complete RNA sequences

(rows 1,8,3,14,19, Table 1, Appendix 1). The inlets show

these atg-distributions at the beginning and end of

genome sequences.

We use a simplified algorithm for

quantitative comparing atg-distributions of

different virus samples. Each numbered atg-

triplet

 

i

y

has its coordinate along a sequence

 

i

x

. Thanks to mutations, the length of some

coding words varied together with the

coordinate

 

i

x

of a triplet.

In our case, we calculated the difference

(deviation) between the coordinates

 

i

x

of atg-

triplets of the same numbers

 

i

y

in the

compared sequences. This operation was

fulfilled only for the sequences of equal atg-

triplets; otherwise, excessive coding words are

neglected in comparisons. Of course, such a

technique for the comparison of geometrical

data has its disadvantages. Therefore, if a

compared sequence has several atg-triplets

fewer than the number of atg-ones in a

reference sequence, the atgs of reference RNA

are excluded from comparisons. Still, it allows

for obtaining some information on mutations of

viruses in a straightforward and resultative way

that will be seen below.

Our approach supposes choosing a reference

nucleotide sequence to compare the genomic

virus data of other samples, and it is a complete

genomic sequence MN988668.1 from GenBank

(row 1, Table 1). Several virus samples from

GenBank and GISAID have been studied in this

way [44], and some results of comparisons are

given in Figure 7.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

330

Volume 21, 2022

Fig. 7: Deviation of atg-coordinates in RNAs of four

SARS CoV-2 viruses relative to the reference RNA

MN988668.1 (see Table 1, Appendix 1).

The ordinary axis

x

in these plots shows

the deviation of coordinates

i

x

of atg-triplets

from the atg-coordinates of the reference

sequence. As a rule, due to the different number

of noncoding nucleotides at the beginning of

complete RNA sequences, the curves in Figure

7 have constant biasing along the

x

axis.

The straight parts of these curves mean that

the atg positions of a compared sequence are

not perturbed regarding the corresponding

coordinates in the reference RNA. It means that

there are no mutations, or they are only with the

variation of coding words without affecting

their lengths if these mutations have a place.

In some studied samples (here and in Refs.

[44],[72]), perturbations are near the end of the

orf1ab gene, as is seen using a graphical tool of

GenBank [41]. Perturbations were detected by

calculating the

i

x

coordinates according to

known

i

y

. The atg-perturbations could

generally occur in any RNA’s part, considering

the random nature of mutations (Figure 7C, D).

Relative deviation

i

xN

did not exceed 1–2%

for compared viruses. Although this deviation is

mathematically tiny, it may have severe

consequences in the biological sense.

Our study shows that these difference curves

(Figure 7) are individual for the studied

samples. Although mutations without affecting

the atg-distributions are possible, this

individuality, theoretically, may be lost.

There are repeating motifs of comparison

curves (Figure 7 and Refs. [44], [72]). The

origin of this is unknown, but it was not

coupled with the lineages of viruses and their

clades. Other viruses can be studied in a similar

manner.

3.2. Study of atg-walks of Complete Genome

Sequences of the Middle East Respiratory

Syndrome-related Coronavirus

Middle East Respiratory Syndrome-related

(MERS) is a viral illness. The virus’ origin is

unknown, but it initially spread through camels

and was first registered in Saudi Arabia

[73],[74]. Most people infected with the MERS

CoV virus developed a severe respiratory

disease, which resulted in multiple human

deaths.

Our simulation of atg-distributions of this

virus shows the compactness of the calculated

curves (Figure 8, and Table 2, Appendix 1), like

the SARS CoV-2 characteristics. It follows that

both viruses demonstrate relatively stable

features towards the strong mutations connected

with the recombination of the virus’s parts. For

instance, the relative difference

 

1,10 29982

i

yx





of these curves is estimated

at around 1% only. On average, the MERS

RNAs have fewer number of atg-triplets, and

longer nucleotide words than the SARS CoV-2

studied sequences.

In general, according to our more than three-

year observation, the two studied coronaviruses

(MERS CoV and SARS CoV-2) demonstrate

relatively strong stability of their atg-

distributions towards severe mutations, leading

to the variation of codon positions, word

lengths, and the number of words. It follows the

conclusions of many scientists working in

virology and virus genomics [71].

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

331

Volume 21, 2022

Fig. 8: Distributions of atg-triplets of 20 samples of the

MERS CoV complete RNA sequences (Table 2,

Appendix 1). Inlet shows the atg-distributions at the end

of genome sequences. The inlets offer these distributions

at the beginning and end of genome sequences.

3.3. Dengue Virus Study

The Dengue virus is spread through mosquito

bites. For instance, a comprehensive review of

the genomics of this virus can be found in Refs.

[75],[76].

Unlike the coronaviruses, the Dengue virus

(Table 3, Appendix 1) tends to form separate

families, i.e., it is less stable than SARS-CoV-2

and MERS viruses. It has five genotypes

(DENV 1–5) and around 47 strains.

Unfortunately, the genomic data for this virus

have been published less compared to

coronaviruses. Some of them for which the

complete genome data are available from

GenBank have been studied below.

Figure 9A (rows 1-5, Table 3, Appendix 1)

shows the atg-distributions of five sequences of

Dengue virus-1 found in China. A rather large

dispersion of these sequences is seen from these

graphs.

Figure 9B gives the atg-distributions of five

complete sequences of the Dengue virus-2.

These distributions are more compactly

localized, although their origin is from different

parts of the world. In general, the observed

Dengue virus-2 samples have an increased

number of shorter words compared to the

sequences of Dengue virus-1.

Fig. 9: Distributions of atg-triplets of complete RNA

sequences of the Dengue virus-1 - (A) and Dengue virus-

2 - (B). See rows 1-5 and 6-10, correspondingly from

Table 3, Appendix 1.

Figure 10A shows five data sets for different

strains of Dengue virus-3 registered in many

countries. They have about the same number of

nucleotides, and comparable averaged lengths

of words.

Figure 10B gives three atg-distributions of a

Gabon-strain [76] of Dengue virus-3. It is

supposed that this strain mutated from the

earlier registered Dengue virus lines (Figure

10A). However, they are different in the length

of complete genome sequences and their

statistical characteristics, which will be

considered in Section 3.5 below.

Fig. 10: Distributions of atg-triplets of complete RNA

sequences of Dengue virus-3 (rows 11-15 –(A) and 16-18

–(B) Table 3, Appendix 1).

Figure 11A shows the atg-distributions of

two Gabon-originated Dengue viruses that can

relate to predecessors of other Dengue viruses

of this family. Figure 11B presents the atg-

distributions of complete RNA sequences for

five Dengue virus-4 samples. They have

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

332

Volume 21, 2022

individual and statistical differences with the

above-considered Dengue viruses.

Fig. 11: Distributions of atg-triplets of complete RNA

sequences of Dengue virus-3 (rows 19, 20 - (A), Table 3,

Appendix 1) and Dengue-4 (rows 21-25 - (B), Table 4,

Appendix 1).

A consolidated plot of all atg-curves of the

Dengue RNAs studied here is shown in Figure

12. There is a substantial divergence of these

trajectories in agreement with the mutation rate

of this virus being relatively strong. For

instance, this relative difference

1,25

y



estimated at

10000

i

x

is 14.1%.

Fig. 12: Consolidated picture for all 25 Dengue virus

arbitrary-chosen samples studied in this paper. The

numbers of virus atg-curves correspond to Table 3,

Appendix 1.

3.4. Analysis of atg-walks of Complete

Genome Sequences of the Ebola Virus

There are four strains of the Ebola virus known

worldwide, although many other mutations of

this virus can be found. Like the Dengue virus,

Ebola shows instability and an increased rate of

mutations. Initially, the infection was registered

in South Sudan and the Democratic Republic of

the Congo, and it spreads due to contact with

the body fluids of primates and humans. This

fever is distinguished by a high death rate (from

25% to 90% of the infected individuals). A

recent comprehensive review of the genomics

of this virus can be found in Ref. [77].

The Ebola virus RNA consists of 19 000

nucleotides and more than three hundred atg-

triplets. Figure 13A shows four sequences of

this virus belonging to the EBOV strain

registered in Zaire and Gabon. Three of them

are very close to each other, but the mutant

Zaire virus (in red) has some differences from

the three others. The samples collected in Sudan

(SUDV) are closer to each other (Figure 13B),

but they have an increased number of atg-

triplets and shorter words.

Fig. 13: Distributions of atg-triplets of complete RNA

sequences of the Ebola - EBOV) virus from Zaire and

Gabon (rows 1-4 - (A), Table 4, Appendix 1) and Ebola

virus - SUDV from Sudan (rows 5-7 - (B), Table 4,

Appendix 1).

The Bombali virus, registered in Sierra

Leone, West Africa, is considered a new strain

of Ebola. The atg-distributions of the five found

in GenBank RNA sequences are different even

visually from the two reviewed above, as seen

in Figure 14A. Another Ebola strain that can be

compared with the one studied above is the

Bundibugyo (BDBV) virus, whose three atg-

distributions are shown in Figure 14B.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

333

Volume 21, 2022

Fig. 14: Distributions of atg-triplets of five complete

RNA sequences of the Ebola (Bombali) virus (rows 8-12

– (A), Table 4, Appendix 1) and three complete RNA

sequences of the Ebola Bundibugyo (BDBV) virus. See

rows 13-15 – (B), Table 4, Appendix 1.

The calculated 15 distributions are

consolidated in Figure 15 to compare all four

strains, where, instead of points, the results are

represented by thin curves to make these

distributions more visible. Here, the tendency of

atg-curves to divergence and form clusters is

seen. For instance, the relative difference of

strains

1,15

y



is estimated at around 9%.

Fig. 15: Consolidated representation of 15 atg-

distributions of four strains of the Ebola virus. Black

color – EBOV; Green color – SUDV; Violet color –

Bombali; Red color – BDBV. The numbers of virus atg-

curves correspond to Table 4, Appendix 1.

It follows that the atg-walk is an effective

visualization tool sensitive to the viral RNA

mutations coupled to the number of codons and

introns and word size variations. It allows the

detection of viruses with essentially unstable

genomes distinguished by their increased

deviation of atg-walks and their fractal

properties.

3.5 Statistical Characterisation of atg-walks:

Calculating, Mapping, and Processing of the

Inter-atg Distance Values

In this research, after applying the tool FracLab

(See Section 2.2.3), it was discovered that all

studied genomic sequences of the SARS CoV-

2, MERS CoV, Dengue, and Ebola viruses have

fractality in their word-length distributions. The

results on this matter are placed in columns 6 of

Tables 1-4, Appendix 1.

Figure 16A shows the fractal dimension

values

F

d

of 37 genome sequences of SARS

CoV-2 viruses. Figure 16B gives this parameter

for 20 MERS coronavirus samples. Calculations

show that the maximal normalized deviation

(relative difference) of

F

d

is only 1.8% for

SARS and 2% for MERS viruses, i.e., these

results are in accordance with the conclusion on

the stability of these viruses towards severe

mutations with the codon-length and codon-

content perturbations.

Fig. 16: Fractal dimensions

F

d

of word-length

 

,1

atg

ii

l

distributions of 37 complete genome sequences of the

SARS CoV-2 (A) and 25 complete genome sequences of

the MERS CoV (B) viruses. The number of samples

corresponds to Tables 1 and 2 of Appendix 1.

The Dengue virus has five families and 47

strains; they have different atg-distributions and

fractal dimensions. According to the fractal

calculations, some strains are close to each

other (Figure 17).

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

334

Volume 21, 2022

Fig. 17: Fractal dimensions

F

d

of word-length

 

,1

atg

ii

l

distributions of 25 complete genome sequences of the

Dengue 1-4 viruses and their strains (Table 3, Appendix

1).

At the same time, the maximal relative

deviation is around 22% estimated for all

studied samples, and it agrees that this virus is

unstable from strain to strain.

The same conclusion is evident in Figure 18,

where the fractal dimensions of several studied

strains of the Ebola virus are given with a

maximal relative deviation of around 18%.

Fig. 18: Fractal dimensions

F

d

of word-length

 

,1

atg

ii

l

distributions of 15 complete genome sequences of the

Ebola virus strains (Table 4, Appendix 1).

4. Discussion

The research on the RNAs and DNAs of viruses

and cellular organisms is a highly complex

problem because of the many nucleotides of

these organic polymers, unclear mechanisms of

their synthesis, and pathological mutation

consequences for host organisms. Although

many mathematical tools have been developed,

new studies are exciting and can be fruitful.

In this paper, the viral RNAs were studied

using a novel algorithm based on exploring the

RNA patterns of arbitrary length. One of the

operations of this algorithm is the numerical

mapping of RNA characters, which is

performed by calculating the Hamming distance

between the preliminary binary-expressed

queries and RNA symbols. This allows

fulfilling these steps approximately twice as fast

regarding the operations with real numbers [45].

The results of the application of this algorithm

are verified by comparing them with complete

RNA sequences. Considering this algorithm can

search arbitrary-length patterns, the trajectories

of separate symbols can be combined with the

atg-walks for multi-scale plotting and RNA

analysis, as shown in this contribution.

The mentioned atg-triplets compose

relatively stable 1-D distributions called the

RNA schemes in this paper. These distributions

have been studied using our algorithm applied

to arbitrary-chosen complete RNA sequences of

the SARS CoV-2, MERS CoV, Dengue, and

Ebola viruses.

The following properties of virus RNAs have

been found in our research and not seen earlier:

1. Stability of atg-schemes towards intra-

family mutations when the geometry of

atg-curves is only slightly distorted

according to the estimates of the relative

difference of curve points (Section 3)

2. The highly compact atg-curve sets of

the SARS CoV-2 and MERS CoV

viruses, despite their continuous

mutation (to the date of submission),

estimated visually and quantitively

calculating the relative difference of atg-

coordinates (δy=1-1.6%), see Sections

3.1 and 3.2

3. More substantial divergence (δy=9-

14%) of atg-curves of the Dengue and

Ebola viruses in comparison to SARS

CoV-2 and MERS CoV species

(Sections 3.3 and 3.4 and 3.1, 3.2)

4. A visually found tendency towards

clustering the atg-curves in the limits of

one virus family (Ebola case, Section

3.4)

5. Distribution of single RNA’s symbols

and atg-triplets according to the random

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

335

Volume 21, 2022

fractal Cantor rule (Section 2.2.2.1)

6. Possible global correlation of the inter-

triplet distances due to this fractality

(Section 2.2.3)

7. Correlation of dispersion of fractal

dimension values of atg-distributions

with the instability of viruses (Section

3.5)

5. Conclusion

In this paper, the visual and quantitative

analyses of viral RNAs have been performed

using a novel algorithm to calculate the RNA

pattern positions in the studied sequences. A

part of this code uses binary symbols of RNA

nucleotides for accelerated search. The

algorithm allows more effective genomic

studies by building 1-D distributions of

different patterns and combining these sets on a

single multi-scale plot.

The proposed techniques were applied to

analyze the SARS CoV-2, MERS CoV,

Dengue, and Ebola viruses. The 1-D

distributions of atg-triplets (atg-schemes of

RNAs) were calculated and plotted for these

species. The levels of stability of these

distributions have been estimated visually and

numerically by calculating the divergence of

triplet curves and deviation of word fractal

dimensions of RNAs.

The main finding of this work is that the

level of clustering of atg-curves is with the

degree of stability of RNAs shown comparing

the SARS-CoV-2 and MERS species and the

Dengue and Ebola viruses. In addition, the rank

of clustering is coupled with the deviation of

fractal dimensions of codon-length

distributions, which is more significant for the

unstable Dengue and Ebola viruses mentioned.

It may be found engaging in further study of the

mutation of viruses and building their

phylogenetic trees.

The developed approach is with the study of

the RNA words, their lengths, and fractal

properties of the word-length distributions. It

can be applied in the research of mammalian

DNAs where the gene length is the evolutionary

dynamic and partly defines the gene expression

level [78].

In addition, recent studies show that with

aging, an imbalance of short and long genes in

the transcriptome occurs, and the research of

these phenomena by our algorithm may play an

important role in the development of anti-aging

treatments [79].

Thus, even quantitative and qualitative

studies and modeling of short virus genes can

be helpful in the genetics of more complicated

mammalian DNAs consisting of billions of

nucleotides.

Abbreviations

RNA: Ribonucleic acid; DNA:

Deoxyribonucleic acid; SARS CoV-2: Severe

Acute Respiratory Syndrome Coronavirus 2;

cDNA: complementary DNA; GISAID: Global

Initiative on Sharing All Influenza Data; NP:

nondeterministic polynomial; UTF-8: Unicode

Transformation Format-8 bit; US-ASCII:

American Standard Code for Information

Interchange; MERS CoV: Middle-East

Respiratory Syndrome-related Corona Virus.

Acknowledgments

The authors thank the GenBank® [39] and

GISAID [40] genetic data banks, and all

researchers placed their genomic sequences in

them. The online text processing service of

https://onlinetexttools.com/ is appreciated.

References:

[1]. G. Meister, RNA Biology: An Introduction,

Weinheim, Wiley-VCH, 2011.

[2]. K.R. Kukurba and S.B. Montgomery, RNA

sequencing and analysis, Cold Spring Harb.

Protoc., Vol. 11, 2015, pp. 951-967.

https://dx.doi.org/10.1101%2Fpdb.top084970

[3]. G. Storz, An expanding universe of noncoding

RNAs, Science, Vol. 296, 2002, pp. 1260-1263.

https://doi.org/10.1126/science.1072249

[4]. C. Nello and M.W. Hahn, Introduction to

Computational Genomics: A Case Studies

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

336

Volume 21, 2022

Approach. Cambridge, University Press, 2012.

https://doi.org/10.1017/CBO9780511808982

[5]. H.K. Kwan and S.B. Arniker, Numerical

representation of DNA sequences. Proc. 2009

IEEE Int. Conf., Electro/Information

Technology, 2009, pp. 307-310.

http://dx.doi.org/10.1109/EIT.2009.5189632

[6]. C. Cattani, Complex representation of DNA

sequences, Commun. in Computer and Inform.

Sci., Vol. 13, 2008, pp. 528-537.

http://dx.doi.org/10.1007/978-3-540-70600-7_42

[7]. P.D. Cristea, Conversation of nucleotide

sequences into genomic signals, J. Cell. Mol.

Med., Vol. 6, 2002, pp. 279-303.

https://doi.org/10.1111/j.1582-

4934.2002.tb00196.x

[8]. F. Bai, J. Zhang, J. Zheng, C. Li, and L. Liu,

Vector representation and its application of DNA

sequences based on nucleotide triplet codons, J.

Mol. Graphics Modell., Vol. 62, 2015, pp. 150-

156. https://doi.org/10.1016/j.jmgm.2015.09.011

[9]. B. Brejová, T. Vinar, and M. Li, Pattern

discovery. In: Krawetz S.A., Womble D.D. (eds)

Introduction to Bioinformatics, Humana Press,

Totowa, NJ, 2003.

[10]. J. Zhang, Visualization for Information

Retrieval, Springer, 2007.

https://doi.org/10.1007/978-0-387-39940-9_954

[11]. M. Randic, M. Novic, and D. Plavsic.

Milestones in graphical bioinformatics, Int. J.

Quantum Chem., Vol. 113, 2013, pp. 2413-2446.

https://doi.org/10.1002/qua.24479

[12]. P.P. Vaidyanathan, Genomics and

proteomics: A signal processing tour, IEEE Circ.

Syst. Mag., 4th Quarter, 2004, pp. 6-29.

https://doi.org/10.1109/MCAS.2004.1371584

[13]. J.V. Lorenzo-Ginori, A. Rodríguez-

Fuentes, R.G. Ábalo, R. Grau, and R.S.

Rodríguez, Digital signal processing in the

analysis of genomic sequences, Current

Bioinformatics, Vol. 4, 2009, pp. 28-40.

https://doi.org/10.2174/157489309787158134

[14]. L. Das, S. Nanda, and J.K. Das, An

integrated approach for identification of exon

locations using recursive Gauss-Newton tuned

adaptive Kaiser window, Genomics, Vol. 111,

2019, pp. 284-296.

https://doi.org/10.1016/j.ygeno.2018.10.008

[15]. A. E. Lamairia, Nonexistence results of

global solutions for fractional order integral

equations on the Heisenberg group, WSEAS

Trans. Systems, Vol. 21, 2022, pp. 382-386.

http://dx.doi.org/10.37394/23202.2022.21.42

[16]. N. Viriyapong, Modification of

Sumudu Decomposition method for nonlinear

fractional Volterra integro-differential equations,

WSEAS Trans. Math., Vol. 21, 2022, pp. 187-

195. DOI: 10.37394/23206.2022.21.25

[17]. A. Czerniecka, D. Bielinska-Waz, P.

Waz, and T. Clark, 20D-dynamic representation

of protein sequences, Genomics, Vol. 107, 2016,

pp. 16-23.

https://doi.org/10.1016/j.ygeno.2015.12.003

[18]. E.R. Hamori and J. Raskin, H curves, a

novel method of representation of nucleotide

series especially suited for long DNA sequences,

J. Biol. Chem., Vol. 258, 1983, pp. 1318-1327.

https://doi.org/10.1016/S0021-9258(18)33196-X

[19]. M.A. Gates, Simpler DNA

representation, Nature, Vol. 316, 1985, pp. 219.

https://doi.org/10.1038/316219a0

[20]. C.L. Berthelsen, J.A. Glazier, and M.H.

Skolnick, Global fractal dimension of human

DNA sequences treated as pseudorandom walks,

Phys. Rev. A., Vol. 45, 1992, Paper No

89028913.

https://doi.org/10.1103/PhysRevA.45.8902

[21]. P. Licinio and R.B. Caligiorne,

Inference of phylogenetic distances from DNA-

walk divergences, Physica A, Vol. 341, 2004, pp.

471-481.

http://dx.doi.org/10.1016/j.physa.2004.03.098

[22]. J.A. Berger, S.K. Mitra, M. Carli, and

A. Neri, Visualization and analysis of DNA

sequences using DNA walks, J. Franklin Inst.,

Vol. 341, 2004, pp. 37-53.

https://doi.org/10.1016/j.jfranklin.2003.12.002

[23]. A. Rosas, E. Nogueira Jr., and J.F.

Fontanari, Multifractal analysis of DNA walks

and trails, Phys. Rev. E, Vol. 66, 2002, Paper No

061906.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

337

Volume 21, 2022

http://dx.doi.org/10.1103/PhysRevE.66.061906

[24]. A.D. Haimovich, B. Byrne, R.

Ramaswamy, and W.J. Welsh, Wavelet analysis

of DNA walks, J. Comput. Biol., Vol. 13, 2006,

pp. 1289-1298.

https://doi.org/10.1089/cmb.2006.13.1289

[25]. H. Namazi, V.V. Kulish, F. Delaviz, and

A. Delaviz, Diagnosis of skin cancer by

correlation and complexity analyses of damaged

DNA, Onkotarget, Vol. 6, 2015, pp. 42623-

42631.

https://dx.doi.org/10.18632%2Foncotarget.6003

[26]. B. Hewelt, H. Li, M.K. Jolly, P.

Kulkarni, I. Mambetsariev, and R. Salgia, The

DNA walk and its demonstration of

deterministic chaos—relevance to genomic

alterations in lung cancer. Bioinformat., Vol. 35,

2019, pp. 2738-2748.

https://doi.org/10.1093/bioinformatics/bty1021

[27]. K.S. Birdi, Fractals in Chemistry,

Geochemistry, and Biophysics, N.-Y., Plenum

Press, 1993.

[28]. T.G. Dewey, Fractals in Molecular

Biophysics, Cambridge, Oxford University

Press, 1997.

[29]. G. Abramson, H.A. Cerdeira, and C.

Bruschi, Fractal properties of DNA walks,

Biosystems, Vol. 49, 1999, pp. 63-70,

https://doi.org/10.1016/s0303-2647(98)00032-x

[30]. C. Cattani, Fractals and hidden

symmetries in DNA, Math. Problems Eng., Vol.

2010, 2010, Paper No 507056(1-31).

https://doi.org/10.1155/2010/507056

[31]. S.-A. Ouadfeul, Multifractal analysis of

SARS-CoV-2 coronavirus genomes using the

wavelet transforms, bioRxiv preprint:

https://doi.org/10.1101/2020.08.15.252411

[32]. B. Hao, H.C. Lee, and S. Zhang,

Fractals related to long DNA sequences and

complete genomes, Chaos, Solitons and

Fractals, Vol. 11, 2000, pp. 825-836.

https://doi.org/10.1016/S0960-0779(98)00182-9

[33]. Z.-Y. Su, T. Wu, and S.-Y. Wang, Local

scaling and multifractality spectrum analysis of

DNA sequences- GenBank data analysis, Chaos,

Solitons&Fractals, Vol. 40, 2009, pp. 1750-

1765.

https://doi.org/10.1016/j.chaos.2007.09.078

[34]. G. Durán-Meza, J. López-García, and

J.L. del Río-Correa, The self-similarity

properties and multifractal analysis of DNA

sequences, Appl. Math. Nonlin. Sci., Vol. 4,

2019, pp. 267-278.

https://doi.org/10.2478/AMNS.2019.1.00023

[35]. M.S. Swapna and S. Sankararaman,

Fractal applications in bio-nanosystems,

Bioequiv. Availab., Vol. 2, 2019, Paper No

OABB.000541.

[36]. X. Bin, E.H. Sargent, and S.O. Kelley,

Nanostructuring of sensors determines the

efficiency of biomolecular capture, Anal. Chem.,

Vol. 82, 2010, pp. 5928–5931.

https://doi.org/10.1021/ac101164n

[37]. J. Chen, Z. Luo, C. Sun, Z. Huang, C.

Zhou, S. Yin, Y. Duan, and Y. Li, Research

progress of DNA walker and its recent

applications in biosensor, TrAC Trends in Anal.

Chem., Vo l. 120, 2019, Paper No 115626.

https://doi.org/10.1016/j.trac.2019.115626

[38]. A. Sadana, Engineering Biosensors.

Kinetics and Design Application, San Diego,

California, Acad. Press, 2001.

https://doi.org/10.1016/B978-0-12-613763-

7.X5015-0

[39]. G.A. Kouzaev, Frequency dependence

of microwave-assisted electron-transfer

chemical reactions, Mol. Phys., Vol. 118, 2020,

paper No e1685691.

https://doi.org/10.1080/00268976.2019.1685691

[40]. S.V. Kapranov and G.A. Kouzaev,

Nonlinear dynamics of dipoles in microwave

electric field of a nanocoaxial tubular reactor,

Mol. Phys., Vol. 117, 2018, pp. 489-506.

https://doi.org/10.1080/00268976.2018.1524526

[41]. GenBank® [

https://www.ncbi.nlm.nih.gov/genbank/ ].

[42]. Global Initiative on Sharing All

Influenza Data (GISAID) [

https://www.gisaid.org/].

[43]. A. Belinsky and G.A. Kouzaev, Visual

and quantitative analyses of virus genomic

sequences using a metric-based algorithm,

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

338

Volume 21, 2022

bioArxiv preprint: bioArxiv 2021.06.17.448868;

Europe PMC: PPR: PPR358597.

https://doi.org/10.1101/2021.06.17.448868

[44]. A. Belinsky and G.A. Kouzaev,

Geometrical study of virus RNA sequences,

bioArxiv preprint: bioRxiv 2021.09.06.459135;

https://doi.org/10.1101/2021.09.06.459135;

Europe PMC:

https://europepmc.org/article/PPR/PPR391263

[45]. R. Mian, M. Shintani, and M. Inoue,

Hardware-software co-design for decimal

multiplication, Computers, Vol. 10, 2021, pp.

17(1-19).

https://doi.org/10.3390/computers10020017

[46]. N. Brisebarre, C. Lauter, M,

Mezzarobba, and J.-M. Muller, Comparison

between binary and decimal floating-point

numbers, IEEE Trans. Comput., Vol. 65, 2016,

pp. 2032-2044.

https://doi.org/10.1109/TC.2015.2479602

[47]. A. Kostadinov and G.A. Kouzaev, A

novel processor for artificial intelligence

acceleration, WSEAS Trans. Circ., Systems, Vol.

21, 2022, pp. 125-141.

https://doi.org/10.37394/23201.2022.21.14

[48]. Matlab® R2020b, version

9.9.0.1477703, [

https://se.mathworks.com/products/matlab.html]

[49]. Chapter 2. General Structure. The

Unicode Standard (6.0 ed.). Mountain View,

California, US: The Unicode Consortium. ISBN

978-1-936213-01-6.

[50]. R.W. Hamming, Error detecting and

error-correcting codes, Bell Syst. Techn. J., Vol.

29, 1950, pp. 147-160.

[51]. W.N. Waggener, Pulse Code

Modulation Techniques, Berlin-Heidelberg:

Springer Verlag, 1995.

[52]. G. Navarro and M. Raffinot, Flexible

Pattern Matching in Strings: Practical Online

Search Algorithms for Texts and Biological

Sequences, Cambridge: Cambridge University

Press, 2002.

https://doi.org/10.1017/CBO9781316135228

[53]. V.I. Levenshtein, Binary codes capable

of correcting deletions, insertions, and reversals,

Soviet Physics Doklady, Vol. 10, 1966, pp. 707–

710.

[54]. E. Gabidullin, Theory of codes with

maximum rank distance, Probl. Inform. Trans.,

Vol. 21, 1985, pp. 1-76.

[55]. E. Polityko, Calculation of distance

between strings

(https://www.mathworks.com/matlabcentral/file

exchange/17585-calculation-of-distance-

between-strings, MATLAB Central File

Exchange. Retrieved March 3, 2021.

[56]. X. Yang, N. Dong, E. Chan, and S.

Chen, Genetic cluster analysis of SARS-CoV-2

and the identification of those responsible for the

major outbreaks in various countries, Emerging

Microbes&Infect., Vol. 9, 2020, pp. 1287-1299.

https://doi.org/10.1080/22221751.2020.1773745

[57]. J. Tzeng, H.H.-S. Lu, and W.-H. Li,

Multidimensional scaling for large genomic data

sets, BMC Bioinformatics, Vol. 9, 2008, Article

No 179, pp. 1-17. https://doi.org/10.1186/1471-

2105-9-179

[58]. Online Text Tools

[https://onlinetexttools.com/].

[59]. J. Feder, Fractals, N.-Y., Plenum Press,

1988.

[60]. P. Grassberger and I. Procaccia,

Measuring the strangeness of strange attractors,

Physica D, Vol. 9, 1983, pp. 189-208.

https://doi.org/10.1016/0167-2789(83)90298-1

[61]. S.N. Rasband, Chaotic Dynamics of

Nonlinear Systems. Weinheim, J. Wiley & Sons,

1989.

[62]. B. Henry, N. Lovell, and F. Camacho,

Nonlinear Dynamics Time Series Analyses, In:

Nonlinear Biomedical Signal Processing:

Dynamic Analysis and Modeling. Edited by

Akay M., IEEE, 2000, pp. 1-39.

[63]. F. Roueff and J.L. Véhel, A

regularization approach to fractional dimension

estimation. In: Proc. Int. Conf. Fractals 98, Oct.

1998, Valletta, Malta. World Sci., 1998, pp. 1-

14.

[64]. J.L. Véhel and P. Legrand, Signal and

image processing with Fraclab, In: Thinking in

Patterns. World Sci., 2003, pp. 321-322.

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

339

Volume 21, 2022

[65]. G.A. Kouzaev, Application of

Advanced Electromagnetics. Components and

Systems. Berlin-Heidelberg: Springer, 2013.

https://doi.org/10.1007/978-3-642-30310-4

[66]. C. Guidolin, R. Tortorella, R. De Caro,

and L.F. Agnati, Does a self-similarity logic

shape the organization of the nervous system?

In: The Fractal Geometry of the Brain. Edited

by Di Leva A: Berlin-Heidelberg: Springer

Verlag, 2016, pp. 138-156.

http://dx.doi.org/10.1007/978-1-4939-3995-4

[67]. FracLab 2.2. A fractal analysis toolbox

for signal and image processing.

[https://project.inria.fr/fraclab/ ].

[68]. J. Monge-Álvarez, Weierstrass cosine

function (WCF)

[https://www.mathworks.com/matlabcentral/file

exchange/50292-weierstrass-cosine-function-

wcf], MATLAB Central File Exchange.

Retrieved March 21, 2021.

[69]. A. Rahimi, A. Mirzazadeh, and S.

Tavakopolour, Genetics and genomics of SARS-

CoV-2: A review of the literature with the

special focus on genetic diversity and SARS-

CoV-2 genome detection, Genomics, Vol. 113,

2021, pp. 1221-1232.

https://doi.org/10.1016/j.ygeno.2020.09.059

[70]. P. Forster. L. Forster, C. Renfrew, and

M. Forster, Phylogenic network analysis of

SARS-CoV-2 genomes. PNAS, Vol. 117, 2020,

pp. 9241-9243.

https://doi.org/10.1073/pnas.2004999117

[71]. V. Cooper, The coronavirus variants

don’t seem to be highly variable so far, Sci.

American, 2021, March 24.

[72]. G.A. Kouzaev, The geometry of ATG-

walks of the Omicron SARS CoV-2 Virus

RNAs, bioArxiv preprint: bioRxiv doi:

https://doi.org/10.1101/2021.12.20.473613;

Europe PMC: PPR: PPR435860.

[73]. S.A. El-Kafrawy, V.M. Corman, A.M.

Tolah, S.B. Al Masaudi, A.M. Hassan, M.A.

Müller, T. Bleicker, S. M. Harakeh, A.A.

Alzahrani, G.A.A. Abdulaziz, N. Alagili, A.M.

Hashem, A. Zumla, C. Drosten, and E.I. Azhar,

Enzootic patterns of Middle East respiratory

syndrome coronavirus in imported African and

local Arabian dromedary camels: a prospective

genomic study, The Lancet Planetary Health,

Vol. 3, 2019, pp. e521-e528.

https://doi.org/10.1016/S2542-5196(19)30243-8

[74]. M. Kim, H. Cho, S.-H. Lee, W-J. Park,

J.-M. Kim, J.-S. Moon, G.-W. Kim, W. Lee, H.-

G. Jung, J.-S. Yang, J.-H. Choi, J.-Y. Lee, S.S.

Kim, and J.-W. Oh, An infectious cDNA clone

of a growth attenuated Korean isolate of MERS

coronavirus KNIH002 in clade B, Emerg.

Microbes Infect., Vol. 9, 2020, pp. 2714-2720.

https://doi.org/10.1080/22221751.2020.1861914

[75]. V.D. Dwivedi, I.P. Tripathi, R.C.

Tripathi, S. Bharadwaj, and S.K Mishra,

Genomics, proteomics and evolution of dengue

virus, Briefings in Functional Genomics, Vol.

16, 2017, pp. 217-227.

https://doi.org/10.1093/bfgp/elw040

[76]. H. Abea, Y. Ushijimaa, M.M. Loembe,

R. Bikangui, G. Nguema-Ondo, P.I. Mpingabo,

V.R. Zadeh, C.M. Pemba, Y. Kurosaki, Y.

Igasaki, S.G. deVries, M.P. Grobusch, S.T.

Agnandji, B. Lell, and J. Yasuda, Re-emergence

of Dengue virus serotype 3 infections in Gabon

in 2016–2017, and evidence for the risk of

repeated Dengue virus infections, Int. J. Infect.

Diseases, Vol. 91, 2020, pp. 129-136.

https://doi.org/10.1016/j.ijid.2019.12.002

[77]. N. Di Paola, M. Sanchez-Lockhart, X.

Zeng, J.H. Kuhn, and G. Palacios, Viral

genomics in Ebola virus research, Nature Rev.

Microbiol., Vol. 8, 2020, pp. 365–378.

https://doi.org/10.1038/s41579-020-0354-7

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

340

Volume 21, 2022

[78]. V. Grishkevich and I. Yanai, Gene

length and expression level shape genomic

novelties, Genome Research, Vol. 24, 2014, pp.

1497-1503.

https://doi.org/10.1101%2Fgr.169722.113

[79]. T. Stoeger, R.A. Grant, A.C.

McQuattie-Pimentel, K.R. Anekalla, S.S. Liu, H.

Tejedor-Navarro, B.D. Singer, H. Abdala-

Valencia, M. Schwake, M.P. Tetreault, H.

Perlman, W E. Balch, N.S. Chandel, K.M.

Ridge, J.I. Sznajder, R.I. Morimoto, A.V.

Misharin G R. Scott Budinger, and L.A.N.

Amaral, Aging is associated with a systemic

length-associated transcriptosome imbalance,

Nature Aging, vol. 2, 2022, pp. 1191-1206.

https://doi.org/10.1038/s43587-022-00317-6

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

341

Volume 21, 2022

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare

that are relevant to the content of this article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

Appendix 1. Results of statistical characterization of complete genetic sequences of the SARS CoV-2,

MERS CoV, Dengue, and Ebola viruses

Table 1. Severe acute respiratory syndrome coronavirus 2, (GenBank, GISAID), atg-walk

#

GenBank or GISAID Virus Name,

Clade, Lineage, Registration Year,

Sequencing Technology

Number of

Nucleotides in

the Sequence

Number

of atg-

Triplets

in the

sequence

Word

Median

Length

RMS

Word

Length

Fractal

Dimension of

the Word-

length

Distribution

1

2

3

4

5

6

1

GenBank: MN988668.1, Severe

acute respiratory syndrome

coronavirus 2 isolate 2019-nCoV

WHU01, Wuhan, China, 2020,

Illumina

29881

725

29

57.93

2.17

2

hCoV-19/Japan/NGY-NNH-

075/2021, GR, B.1.1.64, Illumina

MiSeq, Sanger

29848

722

29

58.03

2.17

3

hCoV-19/India/ILSGS00925/2021,

G, (Delta) B.1.617.2, Illumina

NextSeq550

29782

723

28.05

57.77

2.16

4

hCoV-19/South

Korea/KDCA3504/2021, GH,

B.1.497, Illumina Miseq

29901

722

29

57.96

2.17

5

hCoV-19/Taiwan/TSGH-34/2020, S,

A.1, Illumina NovaSeq4000

29903

724

29

57.79

2.17

6

hCoV-

19/bat/Cambodia/RShSTT182/2010,

A.1, (bat virus), 2021, Illumina

NextSeq

29787

730

29

55.81

2.17

7

hCoV-19/Austria/CeMM3224/2021,

GR, B.1.1.244, Illumina NovaSeq

29782

721

30

59.03

2.16

8

hCoV-19/England/205341113/2020,

GV, B.1.177.54, Illumina NextSeq

29862

721

29

57.97

2.17

9

hCoV-19/Ireland/D-NVRL-

e84IRL94434/2021, GV, B.1.177,

Illumina

29523

719

29

59.56

2.17

10

hCoV-19/Netherlands/UT-RIVM-

13868/2021, GH, B. 1.160,

Nanopore MinION

29782

720

28

58.17

2.16

11

hCoV-19/Norway/0179/2021, GH,

B.1.36, Nanopore GridIon

29782

723

28

57.88

2.15

12

hCoV-19/Russia/IVA-CRIE-

L188N0202/2021, GR, B.1.1.317,

Illumina

29735

720

29

57.77

2.17

13

hCoV-19/Spain/RI-IBV-

99016064/2021, GV, B.1.221,

Illumina MiSeq

29865

719

29

59.56

2.17

14

hCoV-19/Brazil/RS-

00674HM_LMM52649/2020, GR,

B.1.1.33, Illumina Miseq

29867

719

29

58.31

2.17

15

hCoV-19/Canada/ON-S2383/2021,

GH, B. 1.36.38, Illumina MiniSeq

29830

722

29

57.89

2.16

16

hCoV-19/Mexico/CMX-INER-

0222/2020, G, B.1.551, Illumina

NextSeq

29885

724

29

57.83

2.17

17

hCoV-19/USA/TX-HHD-

2102044112/2021, GR, B.1.1.244,

29819

720

29

58.10

2.17

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

342

Volume 21, 2022

Illumina MiSeq

18

hCoV-19/USA/CA-LACPHL-

AF00513/2021, GH, B.1.429,

Illumina MiSeq

29844

723

29

57.86

2.17

19

hCoV-19/South Africa/KRISP-

K004540/2020, GR, B.1.1.56,

Illumina MiSeq

29851

722

29

57.90

2.17

20

hCoV-19/Canada/ON-NML-

254107/2021, GR, BA.1 (Omicron),

Oxford Nanopore GridION

29685

718

29

57.73

2.17

21

hCoV-19/England/MILK-

2D6B000/2021, GRA, BA.2

(Omicron), Illumina NovaSeq

29724

725

28.5

57.48

2.16

22

hCoV-19/USA/CA-CDC-

LC0934366/2022, GRA, BQ.1,

PacBio Sequel II

29642

718

28

57.31

2.19

23

hCoV-19/USA/CA-CDC-

LC0933659/2022, GRA, BQ.1,

PacBio Sequel II

29668

725

28

57.5

2.16

24

hCoV-19/Canada/QC-

L00549819001/2022, GRA, BQ.1,

Illumina NextSeq

29719

725

28

57.5

2.16

25

hCoV-

19/Malaysia/IMR/CV05212/2022,

GRA, BQ.1.23, Nanopore GridION

29636

725

28

57.5

2.17

26

hCoV-19/Ireland/CO-CUH-

S22C0165/2022, GRA, BQ.1,

Nanopore MinION

29646

723

28

57.4

2.18

27

hCoV-19/Canada/QC-

L00548493001/2022, GRA, BA.4,

Oxford Nanopore PromethION

29709

724

29

57.5

2.16

28

hCoV-19/Guatemala/7817-

LNS/2022 GRA, BA.4,

Illumina MiSeq

29714

724

29

57.48

2.16

29

hcov-19/USA/UT-UPHL-

221115393771/2022, GRA, BA.4.6,

Illumina NovaSeq 6000

29734

723

28.5

57.68

2.16

30

hCoV-19/Germany/RP-USAFSAM-

S20372/2022, GRA, BA.5,

Illumina_NextSeq_Mid

29693

722

29

57.66

2.15

31

hCoV-19/Russia/MOW-CRIE-

89961/2022, GRA, BA.5, Oxford

Nanopore

29646

719

29

57.65

2.17

32

hCoV-19/Norway/OUS-26253/2022,

GRA, BA.5.1, Illumina Swift

Amplicon SARS-CoV-2 protocol at

Norwegian Sequencing Centre

29605

719

29

57.61

2.17

33

hCoV-19/USA/CA-CDPH-

FS48102807/2022, GRA, BA.5.1.1,

Element Biosciences

29607

723

29

57.36

2.17

34

hCoV-19/USA/NY-

NYULH9985/2022, GRA, XBB.1.5,

Amplicon (Illumina), Illumina

NovaSeq

29773

726

29

57.44

2.17

35

hCoV-19/USA/NJ-PHEL-

V22054945/2022, GRA, XBB.1.5,

Oxford_Nanopore

29649

722

29

57.48

2.17

36

hCoV-19/Iceland/L-3254/2022,

GRA, XBB.1.5, Illumina MiSeq

29669

725

28.5

57.47

2.16

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

343

Volume 21, 2022

37

hCoV-19/Ireland/CO-CUH-

S23C0065/2023.txt, GRA, CH.1.1,

Nanopore MinION

29652

722

29

57.49

2.17

Table 2. The Middle East respiratory syndrome-related coronavirus, (GenBank), atg-walk

#

GenBank Virus Name and

Accession Number, Registration

Year, Sequencing Technology

Nucleotides

Number

of atg-

Triplets

Word

Median

Length

RMS

Word

Length

Fractal

Regularization

Dimension of

the Word-

length

Distribution

1

2

3

4

5

6

1

MF598617.1, Middle East

respiratory syndrome-related

coronavirus strain

camel/UAE_B25_2015, United

Arabian Emirates, AE, 2017,

Illumina; Sanger dideoxy

sequencing

30123

712

30

58.8

2.30

2

MF598595.1, Middle East

respiratory syndrome-related

coronavirus strain

camel/UAE_B2_2015, United

Arabian Emirates, 2017,

Illumina; Sanger dideoxy

30123

709

30

59.04

2.30

3

NC-019843.3, Middle East

respiratory syndrome-related

coronavirus isolate HCoV-

EMC/2012, Saudi Arabia, 2020,

Sanger dideoxy

30119

717

30

58.48

2.30

4

KY673148.1, Middle East

respiratory syndrome-related

coronavirus strain

Hu/Oman_50_2015, 2017,

Sanger dideoxy

30123

714

29

58.74

2.30

5

KT225476.2, Middle East

respiratory syndrome coronavirus

isolate MERS-

CoV/THA/CU/17_06_2015,

Oman/Thailand, 2017, Sanger

dideoxy

29809

703

30

59.03

2.25

6

MG923479.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-CoV

camel/Nigeria/NV1712/2016,

2018, Sanger dideoxy

29455

701

30

58.08

2.24

7

MK967708.1, Middle East

respiratory syndrome-related

coronavirus isolate

Merscov/Egypt/Camel/AHRI-

FAO-1/2018, 2019, CLC

genomic workbench

30106

711

30

58.05

2.30

8

MT361640.1, Mutant Middle East

respiratory syndrome-related

coronavirus clone MERS-CoV

YKC, South Korea, 2021,

sequencing technology is

described in [76]

30136

710

30

58.90

2.30

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

344

Volume 21, 2022

9

KT326819.1, Middle East

respiratory syndrome coronavirus

strain MERS-

CoV/KOR/KNIH/001_05_2015,

South Korea, 2017, Illumina and

Sanger dideoxy

29995

711

30

58.86

2.30

10

MK129253.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-

CoV/KOR/KCDC/001_2018-

TSVi, South Korea, 2019, Sanger

dideoxy

30150

712

30

58.81

2.29

11

OL622035.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-

CoV_Riyadh_2016, 2021, Oxford

Nanopore

29994

709

30

59.03

2.29

12

OP712625.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-

CoV/dromedary

camel/Egypt/NC4714/2016,

2022, Illumina

30108

719

30

57.88

2.3

13

MH734114.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-CoV

camel/Kenya/C1215/2018, 2018

30033

721

29.5

58.05

2.27

14

MG923468.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-CoV

camel/Ethiopia/AAU-EPHI-

HKU4458/2017, 2018

30091

722

30

57.65

2.3

15

KJ361503.1, Middle East

respiratory syndrome coronavirus

isolate Hu-France -

FRA2_130569-

2013_Isolate_Sanger, 2014,

Sanger dideoxy sequencing

30040

710

29

59.18

2.27

16

KM210277.1, Middle East

respiratory syndrome coronavirus

isolate England/4/2013, complete

genome, 2014, Sanger dideoxy

sequencing

30031

712

30

58.81

2.27

17

KF958702.1, Middle East

respiratory syndrome coronavirus

isolate MERS-CoV-Jeddah-

human-1, 2013, Sanger dideoxy

sequencing

29851

709

29

58.67

2.25

18

OP654179.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-

CoV/dromedary

camel/Egypt/NC270-P9/2015,

2022, Illumina

30131

711

30

58.87

2.3

19

MW086535.1, Middle East

respiratory syndrome-related

coronavirus isolate MERS-

CoV/JC32/Ramtha, 2020,

Illumina

29825

707

29.5

58.72

2.25

20

MZ268405.1, Middle East

30106

718

30

58.19

2.3

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

345

Volume 21, 2022

respiratory syndrome-related

coronavirus isolate MERS-

CoV/Camel/Kenya/HKU-

CAC10200/2020, 2021, Illumina

Table 3. The Dengue virus, (GenBank), atg-walk

#

GenBank Virus Name,

Registration Year, Sequencing

Technology

Nucleotides

Number

of atg-

Triplets

Word

Median

Length

RMS

Word

Length

Fractal

Regularization

Dimension of

the Word-

length

Distribution

1

2

3

4

5

6

1

KY672944.1, Dengue virus 1

isolate DENV-

1/China/YN/YNH22 (2013),

2019, Sanger dideoxy

10709

299

23

47.74

2.36

2

KY672937.1, Dengue virus 1

isolate DENV-

1/China/YN/DGRL-6(2014),

2019, Sanger dideoxy

10738

294

23

50.02

2.33

3

MW386865.1, Dengue virus 1

isolate YNBN04, China, 2020,

Sanger dideoxy

10742

289

24

50.81

2.36

4

MG560269.1, Dengue virus 1

isolate

P1253/China/GD/CZ/2014,

2018, Sanger dideoxy

10583

298

23

47.55

2.35

5

MG560267.1, Dengue virus 1

isolate

P1258/China/GD/CZ/2014,

2018, Sanger dideoxy

10583

299

23

47.22

2.35

6

MN566112.1, Dengue virus 2

isolate New Caledonia-2018-

AVS127, 2020, Illumina

10722

267

32

52.24

2.4

7

KY672955.1, Dengue virus 2

isolate DENV-

2/China/YN/15DGR65(2015),

2019, Sanger dideoxy

10723

273

28

52.77

2.44

8

KY672954.1, Dengue virus 2

isolate DENV-

2/China/YN/JH1516(2015),

2019, Sanger dideoxy

10665

271

29

51.50

2.48

9

MK268692.1, Dengue virus 2

isolate DENV-2/TH/1974,

Thailand, 2019, Sanger dideoxy

10721

274

28

52.67

2.45

10

MH069499.1, Dengue virus 2

strain DENV-

2/VE/IDAMS/910105,

Venezuela, 2018, Illumina

10712

275

28

52.84

2.49

11

MN018389.1, Dengue virus 3

isolate D17011, China, 2020,

Sanger dideoxy sequencing

10708

272

28

55.46

2.57

12

NC_001475.3, Dengue virus 3,

Sri Lanka, 2019, Illumina

10707

273

27

55.05

2.58

13

KY863456 .1, Dengue virus 3

isolate 201610225, Indonesia,

2017, IonTorrent, Sanger

10707

278

28

52.84

2.5

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

346

Volume 21, 2022

dideoxy sequencing

14

MH544649.1, Dengue virus 3

isolate

449686_Antioquia_CO_2015,

Colombia, 2018, Illumina;

Sanger dideoxy sequencing

10707

273

28

52.84

2.49

15

MH823209.1, Dengue virus 3

isolate SMD-031, Indonesia,

2019, Illumina

10707

272

28

52.84

2.46

16

LC379197.1, Dengue virus 3

strain SYMAV-17/Gabon/2017

genomic RNA, 2019, Illumina

10641

271

29

52.85

2.06

17

KY921907.1, Dengue virus 3

isolate SG(EHI)D3/15095Y15,

2017, Singapore, Sanger dideoxy

sequencing

10667

266

29

53.58

2.09

18

KF041255.1, Dengue virus 3

isolate D3/Pakistan/55505/2007,

2013, Sanger dideoxy

sequencing

10675

268

29

53.55

2.07

19

LC379196.1, Dengue virus 3

strain SYMAV-09/Gabon/2016

genomic RNA, 2019, Illumina

10663

273

29

53.64

2.53

20

LC379195.1, Dengue virus 3

strain SYMAV-07/Gabon/2016

genomic RNA, 2019, Illumina

10663

273

29

53.64

2.53

21

KJ579245.1, Dengue virus 4

strain DENV-

4/MT/BR23_TVP17909/2012,

Brazil, 2020, Illumina

10649

273

26

53.12

2.09

22

MG272274.1, Dengue virus 4

isolate D4/IND/PUNE/IRSHA-

FG-03 (S-49), complete genome,

India, 2018, Ion Proton System

10652

270

27

53.07

2.07

23

KY672960.1, Dengue virus 4

isolate DENV-

4/China/YN/15DGR394 (2015),

2019, Sanger dideoxy

10661

276

26

52.63

2.09

24

KX224312.2, Dengue virus 4

isolate SG(EHI)D4/02990Y14,

Singapore, 2017, Sanger dideoxy

sequencing

10652

275

27

52.21

2.09

25

MG272272.1, Dengue virus 4

isolate D4/IND/PUNE/IRSHA-

FG-01 (1028), India, 2018, Ion

Proton System

10652

272

27

54.57

2.09

Table 4. The Ebola Virus, (GenBank), atg-walk

#

Genbank Virus Name, Registration Year,

Sequencing Technology

Nucleotide

Number

of atg-

Triplets

Word

Median

Length

RMS

Word

Length

Fractal

Dimension

of the Word-

length

Distribution

1

2

3

4

5

6

1

MG572235.1, Zaire ebolavirus isolate

Ebola virus/H.sapiens-

tc/COD/1995/Kikwit-9510621, Zaire,

2019, PacBio; Illumina

18957

329

40.5

85.29

2.03

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

347

Volume 21, 2022

2

KU174137.1, Mutant Zaire ebolavirus

isolate Ebola virus/H.sapiens-

rec/COD/1976/Yambuku-Mayinga-

eGFP-BDBV_GP, Zaire, 2019, Illumina

19774

339

41.5

84.02

2.06

3

KY786025.1, Ebola virus strain Ebola

virus/M.fascicularis-

wt/GAB/2001/untreated-CCL053D5,

Gabon, 2018, IonTorrent

18871

327

40.5

85.06

2.03

4

KY785936.1, Ebola virus strain Ebola

virus/M.fascicularis-

wt/GAB/2001/100mg-CA470D5, Gabon,

2018, IonTorrent

18871

327

40.5

85.06

2.03

5

MH121162.1, Sudan ebolavirus isolate

Ebola virus/H.sapiens-

tc/Sudan/1976/Boniface-R4142L, 2019,

Illumina

18831

345

37.5

77.79

2.2

6

MK952150.1, Sudan ebolavirus isolate

Ebola virus/H.sapiens-

wt/SSD/1976/Maridi-BNI/DT, South

Sudan, 2020, Illumina

18847

345

37

77.84

2.2

7

MH121169.1, Sudan ebolavirus isolate

Ebola virus/H.sapiens-

tc/Sudan/2004/Yambio-HCM/SAV/017,

2019, PacBio; Illumina

18849

345

37

77.84

2.2

8

NC_039345.1, Bombali ebolavirus

isolate Bombali ebolavirus/Mops

condylurus/SLE/2016/PREDICT_SLAB

000156, Sierra Leone, 2018, Sanger

dideoxy, Illumina

19043

325

36

84.24

2.38

9

MW056492.1, Bombali ebolavirus

isolate X030, Kenya, 2020, Illumina

19025

326

36

84.11

2.37

10

MF319186.1, Bombali ebolavirus isolate

Bombali virus/C.pumilus-

wt/SLE/2016/Northern Province-

PREDICT_SLAB000047, Sierra Leone,

2019, Sanger dideoxy

19043

324

36

85.27

2.36

11

MK340750.1, Bombali ebolavirus isolate

B241, Kenya, 2019, Illumina

19025

328

36

83.59

2.36

12

MW056493.1, Bombali ebolavirus

isolate Z153, Kenya, 2020, Illumina

19025

332

36

81.46

2.41

13

MK028856.1, Bundibugyo ebolavirus

isolate Ebola virus/H.sapiens-

tc/Uganda/2007/Bundibugyo-

200706291, 2019, PacBio; Illumina

18940

325

39

84.62

2.01

14

MK028834.1, Bundibugyo ebolavirus

isolate Ebola virus/H.sapiens-

tc/Uganda/2007/Bundibugyo-R4386L,

Uganda, 2019, PacBio; Illumina

18917

325

39

84.62

2.01

15

MK028835.1, Bundibugyo ebolavirus

isolate Ebola virus/H.sapiens-

tc/Uganda/2007/Bundibugyo-

200706291, 2019, PacBio; Illumina

18936

325

39

84.63

2.01

WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS

DOI: 10.37394/23201.2022.21.35

Alexandra Belinsky, Guennadi A. Kouzaev

E-ISSN: 2224-266X

348

Volume 21, 2022