RNA Knowledge Graph Analysis via Embedding Methods

FRANCESCO TORGANO, EMANUELE CAVALLERI, JESSICA GLIOZZO,

FEDERICO STACCHIETTI, EMANUELE SAITTO, MARCO MESITI,

ELENA CASIRAGHI, GIORGIO VALENTINI

AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano

Via Celoria 18, Milano

ITALY

Abstract: Recent advances in RNA technologies opened the avenue to the design of novel vaccines as wit-

nessed by the success of the COVID-19 vaccine and also by new ongoing vaccines for cancer. New drugs

based on non-coding RNA can also be developed at lower costs considering the relatively simple structure of

these molecules with respect to classical recombinant protein technologies. We recently developed RNA-KG, a

biomedical Knowledge Graph focused on RNA, collecting information from more than 50 public databases and

bio-medical ontologies to support the study of RNA and the design of novel RNA-based drugs. In this work

we show that, by applying inductive machine learning methods on top of embedded node and edges obtained by

applying classical Graph Representation Learning methods, we can accurately predict the entities and the rela-

tionships between entities included in RNA-KG. Our results open the way to the analysis and the discovery of

novel relationships between RNAs and other bio-molecules and medical concepts represented in RNA-KG.

Key-Words: Artificial Intelligence methods for graph analysis, Graph Representation Learning, Knowledge

Graphs, RNA.

Received: January 26, 2024. Revised: August 3, 2024. Accepted: September 4, 2024. Published: October 3, 2024.

1 Introduction

RNA-based technologies introduced novel therapeu-

tics for the treatment and prevention of human dis-

eases, [1]. Indeed RNA molecules play a fundamen-

tal role in cell biology, performing a wide range of

functions either directly by regulating gene expres-

sion, exhibiting enzymatic activity, modifying and

regulating other RNAs and other bio-molecules, or

indirectly by being translated into proteins. Different

types of RNA are involved in regulatory processes:

small non-coding RNAs (sncRNAs) are associated

with RNA interference pathways, including short in-

terfering RNAs (siRNAs), microRNAs (miRNAs),

short hairpin RNAs (shRNAs), antisense oligonu-

cleotides (ASOs), piwi-interacting RNAs (piRNAs),

tRNA-derived fragments (tRFs), and tRNA-derived

small RNAs (tsRNAs). sncRNAs modulate mRNA

expression by inhibiting translation or facilitating the

degradation of the target transcript via complemen-

tary base pairing. Long non-coding RNAs (lncRNAs)

hold crucial importance in the onset and advancement

of diseases, [2], and are involved in competitive en-

dogenous RNA (ceRNA) regulation, transcriptional

and epigenetic regulation, [3].

More in general several studies revealed the func-

tional characteristics of a large variety of RNA

molecules, [4], [5], thus opening the door for the

design of mRNA-based vaccines for the COVID-19

pandemic, [6], for the treatment of melanom,a [7],

and for the development of new drugs that can target

both proteins and mRNA, as well as other non-coding

RNA, [8].

Recently we proposed RNA-KG, the first

ontology-based knowledge graph (KG) for repre-

senting coding and non-coding RNA molecules and

their interactions with other bio-molecules as well as

with pathways, abnormal phenotypes, and diseases to

support the study and the discovery of the biological

role of the “RNA-world”, [9]. RNA-KG represents

relationships between bio-molecules and bio-medical

concepts through Resource Description Framework

(RDF) triples extracted from more than 50 public data

sources and also integrates related bio-medical con-

cepts coded through biomedical ontologies including

the Human Phenotype Ontology, [10], the Monarch

Merged Disease Ontology, [11], Chemical Entities

of Biological Interests, [12], and other fundamental

biomedical ontologies, [9].

RNA-KG exploits PheKnowLator, [13], a soft-

ware system for the construction of semantically rich,

large-scale biomedical KGs that are Semantic Web

compliant and amenable to automatic OWL reason-

ing. The current version of RNA-KG includes about

600K nodes and 9M of edges and can be exported in

different data formats. RNA-KG has been designed

not only to represent information related to RNA in a

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

302

Volume 21, 2024

chemical

25.9%

(150,080) protein

16.6%

(96,197)

cell

7.6%

(44,175)

7.6%

(43,869)

sncRNA

13.4%

(77,741)

other bio-entities

7.7%

(44,399)

disease 4.0%(23,270)

phenotype 2.9%(16,865)

anatomy 2.5%(14,232)

vaccine 1.1%(6,246)

pathway 0.5%(2,606)

other terms 0.4%(2,427)

sequence 0.4%(2,363)

species 0.4%(2,148)

lncRNA 4.3% (24,749)

mRNA 3.6% (20,632)

viral RNA 1.0% (5,693)

unclassified RNA 0.1% (692)

ontology terms 69.9% (404K) RNA 22.3% (130K) other 7.7% (44K)

(a) Distribution of nodes.

RO properties

involving RNA

molecules

85.9%

(7,528,124)

other

RO properties

introduced by

the integration

2.4% (213,259)

subClassOf

9.2%

(806,681)

other properties

2.5% (220,518)

non-RO 11.7% (1,027,199) RO 88.3% (7,741,383)

(b) Distribution of edges.

Fig.1: Pie-chart of: (a) node distribution according to node types. (b) edge distribution according to edge

types

relational graph format but also to provide a KG ready

to be analyzed through graph-based AI methods for

inferring new knowledge about the “RNA world” and

supporting the discovery of new RNA-based drugs.

In this paper we show that Graph Representation

Learning (GRL) methods, [14], [15], can be applied

to the analysis of RNA-KG to both visually repre-

sent and to classify the different types of nodes and

edges that characterize the heterogeneous graph. We

model the entity and relation predictions in RNA-KG

as a multi-class classification problem, using graph

embedding methods to transform nodes and edges

into their vector representation and applying induc-

tive machine learning methods to classify the nodes

and edges of the KG.

2 The RNA Knowledge Graph

(RNA-KG)

RNA-KG is the first KG that aggregates biologi-

cal knowledge about RNAs from over 50 public

databases, integrating functional relationships be-

tween genes, proteins, chemicals, and ontologically

grounded biomedical concepts. The current release

of RNA-KG [16] has a single component with ap-

proximately 600K nodes and 9M edges, and it can be

queried via a SPARQL endpoint from the laboratory

website [17]. The nodes are typically mapped to ref-

erence biomedical vocabularies and ontologies, such

as NCBI Gene Entrez identifiers for unique identifica-

tion of genes and various types of non-coding RNAs

(ncRNAs), the Human Phenotype Ontology for phe-

notypes, the Monarch merged disease ontology for

diseases, and the Gene Ontology, [18], for annotat-

ing genes. Furthermore, all possible interactions are

represented using the Relation Ontology (RO, [19]),

which ensures consistent semantics for the different

relationships extracted from the sources.

Fig 1a (adapted from [9]) illustrates the distri-

bution of nodes within RNA-KG. Nodes are di-

vided into those representing ontology terms and

bio-entities without a direct mapping. The bio-

entities category is further split into RNA nodes

(which includes sncRNA, mRNA, lncRNA, viral

RNA, and unclassified RNA nodes), and non-RNA

nodes (termed other bio-entities), including for

example gene and variant (SNP) nodes. Fig 1b dis-

plays the distribution of edges in RNA-KG. Edges

are sorted into three groups: (i) edges represent-

ing RO terms that denote interactions among RNA

molecules from various sources, (ii) edges represent-

ing the subClassOf relationships, and (iii) edges rep-

resenting other types of relationships not covered by

RO. The subClassOf relationship arises from the

integration of bio-ontologies into RNA-KG, along

with the absence of a dedicated ontology for RNA

molecules. When RNA molecules cannot be pre-

cisely mapped to a reference ontology, they are clas-

sified as subClassOf an appropriate category within

the Sequence Ontology, [20], such as SO_0000276 for

miRNA molecules.

3 RNA-KG embedding

We constructed embedded representation of nodes

and edges of RNA-KG to assess whether their vector

representations can be used to visualize the resulting

graph in an Euclidean space and to predict the node

and edge types of the graph. In particular, we applied

node2vec, [21], and LINE, [22], embedding methods.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

303

Volume 21, 2024

Fig.2: t-SNE projections of RNA-KG embedding of the main node types generated by LINE. (Left) First order

LINE; (right) Second order LINE

3.1 Node2vec and LINE embedding

Node2vec uses random walks (RWs) to obtain se-

quences of nodes to “linearize” the graph and then

applies a shallow neural network to obtain a vectorial

representation of the nodes and edges, using an ap-

proach similar to word2vec to embed text, [23]. More

precisely, the node2vec second order RW is defined

by a transition probability of the form:

πrvx =αp,q(x, v, r)·wv,x.(1)

where πrvx is the probability of moving from node v

to node xcoming from node r. In eq. 1 the term wv,x

denotes the weight of the edge (v, x)∈E, and αp,q

are node2vec parameters defined as:

αp,q =









pif dr,x = 0

1if dr,x = 1

qif dr,x = 2

where dr,x denotes the graph distance between the

nodes rand x, whereby d∈ {0,1,2}. By tun-

ing the parameters pand qwe can both leverage the

homophily (through a Depth-First Sampling (DFS)-

like visit) and the structural (through a Breadth-First

Sampling (BFS)-like visit) characteristics of the input

graph, thus obtaining embeddings that can capture the

topological features of the graph.

LINE (Large-scale Information Network Embed-

ding) provides embeddings that scale nicely with big

graphs, [22]. The proposed model optimizes an ob-

jective function that preserves the first and second or-

der proximity of nodes in the embedded spaces. First-

order proximity is defined as the local pairwise prox-

imity between two vertices, indicated by the weight of

the edge connecting them. Second-order proximity is

instead defined as the similarity between the neigh-

borhoods of the two vertices. The first-order LINE

model optimizes an objective function that considers

the following function:

O1=X

i,j∈E

KL(ˆp1(vi, vj), p1(vi, vj))

where KL is the Kullback-Leibler divergence be-

tween the joint probability p1(vi, vj) = 1

1+exp(−uT

i·uj)

and the empirical probability ˆp1(vi, vj) = wij

W, with

viand vj∈Vand ui, ujare their corresponding em-

bedding in a vector space, and wij is the weight of the

edge (vi, vj), with W=P(i,j)∈Ewij .

Similarly second-order LINE minimizes the KL

divergence between the second-order proximity em-

pirical distribution and the second-order proximity

distribution in the embedded space, [22].

3.2 Embedded representations of RNA-KG

We generated 100-dimensional representations of the

nodes through the LINE algorithm. Fig. 2 shows the t-

SNE, [24], two-dimensional projections of the nodes

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

304

Volume 21, 2024

obtained by LINE, while Fig. 3 shows the same em-

beddings obtained with node2vec. In both figures the

main seven most numerous node types are shown. We

can observe that all the embeddings methods can rea-

sonably separate the different node types. However

node2vec can neatly separate the node types, espe-

cially with the second-order Depth-First-like (DFS:

p= 5, q = 0.2) and Breadth-First-like (BFS: p=

0.2, q = 5) RWs; in particular while node2vec is able

to separate genes (brown points) from proteins (or-

ange points), this is not the case with LINE embed-

dings (Fig. 2 and Fig. 3).

A reasonable separation has been also achieved

with respect to edge types, even if for both LINE and

node2vec the separation of the different edge types is

not so clearly defined as for node types (Fig. 4).

4 Classification of embedded nodes

and edges

The embedded representations of nodes and edges are

used to train different learning machines for the pre-

diction of the node and the edge types of RNA-KG.

4.1 Experimental set-up

We applied the following models to classify the nodes

and edges of RNA-KG: Decision tree classifier, [25],

random forest ensembles, [26], a linear perceptron

classifier, a support vector machine classifier with

Gaussian kernel and a multi-layer perceptron (MLP)

classifier with one hidden layer.

We applied multiple hold-out (70% training and

30% test) on 2-dimensional t-SNE projections, [24],

of the 100-dimensional embedding of randomly sam-

pled nodes (20K from about 600K) and edges (10K

from about 9M) of RNA-KG. We also evaluated dif-

ferent multi-class classification tasks, considering for

node type prediction the classification of the 7most

represented classes of nodes, and also the multi-class

classification of the 20 and 54 most represented node

types. For edge type classification we considered re-

spectively the 7,15,40 and 74 most represented edge

types.

We performed limited model selection with de-

cision trees (tuning of the maximum depth of the

tree) and random forest (tuning of the number of base

learners) and no model selection at all for the Percep-

tron, MLP (one hidden layer with 100 neurons, ReLU

activation function, ADAM for weight optimization

and maximum number of iterations set to 500), and a

gaussian SVM with regularization parameter C= 1

and maximum number of iterations set to 200. All the

models have been implemented using the scikit-learn

Python library and the embeddings have been com-

puted using the GRAPE library, [27].

4.2 Classification of node and edge types of

the RNA-KG

Fig. 5 summarizes RNA-KG node and edge type pre-

dictions across the different classification tasks and

the different models using the 2-dimensional t-SNE

projections of the BFS-like node2vec embeddings.

Node type classification, decision trees, random

forests and SVMs achieved a balanced accuracy

across the 7most represented classes larger than 90%

and also with 20 classes we obtained a balanced ac-

curacy close or larger than 90%. With 54 node types,

a reasonable accuracy of about 50% is obtained (con-

sider that a random balanced accuracy with 54 classes

would be about 2%). The linear perceptron obtained

worse results, since classes are surely non linearly

separable (see Fig. 3).

Reasonable, but significantly worse results are ob-

tained for edge type classification (Fig. 5, bottom).

With random forests (the best performing method) we

obtained a balanced accuracy of about 75% with 7

classes but performances decrease with other models

or, as expected, when the number of classes is higher.

Fig. 6 shows the effect of model parameters in de-

cision trees (Fig. 6 a) and b) and random forests (Fig. 6

c).

Summarizing, results show that edge and espe-

cially node types of RNA-KG are predictable using

also simple prediction models (e.g., decision trees)

trained on top of the node and edge node2vec embed-

dings.

4.3 Prediction of the overall nodes and

edges of RNA-KG

Previous results were obtained on a random sample of

20K nodes of RNA-KG. Here we present the results

obtained on the analysis of larger numbers of nodes

till to the overall about 600K nodes of RNA-KG.

Table 1 and Table 2 report the results of the deci-

sion trees and random forest trained on different num-

ber of classes and including the original sample of

20K nodes but also a larger random sample of 100K

till to the overall 600K nodes of RNA-KG. Also in

this case we performed a multiple hold-out (repeated

5 times) by splitting the available data with 70% of

training and 30% test set.

5 Discussion

We predicted node and edge types of RNA-KG, us-

ing relatively simple embedding and classification

methods. Embeddings of nodes and edges visual-

ized through t-SNE projections show that the differ-

ent types of nodes and edges can be separated in the

euclidean space, even if edge types show a less clear

separation (Fig. 2, Fig.3, Fig. 4).

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

305

Volume 21, 2024

Fig.3: t-SNE projections of RNA-KG embedding of the main node types using node2vec according to three

different graph visiting strategies: a) DFS-like RW b) BFS-like RW c) First order RW

Fig.4: Embedding of the main edge types of RNA-KG using second-order LINE (left) and DeepWalk (right)

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

306

Volume 21, 2024

Fig.5: RNA-KG node and edge type classification. (top) Comparison of balanced accuracy results between

models on the most represented 7, 20, 54 node types; (bottom) Comparison of balanced accuracy results between

models on the most represented 7, 14, 40, 71 edge types. Vertical lines on top of the bars represent the standard

deviation

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

307

Volume 21, 2024

(a) (b)

Fig.6: Effect of the learning parameters on decision tree and random forest balanced accuracy performance.

(a) Decision tree node type prediction results with respect to the maximum tree depth; (b) Decision tree edge type

prediction results with respect to the maximum tree depth; (c) Random forest node type prediction results with

respect to the number of base learners. (d) Random forest edge type prediction results with respect to the number

of base learners

These results are also confirmed by the classifica-

tion performance obtained by machine learning mod-

els trained on top of the embeddings. All the models,

except the linear perceptron, achieve a reasonable bal-

anced accuracy (Fig. 5) for all the classification tasks.

The results on edge embeddings (Fig. 5, bottom),

even if significantly better than random guessing, are

worse with respect to node type prediction since cur-

rent edge type definitions in RNA-KG are too general

to functionally characterize the distinct types of edges

of RNA KG. For instance general relationships, such

as “interacts with” or “regulates activity of” can in-

volve different types of nodes, e.g. genes or proteins

or miRNA and mRNA, and refers to different func-

tional relationships all “pushed” into the same type of

edge.

Classification results on the overall nodes of RNA-

KG show a certain decrement in the performance of

decision trees and random forest classifiers, even if

with 7classes we obtained a balanced accuracy larger

than 80% with both decision trees and random forests

(Table 1 and Table 2). This may be due to the reduced

input dimension and also to the fact that we did not

perform a thorough model selection, or to a possible

larger presence of outliers.

Summarizing this preliminary analysis shows that

embeddings methods coupled with off-the-shelf clas-

sification methods can obtain good results on the pre-

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

308

Volume 21, 2024

Table 1.Decision tree node type prediction results on

RNA-KG. Balanced accuracy and empirical compu-

tational time are reported considering different num-

bers of classes and nodes of the graph.

#cl. #nodes balanced_acc time

7 20 000 92.81% ± 0.57% 0.45

7 100 000 92.34% ± 0.27% 2.81

7 578 384 81.84% ± 0.14% 12.93

20 20 000 90.18% ± 0.58% 0.51

20 100 000 89.73% ± 0.22% 3.22

20 578 384 77.19% ± 0.06% 14.75

54 20 000 55.43% ± 2.50% 0.67

54 578 384 42.14% ± 0.41% 19.00

68 100 000 52.77% ± 1.66% 4.37

68 578 384 33.39% ± 0.39% 20.90

81 578 384 32.36% ± 0.53% 22.72

Table 2.Random forest node type prediction re-

sults on RNA-KG. Balanced accuracy and empirical

computational time are reported considering different

numbers of classes and nodes of the graph. Est. rep-

resents the number of estimators (base learners).

#cl. #nodes #est. balanced_acc time

7 20 000 50 94.32% ± 0.22% 6.77

7 100 000 200 94.11% ± 0.23% 136.48

7 578 384 500 83.72% ± 0.11% 194.26

20 20 000 50 92.07% ± 0.43% 7.92

20 100 000 200 91.74% ± 0.44% 133.67

20 578 384 500 80.29% ± 0.19% 248.77

54 20 000 50 55.45% ± 2.78% 10.47

54 578 384 500 42.12% ± 0.44% 369.31

68 100 000 200 52.28% ± 2.89% 169.01

68 578 384 500 33.59% ± 0.33% 412.86

81 578 384 500 32.20% ± 0.67% 452.37

dictions of node and edge types of RNA-KG. This

is quite surprising since we used relatively simple

prediction methods without performing an accurate

model selection. We foresee that better results could

be obtained through fine-tuned models (some prelim-

inary results seem to confirm this hypothesis, data not

shown).

Moreover we used only bi-dimensional t-SNE pro-

jections of the embeddings to train the classifiers, but

by using full embeddings we could in principle further

improve the classification performances.

We also observe that for the embeddings we used

methods conceived for homogeneous graph embed-

dings. By applying graph embeddings for heteroge-

neous graphs, [28], we could further improve the re-

sults since RNA-KG is a heterogeneous graph com-

posed of a large number of different nodes and edge

types.

Our results show that GRL methods can be suc-

cessfully applied to the analysis of RNA-KG with ac-

curate predictions. These findings open the door to

more refined analyses of RNA-KG to detect novel

edges to support the investigation of the “RNA world”

and the discovery of new RNA-based drugs.

For future research we foresee that the applica-

tion of graph embedding methods aware of the hetero-

geneity of the RNA-KG, [29], [30], can significantly

improve the predictions of novel relationships be-

tween non coding RNAs and other target molecules,

as well as associations between ncRNA with abnor-

mal phenotypes and diseases. Another futuew re-

search direction is represented by the application of

graph neural networks that can solve the edge predic-

tion problem with a direct end-to-end approach, [31],

[32].

References:

[1] Sparmann, Anke and Vogel, Jörg. Rna-based

medicine: from molecular mechanisms to ther-

apy. The EMBO Journal, 42(21):e114760,

2023.

[2] John S. Mattick, Paulo P. Amaral, Piero Carn-

inci, Susan Carpenter, Howard Y. Chang, Ling-

Ling Chen, Runsheng Chen, Caroline Dean,

Marcel E. Dinger, Katherine A. Fitzgerald,

Thomas R. Gingeras, Mitchell Guttman, Tet-

suro Hirose, Maite Huarte, Rory Johnson, Chan-

drasekhar Kanduri, Philipp Kapranov, Jeanne B.

Lawrence, Jeannie T. Lee, Joshua T. Mendell,

Timothy R. Mercer, Kathryn J. Moore, Shinichi

Nakagawa, John L. Rinn, David L. Spector,

Igor Ulitsky, Yue Wan, Jeremy E. Wilusz, and

Mian Wu. Long non-coding rnas: defini-

tions, functions, challenges and recommenda-

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

309

Volume 21, 2024

tions. Nature Reviews Molecular Cell Biology,

24(6):430–447, January 2023.

[3] Lin Liu, Zhao Li, Chang Liu, Dong Zou, Qian-

peng Li, Changrui Feng, Wei Jing, Sicheng Luo,

Zhang Zhang, and Lina Ma. LncRNAWiki

2.0: a knowledgebase of human long non-

coding RNAs with enhanced curation model

and database system. Nucleic Acids Research,

50(D1):D190–D195, 2022.

[4] Lucia Lorenzi, Hua-Sheng Chiu, Francisco

Avila Cobos, Stephen Gross, Pieter-Jan Vold-

ers, Robrecht Cannoodt, Justine Nuytens, Ka-

trien Vanderheyden, Jasper Anckaert, Steve

Lefever, et al. The rna atlas expands the catalog

of human non-coding rnas. Nature biotechnol-

ogy, 39(11):1453–1465, 2021.

[5] Andreas Keller, Laura Gröger, Thomas Tsch-

ernig, Jeffrey Solomon, Omar Laham, Nicholas

Schaum, Viktoria Wagner, Fabian Kern,

Georges Pierre Schmartz, Yongping Li, et al.

mirnatissueatlas2: an update to the human

mirna tissue atlas. Nucleic acids research,

50(D1):D211–D221, 2022.

[6] Ann J. Barbier, Allen Yujie Jiang, Peng

Zhang, Richard Wooster, and Daniel G. Ander-

son. The clinical progress of mrna vaccines

and immunotherapies. Nature Biotechnology,

40(6):840–854, May 2022.

[7] Thiago Carvalho. Personalized anti-cancer

vaccine combining mrna and immunotherapy

tested in melanoma trial. Nature Medicine,

29(10):2379–2380, August 2023.

[8] Melanie Winkle, Sherien M. El-Daly, Muller

Fabbri, and George A. Calin. Noncoding

rna therapeutics — challenges and potential

solutions. Nature Reviews Drug Discovery,

20(8):629–651, June 2021.

[9] Cavalleri, E and Cabri, A and Soto-Gomez, M

and Bonfitto, S and Perlasca, P and Gliozzo, J

and Callahan, T and Reese, J and Robinson, P

and Casiraghi, E and Valentini, G and Mesiti,

M. Rna-kg: An ontology-based knowledge

graph for representing interactions involving rna

molecules. Scientific Data, Nature Publishing,

(in press), 2024.

[10] Peter N. Robinson, Sebastian Köhler, Sebas-

tian Bauer, Dominik Seelow, Denise Horn, and

Stefan Mundlos. The human phenotype ontol-

ogy: A tool for annotating and analyzing human

hereditary disease. The American Journal of Hu-

man Genetics, 83(5):610–615, November 2008.

[11] Lynn M Schriml, James B Munro, Mike Schor,

Dustin Olley, Carrie McCracken, Victor Fe-

lix, J Allen Baron, Rebecca Jackson, Su-

san M Bello, Cynthia Bearer, Richard Lichen-

stein, Katharine Bisordi, Nicole Campion Di-

alo, Michelle Giglio, and Carol Greene. The

human disease ontology 2022 update. Nu-

cleic Acids Research, 50(D1):D1255–D1261,

November 2021.

[12] K. Degtyarenko, P. de Matos, M. Ennis, J. Hast-

ings, M. Zbinden, A. McNaught, R. Alcan-

tara, M. Darsow, M. Guedj, and M. Ashburner.

Chebi: a database and ontology for chemical

entities of biological interest. Nucleic Acids

Research, 36(Database):D344–D350, Decem-

ber 2007.

[13] Tiffany J. Callahan, Ignacio J. Tripodi, Adri-

anne L. Stefanski, Luca Cappelletti, Sanya B.

Taneja, Jordan M. Wyrwa, Elena Casir-

aghi, Nicolas A. Matentzoglu, Justin Reese,

Jonathan C. Silverstein, Charles Tapley Hoyt,

Richard D. Boyce, Scott A. Malec, Deepak R.

Unni, Marcin P. Joachimiak, Peter N. Robinson,

Christopher J. Mungall, Emanuele Cavalleri,

Tommaso Fontana, Giorgio Valentini, Marco

Mesiti, Lucas A. Gillenwater, Brook Santan-

gelo, Nicole A. Vasilevsky, Robert Hoehndorf,

Tellen D. Bennett, Patrick B. Ryan, George

Hripcsak, Michael G. Kahn, Michael Bada,

William A. Baumgartner, and Lawrence E.

Hunter. An open source knowledge graph

ecosystem for the life sciences. Scientific Data,

11(1), April 2024.

[14] M.M. Li, K. Huang, and M. Zitnik. Graph rep-

resentation learning in biomedicine and health-

care. Nat. Biomed. Eng., 6:1353–1369, 2022.

[15] Luca Cappelletti, Lauren Rekerle, Tommaso

Fontana, Peter Hansen, Elena Casiraghi, Vida

Ravanmehr, Christopher J Mungall, Jeremy J

Yang, Leonard Spranger, Guy Karlebach,

J Harry Caufield, Leigh Carmody, Ben Cole-

man, Tudor I Oprea, Justin Reese, Giorgio

Valentini, and Peter N Robinson. Node-degree

aware edge sampling mitigates inflated classifi-

cation performance in biomedical random walk-

based graph representation learning. Bioinfor-

matics Advances, 4(1):vbae036, 03 2024.

[16] Emanuele Cavalleri et al. RNA-KG: data

and experiments code. Available at: https:

//doi.org/10.5281/zenodo.10418431. Ac-

cessed: 14 March 2024.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

310

Volume 21, 2024

[17] RNA-KG website. Available at: http://

RNA-KG.anacleto.di.unimi.it. Accessed:

22 December 2023.

[18] Michael Ashburner, Catherine A. Ball, Ju-

dith A. Blake, David Botstein, Heather But-

ler, J. Michael Cherry, Allan P. Davis, Kara

Dolinski, Selina S. Dwight, Janan T. Eppig,

Midori A. Harris, David P. Hill, Laurie Issel-

Tarver, Andrew Kasarskis, Suzanna Lewis,

John C. Matese, Joel E. Richardson, Martin

Ringwald, Gerald M. Rubin, and Gavin Sher-

lock. Gene ontology: tool for the unification

of biology. Nature Genetics, 25(1):25–29, May

2000.

[19] Chris Mungall, Nico Matentzoglu, Jim Balhoff,

David Osumi-Sutherland, Bill Duncan, pgaudet,

Shawn Tan, Charles Tapley Hoyt, Clare Pil-

grim, James A. Overton, Lauren, Anita Caron,

Nomi Harris, Sierra Moxon, lschriml, Nicole

Vasilevsky, Sabrina Toro, Damien Goutte-

Gattat, Matthew Brush, Vasundra Touré, An-

thony Bretaudeau, Scott Cain, Melissa Haen-

del, diatomsRcool, Bide Zhang, Clint Dow-

land, Damion Dooley, actions user, and Jen

Hammock. oborel/obo-relations: 2023-08-18

release. Available at https://doi.org/10.

5281/zenodo.8263469, August 2023.

[20] Karen Eilbeck, Suzanna E Lewis, Christopher J

Mungall, Mark Yandell, Lincoln Stein, Richard

Durbin, and Michael Ashburner. The sequence

ontology: a tool for the unification of genome

annotations. Genome Biology, 6(5), April 2005.

[21] Aditya Grover and Jure Leskovec. Node2vec:

Scalable feature learning for networks. In Pro-

ceedings of the 22nd ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and

Data Mining, KDD ’16, page 855–864, New

York, NY, USA, 2016. Association for Comput-

ing Machinery.

[22] Jian Tang, Meng Qu, Mingzhe Wang, Ming

Zhang, Jun Yan, and Qiaozhu Mei. Line:

Large-scale information network embedding. In

Proceedings of the 24th International Confer-

ence on World Wide Web, WWW ’15, page

1067–1077, Republic and Canton of Geneva,

CHE, 2015. International World Wide Web

Conferences Steering Committee.

[23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg

Corrado, and Jeffrey Dean. Distributed repre-

sentations of words and phrases and their com-

positionality. In Proceedings of the 26th In-

ternational Conference on Neural Information

Processing Systems - Volume 2, NIPS’13, page

3111–3119, Red Hook, NY, USA, 2013. Curran

Associates Inc.

[24] Laurens van der Maaten and Geoffrey Hinton.

Visualizing data using t-sne. Journal of Machine

Learning Research, 9(86):2579–2605, 2008.

[25] L. Breiman, Jerome H. Friedman, Richard A.

Olshen, and C. J. Stone. Classification and re-

gression trees. Biometrics, 40:874, 1984.

[26] L. Breiman. Random forests. Machine Learn-

ing, 45:5–32, 2001.

[27] L. Cappelletti, T. Fontana, E. Casiraghi,

V. Ravanmehr, T.J. Callahan, C. Cano, M.P.

Joachimiak, C.J. Mungall, P.N. Robinson,

J. Reese, and G. Valentini. Grape for fast

and scalable graph processing and random

walk-based embedding. Nature Computational

Science, 3:552–568, 2023.

[28] Y. Xie, B. Yu, S. Lv, C. Zhang, G. Wang, and

M. Gong. A survey on heterogeneous network

representation learning. Pattern Recognition,

116(107936), 2021.

[29] Ayush Noori, Michelle M Li, Amelia LM

Tan, and Marinka Zitnik. Metapaths: similar-

ity search in heterogeneous knowledge graphs

via meta-paths. Bioinformatics, 39(5):btad297,

2023.

[30] Dengju Yao, Yuexiao Deng, Xiaojuan Zhan,

and Xiaorong Zhan. Predicting lncrna-disease

associations using multiple metapaths in hierar-

chical graph attention networks. BMC Bioinfor-

matics, 25(1), January 2024.

[31] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré,

and K. Murphy. Machine Learning on

Graphs: A Model and Comprehensive Taxon-

omy. Journal of Machine Learning Research,

23(89):1–64, 2022.

[32] Yixuan Liang and Yuan Wan. Learning

on heterogeneous graph neural networks with

consistency-based augmentation. Applied Intel-

ligence, 53(22):27624–27636, 2023.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

Francesco Torgano implemented the code.

Emanuele Cavalleri curated the graphs used in the ex-

periments.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

311

Volume 21, 2024

Francesco Torgano, Jessica Gliozzo, Federico Stac-

chietti and Emanuele Saitto performed the experi-

ments, results visualization and evaluation.

Giorgio Valentini conceptualized the work, drafted

the original paper, and acquired fundings.

Marco Mesiti, Elena Casiraghi and Giorgio Valentini

validated the results and supervised the work.

All the authors revised, read and approved the final

manuscript.

Funding sources

This research was supported by the National Cen-

ter for Gene Therapy and Drugs based on RNA

Technology, PNRR-NextGenerationEU program

(G43C22001320007).

Conflicts of Interest

The authors have no conflicts of interest to declare

that are relevant to the content of this article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International , CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.30

Francesco Torgano, Emanuele Cavalleri,

Jessica Gliozzo, Federico Stacchietti,

Emanuele Saitto, Marco Mesiti,

Elena Casiraghi, Giorgio Valentini

E-ISSN: 2224-2902

312

Volume 21, 2024