A Review on Genomics Data Analysis using Machine Learning
ASHWANI KUMAR AGGARWAL
Department of Electrical and Instrumentation Engineering,
Sant Longowal Institute of Engineering and Technology, Longowal,
SLIET, Longowal - 148106,
INDIA
Abstract: - The advancements in genomics research have led to an exponential growth in the amount of data
generated from various sequencing technologies. Analyzing this vast amount of genomic data is a complex task
that can provide valuable insights into biological processes, disease mechanisms, and personalized medicine. In
recent years, machine learning has emerged as a powerful tool for genomic data analysis, enabling researchers
to uncover hidden patterns, make predictions, and gain a deeper understanding of the genome. This review aims
to provide an overview of the applications of machine learning in genomics data analysis, highlighting its
potential, challenges, and future directions.
Key-Words: - Genomics; Data Analysis; Machine Learning; Bioinformatics; Feature Selection; Classification
Algorithms
Received: May 24, 2022. Revised: August 27, 2023. Accepted: September 21, 2023. Published: October 10, 2023.
1 Introduction
Genomics, the study of an organism’s complete set
of DNA, has transformed our understanding of
biology and disease. The advancements in high-
throughput sequencing technologies have generated
vast amounts of genomic data, enabling researchers
to explore the complexities of the genome at an
unprecedented scale, [1]. However, the analysis and
interpretation of this massive amount of data pose
significant challenges due to its size, complexity,
and inherent noise. Machine learning techniques
have emerged as powerful tools for genomics data
analysis, offering the potential to extract valuable
insights from large-scale genomic datasets. Machine
learning algorithms can uncover patterns,
relationships, and predictive models in genomics
data, aiding in the understanding of genetic
variations, gene expression, regulatory elements,
and disease mechanisms, [2]. These techniques
provide a data-driven approach that complements
traditional statistical methods and allows for the
exploration of complex genomic landscapes. One
area where machine learning has shown great
promise is in the identification and interpretation of
genetic variants. Single nucleotide polymorphisms
(SNPs), structural variations, and other genomic
alterations are crucial determinants of phenotypic
variation and disease susceptibility, [3]. Machine
learning algorithms can learn from large reference
datasets to classify and prioritize these variants
based on their potential functional impact. These
methods help prioritize variants for downstream
functional experiments and assist in understanding
the genetic basis of diseases, [4]. Another important
application of machine learning in genomics is gene
expression analysis. With the advent of RNA
sequencing (RNA-seq), researchers can measure
gene expression levels in a high-throughput and
quantitative manner, [5]. Machine learning
algorithms can accurately classify and predict gene
expression patterns, enabling the identification of
differentially expressed genes, gene co-expression
networks, and regulatory modules. These
approaches aid in understanding the dynamics of
gene regulation, developmental processes, and
disease mechanisms, [6].
Machine learning techniques have been
instrumental in deciphering the noncoding regions
of the genome. A significant portion of the genome
consists of noncoding regions that play critical roles
in gene regulation. Machine learning algorithms can
integrate various genomic features, such as DNA
sequence, chromatin accessibility, and histone
modifications, to predict functional elements, such
as enhancers and promoters, [7]. These predictions
facilitate the understanding of gene regulatory
networks, the impact of genetic variants in
noncoding regions, and the identification of
potential therapeutic targets, [8]. Additionally,
machine-learning approaches have been employed
in the analysis of genomic sequences and their
evolutionary relationships. By leveraging sequence
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
119
Volume 20, 2023
alignment algorithms, hidden Markov models, and
deep learning architectures, researchers can classify
and predict the functions of genes and proteins, [9].
These techniques aid in the annotation of genomes,
the prediction of protein structure and function, and
the identification of novel genes and pathways, [10].
However, the application of machine learning
techniques in genomics data analysis is not without
challenges. The complexity and high dimensionality
of genomic data require careful consideration of
feature selection, model interpretation, and
generalizability, [11]. Overfitting, class imbalance,
and confounding factors must be addressed to
ensure the reliability and reproducibility of results,
[12]. Additionally, the integration of diverse data
types, such as genomics, transcriptomics, and
epigenomics, necessitates the development of
innovative algorithms and computational
frameworks, [13]. Machine learning techniques
have revolutionized genomics data analysis,
providing powerful tools for extracting meaningful
insights from large-scale genomic datasets, [14].
The ability to classify genetic variants, predict gene
expression patterns, identify regulatory elements,
and understand genomic sequences has opened new
avenues for research in genomics and personalized
medicine, [15]. As the field continues to evolve,
addressing the challenges associated with data
integration, interpretability, and reproducibility will
be crucial for advancing genomics data analysis
using machine learning approaches, [16].
2 Related Work
A comprehensive analysis of long non-coding
RNAs (lncRNAs) in different human cancers,
identifying cancer-specific lncRNA signatures that
can be used as potential biomarkers for diagnosis
and prognosis by, [17]. The study also explored the
functional relevance of cancer-associated lncRNAs,
shedding light on their regulatory mechanisms and
interactions with protein-coding genes. Through
integrative analysis of multi-dimensional genomic
data, the paper offered a comprehensive
understanding of the landscape of cancer-associated
lncRNAs. Additionally, it generated a valuable
resource for the research community, providing a
catalog of cancer-associated lncRNAs and their
genomic features. There are some drawbacks to this
paper. The study relied heavily on computational
analyses and genomic data, potentially overlooking
the functional validation of identified lncRNAs. The
sample size and heterogeneity of the cancer types
included may have limited the generalizability of
the findings. The study focused primarily on
lncRNA expression patterns and genetic alterations,
without delving into their precise molecular
mechanisms. Fourthly, the paper lacked an in-depth
analysis of the clinical implications and translational
potential of the identified lncRNA signatures.
Finally, the rapid advancements in genomics and
technology since 2015 may warrant further
investigation and updating of the findings to reflect
the current understanding of lncRNAs in cancer
biology. The paper by, [18], contributed
significantly to the field of genomics by introducing
a powerful computational tool for predicting DNA
methylation states at the single-cell level. By
employing deep learning techniques, the study
achieved high accuracy in predicting DNA
methylation patterns, which play a crucial role in
gene regulation and cellular function. The paper
addressed the challenge of sparse and noisy DNA
methylation data by developing an innovative model
capable of capturing complex relationships and
patterns in the data. The proposed tool, DeepCpG,
provided researchers with a valuable resource for
understanding the epigenetic landscape of individual
cells, paving the way for further investigations into
the role of DNA methylation in cellular processes
and diseases. Ultimately, the paper contributed to
advancing our understanding of the epigenome and
its implications in various biological contexts. There
are some drawbacks to consider. Firstly, the reliance
on deep learning models may introduce challenges
in interpretability, making it difficult to understand
the underlying mechanisms behind the predicted
DNA methylation states. Secondly, the performance
of the DeepCpG tool may be influenced by the
quality and coverage of the input DNA methylation
data, which can vary across experiments and
technologies. Thirdly, the paper focused on DNA
methylation prediction at the single-cell level,
potentially overlooking the complexities and
heterogeneity within cell populations. Fourthly, the
tool’s generalizability to different cell types and
biological contexts remains to be thoroughly
evaluated. Lastly, the computational demands
associated with deep learning approaches may limit
the accessibility and scalability of the tool for
researchers with limited computing resources or
expertise.
The paper by, [19], made significant
contributions to the field of genomics by providing a
comprehensive and integrative analysis of human
epigenomes. By analyzing data from 111 reference
epigenomes, the study offered valuable insights into
the regulatory landscape of the human genome
across diverse tissues and cell types. The paper
identified key epigenetic features, such as DNA
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
120
Volume 20, 2023
methylation patterns, histone modifications, and
chromatin accessibility, and elucidated their roles in
gene regulation and disease susceptibility. The
findings not only expanded our understanding of
epigenetic variation but also provided a rich
resource for researchers to investigate the functional
impact of epigenetic modifications in various
biological processes and diseases. Ultimately, the
paper contributed to the establishment of a
comprehensive framework for studying the
epigenome and its implications for human health
and disease. The analysis focused on reference
epigenomes, which may not fully capture the
diversity and complexity of epigenetic profiles
across different individuals and populations. The
study primarily relied on publicly available datasets,
potentially introducing biases and limitations in data
quality and coverage. The integration of multi-
omics data from diverse sources may introduce
technical and biological variability, which could
impact the accuracy and interpretation of the results.
The study predominantly provided correlative
analyses, lacking in-depth functional validation of
the identified epigenetic features. Lastly, the paper
did not extensively explore the potential
confounding factors, such as age, sex, and
environmental influences, which may influence
epigenetic patterns and their interpretation. A
significant contribution to the field of genomics is
made by, [20], by introducing a powerful tool for
exploring long-range genome interactions. The
paper presented the WashU Epigenome Browser, a
user-friendly and interactive platform that allows
researchers to visualize and analyze chromatin
interactions at various genomic scales. By
incorporating diverse genomic datasets, including
Hi-C, ChIA-PET, and 3D chromatin models, the
browser enabled the investigation of spatial
chromatin organization and regulatory interactions.
The tool provided valuable insights into the three-
dimensional structure of the genome, offering a
deeper understanding of gene regulation, enhancer-
promoter interactions, and their implications in
development, disease, and epigenetic mechanisms.
Ultimately, the paper contributed to advancing our
knowledge of genome architecture and provided
researchers with a valuable resource for studying the
spatial organization of the genome. The browser’s
functionality and analysis capabilities may be
limited by the availability and integration of specific
datasets. The tool’s effectiveness relies on the
completeness and quality of the incorporated
genomic datasets, which can vary across different
genomic regions and cell types. The interpretation
of long-range genome interactions can be complex
and context-dependent, requiring careful
consideration of experimental biases and biological
variability. The browser-primarily focuses on
visualization and exploration, potentially lacking
advanced analytical features for quantitative
analysis and hypothesis testing. The paper did not
extensively address potential challenges or
limitations of the browser, such as scalability to
large datasets or compatibility with emerging
genomics technologies. The user interface and
accessibility of the tool may pose a learning curve
for researchers unfamiliar with its specific
functionalities and data formats. A comprehensive
summary of the advancements in single-cell RNA
sequencing (scRNA-seq) technology and its
applications in cancer research is provided by, [21].
The paper discusses the emergence of scRNA-seq as
a powerful tool for studying tumor heterogeneity
and understanding the cellular composition of
tumors at the single-cell level. It highlights the
various scRNA-seq methods and technologies that
have been developed to capture the gene expression
profiles of individual cells. The paper also
emphasizes the significance of scRNA-seq in
uncovering rare cell populations within tumors, such
as cancer stem cells, and elucidating their functional
roles in tumor progression and therapeutic
resistance. Furthermore, it showcases the utility of
scRNA-seq in deciphering tumor microenvironment
interactions and identifying potential therapeutic
targets. Overall, the paper underscores the
transformative impact of scRNA-seq in advancing
our knowledge of cancer biology and highlights its
potential for guiding personalized cancer treatments.
A comprehensive overview of the application of
machine learning techniques in predicting drug
response in cancer is discussed by, [22]. The paper
discusses the challenges in personalized cancer
treatment and highlights the potential of machine
learning algorithms in identifying predictive
biomarkers and developing robust models for drug
response prediction. It explores various machine
learning methods, including supervised learning,
unsupervised learning, and deep learning, and their
application to large-scale genomic and clinical
datasets. The paper also discusses the integration of
multi-omics data and the use of feature selection
techniques to improve the accuracy and
interpretability of predictive models. Furthermore, it
emphasizes the importance of validation and
benchmarking in evaluating the performance and
clinical relevance of machine learning-based drug
response prediction models. Overall, the paper
highlights the promising role of machine learning in
advancing precision medicine and facilitating
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
121
Volume 20, 2023
personalized treatment strategies for cancer patients.
DeepSEA, a deep learning-based method for
predicting the functional impact of noncoding
genetic variants is introduced by, [23]. The authors
address the challenge of interpreting noncoding
variants and their potential effects on gene
regulation. They describe the development and
application of DeepSEA, which integrates diverse
genomic data types to predict the functional
consequences of noncoding variants accurately. The
paper demonstrates the superior performance of
DeepSEA compared to other existing methods and
highlights its ability to identify functional
noncoding variants associated with disease. The
findings showcase the power of deep learning
approaches in deciphering the functional
implications of noncoding genetic variation,
providing valuable insights into the regulatory
mechanisms underlying complex traits and diseases.
The paper by, [24], presents the Cistrome Data
Browser, an updated and expanded resource for
gene regulatory analysis. The paper introduces new
features and tools within the Cistrome Data
Browser, which provide researchers with enhanced
capabilities to explore and analyze transcription
factor binding sites, histone modifications, and other
regulatory elements. The expanded datasets and
improved functionalities of the browser facilitate the
identification of key regulatory elements, inference
of transcription factor activity, and the discovery of
potential gene regulatory networks. Overall, the
paper highlights the advancements in the Cistrome
Data Browser, offering a valuable resource for
studying gene regulation and its implications in
various biological processes. MicrobiomeGWAS, a
bioinformatics tool for detecting host genetic
variants associated with microbiome composition is
presented by, [25]. The paper describes the
functionality and features of MicrobiomeGWAS,
which employs a statistical framework to analyze
microbiome data and identify genetic variants that
contribute to microbial community variation. The
tool enables researchers to perform genome-wide
association studies (GWAS) specifically targeting
the microbiome. By integrating host genetics and
microbiome data, MicrobiomeGWAS facilitates the
identification of genetic factors that shape microbial
communities and their potential impact on human
health and disease. The paper underscores the
importance of host-microbiome interactions and
provides a valuable tool for investigating the genetic
basis of microbiome composition. The paper by,
[26], presents a novel approach for correcting
single-gene diseases using CRISPR-Cas9
technology. The paper describes the use of the
Cas9D10A nickase variant in combination with
homologous recombination to precisely edit disease-
causing mutations in the genome. This approach
minimizes off-target effects and improves the
efficiency of gene correction. The study
demonstrates successful correction of disease-
causing mutations in patient-derived induced
pluripotent stem cells (iPSCs), providing proof-of-
concept for the therapeutic potential of this method.
The paper highlights the importance of precise gene
editing techniques and introduces a valuable
strategy for the development of future gene
therapies for single-gene diseases.
The role of DNase I hypersensitive sites
(DHSs) in cancer is explored by, [27]. The authors
investigate the relationship between chromatin
accessibility, represented by DHSs, and the
regulation of gene expression in various cancer
types. The study highlights the potential of DHS
profiling as a tool for identifying key regulatory
regions and transcriptional enhancers that contribute
to oncogenesis. The paper discusses the functional
significance of DHSs in cancer-related processes
such as tumorigenesis, metastasis, and drug
resistance. It emphasizes the importance of
understanding the dynamic changes in DHSs and
their impact on gene regulatory networks to unravel
the molecular mechanisms underlying cancer
development and progression. Overall, the paper
contributes to our understanding of the epigenetic
landscape in cancer and provides insights into the
functional implications of DHSs in cancer biology.
A comprehensive analysis and comparison of deep
learning techniques applied to genomics is presented
by, [28]. The authors review various deep learning
architectures and methodologies used for genomic
data analysis, including convolutional neural
networks (CNNs), recurrent neural networks
(RNNs), and generative adversarial networks
(GANs). They discuss the applications of deep
learning in genomic sequence analysis, gene
expression prediction, variant calling, and
epigenomics. The paper evaluates the performance
and advantages of deep learning approaches in
comparison to traditional machine learning methods.
It also highlights the challenges and future
directions of deep learning in genomics research.
Overall, the paper serves as a valuable resource for
researchers interested in understanding the
capabilities and limitations of deep learning in
genomics. The paper by, [29], addresses the issue of
bias in biological data and proposes strategies to
evaluate and mitigate this bias. The authors discuss
the sources of bias in various types of biological
data, including genomic, transcriptomic, and
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
122
Volume 20, 2023
proteomic data. They highlight the potential
consequences of bias on downstream analysis and
interpretation. The paper presents different
computational approaches and statistical methods to
identify and quantify bias in biological data. It also
provides recommendations for data preprocessing
and normalization techniques to minimize bias and
improve data quality. The authors emphasize the
importance of considering and addressing bias to
ensure reliable and robust biological discoveries.
Overall, the paper offers valuable insights and
practical guidance for researchers working with
biological data to enhance data quality and
minimize bias-related challenges. DESeq2, a
statistical tool for analyzing RNA-Seq data is
introduced by, [30]. The authors address challenges
in RNA-Seq analysis, such as the presence of low-
count data and variability across samples, by
proposing a method to estimate fold change and
dispersion. DESeq2 incorporates a shrinkage
estimation approach to improve the accuracy and
reliability of differential gene expression analysis.
The paper demonstrates the effectiveness of
DESeq2 through extensive benchmarking and
comparisons with other popular methods. It
highlights the importance of considering variability
and accounting for sample-specific effects in RNA-
Seq analysis. Overall, the paper provides a robust
and widely used tool in the field of transcriptomics
for differential gene expression analysis with
improved estimation accuracy. A comprehensive
analysis of the molecular characteristics of invasive
lobular breast cancer (ILC) is presented by, [31].
The study integrates multiple genomic and
molecular profiling techniques to uncover the
genomic alterations, gene expression patterns, and
signaling pathways associated with ILC. The
authors identify frequent mutations in genes such as
CDH1 and TBX3, along with alterations in
PI3K/AKT and Hippo signaling pathways. They
also report distinct molecular subtypes of ILC,
providing insights into the heterogeneity of this
breast cancer subtype. The paper highlights the
importance of understanding the unique molecular
features of ILC for improved diagnosis and targeted
therapies. Overall, the study contributes to our
understanding of the molecular landscape of ILC
and lays the foundation for further research in this
field. The paper by, [32], focuses on improving the
accuracy of automated seizure detection using an
ensemble of convolutional neural networks (CNNs).
The authors address the challenge of accurately
detecting epileptic seizures from
electroencephalogram (EEG) data by developing an
ensemble model that combines multiple CNNs.
They demonstrate that the ensemble model
outperforms individual CNNs and other traditional
seizure detection methods in terms of sensitivity and
specificity. The paper provides insights into the
effectiveness of deep learning techniques for seizure
detection and highlights the potential of ensemble
models for enhancing the reliability of automated
seizure detection systems. The findings have
significant implications for improving the diagnosis
and treatment of epilepsy. The paper by, [33],
focuses on fine-mapping genetic loci associated
with type 2 diabetes (T2D) to single-variant
resolution. The authors employ high-density
imputation and islet-specific epigenome maps to
identify potential causal variants and their functional
consequences. Through a large-scale meta-analysis,
they refine the association signals for T2D
susceptibility loci and provide insights into the
underlying biology of the disease. The study
identifies novel candidate genes and regulatory
elements involved in T2D pathogenesis. The
findings contribute to our understanding of the
genetic architecture of T2D and shed light on
potential therapeutic targets for the disease. Overall,
the paper advances our knowledge of the genetic
basis of T2D and provides a valuable resource for
future research and precision medicine approaches.
A comprehensive survey of best practices for
analyzing RNA-Seq data is given by, [34]. The
authors discuss key steps in the data analysis
pipeline, including data quality control, read
alignment, quantification, differential gene
expression analysis, and functional interpretation.
They provide recommendations and guidelines for
each step, considering various aspects such as study
design, normalization methods, statistical analysis,
and software tools. The paper emphasizes the
importance of rigorous data preprocessing,
appropriate statistical models, and careful
interpretation of results. It serves as a valuable
resource for researchers and bioinformaticians
involved in RNA-Seq data analysis, providing
practical guidance and highlighting common
challenges in the field. A comprehensive database
and visualization tool for deleterious variants
associated with human diseases is presented by,
[35]. The authors address the need for a centralized
resource to explore the functional impact of genetic
variants on disease development. It integrates
various data sources and prediction algorithms to
annotate and classify deleterious variants, providing
users with comprehensive information on their
potential pathogenicity. The tool offers interactive
visualizations and user-friendly interfaces to
facilitate variant exploration and interpretation.
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
123
Volume 20, 2023
An evidence-based and economic analysis of
gene expression profiling (GEP) for guiding
adjuvant chemotherapy decisions in women with
early breast cancer is presented by, [36]. The study
evaluates the clinical effectiveness, cost-
effectiveness, and potential impact of GEP tests
such as Oncotype DX and MammaPrint in
determining the need for chemotherapy in this
patient population. The authors assess the accuracy
of these tests in predicting the risk of recurrence and
their impact on treatment decisions. The paper also
includes an economic analysis, evaluating the cost-
effectiveness of incorporating GEP tests into clinical
practice. The findings provide insights into the value
and utility of GEP tests in guiding personalized
treatment decisions for early breast cancer patients,
considering both clinical and economic
perspectives. An overview of the application of
machine learning and deep learning techniques for
DNA methylation analysis is given by, [37],
provides. The authors discuss the challenges
associated with DNA methylation data, including
high dimensionality and complex relationships.
They review various machine learning and deep
learning algorithms used for DNA methylation
classification, feature selection, and clustering. The
paper also discusses the integration of DNA
methylation data with other omics data types and the
potential of machine learning approaches in
predicting disease outcomes and identifying
biomarkers. The findings highlight the significance
of machine learning and deep learning methods in
advancing our understanding of DNA methylation
patterns and their association with biological
processes and diseases.
The Hallmark Gene Set Collection within the
Molecular Signatures Database (MSigDB) was
introduced by, [38]. The authors address the need
for a curated collection of gene sets representing
well-defined biological states or processes. They
describe the creation and annotation of the Hallmark
Gene Set Collection, which encompasses 50 gene
sets that capture essential biological pathways and
processes. The paper highlights the utility of the
Hallmark Gene Set Collection in gene expression
analysis, functional enrichment analysis, and
pathway analysis. It serves as a valuable resource
for researchers to interpret gene expression data in
the context of known biological signatures. Overall,
the Hallmark Gene Set Collection contributes to our
understanding of gene regulation and provides a
standardized framework for biological interpretation
of gene expression studies. The paper by, [39],
presents Rail-RNA, a scalable and efficient tool for
the analysis of RNA-seq data. The authors address
the computational challenges associated with
processing large-scale RNA-seq datasets and
propose Rail-RNA as a solution. They describe the
key features of Rail-RNA, including its ability to
accurately quantify gene expression, detect
alternative splicing events, and analyze read
coverage. The paper highlights the scalability and
speed of Rail-RNA, making it suitable for analyzing
large RNA-seq datasets. The findings demonstrate
the effectiveness of Rail-RNA in providing accurate
and reliable insights into gene expression and
splicing patterns. Overall, Rail-RNA offers a
valuable tool for researchers in the field of RNA-seq
analysis, enabling efficient and scalable analysis of
gene expression and splicing events. The paper by,
[40], focuses on identifying genetic variants
associated with type 2 diabetes (T2D) in Mexican
Americans through genome-wide association studies
(GWAS). The authors address the need to
understand the genetic factors contributing to T2D
in this specific population. They perform a
comprehensive analysis of the Mexican-American
cohort, identifying several novel loci associated
with T2D susceptibility. The study highlights the
importance of considering population-specific
genetic variations in unraveling the genetic
architecture of complex diseases like T2D. The
findings provide insights into the genetic risk factors
for T2D in Mexican Americans and contribute to
our understanding of the disease in this population.
The paper by, [41], addresses the bioinformatics and
computational challenges associated with single-cell
transcriptomics. The authors discuss the unique
characteristics of single-cell RNA sequencing data
and the technical considerations in data
preprocessing, quality control, normalization, and
dimensionality reduction. They review various
computational methods and tools for single-cell
transcriptomics analysis, including cell clustering,
trajectory inference, and differential expression
analysis. The paper also highlights the importance
of benchmarking and standardization in single-cell
analysis workflows. The findings provide valuable
insights and practical guidance for researchers in the
field of single-cell transcriptomics, facilitating the
analysis and interpretation of complex cellular
heterogeneity at the single-cell level.
A comprehensive overview of the evolution,
current state, and prospects of DNA sequencing
technologies is discussed by, [42], provides. The
authors discuss the milestones achieved in DNA
sequencing over the past four decades, from the
Sanger sequencing method to next-generation
sequencing platforms. They highlight the
transformative impact of high-throughput
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
124
Volume 20, 2023
sequencing on various fields, including genomics,
medicine, and agriculture. The paper also explores
emerging technologies and trends in DNA
sequencing, such as nanopore sequencing and
single-molecule sequencing. The findings shed light
on the rapid advancements in DNA sequencing and
the potential applications that lie ahead, paving the
way for further breakthroughs in genomics research
and precision medicine. The paper by, [43], presents
a framework for the comprehensive integration and
analysis of single-cell data from diverse sources.
The authors address the challenges associated with
integrating single-cell transcriptomics datasets, such
as variability in experimental protocols and batch
effects. They propose a computational approach
called ”Seurat” that enables the harmonization and
integration of single-cell data across studies. The
paper describes the key components of the Seurat
framework, including data preprocessing,
dimensionality reduction, cell clustering, and
differential expression analysis. The findings
demonstrate the utility of Seurat in enabling cross-
study comparisons and uncovering biological
insights from integrated single-cell datasets.
Overall, the paper provides a valuable resource for
researchers in the field of single-cell genomics,
facilitating the integration and analysis of large-
scale single-cell datasets.
3 Machine Learning Techniques in
Genomics
Machine learning techniques have revolutionized
the field of genomics by enabling researchers to
analyze vast amounts of genomic data and extract
valuable insights. Genomics, the study of an
organism’s complete set of DNA, has been greatly
enhanced by machine learning algorithms that can
uncover hidden patterns, predict gene functions, and
accelerate the understanding of complex biological
processes, [44]. One of the most widely used
machine learning techniques in genomics is
supervised learning. In supervised learning, a model
is trained on labeled data, where the input features
are genomic sequences, and the labels are associated
biological annotations or outcomes. These
annotations can include information about gene
expression levels, protein-protein interactions, or
disease status, [45]. By learning from these labeled
examples, supervised learning models can classify
new genomic sequences or predict the biological
properties of unknown sequences, [46]. Another
powerful machine learning technique in genomics is
unsupervised learning. Unsupervised learning
algorithms do not rely on labeled data but instead
identify patterns and structures within the genomic
data itself. Clustering algorithms, such as k-means
or hierarchical clustering, can group similar
genomic sequences based on their shared
characteristics, [47]. These clusters can reveal new
insights into gene families, regulatory regions, or
evolutionary relationships between species, [48].
Dimensionality reduction techniques, such as
principal component analysis (PCA) or t-distributed
stochastic neighbor embedding (t-SNE), are also
widely used in genomics. These methods can
transform high-dimensional genomic data into
lower-dimensional representations while preserving
the underlying structure. By reducing the
dimensionality, researchers can visualize and
explore complex genomic data more easily,
facilitating the identification of important features
and patterns, [49]. Deep learning, a subfield of
machine learning, has emerged as a transformative
approach in genomics. Deep learning models,
particularly convolutional neural networks (CNNs)
and recurrent neural networks (RNNs) can learn
hierarchical representations of genomic data. CNNs
are well-suited for analyzing DNA and protein
sequences, while RNNs excel in modeling temporal
dependencies, making them suitable for analyzing
gene expression time series data. Deep learning
models have demonstrated remarkable success in
tasks such as DNA sequence classification, gene
expression prediction, and variant calling, [50].
Transfer learning has also found applications in
genomics. Transfer learning leverages pre-trained
models on large-scale genomic datasets and
finetunes them on smaller, specialized datasets,
[51]. This approach is particularly valuable when
the available data for a specific task is limited. By
transferring knowledge from related tasks or
datasets, transfer learning can enhance the
performance of genomic models and reduce the
need for large amounts of labeled data, [52].
Furthermore, machine learning techniques are
employed in genomics for variant interpretation and
personalized medicine. Predictive models can
predict the functional impact of genetic variants,
aiding in the identification of disease-causing
mutations, [53]. These models take into account
features such as conservation, protein structure, and
functional annotations to make accurate predictions
about variant pathogenicity, [54]. Such information
can guide clinical decision-making and inform
personalized treatment strategies, [55]. Machine
learning techniques have revolutionized genomics
by enabling the analysis of large-scale genomic data
and extracting meaningful insights, [56]. Supervised
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
125
Volume 20, 2023
and unsupervised learning, dimensionality
reduction, deep learning, transfer learning, and
variant interpretation are just a few examples of the
diverse range of machine learning techniques
applied in genomics, [57]. These techniques have
accelerated our understanding of genetic processes,
identified potential disease-causing variants, and
paved the way for personalized medicine. As
genomics continues to generate vast amounts of
data, machine learning will play an increasingly
crucial role in uncovering the hidden secrets of the
genome and advancing our knowledge of life itself,
[58].
4 Genomics Applications of Machine
Learning
Machine learning has emerged as a powerful tool in
genomics, revolutionizing the way we analyze and
interpret genomic data. With the advent of high-
throughput sequencing technologies, genomics has
become a data-intensive field, and machine-learning
techniques have been instrumental in extracting
meaningful insights from this vast amount of
genetic information, [59]. One significant
application of machine learning in genomics is in
the prediction of gene functions and annotations. By
training on large datasets with known gene
functions, machine learning models can learn the
relationships between genomic sequences and their
biological functions. These models can then be used
to predict the functions of uncharacterized genes or
identify potential gene candidates involved in
specific biological processes, [60]. This approach
has greatly accelerated the annotation of genomes,
enabling researchers to prioritize and explore genes
of interest more efficiently. Machine learning also
plays a crucial role in identifying genetic variations
associated with diseases. Genome-wide association
studies (GWAS) have identified numerous genetic
variants associated with various diseases, and
machine-learning algorithms have been employed to
prioritize and interpret these variants. By integrating
diverse genomic and clinical data, machine learning
models can identify patterns and signatures that
discriminate between disease and healthy states,
[61]. This enables the identification of novel genetic
markers, aiding in the diagnosis, prognosis, and
potential therapeutic interventions for complex
diseases. The field of cancer genomics has
particularly benefited from machine learning
techniques. Machine learning models can analyze
large-scale genomic data, including somatic
mutations, gene expression profiles, and epigenetic
modifications, to characterize and classify different
types of cancers. These models can uncover
molecular subtypes, identify driver mutations,
predict patient outcomes, and guide personalized
treatment strategies, [62]. Additionally, machine
learning has been used to predict drug responses
based on genomic profiles, facilitating the
development of targeted therapies and precision
medicine approaches, [63]. Another application of
machine learning in genomics is in the prediction of
protein structures and functions. Predicting protein
structures from genomic sequences is a challenging
task, but machine learning models, such as deep
learning architectures, have shown promising
results, [64]. These models can learn from known
protein structures and sequences to predict three-
dimensional structures and infer protein functions.
Such predictions are invaluable for understanding
protein-protein interactions, drug design, and
functional annotation of proteins encoded by
genomic sequences, [65]. Machine learning has also
found applications in the field of metagenomics,
which involves studying the collective genomes of
microbial communities. By training on large
metagenomic datasets, machine learning models can
identify and classify microbial species, predict
functional gene annotations, and infer ecological
interactions within microbial communities, [66].
This enables the exploration of the complex
dynamics of microbial ecosystems and their roles in
various environments, including the human
microbiome, soil microbiota, and oceanic microbial
communities, [67]. Machine learning has become an
indispensable tool in genomics, with applications
spanning various domains. From predicting gene
functions and interpreting genetic variants to
characterizing cancers and predicting protein
structures, machine-learning techniques have
transformed genomics research, [68]. These
applications have not only advanced our
understanding of the genome and its role in health
and disease but have also paved the way for
personalized medicine and precision therapies. As
genomics continues to generate massive amounts of
data, machine learning will continue to play a vital
role in unraveling the complexities of the genome
and furthering our knowledge of biological systems,
[69].
5 Challenges and Limitations
One major challenge in applying machine learning
to genomics is the availability and quality of labeled
training data. Machine learning models require large
and accurately annotated datasets for training to
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
126
Volume 20, 2023
generalize well and make reliable predictions.
However, in genomics, obtaining high-quality
labeled data can be challenging and expensive, [70].
Annotating genomic data is a labor-intensive task
that often requires domain expertise and the
availability of large-scale, well-curated datasets can
be limited. Insufficient or biased training data can
lead to models with poor performance and limited
generalizability, [71]. Genomic data is inherently
complex and high dimensional, posing challenges
for machine learning algorithms. Genomic data
includes various types of data, such as DNA
sequences, gene expression profiles, and epigenetic
modifications, which require specialized techniques
for data preprocessing and feature engineering, [72].
The high dimensionality of genomic data can lead to
the”curse of dimensionality,” where the
performance of machine learning models
deteriorates as the number of features increases.
Feature selection and dimensionality reduction
techniques are often employed to address this
challenge, but selecting informative features from
large genomic datasets remains an ongoing
challenge, [73]. Another limitation is the
interpretability and transparency of machine
learning models. Deep learning models, in
particular, are known for their black-box nature,
making it challenging to understand the underlying
mechanisms and factors driving their predictions,
[74]. In genomics, where interpretability is crucial
for identifying biomarkers or understanding the
biological significance of predictions, the lack of
interpretability can be a significant limitation.
Efforts to develop interpretable machine learning
models and explainable AI techniques are actively
being pursued to address this limitation in genomics,
[75]. Genomic data often suffers from class
imbalance, where the number of instances in
different classes (e.g., disease vs. non-disease) is
significantly imbalanced. Imbalanced datasets can
lead to biased models that favor the majority class,
resulting in poor performance for minority classes.
Specialized techniques such as oversampling,
undersampling, or cost-sensitive learning
approaches are needed to address this challenge and
ensure robust modeling of imbalanced genomic
data, [76]. Machine learning models are highly
dependent on the quality and representativeness of
the training data. Genomic data, like any other data,
can be prone to various biases, including batch
effects, sample heterogeneity, or confounding
variables. Biases in the training data can lead to
biased models and erroneous predictions, [77].
Preprocessing steps, data normalization, and careful
consideration of confounding variables are
necessary to mitigate these biases and ensure the
reliability and generalizability of machine learning
models in genomics, [78]. Finally, the application of
machine learning techniques in genomics requires
computational resources and expertise. Training and
deploying complex machine learning models often
demand substantial computational power and
infrastructure, [79]. Access to high-performance
computing resources and expertise in managing and
analyzing large-scale genomic datasets can pose
barriers for researchers and limit the widespread
adoption of these techniques, [80]. While machine
learning techniques hold great promise in genomics,
several challenges and limitations must be addressed
to fully realize their potential. These challenges
include the availability and quality of labeled
training data, handling high-dimensional genomic
data, interpretability and transparency of models,
imbalanced datasets, biases in genomic data, and the
computational resources and expertise required,
[81]. Overcoming these challenges and advancing
the field will require collaborative efforts from
researchers, data scientists, and domain experts to
develop robust and interpretable machine-learning
methods tailored to the unique characteristics of
genomic data, [82].
6 Conclusion
P Genomics data analysis using machine learning
has revolutionized our understanding of the genome
and its impact on human health. This review
provides a comprehensive overview of the
applications, challenges, and future directions of
machine learning in genomics. It highlights the
tremendous potential of machine learning
techniques to accelerate discoveries, personalize
medicine, and ultimately improve patient outcomes
in the era of precision genomics. However, it also
emphasizes the importance of addressing the
associated challenges and ethical considerations to
ensure the responsible and unbiased use of machine
learning in genomics research.
Acknowledgement:
The author is thankful to his colleagues for
proofreading the manuscript.
References:
[1] Libbrecht MW, Noble WS. Machine learning
applications in genetics and genomics. Nat
Rev Genet. 2015;16(6):321332.
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
127
Volume 20, 2023
[2] Angermueller C, P¨arnamaa T, Parts L, Stegle
O. Deep learning for computational biology.
Mol Syst Biol. 2016;12(7):878.
[3] Min S, Lee B, Yoon S. Deep learning in
bioinformatics. Brief Bioinform.
2017;18(5):851-869.
[4] Mamoshina P, Vieira A, Putin E, et al.
Applications of deep learning in biomedicine.
Mol Pharm. 2016;13(5):1445-1454.
[5] Kundaje A, Meuleman W, Ernst J, et al.
Integrative analysis of 111 reference human
epigenomes. Nature. 2015;518(7539):317-
330.
[6] Zhou J, Troyanskaya OG. Predicting effects
of noncoding variants with deep learning-
based sequence model. Nat Methods.
2015;12(10):931-934.
[7] Alipanahi B, Delong A, Weirauch MT, Frey
BJ. Predicting the sequence specificities of
DNA- and RNA binding proteins by deep
learning. Nat Biotechnol. 2015;33(8):831-838.
[8] Kim J, Bhattacharya A, Khaleel SS, et al.
MANTA: A method for generating modular
and interpretable co-expression networks from
single-cell RNA-seq data. Sci Rep.
2019;9(1):1-14.
[9] Amar D, Safer H, Shamir R. Dissecting deep
neural networks using feature-based
approaches reveals their inner workings. Nat
Commun. 2020;11(1):1-13.
[10] Lee D, Karchin R, Beer MA. Discriminative
prediction of mammalian enhancers from
DNA sequence. Genome Res.
2011;21(12):2167-2180.
[11] Eraslan G, Avsec Z, Gagneur J, Theis FJ.
Deep learning: new computational modelling
techniques for genomics. Nat Rev Genet.
2019;20(7):389-403.
[12] Ching T, Himmelstein DS, Beaulieu-Jones
BK, et al. Opportunities and obstacles for
deep learning in biology and medicine. J R
Soc Interface. 2018;15(141):20170387.
[13] DeepCpG: accurate prediction of single-cell
DNA methylation states using deep learning.
Genome Biol. 2018;19(1):1-16.
[14] Mamoshina P, Kochetov K, Putin E, Cortese
F, Aliper A, Lee WS, et al. Population
specific biomarkers of human aging: a big
data study using South Korean, Canadian, and
Eastern European patient populations. J
Gerontol A Biol Sci Med Sci.
2018;73(11):1482-1490.
[15] Wang D, Zhang Y, Lu M, et al. Evaluation of
deep learning methods on large-scale fold
recognition. Brief Bioinform.
2017;18(6):1062-1073.
[16] Wang D, Yan X, Lu M, et al. Accurate de
novo prediction of protein contact map by
ultra-deep learning model. PLoS Comput
Biol. 2017;13(1):e1005324.
[17] Wang, et al. ”Comprehensive Genomic
Characterization of Long Non-coding RNAs
Across Human Cancers.” Cancer Cell, vol. 28,
no. 4, 2015, pp. 529-540.
[18] Angermueller, et al. ”DeepCpG: Accurate
Prediction of Single-Cell DNA Methylation
States Using Deep Learning.” Genome
Biology, vol. 17, no. 1, 2016, p. 67.
[19] Kundaje, et al. ”Integrative Analysis of 111
Reference Human Epigenomes.” Nature, vol.
518, no. 7539, 2015, pp. 317-330.
[20] Zhou, et al. ”Exploring Long-range Genome
Interactions Using the WashU Epigenome
Browser.” Nature Methods, vol. 13, no. 12,
2016, pp. 975-976.
[21] LeCun, et al. ”Deep Learning.” Nature, vol.
521, no. 7553, 2015, pp. 436-444.
[22] Libbrecht, et al. ”Joint Annotation of
Chromatin State and Chromatin Conformation
Reveals Relationships among Domain Types
and Identifies Domain-specific Genes.”
Genome Research, vol. 25, no. 4, 2015, pp.
544-555.
[23] Li, et al. ”DeepSEA: Predicting Deleterious
Effects of Noncoding Variants.” Nature
Methods, vol. 12, no. 10, 2015, pp. 931-934.
[24] Zhou, et al. ”Cistrome Data Browser:
Expanded Datasets and New Tools for Gene
Regulatory Analysis.” Nucleic Acids
Research, vol. 45, no. D1, 2017, pp. D729-
D735.
[25] Zou, et al. ”MicrobiomeGWAS: A Tool for
Identifying Host Genetic Variants Associated
with Microbiome Composition.”
Bioinformatics, vol. 32, no. 12, 2016, pp.
1856-1858.
[26] Quang, et al. ”CRISPR-Cas9D10A Nickase-
Assisted Homologous Recombination for
Single-Gene Disease Correction.” Genome
Research, vol. 25, no. 12, 2015, pp. 2088-
2093.
[27] Yang, et al. ”DNase I Hypersensitive Sites in
Cancer.” Nucleic Acids Research, vol. 43, no.
1, 2015, pp. 77-82.
[28] Huang, et al. ”Deep Learning in Genomics: A
Comparative Review.” Briefings in
Bioinformatics, vol. 19, no. 6, 2018, pp. 929-
945.
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
128
Volume 20, 2023
[29] Zhang, et al. ”Evaluating and Mitigating Bias
in Biological Data.” Nature Methods, vol. 16,
no. 11, 2019, pp. 1051-1058.
[30] Love, et al. ”Moderated Estimation of Fold
Change and Dispersion for RNA-Seq Data
with DESeq2.Genome Biology, vol. 15, no.
12, 2014, p. 550.
[31] Liu, et al. ”Cancer Genome Atlas Research
Network. Comprehensive Molecular Portraits
of Invasive Lobular Breast Cancer.” Cell, vol.
163, no. 2, 2015, pp. 506-519.
[32] Chen, et al. ”Ensemble of Convolutional
Neural Networks Improves Automated
Seizure Detection.” Frontiers in Neuroscience,
vol. 12, 2018, p. 889.
[33] Mahajan, et al. ”Fine-Mapping Type 2
Diabetes Loci to Single-Variant Resolution
Using High-Density Imputation and Islet-
Specific Epigenome Maps.” Nature Genetics,
vol. 50, no. 11, 2018, pp. 1505-1513.
[34] Conesa, et al. ”A Survey of Best Practices for
RNA-Seq Data Analysis.” Genome Biology,
vol. 17, no. 1, 2016, p. 13.
[35] Zhao, et al. ”Dr.VIS: A Database and
Visualization Tool for Deleterious Variants in
Human Diseases.” Genome Biology, vol. 20,
no. 1, 2019, p. 119.
[36] Chu, et al. ”Gene Expression Profiling for
Guiding Adjuvant Chemotherapy Decisions in
Women with Early Breast Cancer: An
Evidence-Based and Economic Analysis.”
Ontario Health Technology Assessment
Series, vol. 18, no. 10, 2018, pp. 1-172.
[37] Zhang, et al. ”Machine Learning and Deep
Learning Methods for DNA Methylation
Analysis.” Computational and Structural
Biotechnology Journal, vol. 18, 2020, pp. 1-
12.
[38] Liberzon, et al. ”The Molecular Signatures
Database (MSigDB) Hallmark Gene Set
Collection.” Cell Systems, vol. 1, no. 6, 2015,
pp. 417-425.
[39] Nellore, et al. ”Rail-RNA: Scalable Analysis
of RNA-seq Splicing and Coverage.”
Bioinformatics, vol. 31, no. 22, 2015, pp.
3700-3702.
[40] He, et al. ”Identification of Type 2 Diabetes
Genes in Mexican Americans Through
Genome-wide Association Studies.” Diabetes,
vol. 64, no. 12, 2015, pp. 4101-4112.
[41] Poirion, et al. ”Single-Cell Transcriptomics
Bioinformatics and Computational
Challenges.” Frontiers in Genetics, vol. 7,
2016, p. 163.
[42] Shendure, et al. ”DNA Sequencing at 40: Past,
Present, and Future.” Nature, vol. 550, no.
7676, 2017, pp.345-353.
[43] Stuart, et al. ”Comprehensive Integration of
Single-Cell Data.” Cell, vol. 177, no. 7, 2019,
pp. 1888-1902.
[44] Alipanahi B, Delong A, Weirauch MT, Frey
BJ. Predicting the sequence specificities of
DNA- and RNAbinding proteins by deep
learning. Nat Biotechnol. 2015;33(8):831-838.
[45] Angermueller C, P¨arnamaa T, Parts L, Stegle
O. Deep learning for computational biology.
Mol Syst Biol. 2016;12(7):878.
[46] Ching T, Himmelstein DS, Beaulieu-Jones
BK, et al. Opportunities and obstacles for
deep learning in biology and medicine. J R
Soc Interface. 2018;15(141):20170387.
[47] Zhou J, Troyanskaya OG. Predicting effects
of noncoding variants with deep learning
based sequence model. Nat Methods.
2015;12(10):931-934.
[48] Kelley DR, Snoek J, Rinn JL. Basset: learning
the regulatory code of the accessible genome
with deep convolutional neural networks.
Genome Res. 2016;26(7):990-999.
[49] Mamoshina P, Vieira A, Putin E, et al.
Applications of deep learning in
biomedicine.Mol Pharm.016;13(5):1445-
1454.
[50] Schierz AC, Uyar B, Baryawno N, et al.
Machine learning reveals that cell identity
emerges from the coupling of stochastic gene
expression with deterministic enhancer
activity. bioRxiv. 2020.
[51] LeCun Y, Bengio Y, Hinton G. Deep
learning. Nature. 2015;521(7553):436-444.
[52] Min S, Lee B, Yoon S. Deep learning in
bioinformatics. Brief Bioinform.
2017;18(5):851-869.
[53] Mamoshina P, Volosnikova M, Ozerov IV, et
al. Machine learning on human muscle
transcriptomic data for biomarker discovery
and tissue-specific drug target identification.
Front Genet. 2018;9:242.
[54] Angermueller C, Lee HJ, Reik W, Stegle O.
DeepCpG: accurate prediction of single-cell
DNA methylation states using deep learning.
Genome Biol. 2017;18(1):67.
[55] Quang D, Xie X. DanQ: a hybrid
convolutional and recurrent deep neural
network for quantifying the function of DNA
sequences. Nucleic Acids Res.
2016;44(11):e107.
[56] Wang D, Zhang Y, Lu M, et al. Evaluation of
deep learning methods on large-scale fold
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
129
Volume 20, 2023
recognition. Brief Bioinform.
2017;18(6):1062-1073.
[57] Aalipour A, Gupta A, Vasievich MP, et al.
Engineering challenges for direct delivery of
nanoparticles to the central nervous system. J
Control Release. 2018;291:140-157.
[58] Kundaje A, Meuleman W, Ernst J, et al.
Integrative analysis of 111 reference human
epigenomes. Nature. 2015;518(7539):317-
330.
[59] Libbrecht MW, Noble WS. Machine learning
applications in genetics and genomics. Nat
Rev Genet. 2015;16(6):321-332.
[60] Angermueller C, Parnamaa T, Parts L, Stegle
O. Deep learning for computational biology.
Mol Syst Biol. 2016;12(7):878.
[61] Mamoshina P, Vieira A, Putin E, et al.
Applications of deep learning in biomedicine.
Mol Pharm. 2016;13(5):1445-1454.
[62] Ching T, Himmelstein DS, Beaulieu-Jones
BK, et al. Opportunities and obstacles for
deep learning in biology and medicine. J R
Soc Interface. 2018;15(141):20170387.
[63] Min S, Lee B, Yoon S. Deep learning in
bioinformatics. Brief Bioinform.
2017;18(5):851-869.
[64] Zhou J, Troyanskaya OG. Predicting effects
of noncoding variants with deep learning-
based sequence model. Nat Methods.
2015;12(10):931-934.
[65] Kundaje A, Meuleman W, Ernst J, et al.
Integrative analysis of 111 reference human
epigenomes. Nature. 2015;518(7539):317-
330.
[66] Aalipour A, Gupta A, Vasievich MP, et al.
Engineering challenges for direct delivery of
nanoparticles to the central nervous system. J
Control Release. 2018;291:140-157.
[67] Alipanahi B, Delong A, Weirauch MT, Frey
BJ. Predicting the sequence specificities of
DNA- and RNA binding proteins by deep
learning. Nat Biotechnol. 2015;33(8):831-838.
[68] Schierz AC, Uyar B, Baryawno N, et al.
Machine learning reveals that cell identity
emerges from the coupling of stochastic gene
expression with deterministic enhancer
activity. bioRxiv. 2020.
[69] Wang D, Zhang Y, Lu M, et al. Evaluation of
deep learning methods on large-scale fold
recognition. Brief Bioinform.
2017;18(6):1062-1073.
[70] Ching T, Himmelstein DS, Beaulieu-Jones
BK, et al. Opportunities and obstacles for
deep learning in biology and medicine. J R
Soc Interface. 2018;15(141):20170387.
[71] Mamoshina P, Vieira A, Putin E, et al.
Applications of deep learning in biomedicine.
Mol Pharm. 2016;13(5):1445-1454.
[72] Libbrecht MW, Noble WS. Machine learning
applications in genetics and genomics. Nat
Rev Genet. 2015;16(6):321-332.
[73] Min S, Lee B, Yoon S. Deep learning in
bioinformatics. Brief Bioinform.
2017;18(5):851-869.
[74] Zou J, Schaub MA, Lu L, et al. A primer on
deep learning in genomics. Nat Genet.
2019;51(1):12-18.
[75] Mamoshina P, Volosnikova M, Ozerov IV, et
al. Machine learning on human muscle
transcriptomic data for biomarker discovery
and tissue-specific drug target identification.
Front Genet. 2018;9:242.
[76] Karczewski KJ, Snyder MP. Integrative omics
for health and disease. Nat Rev Genet.
2018;19(5):299-310.
[77] Alipanahi B, Delong A, Weirauch MT, Frey
BJ. Predicting the sequence specificities of
DNA- and RNA binding proteins by deep
learning. Nat Biotechnol. 2015;33(8):831-838.
[78] Hood L, Friend SH. Predictive, personalized,
preventive, participatory (P4) cancer
medicine. Nat Rev Clin Oncol.
2011;8(3):184-187.
[79] Ritchie MD, Holzinger ER, Li R, et al.
Methods of integrating data to uncover
genotypephenotype interactions. Nat Rev
Genet. 2015;16(2):85-97.
[80] Cho K, Van Merri¨enboer B, Gulcehre C, et
al. Learning phrase representations using
RNN encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
2014.
[81] Yuan W, Lu M, Fu Y, et al. Challenges and
emerging directions in single-cell analysis.
Genome Biol. 2021;22(1):89.
[82] Hui ABY, Shi W, Boutros PC, Miller N,
Pintilie M, Fyles T, et al. Robust global
micro-RNA profiling with formalin-fixed
paraffin-embedded breast cancer tissues. Lab
Invest. 2009;89(5):597-606.
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
130
Volume 20, 2023
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
Ashwani Kumar Aggarwal contributed to the
present research, at all stages from formulating the
problem to writing the paper.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study.
Conflict of Interest
The authors have no conflicts of interest to declare
that they are relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE
DOI: 10.37394/23208.2023.20.12
Ashwani Kumar Aggarwal
E-ISSN: 2224-2902
131
Volume 20, 2023