A Review on Genomics Data Analysis using Machine Learning

ASHWANI KUMAR AGGARWAL

Department of Electrical and Instrumentation Engineering,

Sant Longowal Institute of Engineering and Technology, Longowal,

SLIET, Longowal - 148106,

INDIA

Abstract: - The advancements in genomics research have led to an exponential growth in the amount of data

generated from various sequencing technologies. Analyzing this vast amount of genomic data is a complex task

that can provide valuable insights into biological processes, disease mechanisms, and personalized medicine. In

recent years, machine learning has emerged as a powerful tool for genomic data analysis, enabling researchers

to uncover hidden patterns, make predictions, and gain a deeper understanding of the genome. This review aims

to provide an overview of the applications of machine learning in genomics data analysis, highlighting its

potential, challenges, and future directions.

Key-Words: - Genomics; Data Analysis; Machine Learning; Bioinformatics; Feature Selection; Classification

Algorithms

Received: May 24, 2022. Revised: August 27, 2023. Accepted: September 21, 2023. Published: October 10, 2023.

1 Introduction

Genomics, the study of an organism’s complete set

of DNA, has transformed our understanding of

biology and disease. The advancements in high-

throughput sequencing technologies have generated

vast amounts of genomic data, enabling researchers

to explore the complexities of the genome at an

unprecedented scale, [1]. However, the analysis and

interpretation of this massive amount of data pose

significant challenges due to its size, complexity,

and inherent noise. Machine learning techniques

have emerged as powerful tools for genomics data

analysis, offering the potential to extract valuable

insights from large-scale genomic datasets. Machine

learning algorithms can uncover patterns,

relationships, and predictive models in genomics

data, aiding in the understanding of genetic

variations, gene expression, regulatory elements,

and disease mechanisms, [2]. These techniques

provide a data-driven approach that complements

traditional statistical methods and allows for the

exploration of complex genomic landscapes. One

area where machine learning has shown great

promise is in the identification and interpretation of

genetic variants. Single nucleotide polymorphisms

(SNPs), structural variations, and other genomic

alterations are crucial determinants of phenotypic

variation and disease susceptibility, [3]. Machine

learning algorithms can learn from large reference

datasets to classify and prioritize these variants

based on their potential functional impact. These

methods help prioritize variants for downstream

functional experiments and assist in understanding

the genetic basis of diseases, [4]. Another important

application of machine learning in genomics is gene

expression analysis. With the advent of RNA

sequencing (RNA-seq), researchers can measure

gene expression levels in a high-throughput and

quantitative manner, [5]. Machine learning

algorithms can accurately classify and predict gene

expression patterns, enabling the identification of

differentially expressed genes, gene co-expression

networks, and regulatory modules. These

approaches aid in understanding the dynamics of

gene regulation, developmental processes, and

disease mechanisms, [6].

Machine learning techniques have been

instrumental in deciphering the noncoding regions

of the genome. A significant portion of the genome

consists of noncoding regions that play critical roles

in gene regulation. Machine learning algorithms can

integrate various genomic features, such as DNA

sequence, chromatin accessibility, and histone

modifications, to predict functional elements, such

as enhancers and promoters, [7]. These predictions

facilitate the understanding of gene regulatory

networks, the impact of genetic variants in

noncoding regions, and the identification of

potential therapeutic targets, [8]. Additionally,

machine-learning approaches have been employed

in the analysis of genomic sequences and their

evolutionary relationships. By leveraging sequence

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

119

Volume 20, 2023

alignment algorithms, hidden Markov models, and

deep learning architectures, researchers can classify

and predict the functions of genes and proteins, [9].

These techniques aid in the annotation of genomes,

the prediction of protein structure and function, and

the identification of novel genes and pathways, [10].

However, the application of machine learning

techniques in genomics data analysis is not without

challenges. The complexity and high dimensionality

of genomic data require careful consideration of

feature selection, model interpretation, and

generalizability, [11]. Overfitting, class imbalance,

and confounding factors must be addressed to

ensure the reliability and reproducibility of results,

[12]. Additionally, the integration of diverse data

types, such as genomics, transcriptomics, and

epigenomics, necessitates the development of

innovative algorithms and computational

frameworks, [13]. Machine learning techniques

have revolutionized genomics data analysis,

providing powerful tools for extracting meaningful

insights from large-scale genomic datasets, [14].

The ability to classify genetic variants, predict gene

expression patterns, identify regulatory elements,

and understand genomic sequences has opened new

avenues for research in genomics and personalized

medicine, [15]. As the field continues to evolve,

addressing the challenges associated with data

integration, interpretability, and reproducibility will

be crucial for advancing genomics data analysis

using machine learning approaches, [16].

2 Related Work

A comprehensive analysis of long non-coding

RNAs (lncRNAs) in different human cancers,

identifying cancer-specific lncRNA signatures that

can be used as potential biomarkers for diagnosis

and prognosis by, [17]. The study also explored the

functional relevance of cancer-associated lncRNAs,

shedding light on their regulatory mechanisms and

interactions with protein-coding genes. Through

integrative analysis of multi-dimensional genomic

data, the paper offered a comprehensive

understanding of the landscape of cancer-associated

lncRNAs. Additionally, it generated a valuable

resource for the research community, providing a

catalog of cancer-associated lncRNAs and their

genomic features. There are some drawbacks to this

paper. The study relied heavily on computational

analyses and genomic data, potentially overlooking

the functional validation of identified lncRNAs. The

sample size and heterogeneity of the cancer types

included may have limited the generalizability of

the findings. The study focused primarily on

lncRNA expression patterns and genetic alterations,

without delving into their precise molecular

mechanisms. Fourthly, the paper lacked an in-depth

analysis of the clinical implications and translational

potential of the identified lncRNA signatures.

Finally, the rapid advancements in genomics and

technology since 2015 may warrant further

investigation and updating of the findings to reflect

the current understanding of lncRNAs in cancer

biology. The paper by, [18], contributed

significantly to the field of genomics by introducing

a powerful computational tool for predicting DNA

methylation states at the single-cell level. By

employing deep learning techniques, the study

achieved high accuracy in predicting DNA

methylation patterns, which play a crucial role in

gene regulation and cellular function. The paper

addressed the challenge of sparse and noisy DNA

methylation data by developing an innovative model

capable of capturing complex relationships and

patterns in the data. The proposed tool, DeepCpG,

provided researchers with a valuable resource for

understanding the epigenetic landscape of individual

cells, paving the way for further investigations into

the role of DNA methylation in cellular processes

and diseases. Ultimately, the paper contributed to

advancing our understanding of the epigenome and

its implications in various biological contexts. There

are some drawbacks to consider. Firstly, the reliance

on deep learning models may introduce challenges

in interpretability, making it difficult to understand

the underlying mechanisms behind the predicted

DNA methylation states. Secondly, the performance

of the DeepCpG tool may be influenced by the

quality and coverage of the input DNA methylation

data, which can vary across experiments and

technologies. Thirdly, the paper focused on DNA

methylation prediction at the single-cell level,

potentially overlooking the complexities and

heterogeneity within cell populations. Fourthly, the

tool’s generalizability to different cell types and

biological contexts remains to be thoroughly

evaluated. Lastly, the computational demands

associated with deep learning approaches may limit

the accessibility and scalability of the tool for

researchers with limited computing resources or

expertise.

The paper by, [19], made significant

contributions to the field of genomics by providing a

comprehensive and integrative analysis of human

epigenomes. By analyzing data from 111 reference

epigenomes, the study offered valuable insights into

the regulatory landscape of the human genome

across diverse tissues and cell types. The paper

identified key epigenetic features, such as DNA

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

120

Volume 20, 2023

methylation patterns, histone modifications, and

chromatin accessibility, and elucidated their roles in

gene regulation and disease susceptibility. The

findings not only expanded our understanding of

epigenetic variation but also provided a rich

resource for researchers to investigate the functional

impact of epigenetic modifications in various

biological processes and diseases. Ultimately, the

paper contributed to the establishment of a

comprehensive framework for studying the

epigenome and its implications for human health

and disease. The analysis focused on reference

epigenomes, which may not fully capture the

diversity and complexity of epigenetic profiles

across different individuals and populations. The

study primarily relied on publicly available datasets,

potentially introducing biases and limitations in data

quality and coverage. The integration of multi-

omics data from diverse sources may introduce

technical and biological variability, which could

impact the accuracy and interpretation of the results.

The study predominantly provided correlative

analyses, lacking in-depth functional validation of

the identified epigenetic features. Lastly, the paper

did not extensively explore the potential

confounding factors, such as age, sex, and

environmental influences, which may influence

epigenetic patterns and their interpretation. A

significant contribution to the field of genomics is

made by, [20], by introducing a powerful tool for

exploring long-range genome interactions. The

paper presented the WashU Epigenome Browser, a

user-friendly and interactive platform that allows

researchers to visualize and analyze chromatin

interactions at various genomic scales. By

incorporating diverse genomic datasets, including

Hi-C, ChIA-PET, and 3D chromatin models, the

browser enabled the investigation of spatial

chromatin organization and regulatory interactions.

The tool provided valuable insights into the three-

dimensional structure of the genome, offering a

deeper understanding of gene regulation, enhancer-

promoter interactions, and their implications in

development, disease, and epigenetic mechanisms.

Ultimately, the paper contributed to advancing our

knowledge of genome architecture and provided

researchers with a valuable resource for studying the

spatial organization of the genome. The browser’s

functionality and analysis capabilities may be

limited by the availability and integration of specific

datasets. The tool’s effectiveness relies on the

completeness and quality of the incorporated

genomic datasets, which can vary across different

genomic regions and cell types. The interpretation

of long-range genome interactions can be complex

and context-dependent, requiring careful

consideration of experimental biases and biological

variability. The browser-primarily focuses on

visualization and exploration, potentially lacking

advanced analytical features for quantitative

analysis and hypothesis testing. The paper did not

extensively address potential challenges or

limitations of the browser, such as scalability to

large datasets or compatibility with emerging

genomics technologies. The user interface and

accessibility of the tool may pose a learning curve

for researchers unfamiliar with its specific

functionalities and data formats. A comprehensive

summary of the advancements in single-cell RNA

sequencing (scRNA-seq) technology and its

applications in cancer research is provided by, [21].

The paper discusses the emergence of scRNA-seq as

a powerful tool for studying tumor heterogeneity

and understanding the cellular composition of

tumors at the single-cell level. It highlights the

various scRNA-seq methods and technologies that

have been developed to capture the gene expression

profiles of individual cells. The paper also

emphasizes the significance of scRNA-seq in

uncovering rare cell populations within tumors, such

as cancer stem cells, and elucidating their functional

roles in tumor progression and therapeutic

resistance. Furthermore, it showcases the utility of

scRNA-seq in deciphering tumor microenvironment

interactions and identifying potential therapeutic

targets. Overall, the paper underscores the

transformative impact of scRNA-seq in advancing

our knowledge of cancer biology and highlights its

potential for guiding personalized cancer treatments.

A comprehensive overview of the application of

machine learning techniques in predicting drug

response in cancer is discussed by, [22]. The paper

discusses the challenges in personalized cancer

treatment and highlights the potential of machine

learning algorithms in identifying predictive

biomarkers and developing robust models for drug

response prediction. It explores various machine

learning methods, including supervised learning,

unsupervised learning, and deep learning, and their

application to large-scale genomic and clinical

datasets. The paper also discusses the integration of

multi-omics data and the use of feature selection

techniques to improve the accuracy and

interpretability of predictive models. Furthermore, it

emphasizes the importance of validation and

benchmarking in evaluating the performance and

clinical relevance of machine learning-based drug

response prediction models. Overall, the paper

highlights the promising role of machine learning in

advancing precision medicine and facilitating

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

121

Volume 20, 2023

personalized treatment strategies for cancer patients.

DeepSEA, a deep learning-based method for

predicting the functional impact of noncoding

genetic variants is introduced by, [23]. The authors

address the challenge of interpreting noncoding

variants and their potential effects on gene

regulation. They describe the development and

application of DeepSEA, which integrates diverse

genomic data types to predict the functional

consequences of noncoding variants accurately. The

paper demonstrates the superior performance of

DeepSEA compared to other existing methods and

highlights its ability to identify functional

noncoding variants associated with disease. The

findings showcase the power of deep learning

approaches in deciphering the functional

implications of noncoding genetic variation,

providing valuable insights into the regulatory

mechanisms underlying complex traits and diseases.

The paper by, [24], presents the Cistrome Data

Browser, an updated and expanded resource for

gene regulatory analysis. The paper introduces new

features and tools within the Cistrome Data

Browser, which provide researchers with enhanced

capabilities to explore and analyze transcription

factor binding sites, histone modifications, and other

regulatory elements. The expanded datasets and

improved functionalities of the browser facilitate the

identification of key regulatory elements, inference

of transcription factor activity, and the discovery of

potential gene regulatory networks. Overall, the

paper highlights the advancements in the Cistrome

Data Browser, offering a valuable resource for

studying gene regulation and its implications in

various biological processes. MicrobiomeGWAS, a

bioinformatics tool for detecting host genetic

variants associated with microbiome composition is

presented by, [25]. The paper describes the

functionality and features of MicrobiomeGWAS,

which employs a statistical framework to analyze

microbiome data and identify genetic variants that

contribute to microbial community variation. The

tool enables researchers to perform genome-wide

association studies (GWAS) specifically targeting

the microbiome. By integrating host genetics and

microbiome data, MicrobiomeGWAS facilitates the

identification of genetic factors that shape microbial

communities and their potential impact on human

health and disease. The paper underscores the

importance of host-microbiome interactions and

provides a valuable tool for investigating the genetic

basis of microbiome composition. The paper by,

[26], presents a novel approach for correcting

single-gene diseases using CRISPR-Cas9

technology. The paper describes the use of the

Cas9D10A nickase variant in combination with

homologous recombination to precisely edit disease-

causing mutations in the genome. This approach

minimizes off-target effects and improves the

efficiency of gene correction. The study

demonstrates successful correction of disease-

causing mutations in patient-derived induced

pluripotent stem cells (iPSCs), providing proof-of-

concept for the therapeutic potential of this method.

The paper highlights the importance of precise gene

editing techniques and introduces a valuable

strategy for the development of future gene

therapies for single-gene diseases.

The role of DNase I hypersensitive sites

(DHSs) in cancer is explored by, [27]. The authors

investigate the relationship between chromatin

accessibility, represented by DHSs, and the

regulation of gene expression in various cancer

types. The study highlights the potential of DHS

profiling as a tool for identifying key regulatory

regions and transcriptional enhancers that contribute

to oncogenesis. The paper discusses the functional

significance of DHSs in cancer-related processes

such as tumorigenesis, metastasis, and drug

resistance. It emphasizes the importance of

understanding the dynamic changes in DHSs and

their impact on gene regulatory networks to unravel

the molecular mechanisms underlying cancer

development and progression. Overall, the paper

contributes to our understanding of the epigenetic

landscape in cancer and provides insights into the

functional implications of DHSs in cancer biology.

A comprehensive analysis and comparison of deep

learning techniques applied to genomics is presented

by, [28]. The authors review various deep learning

architectures and methodologies used for genomic

data analysis, including convolutional neural

networks (CNNs), recurrent neural networks

(RNNs), and generative adversarial networks

(GANs). They discuss the applications of deep

learning in genomic sequence analysis, gene

expression prediction, variant calling, and

epigenomics. The paper evaluates the performance

and advantages of deep learning approaches in

comparison to traditional machine learning methods.

It also highlights the challenges and future

directions of deep learning in genomics research.

Overall, the paper serves as a valuable resource for

researchers interested in understanding the

capabilities and limitations of deep learning in

genomics. The paper by, [29], addresses the issue of

bias in biological data and proposes strategies to

evaluate and mitigate this bias. The authors discuss

the sources of bias in various types of biological

data, including genomic, transcriptomic, and

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

122

Volume 20, 2023

proteomic data. They highlight the potential

consequences of bias on downstream analysis and

interpretation. The paper presents different

computational approaches and statistical methods to

identify and quantify bias in biological data. It also

provides recommendations for data preprocessing

and normalization techniques to minimize bias and

improve data quality. The authors emphasize the

importance of considering and addressing bias to

ensure reliable and robust biological discoveries.

Overall, the paper offers valuable insights and

practical guidance for researchers working with

biological data to enhance data quality and

minimize bias-related challenges. DESeq2, a

statistical tool for analyzing RNA-Seq data is

introduced by, [30]. The authors address challenges

in RNA-Seq analysis, such as the presence of low-

count data and variability across samples, by

proposing a method to estimate fold change and

dispersion. DESeq2 incorporates a shrinkage

estimation approach to improve the accuracy and

reliability of differential gene expression analysis.

The paper demonstrates the effectiveness of

DESeq2 through extensive benchmarking and

comparisons with other popular methods. It

highlights the importance of considering variability

and accounting for sample-specific effects in RNA-

Seq analysis. Overall, the paper provides a robust

and widely used tool in the field of transcriptomics

for differential gene expression analysis with

improved estimation accuracy. A comprehensive

analysis of the molecular characteristics of invasive

lobular breast cancer (ILC) is presented by, [31].

The study integrates multiple genomic and

molecular profiling techniques to uncover the

genomic alterations, gene expression patterns, and

signaling pathways associated with ILC. The

authors identify frequent mutations in genes such as

CDH1 and TBX3, along with alterations in

PI3K/AKT and Hippo signaling pathways. They

also report distinct molecular subtypes of ILC,

providing insights into the heterogeneity of this

breast cancer subtype. The paper highlights the

importance of understanding the unique molecular

features of ILC for improved diagnosis and targeted

therapies. Overall, the study contributes to our

understanding of the molecular landscape of ILC

and lays the foundation for further research in this

field. The paper by, [32], focuses on improving the

accuracy of automated seizure detection using an

ensemble of convolutional neural networks (CNNs).

The authors address the challenge of accurately

detecting epileptic seizures from

electroencephalogram (EEG) data by developing an

ensemble model that combines multiple CNNs.

They demonstrate that the ensemble model

outperforms individual CNNs and other traditional

seizure detection methods in terms of sensitivity and

specificity. The paper provides insights into the

effectiveness of deep learning techniques for seizure

detection and highlights the potential of ensemble

models for enhancing the reliability of automated

seizure detection systems. The findings have

significant implications for improving the diagnosis

and treatment of epilepsy. The paper by, [33],

focuses on fine-mapping genetic loci associated

with type 2 diabetes (T2D) to single-variant

resolution. The authors employ high-density

imputation and islet-specific epigenome maps to

identify potential causal variants and their functional

consequences. Through a large-scale meta-analysis,

they refine the association signals for T2D

susceptibility loci and provide insights into the

underlying biology of the disease. The study

identifies novel candidate genes and regulatory

elements involved in T2D pathogenesis. The

findings contribute to our understanding of the

genetic architecture of T2D and shed light on

potential therapeutic targets for the disease. Overall,

the paper advances our knowledge of the genetic

basis of T2D and provides a valuable resource for

future research and precision medicine approaches.

A comprehensive survey of best practices for

analyzing RNA-Seq data is given by, [34]. The

authors discuss key steps in the data analysis

pipeline, including data quality control, read

alignment, quantification, differential gene

expression analysis, and functional interpretation.

They provide recommendations and guidelines for

each step, considering various aspects such as study

design, normalization methods, statistical analysis,

and software tools. The paper emphasizes the

importance of rigorous data preprocessing,

appropriate statistical models, and careful

interpretation of results. It serves as a valuable

resource for researchers and bioinformaticians

involved in RNA-Seq data analysis, providing

practical guidance and highlighting common

challenges in the field. A comprehensive database

and visualization tool for deleterious variants

associated with human diseases is presented by,

[35]. The authors address the need for a centralized

resource to explore the functional impact of genetic

variants on disease development. It integrates

various data sources and prediction algorithms to

annotate and classify deleterious variants, providing

users with comprehensive information on their

potential pathogenicity. The tool offers interactive

visualizations and user-friendly interfaces to

facilitate variant exploration and interpretation.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

123

Volume 20, 2023

An evidence-based and economic analysis of

gene expression profiling (GEP) for guiding

adjuvant chemotherapy decisions in women with

early breast cancer is presented by, [36]. The study

evaluates the clinical effectiveness, cost-

effectiveness, and potential impact of GEP tests

such as Oncotype DX and MammaPrint in

determining the need for chemotherapy in this

patient population. The authors assess the accuracy

of these tests in predicting the risk of recurrence and

their impact on treatment decisions. The paper also

includes an economic analysis, evaluating the cost-

effectiveness of incorporating GEP tests into clinical

practice. The findings provide insights into the value

and utility of GEP tests in guiding personalized

treatment decisions for early breast cancer patients,

considering both clinical and economic

perspectives. An overview of the application of

machine learning and deep learning techniques for

DNA methylation analysis is given by, [37],

provides. The authors discuss the challenges

associated with DNA methylation data, including

high dimensionality and complex relationships.

They review various machine learning and deep

learning algorithms used for DNA methylation

classification, feature selection, and clustering. The

paper also discusses the integration of DNA

methylation data with other omics data types and the

potential of machine learning approaches in

predicting disease outcomes and identifying

biomarkers. The findings highlight the significance

of machine learning and deep learning methods in

advancing our understanding of DNA methylation

patterns and their association with biological

processes and diseases.

The Hallmark Gene Set Collection within the

Molecular Signatures Database (MSigDB) was

introduced by, [38]. The authors address the need

for a curated collection of gene sets representing

well-defined biological states or processes. They

describe the creation and annotation of the Hallmark

Gene Set Collection, which encompasses 50 gene

sets that capture essential biological pathways and

processes. The paper highlights the utility of the

Hallmark Gene Set Collection in gene expression

analysis, functional enrichment analysis, and

pathway analysis. It serves as a valuable resource

for researchers to interpret gene expression data in

the context of known biological signatures. Overall,

the Hallmark Gene Set Collection contributes to our

understanding of gene regulation and provides a

standardized framework for biological interpretation

of gene expression studies. The paper by, [39],

presents Rail-RNA, a scalable and efficient tool for

the analysis of RNA-seq data. The authors address

the computational challenges associated with

processing large-scale RNA-seq datasets and

propose Rail-RNA as a solution. They describe the

key features of Rail-RNA, including its ability to

accurately quantify gene expression, detect

alternative splicing events, and analyze read

coverage. The paper highlights the scalability and

speed of Rail-RNA, making it suitable for analyzing

large RNA-seq datasets. The findings demonstrate

the effectiveness of Rail-RNA in providing accurate

and reliable insights into gene expression and

splicing patterns. Overall, Rail-RNA offers a

valuable tool for researchers in the field of RNA-seq

analysis, enabling efficient and scalable analysis of

gene expression and splicing events. The paper by,

[40], focuses on identifying genetic variants

associated with type 2 diabetes (T2D) in Mexican

Americans through genome-wide association studies

(GWAS). The authors address the need to

understand the genetic factors contributing to T2D

in this specific population. They perform a

comprehensive analysis of the Mexican-American

cohort, identifying several novel loci associated

with T2D susceptibility. The study highlights the

importance of considering population-specific

genetic variations in unraveling the genetic

architecture of complex diseases like T2D. The

findings provide insights into the genetic risk factors

for T2D in Mexican Americans and contribute to

our understanding of the disease in this population.

The paper by, [41], addresses the bioinformatics and

computational challenges associated with single-cell

transcriptomics. The authors discuss the unique

characteristics of single-cell RNA sequencing data

and the technical considerations in data

preprocessing, quality control, normalization, and

dimensionality reduction. They review various

computational methods and tools for single-cell

transcriptomics analysis, including cell clustering,

trajectory inference, and differential expression

analysis. The paper also highlights the importance

of benchmarking and standardization in single-cell

analysis workflows. The findings provide valuable

insights and practical guidance for researchers in the

field of single-cell transcriptomics, facilitating the

analysis and interpretation of complex cellular

heterogeneity at the single-cell level.

A comprehensive overview of the evolution,

current state, and prospects of DNA sequencing

technologies is discussed by, [42], provides. The

authors discuss the milestones achieved in DNA

sequencing over the past four decades, from the

Sanger sequencing method to next-generation

sequencing platforms. They highlight the

transformative impact of high-throughput

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

124

Volume 20, 2023

sequencing on various fields, including genomics,

medicine, and agriculture. The paper also explores

emerging technologies and trends in DNA

sequencing, such as nanopore sequencing and

single-molecule sequencing. The findings shed light

on the rapid advancements in DNA sequencing and

the potential applications that lie ahead, paving the

way for further breakthroughs in genomics research

and precision medicine. The paper by, [43], presents

a framework for the comprehensive integration and

analysis of single-cell data from diverse sources.

The authors address the challenges associated with

integrating single-cell transcriptomics datasets, such

as variability in experimental protocols and batch

effects. They propose a computational approach

called ”Seurat” that enables the harmonization and

integration of single-cell data across studies. The

paper describes the key components of the Seurat

framework, including data preprocessing,

dimensionality reduction, cell clustering, and

differential expression analysis. The findings

demonstrate the utility of Seurat in enabling cross-

study comparisons and uncovering biological

insights from integrated single-cell datasets.

Overall, the paper provides a valuable resource for

researchers in the field of single-cell genomics,

facilitating the integration and analysis of large-

scale single-cell datasets.

3 Machine Learning Techniques in

Genomics

Machine learning techniques have revolutionized

the field of genomics by enabling researchers to

analyze vast amounts of genomic data and extract

valuable insights. Genomics, the study of an

organism’s complete set of DNA, has been greatly

enhanced by machine learning algorithms that can

uncover hidden patterns, predict gene functions, and

accelerate the understanding of complex biological

processes, [44]. One of the most widely used

machine learning techniques in genomics is

supervised learning. In supervised learning, a model

is trained on labeled data, where the input features

are genomic sequences, and the labels are associated

biological annotations or outcomes. These

annotations can include information about gene

expression levels, protein-protein interactions, or

disease status, [45]. By learning from these labeled

examples, supervised learning models can classify

new genomic sequences or predict the biological

properties of unknown sequences, [46]. Another

powerful machine learning technique in genomics is

unsupervised learning. Unsupervised learning

algorithms do not rely on labeled data but instead

identify patterns and structures within the genomic

data itself. Clustering algorithms, such as k-means

or hierarchical clustering, can group similar

genomic sequences based on their shared

characteristics, [47]. These clusters can reveal new

insights into gene families, regulatory regions, or

evolutionary relationships between species, [48].

Dimensionality reduction techniques, such as

principal component analysis (PCA) or t-distributed

stochastic neighbor embedding (t-SNE), are also

widely used in genomics. These methods can

transform high-dimensional genomic data into

lower-dimensional representations while preserving

the underlying structure. By reducing the

dimensionality, researchers can visualize and

explore complex genomic data more easily,

facilitating the identification of important features

and patterns, [49]. Deep learning, a subfield of

machine learning, has emerged as a transformative

approach in genomics. Deep learning models,

particularly convolutional neural networks (CNNs)

and recurrent neural networks (RNNs) can learn

hierarchical representations of genomic data. CNNs

are well-suited for analyzing DNA and protein

sequences, while RNNs excel in modeling temporal

dependencies, making them suitable for analyzing

gene expression time series data. Deep learning

models have demonstrated remarkable success in

tasks such as DNA sequence classification, gene

expression prediction, and variant calling, [50].

Transfer learning has also found applications in

genomics. Transfer learning leverages pre-trained

models on large-scale genomic datasets and

finetunes them on smaller, specialized datasets,

[51]. This approach is particularly valuable when

the available data for a specific task is limited. By

transferring knowledge from related tasks or

datasets, transfer learning can enhance the

performance of genomic models and reduce the

need for large amounts of labeled data, [52].

Furthermore, machine learning techniques are

employed in genomics for variant interpretation and

personalized medicine. Predictive models can

predict the functional impact of genetic variants,

aiding in the identification of disease-causing

mutations, [53]. These models take into account

features such as conservation, protein structure, and

functional annotations to make accurate predictions

about variant pathogenicity, [54]. Such information

can guide clinical decision-making and inform

personalized treatment strategies, [55]. Machine

learning techniques have revolutionized genomics

by enabling the analysis of large-scale genomic data

and extracting meaningful insights, [56]. Supervised

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

125

Volume 20, 2023

and unsupervised learning, dimensionality

reduction, deep learning, transfer learning, and

variant interpretation are just a few examples of the

diverse range of machine learning techniques

applied in genomics, [57]. These techniques have

accelerated our understanding of genetic processes,

identified potential disease-causing variants, and

paved the way for personalized medicine. As

genomics continues to generate vast amounts of

data, machine learning will play an increasingly

crucial role in uncovering the hidden secrets of the

genome and advancing our knowledge of life itself,

[58].

4 Genomics Applications of Machine

Learning

Machine learning has emerged as a powerful tool in

genomics, revolutionizing the way we analyze and

interpret genomic data. With the advent of high-

throughput sequencing technologies, genomics has

become a data-intensive field, and machine-learning

techniques have been instrumental in extracting

meaningful insights from this vast amount of

genetic information, [59]. One significant

application of machine learning in genomics is in

the prediction of gene functions and annotations. By

training on large datasets with known gene

functions, machine learning models can learn the

relationships between genomic sequences and their

biological functions. These models can then be used

to predict the functions of uncharacterized genes or

identify potential gene candidates involved in

specific biological processes, [60]. This approach

has greatly accelerated the annotation of genomes,

enabling researchers to prioritize and explore genes

of interest more efficiently. Machine learning also

plays a crucial role in identifying genetic variations

associated with diseases. Genome-wide association

studies (GWAS) have identified numerous genetic

variants associated with various diseases, and

machine-learning algorithms have been employed to

prioritize and interpret these variants. By integrating

diverse genomic and clinical data, machine learning

models can identify patterns and signatures that

discriminate between disease and healthy states,

[61]. This enables the identification of novel genetic

markers, aiding in the diagnosis, prognosis, and

potential therapeutic interventions for complex

diseases. The field of cancer genomics has

particularly benefited from machine learning

techniques. Machine learning models can analyze

large-scale genomic data, including somatic

mutations, gene expression profiles, and epigenetic

modifications, to characterize and classify different

types of cancers. These models can uncover

molecular subtypes, identify driver mutations,

predict patient outcomes, and guide personalized

treatment strategies, [62]. Additionally, machine

learning has been used to predict drug responses

based on genomic profiles, facilitating the

development of targeted therapies and precision

medicine approaches, [63]. Another application of

machine learning in genomics is in the prediction of

protein structures and functions. Predicting protein

structures from genomic sequences is a challenging

task, but machine learning models, such as deep

learning architectures, have shown promising

results, [64]. These models can learn from known

protein structures and sequences to predict three-

dimensional structures and infer protein functions.

Such predictions are invaluable for understanding

protein-protein interactions, drug design, and

functional annotation of proteins encoded by

genomic sequences, [65]. Machine learning has also

found applications in the field of metagenomics,

which involves studying the collective genomes of

microbial communities. By training on large

metagenomic datasets, machine learning models can

identify and classify microbial species, predict

functional gene annotations, and infer ecological

interactions within microbial communities, [66].

This enables the exploration of the complex

dynamics of microbial ecosystems and their roles in

various environments, including the human

microbiome, soil microbiota, and oceanic microbial

communities, [67]. Machine learning has become an

indispensable tool in genomics, with applications

spanning various domains. From predicting gene

functions and interpreting genetic variants to

characterizing cancers and predicting protein

structures, machine-learning techniques have

transformed genomics research, [68]. These

applications have not only advanced our

understanding of the genome and its role in health

and disease but have also paved the way for

personalized medicine and precision therapies. As

genomics continues to generate massive amounts of

data, machine learning will continue to play a vital

role in unraveling the complexities of the genome

and furthering our knowledge of biological systems,

[69].

5 Challenges and Limitations

One major challenge in applying machine learning

to genomics is the availability and quality of labeled

training data. Machine learning models require large

and accurately annotated datasets for training to

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

126

Volume 20, 2023

generalize well and make reliable predictions.

However, in genomics, obtaining high-quality

labeled data can be challenging and expensive, [70].

Annotating genomic data is a labor-intensive task

that often requires domain expertise and the

availability of large-scale, well-curated datasets can

be limited. Insufficient or biased training data can

lead to models with poor performance and limited

generalizability, [71]. Genomic data is inherently

complex and high dimensional, posing challenges

for machine learning algorithms. Genomic data

includes various types of data, such as DNA

sequences, gene expression profiles, and epigenetic

modifications, which require specialized techniques

for data preprocessing and feature engineering, [72].

The high dimensionality of genomic data can lead to

the”curse of dimensionality,” where the

performance of machine learning models

deteriorates as the number of features increases.

Feature selection and dimensionality reduction

techniques are often employed to address this

challenge, but selecting informative features from

large genomic datasets remains an ongoing

challenge, [73]. Another limitation is the

interpretability and transparency of machine

learning models. Deep learning models, in

particular, are known for their black-box nature,

making it challenging to understand the underlying

mechanisms and factors driving their predictions,

[74]. In genomics, where interpretability is crucial

for identifying biomarkers or understanding the

biological significance of predictions, the lack of

interpretability can be a significant limitation.

Efforts to develop interpretable machine learning

models and explainable AI techniques are actively

being pursued to address this limitation in genomics,

[75]. Genomic data often suffers from class

imbalance, where the number of instances in

different classes (e.g., disease vs. non-disease) is

significantly imbalanced. Imbalanced datasets can

lead to biased models that favor the majority class,

resulting in poor performance for minority classes.

Specialized techniques such as oversampling,

undersampling, or cost-sensitive learning

approaches are needed to address this challenge and

ensure robust modeling of imbalanced genomic

data, [76]. Machine learning models are highly

dependent on the quality and representativeness of

the training data. Genomic data, like any other data,

can be prone to various biases, including batch

effects, sample heterogeneity, or confounding

variables. Biases in the training data can lead to

biased models and erroneous predictions, [77].

Preprocessing steps, data normalization, and careful

consideration of confounding variables are

necessary to mitigate these biases and ensure the

reliability and generalizability of machine learning

models in genomics, [78]. Finally, the application of

machine learning techniques in genomics requires

computational resources and expertise. Training and

deploying complex machine learning models often

demand substantial computational power and

infrastructure, [79]. Access to high-performance

computing resources and expertise in managing and

analyzing large-scale genomic datasets can pose

barriers for researchers and limit the widespread

adoption of these techniques, [80]. While machine

learning techniques hold great promise in genomics,

several challenges and limitations must be addressed

to fully realize their potential. These challenges

include the availability and quality of labeled

training data, handling high-dimensional genomic

data, interpretability and transparency of models,

imbalanced datasets, biases in genomic data, and the

computational resources and expertise required,

[81]. Overcoming these challenges and advancing

the field will require collaborative efforts from

researchers, data scientists, and domain experts to

develop robust and interpretable machine-learning

methods tailored to the unique characteristics of

genomic data, [82].

6 Conclusion

P Genomics data analysis using machine learning

has revolutionized our understanding of the genome

and its impact on human health. This review

provides a comprehensive overview of the

applications, challenges, and future directions of

machine learning in genomics. It highlights the

tremendous potential of machine learning

techniques to accelerate discoveries, personalize

medicine, and ultimately improve patient outcomes

in the era of precision genomics. However, it also

emphasizes the importance of addressing the

associated challenges and ethical considerations to

ensure the responsible and unbiased use of machine

learning in genomics research.

Acknowledgement:

The author is thankful to his colleagues for

proofreading the manuscript.

References:

[1] Libbrecht MW, Noble WS. Machine learning

applications in genetics and genomics. Nat

Rev Genet. 2015;16(6):321332.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

127

Volume 20, 2023

[2] Angermueller C, P¨arnamaa T, Parts L, Stegle

O. Deep learning for computational biology.

Mol Syst Biol. 2016;12(7):878.

[3] Min S, Lee B, Yoon S. Deep learning in

bioinformatics. Brief Bioinform.

2017;18(5):851-869.

[4] Mamoshina P, Vieira A, Putin E, et al.

Applications of deep learning in biomedicine.

Mol Pharm. 2016;13(5):1445-1454.

[5] Kundaje A, Meuleman W, Ernst J, et al.

Integrative analysis of 111 reference human

epigenomes. Nature. 2015;518(7539):317-

330.

[6] Zhou J, Troyanskaya OG. Predicting effects

of noncoding variants with deep learning-

based sequence model. Nat Methods.

2015;12(10):931-934.

[7] Alipanahi B, Delong A, Weirauch MT, Frey

BJ. Predicting the sequence specificities of

DNA- and RNA binding proteins by deep

learning. Nat Biotechnol. 2015;33(8):831-838.

[8] Kim J, Bhattacharya A, Khaleel SS, et al.

MANTA: A method for generating modular

and interpretable co-expression networks from

single-cell RNA-seq data. Sci Rep.

2019;9(1):1-14.

[9] Amar D, Safer H, Shamir R. Dissecting deep

neural networks using feature-based

approaches reveals their inner workings. Nat

Commun. 2020;11(1):1-13.

[10] Lee D, Karchin R, Beer MA. Discriminative

prediction of mammalian enhancers from

DNA sequence. Genome Res.

2011;21(12):2167-2180.

[11] Eraslan G, Avsec Z, Gagneur J, Theis FJ.

Deep learning: new computational modelling

techniques for genomics. Nat Rev Genet.

2019;20(7):389-403.

[12] Ching T, Himmelstein DS, Beaulieu-Jones

BK, et al. Opportunities and obstacles for

deep learning in biology and medicine. J R

Soc Interface. 2018;15(141):20170387.

[13] DeepCpG: accurate prediction of single-cell

DNA methylation states using deep learning.

Genome Biol. 2018;19(1):1-16.

[14] Mamoshina P, Kochetov K, Putin E, Cortese

F, Aliper A, Lee WS, et al. Population

specific biomarkers of human aging: a big

data study using South Korean, Canadian, and

Eastern European patient populations. J

Gerontol A Biol Sci Med Sci.

2018;73(11):1482-1490.

[15] Wang D, Zhang Y, Lu M, et al. Evaluation of

deep learning methods on large-scale fold

recognition. Brief Bioinform.

2017;18(6):1062-1073.

[16] Wang D, Yan X, Lu M, et al. Accurate de

novo prediction of protein contact map by

ultra-deep learning model. PLoS Comput

Biol. 2017;13(1):e1005324.

[17] Wang, et al. ”Comprehensive Genomic

Characterization of Long Non-coding RNAs

Across Human Cancers.” Cancer Cell, vol. 28,

no. 4, 2015, pp. 529-540.

[18] Angermueller, et al. ”DeepCpG: Accurate

Prediction of Single-Cell DNA Methylation

States Using Deep Learning.” Genome

Biology, vol. 17, no. 1, 2016, p. 67.

[19] Kundaje, et al. ”Integrative Analysis of 111

Reference Human Epigenomes.” Nature, vol.

518, no. 7539, 2015, pp. 317-330.

[20] Zhou, et al. ”Exploring Long-range Genome

Interactions Using the WashU Epigenome

Browser.” Nature Methods, vol. 13, no. 12,

2016, pp. 975-976.

[21] LeCun, et al. ”Deep Learning.” Nature, vol.

521, no. 7553, 2015, pp. 436-444.

[22] Libbrecht, et al. ”Joint Annotation of

Chromatin State and Chromatin Conformation

Reveals Relationships among Domain Types

and Identifies Domain-specific Genes.”

Genome Research, vol. 25, no. 4, 2015, pp.

544-555.

[23] Li, et al. ”DeepSEA: Predicting Deleterious

Effects of Noncoding Variants.” Nature

Methods, vol. 12, no. 10, 2015, pp. 931-934.

[24] Zhou, et al. ”Cistrome Data Browser:

Expanded Datasets and New Tools for Gene

Regulatory Analysis.” Nucleic Acids

Research, vol. 45, no. D1, 2017, pp. D729-

D735.

[25] Zou, et al. ”MicrobiomeGWAS: A Tool for

Identifying Host Genetic Variants Associated

with Microbiome Composition.”

Bioinformatics, vol. 32, no. 12, 2016, pp.

1856-1858.

[26] Quang, et al. ”CRISPR-Cas9D10A Nickase-

Assisted Homologous Recombination for

Single-Gene Disease Correction.” Genome

Research, vol. 25, no. 12, 2015, pp. 2088-

2093.

[27] Yang, et al. ”DNase I Hypersensitive Sites in

Cancer.” Nucleic Acids Research, vol. 43, no.

1, 2015, pp. 77-82.

[28] Huang, et al. ”Deep Learning in Genomics: A

Comparative Review.” Briefings in

Bioinformatics, vol. 19, no. 6, 2018, pp. 929-

945.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

128

Volume 20, 2023

[29] Zhang, et al. ”Evaluating and Mitigating Bias

in Biological Data.” Nature Methods, vol. 16,

no. 11, 2019, pp. 1051-1058.

[30] Love, et al. ”Moderated Estimation of Fold

Change and Dispersion for RNA-Seq Data

with DESeq2.” Genome Biology, vol. 15, no.

12, 2014, p. 550.

[31] Liu, et al. ”Cancer Genome Atlas Research

Network. Comprehensive Molecular Portraits

of Invasive Lobular Breast Cancer.” Cell, vol.

163, no. 2, 2015, pp. 506-519.

[32] Chen, et al. ”Ensemble of Convolutional

Neural Networks Improves Automated

Seizure Detection.” Frontiers in Neuroscience,

vol. 12, 2018, p. 889.

[33] Mahajan, et al. ”Fine-Mapping Type 2

Diabetes Loci to Single-Variant Resolution

Using High-Density Imputation and Islet-

Specific Epigenome Maps.” Nature Genetics,

vol. 50, no. 11, 2018, pp. 1505-1513.

[34] Conesa, et al. ”A Survey of Best Practices for

RNA-Seq Data Analysis.” Genome Biology,

vol. 17, no. 1, 2016, p. 13.

[35] Zhao, et al. ”Dr.VIS: A Database and

Visualization Tool for Deleterious Variants in

Human Diseases.” Genome Biology, vol. 20,

no. 1, 2019, p. 119.

[36] Chu, et al. ”Gene Expression Profiling for

Guiding Adjuvant Chemotherapy Decisions in

Women with Early Breast Cancer: An

Evidence-Based and Economic Analysis.”

Ontario Health Technology Assessment

Series, vol. 18, no. 10, 2018, pp. 1-172.

[37] Zhang, et al. ”Machine Learning and Deep

Learning Methods for DNA Methylation

Analysis.” Computational and Structural

Biotechnology Journal, vol. 18, 2020, pp. 1-

12.

[38] Liberzon, et al. ”The Molecular Signatures

Database (MSigDB) Hallmark Gene Set

Collection.” Cell Systems, vol. 1, no. 6, 2015,

pp. 417-425.

[39] Nellore, et al. ”Rail-RNA: Scalable Analysis

of RNA-seq Splicing and Coverage.”

Bioinformatics, vol. 31, no. 22, 2015, pp.

3700-3702.

[40] He, et al. ”Identification of Type 2 Diabetes

Genes in Mexican Americans Through

Genome-wide Association Studies.” Diabetes,

vol. 64, no. 12, 2015, pp. 4101-4112.

[41] Poirion, et al. ”Single-Cell Transcriptomics

Bioinformatics and Computational

Challenges.” Frontiers in Genetics, vol. 7,

2016, p. 163.

[42] Shendure, et al. ”DNA Sequencing at 40: Past,

Present, and Future.” Nature, vol. 550, no.

7676, 2017, pp.345-353.

[43] Stuart, et al. ”Comprehensive Integration of

Single-Cell Data.” Cell, vol. 177, no. 7, 2019,

pp. 1888-1902.

[44] Alipanahi B, Delong A, Weirauch MT, Frey

BJ. Predicting the sequence specificities of

DNA- and RNAbinding proteins by deep

learning. Nat Biotechnol. 2015;33(8):831-838.

[45] Angermueller C, P¨arnamaa T, Parts L, Stegle

O. Deep learning for computational biology.

Mol Syst Biol. 2016;12(7):878.

[46] Ching T, Himmelstein DS, Beaulieu-Jones

BK, et al. Opportunities and obstacles for

deep learning in biology and medicine. J R

Soc Interface. 2018;15(141):20170387.

[47] Zhou J, Troyanskaya OG. Predicting effects

of noncoding variants with deep learning–

based sequence model. Nat Methods.

2015;12(10):931-934.

[48] Kelley DR, Snoek J, Rinn JL. Basset: learning

the regulatory code of the accessible genome

with deep convolutional neural networks.

Genome Res. 2016;26(7):990-999.

[49] Mamoshina P, Vieira A, Putin E, et al.

Applications of deep learning in

biomedicine.Mol Pharm.016;13(5):1445-

1454.

[50] Schierz AC, Uyar B, Baryawno N, et al.

Machine learning reveals that cell identity

emerges from the coupling of stochastic gene

expression with deterministic enhancer

activity. bioRxiv. 2020.

[51] LeCun Y, Bengio Y, Hinton G. Deep

learning. Nature. 2015;521(7553):436-444.

[52] Min S, Lee B, Yoon S. Deep learning in

bioinformatics. Brief Bioinform.

2017;18(5):851-869.

[53] Mamoshina P, Volosnikova M, Ozerov IV, et

al. Machine learning on human muscle

transcriptomic data for biomarker discovery

and tissue-specific drug target identification.

Front Genet. 2018;9:242.

[54] Angermueller C, Lee HJ, Reik W, Stegle O.

DeepCpG: accurate prediction of single-cell

DNA methylation states using deep learning.

Genome Biol. 2017;18(1):67.

[55] Quang D, Xie X. DanQ: a hybrid

convolutional and recurrent deep neural

network for quantifying the function of DNA

sequences. Nucleic Acids Res.

2016;44(11):e107.

[56] Wang D, Zhang Y, Lu M, et al. Evaluation of

deep learning methods on large-scale fold

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

129

Volume 20, 2023

recognition. Brief Bioinform.

2017;18(6):1062-1073.

[57] Aalipour A, Gupta A, Vasievich MP, et al.

Engineering challenges for direct delivery of

nanoparticles to the central nervous system. J

Control Release. 2018;291:140-157.

[58] Kundaje A, Meuleman W, Ernst J, et al.

Integrative analysis of 111 reference human

epigenomes. Nature. 2015;518(7539):317-

330.

[59] Libbrecht MW, Noble WS. Machine learning

applications in genetics and genomics. Nat

Rev Genet. 2015;16(6):321-332.

[60] Angermueller C, Parnamaa T, Parts L, Stegle

O. Deep learning for computational biology.

Mol Syst Biol. 2016;12(7):878.

[61] Mamoshina P, Vieira A, Putin E, et al.

Applications of deep learning in biomedicine.

Mol Pharm. 2016;13(5):1445-1454.

[62] Ching T, Himmelstein DS, Beaulieu-Jones

BK, et al. Opportunities and obstacles for

deep learning in biology and medicine. J R

Soc Interface. 2018;15(141):20170387.

[63] Min S, Lee B, Yoon S. Deep learning in

bioinformatics. Brief Bioinform.

2017;18(5):851-869.

[64] Zhou J, Troyanskaya OG. Predicting effects

of noncoding variants with deep learning-

based sequence model. Nat Methods.

2015;12(10):931-934.

[65] Kundaje A, Meuleman W, Ernst J, et al.

Integrative analysis of 111 reference human

epigenomes. Nature. 2015;518(7539):317-

330.

[66] Aalipour A, Gupta A, Vasievich MP, et al.

Engineering challenges for direct delivery of

nanoparticles to the central nervous system. J

Control Release. 2018;291:140-157.

[67] Alipanahi B, Delong A, Weirauch MT, Frey

BJ. Predicting the sequence specificities of

DNA- and RNA binding proteins by deep

learning. Nat Biotechnol. 2015;33(8):831-838.

[68] Schierz AC, Uyar B, Baryawno N, et al.

Machine learning reveals that cell identity

emerges from the coupling of stochastic gene

expression with deterministic enhancer

activity. bioRxiv. 2020.

[69] Wang D, Zhang Y, Lu M, et al. Evaluation of

deep learning methods on large-scale fold

recognition. Brief Bioinform.

2017;18(6):1062-1073.

[70] Ching T, Himmelstein DS, Beaulieu-Jones

BK, et al. Opportunities and obstacles for

deep learning in biology and medicine. J R

Soc Interface. 2018;15(141):20170387.

[71] Mamoshina P, Vieira A, Putin E, et al.

Applications of deep learning in biomedicine.

Mol Pharm. 2016;13(5):1445-1454.

[72] Libbrecht MW, Noble WS. Machine learning

applications in genetics and genomics. Nat

Rev Genet. 2015;16(6):321-332.

[73] Min S, Lee B, Yoon S. Deep learning in

bioinformatics. Brief Bioinform.

2017;18(5):851-869.

[74] Zou J, Schaub MA, Lu L, et al. A primer on

deep learning in genomics. Nat Genet.

2019;51(1):12-18.

[75] Mamoshina P, Volosnikova M, Ozerov IV, et

al. Machine learning on human muscle

transcriptomic data for biomarker discovery

and tissue-specific drug target identification.

Front Genet. 2018;9:242.

[76] Karczewski KJ, Snyder MP. Integrative omics

for health and disease. Nat Rev Genet.

2018;19(5):299-310.

[77] Alipanahi B, Delong A, Weirauch MT, Frey

BJ. Predicting the sequence specificities of

DNA- and RNA binding proteins by deep

learning. Nat Biotechnol. 2015;33(8):831-838.

[78] Hood L, Friend SH. Predictive, personalized,

preventive, participatory (P4) cancer

medicine. Nat Rev Clin Oncol.

2011;8(3):184-187.

[79] Ritchie MD, Holzinger ER, Li R, et al.

Methods of integrating data to uncover

genotype–phenotype interactions. Nat Rev

Genet. 2015;16(2):85-97.

[80] Cho K, Van Merri¨enboer B, Gulcehre C, et

al. Learning phrase representations using

RNN encoder-decoder for statistical machine

translation. arXiv preprint arXiv:1406.1078.

2014.

[81] Yuan W, Lu M, Fu Y, et al. Challenges and

emerging directions in single-cell analysis.

Genome Biol. 2021;22(1):89.

[82] Hui ABY, Shi W, Boutros PC, Miller N,

Pintilie M, Fyles T, et al. Robust global

micro-RNA profiling with formalin-fixed

paraffin-embedded breast cancer tissues. Lab

Invest. 2009;89(5):597-606.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

130

Volume 20, 2023

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

Ashwani Kumar Aggarwal contributed to the

present research, at all stages from formulating the

problem to writing the paper.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare

that they are relevant to the content of this article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2023.20.12

Ashwani Kumar Aggarwal

E-ISSN: 2224-2902

131

Volume 20, 2023