COVID-19 Medical Data Integration Approach
VIOLETA TODOROVA1, VESKA GANCHEVA2, VALERI MLADENOV1
1Department of Fundamentals of Electrical Engineering
2Department of Programming and Computer Technologies
Technical University of Sofia
Kliment Ohridski boul. 8, Sofia
BULGARIA
Abstract: - The need to create automated methods for extracting knowledge from data arises from the
accumulation of a large amount of data. This paper presents a conceptual model for integrating and processing
medical data in three layers, comprising a total of six phases: a model for integrating, filtering, sorting and
aggregating Covid-19 data. A medical data integration workflow was designed, including steps of data
integration, filtering and sorting. The workflow for Covid-19 medical data from clinical records of 20400
potential patients was employed.
Key-Words: - Clinical Records, COVID-19, Data Analytics, Data Integration
Received: May 15, 2022. Revised: May 28, 2022. Accepted: June 19, 2022. Published: Juluy 18, 2022.
1 Introduction
With the development of health care, science and
high technology, the amount of generated
information is growing at a tremendous speed and
volume [1]. As a result, multiple heterogeneous data
emergece, different in terms of types, storage files,
sources of data generation. The process of managing
different data from different sources is called data
integration. This is a typical process in fields such as
medicine, biology, bioinformatics, etc.
Big data in medicine includes biological,
biometric and electronic health data records [2].
Medical databases have a high degree of differences
in terminologies, features of records, data
presentation [3]. This, in turn, is associated with
problems when querying multiple databases.
Therefore, there is a need to automate database
integration to do much more than simple data
extraction and modification [4, 5]. Records in
different medical databases have different formats.
Integration requires the use of formats across
databases, but high dimensionality and redundancies
make such integration impossible.
During the data integration process, filtering
operations are performed to remove duplicate data,
data conversion, or manage data. The data integration
model can also vary between extract, transform and
load (ETL), extract, load and transform (ELT), data
transformation, data replication, data virtualization,
streaming data integration [6].
This paper presents a conceptual model for
integrating and processing medical data in three
layers, including a total of six phases: a model for
integrating, filtering, sorting and aggregating Covid-
19 data implemented in Talend Open Studio [7].
2 Material and Methods
The structure of the proposed medical data
integration and processing model is illustrated in Fig.
1. The model is organized into three layers, each of
which brings together the tasks to be performed.
Data management consists of three main phases:
data preparation for analysis, interpretation and
visualization, and the preparation phase includes
medical data collection, medical data storage,
medical data integration. The "Medical Data
Collection" phase is based on the data sources, the
technical devices providing visual data, the specifics
of the generated data, including data types and data
formats, images and features. One of the main
sources of medical data includes patient data
obtained from patient examinations, symptoms,
personal data including age, gender, medical history,
etc. Also, sensor data, omics data, electronic health
data and health records are collected.
The second phase "Medical data storage" of the
proposed model is related to the data storage process.
Typically, clinical data is collected and stored in
various file formats such as “.xls”, “.xlsx”, “.csv”,
“.xlsm”, DICOM, etc. However, there are two main
MOLECULAR SCIENCES AND APPLICATIONS
DOI: 10.37394/232023.2022.2.11
Violeta Todorova, Veska Gancheva, Valeri Mladenov
E-ISSN: 2732-9992
102
Volume 2, 2022
problems with the data. First, the large amount of data
of different sizes, data types, and file formats requires
storage space and data processing tools. Second, the
variety of technical characteristics of the large
amount of data leads to heterogeneity.
In the third phase "Medical Data Integration" the
process of merging data collected from different
sources is carried out. That includes cleaning,
ETL/ELT, mapping, transformation steps. In the data
integration process, data cleaning, unifying formats,
grouping and classification, regression and filtering,
copying, downloading, loading data warehouses, data
extraction, merging, aggregating, object and server
data management and so on. Storing the data in a data
warehouse allows users to quickly access data stored
from different sources in one place to ensure a short
time to retrieve information and store a large amount
of data from previous periods. The latter allows users
to perform analysis over a given period and make
predictions about further trends and events.
The phase "Medical Data Processing" of the
second layer "Data Analysis" of the presented model
includes the process and methods applied to process
the medical data. The data processing process
involves manipulating the collected data and
performing functions and operations in order to
extract meaningful information. Features included
include validation, sorting, aggregation, analysis,
reporting, classification. Sorting is used to arrange
the data according to all submitted requirements.
Aggregation is the process of combining multiple
pieces of data. Analysis is applied to transform and
model the data. Classification performs the
separation of data into groups according to
requirements.
The Medical Data Classification phase involves
the process of arranging data into groups based on
predefined criteria. Clustering methods and
techniques such as k-Nearest Neighbor (kNN), k-
Means, Support Vector machine (SVM), Artificial
Neural Networks (ANN), Convolutional Neural
Networks (CNN), Naive Bayes, etc. are applied for
data purposes.
The "Decision Making" phase is the last phase and
is structured in the "Problem Solving" layer, and it is
built upon the phases carried out earlier using
different methods.
Figure 1: A conceptual model for medical data
integration
A workflow was developed for data integration,
filtering, sorting and aggregation of Covid-19 data
(Fig. 2). The raw medical data consisted of
information coming from clinical records covering a
period slightly longer than 10 months, from April
2020 to February 2021. The clinical records of
20,400 potential patients were used, of which 10,200
patients were infected with SARS-CoV-2, and the
remaining 10,200 hospital patients were not infected.
Medical records were linked to 10,232 women and
10,168 men, aged 21 to 85 years.
The data is organized into a structure of 17
variables arranged in the following order: “ID”,
“GENDER”, “AGE”, “COVID”,
“COVID_SYMPTHOM”, “HEART_DISEASE”,
“HYPERTENSION”, “DIASTOLIC”,
“SYSTOLIC”, “DIABETS2”, “HBA1C”, “CKD”,
“CALCIUM”, ”POTASSIUM”, “PHOSPHORUS”,
“CANCER” and with data type integer and text
string.
Each field of the proposed data integration and
processing model is configured by type, format, and
length of data . Designed in Talend Open Studio, the
model performs four main tasks: data integration,
data filtering, data sorting, and output.
Figure 2: Medical Data Processing Workflow
Problem solving
Making decision
Data analysis
Processing of medical data Classification of medical data
Data management
Collection of
medical data
Storage of medical
data
Integration of
medical data
MOLECULAR SCIENCES AND APPLICATIONS
DOI: 10.37394/232023.2022.2.11
Violeta Todorova, Veska Gancheva, Valeri Mladenov
E-ISSN: 2732-9992
103
Volume 2, 2022
3 Results and Discussion Problem
Solution
The process of developing a data integration
workflow begins with the creation of a metadata file
based on a test database represented by a .csv file
with the data for COVID-19. The generated metadata
file is loaded at the input of the data integration
workflow via the tFileInputDelimited component.
The names of the variables used and the data types
required for the integration process are specified. All
the variables contained in the .csv file with statistical
data are selected for the solution of the given task.
Through the tLogRow1 component, an output is
defined that shows the result of loading the file with
selected attributes. Fig. 3 depicts the loading process
and the number of records processed. The resulting
output (Fig. 4) shows the availability of the selected
variables and records for processing.
Figure 3: Metadata file loading process
To achieve higher precision in the processing of
COVID-19 data, a filter has been added through the
tMap component. It contains the variables ID,
GENDER, AGE, COVID, COVID_SYMPTHOM,
HEARTDISEASE, HYPERTENSION,
DIASTOLIC, SYSTOLIC, DIABETS2. The
selection of variables is consistent with the medical
requirements when registering a COVID-19 illness.
According to them, when the presence of a virus is
detected, manifested symptoms are reported, patients
are defined as high-risk in the presence of heart
disease and deviations in blood pressure indicators,
as well as diabetes.
For the analysis of the patient's condition, the
patient's age was also included (Fig. 5).
Figure 4: Result of the metadata file loading process
Figure 5: Data processing filter design
A new variable, "PATIENT_DETAILS" , is
defined for presenting patient details. It concatenates
the records of the "GENDER" and "AGE" variables.
" ," is applied as a value separator. The expression is
constructed for the implementation:
row1.GENDER+”,”+row1.AGE
Through the developed filter, the diastolic and
systolic blood pressures are jointly represented in a
common variable "DIASTOLIC_SYSTOLIC", using
a delimiter "|" representing the corresponding values.
For this purpose, the expression is used:
row1.DIASTOLIC+”|”+row1.SYSTOLIC
Fig. 6 shows the result of the action of the
developed filter.
After filtering the data, a sorting process is
applied. It is implemented through the tSortRow
component, defining a scheme of the variables that
can be processed when sorting and those of them,
prioritized variables that will form the output result
(Fig. 7). The sorting component defines the sorting
rules satisfying the given task. In the case under
consideration, for the display of all records of
patients with a positive COVID-19 test, the variable
COVID is defined to be displayed in descending
order. Thus, in the result, the cases with a disease
(marked with "1"), which require intensive medical
assistance, will be displayed as a priority. To
MOLECULAR SCIENCES AND APPLICATIONS
DOI: 10.37394/232023.2022.2.11
Violeta Todorova, Veska Gancheva, Valeri Mladenov
E-ISSN: 2732-9992
104
Volume 2, 2022
determine the degree of risk for patients with a
disease, the variables COVID_SYMPTHOM,
HEART_DISEASE, DIABETS2, AGE, ID,
configured in descending and ascending order (Fig.
8), have been added. As a result, all cases of COVID-
19 disease characterized by severe symptoms and
classified by age will be prioritized. The result shows
records that indicate the presence or absence of heart
disease or diabetes by the values "1" and "-1"
respectively. To assess the patient's condition, data
on the presence or absence of hypertension, data on
systolic and diastolic blood pressure and additional
information on the patient's gender and age are
included (Fig. 9).
Figure 6: Result of the designed filter action
Figure 7: Configuration of the input and output
varaibles of the sorting component
Figure 8: Defining of rules for sorting by attributes
To store the generated result of the data
integration process, an .xls file containing the
obtained data is generated. For this purpose, the
tFileOutputExcel_1 component connected to the
output of the tLogRow_2 sorting result component
output was used. The component is configured to
output and save the result of the integration, with the
first line considered as the header.
Figure 9: The result of applying a sort process
4 Conclusion
A conceptual model for medical data integration is
proposed, consisting of three layers and six phases:
data preparation for analysis, interpretation, and
visualization, with the preparation phase including
medical data collection, medical data storage, and
medical data integration. A medical data integration
workflow was designed, including steps of data
integration, filtering and sorting. The workflow for
SARS-CoV-2 medical data from clinical records of
20400 potential patients was employed.
References:
[1] Chen P., Zhang C, Data-intensive applications,
challenges, techniques and technologies: A
survey on Big Data, Journal of Information
Sciences, 275:314–347, DOI:
10.1016/j.ins.2014.01.015.
[2] Mallappallil M, Sabu J, Gruessner A, Salifu M.
A review of big data and medical research.
SAGE Open Med. 2020;8:2050312120934839.
Published 2020 Jun 25.
doi:10.1177/2050312120934839.
[3] Chandra Sekhara Rao, DVLN Somayajulu,
Haider Banka, Sawrav Roy, Feature Binding
Technique for Integration of Biological
Databases with Optimized Search and Retrieve,
2nd International Conference on
Communication, Computing & Security
[ICCCS-2012], pp.622- 629.
[4] Paton N., etc. (ed.) Data Integration in the Life
Sciences: 6th International Workshop, DILS
2009, Manchester, UK, July 20-22, 2009,
Proceedings (Lecture Notes in Computer
MOLECULAR SCIENCES AND APPLICATIONS
DOI: 10.37394/232023.2022.2.11
Violeta Todorova, Veska Gancheva, Valeri Mladenov
E-ISSN: 2732-9992
105
Volume 2, 2022
Science / Lecture Notes in Bioinformatics),
Springer, ISBN-10: 3642028780, 2009.
[5] Zhang Zhang, Vladimir B. Bajic, Jun Yu, Kei-
Hoi Cheung and Jeffrey P. Townsend, Data
Integration in Bioinformatics: Current Efforts
and Challenges, Journal Bioinformatics
Trends and Methodologies, November, 2011,
pp. 41-56.
[6] Julyeta P.A. RuntuweneIrene, Irene
Tangkawarow, C T M Manoppo, Salaki
Reynaldo Joshua, A Comparative Analysis of
Extract, Transformation and Loading (ETL)
Process, IOP Conference Series Materials
Science and Engineering 306(1):012066, DOI:
10.1088/1757-899X/306/1/012066.
[7] Talend Open Studio
https://www.talend.com/products/talend-open-
studio/
Acknowledgments
The presented work was founded by the National
Science Fund, Ministry of Education and
Science, Republic of Bulgaria under contract
KP-06-N37/24, research project Innovative
Platform for Intelligent Management and
Analysis of Big Data Streams Supporting
Biomedical Scientific Research”.
Creative Commons Attribution
License 4.0 (Attribution 4.0
International , CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
MOLECULAR SCIENCES AND APPLICATIONS
DOI: 10.37394/232023.2022.2.11
Violeta Todorova, Veska Gancheva, Valeri Mladenov
E-ISSN: 2732-9992
106
Volume 2, 2022