Applications of the Topological Data Analysis in Real Life
S. Z. RIDA1, ALAA HASSAN NORELDEEN2, FATEN. R. KARAR2
1Mathematics Department, Faculty of Science, South Valley University, Qena, EGYPT
2Mathematics Department, Faculty of Science, Aswan University, Aswan, EGYPT
Abstract: Statistical topology inference is a branch of algebraic topology that analyzes the geometric structure's
global topological properties underlying a point cloud dataset. There is an increasing need to analyze massive
data sets and screen large databases to address real-world problems. A central challenge to modern applied
mathematics is the need to generate tools to simplify the data in high dimensional order to extract the important
features or the relationships while performing the analysis. A growing field of study at the intersection of
algebraic topology, computational geometry, and statistics is topological data analysis (TDA) inference. This
study applies TDA tools to test hypothesis between two high-dimensional data sets. Hypothesis testing is one of
the most important topics of statistical topology inference. A proposed test was created, which was built on the
nearest-neighbor function.
Three tests such as (Hypothesis testing based on persistent homology, hypothesis testing based on persistent
landscapes, and hypothesis testing based on density estimation) based on TDA, are discussed. Moreover, a
modification of these tests was proposed. Monte Carlo simulation was conducted to compare the power of the
previous tests. We displayed the use of TDA tools in hypothesis testing. It was proposed that this test might be
established based on the nearest neighbor distance function. Furthermore, a suggested modification for the
present tests based on TDA was introduced. Finally, the tests specified in the vignette were enabled by two
empirical applications within the biology field. We demonstrated the efficacy of the above tests on the heart
disease dataset from Statlog and the Wisconsin breast cancer dataset.
Keywords: Persistent homology; Topological data; Density estimation; Betti numbers; Persistent landscapes;
Hypothesis testing
2010 Mathematics Subject Classification: 55N35, 55U05, 62H99.
Received: August 31, 2022. Revised: October 1, 2022. Accepted: October 20, 2022. Available online: November 28, 2022.
1 Introduction
Topology, a mathematics field that arises in an
attempt to describe global features of space using
point cloud data, can provide new insights and tools
for finding and quantifying interclass relationships.
Computational topology is particularly useful for
understanding non-practical data using standard
statistical methods (e.g., canonical correlation,
principal component analysis, and hierarchical
clustering).
Topology data analysis combines techniques and
tools that allow academics to discover and analyze
topological data for invariant structures, [1].
Those processes often use point cloud data as an
input, commonly represented as a huge finite dataset
in an n-dimensional metric space taken from a
geometrical object, perhaps with some noise. The
result is a set of data analyses and diagrams required
to evaluate the statistical properties of the data
accurately.
2 Simplicial Complexes
Simplicial complexes are used as the prime data
structure to represent topological spaces. Graphs are
commonly employed in many data analysis
applications since they store relationships between
data points. Simplicial complexes generalize the
notion of graphs by allowing for 2, 3, and higher
dimensional building blocks, called simplices.
2.1 Definition
is an -dimensional Euclidean space. Point
cloud data (PCD) is an unordered sequence of points
󰇝󰇞 embedded in .
A simplicial complex on PCD is defined by
considering each point in the metric space as a
vertex of an approximation. An edge connects two
vertices based on their proximity. Higher-
dimensional simplices can then be defined on the
approximations in different ways. One of the most
commonly used complexes is the Vietoris-Rips
complex. To convert data from PCD into a metric
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
22
Volume 22, 2023
space󰇛󰇜, we use the point cloud (PC) as
vertices of the approximation whose edges are
determined by proximity using vertices within a
specified through a distance metric, which
satisfies the following conditions, for all data
points, and we have:
󰇛󰇜󰇛󰇜󰇛󰇜
and 󰇛󰇜󰇛󰇜󰇛󰇜.
A simplicial complex is a finite collection of
simplices such that for any face and
implies, and  implies
is either empty or a face of both. For
example, 0-simplex consists of a point, a 1-simplex
consists of a line segment, a -simplex consists of a
triangle, and a -simplex consists of a tetrahedron
(Fig. 1). A lot more can be extracted about the
simplex, but for our need for the simplicial complex,
this will be satisfactory, [2].
Fig. 1: Shapes of -simplices for ,
respectively.
Indeed, different ways can be employed to filter the
simplicial complex, such as ech Lazy Witness. The
typical difficulty associated with any filter method
is choosing the suitable to give a decent
approximation to the structure underlying the point
cloud, as for sufficiently small, the complex is a
discrete set; for sufficiently large, the complex is a
single high-dimensional simplex. There are many
notions of distance functions that one can
reasonably use to obtain the Vietoris-Rips, [3], such
as: 󰇛󰇜
.
3 Homology and Betti Numbers
Homology groups identify holes and loops
indirectly by examining the space around them, but
Betti numbers allow counting the number of
different loops and holes. We begin building the
homology groups by examining structured sums of
simplices, creating an abelian group, [4].
3.1 Definition
The boundary of a -simplex
 is defined as the formal sum of its
󰇛󰇜dimensional faces:
󰇛󰇜
 , (1)
where
represents a point that is not included in
the simplex. That is for any vertices:
for any 1-simplex: 󰇟󰇠󰇟󰇠󰇟󰇠, for any
2-simplex: :󰇟󰇠󰇟󰇠󰇟󰇠
󰇟󰇠. In general, .
We may naturally extend the above definition to -
chains by specifying -chain's boundary:
 as  . We can establish
a family of boundary homomorphisms.
connecting the various groups of -chains of a
simplicial complex by mapping -simplices to their
boundaries:

󰇒
󰇏

󰇏
󰇒
󰇏

󰇒
󰇏
.
By construction, we have the property,
󰆷󰆷
Thus, is indeed a homomorphism. This type of
chain sequence and homomorphisms is
defined as a chain complex, denoted by
󰇛).
We define the boundary of a K-simplex as the
sum of its (k-1)-dimensional faces, which can be
expressed as
󰇛󰇜󰇟󰇠
 ,
where, represents a point that is not included in
the simplex, which means that for any vertices
:for any 1-simplex:󰇟󰇠
󰇟󰇠󰇟󰇠for any 2-simplex: 󰇟󰇠
󰇟󰇠󰇟󰇠󰇟󰇠. In general, 
.
Hence, the k-th homology group can be
formulated as follows:

,
where,  and  are donated to the kernel
and the image of the boundary operator,
respectively. One can easily prove that ,
and . Betti numbers are an
important feather linked with the homology group
because they convey relevant information about the
complex. The  Betti numberrepresents the
number of  dimensional independent holes
in, so the number of connected components of
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
23
Volume 22, 2023
is denoted as, the number of loops is denoted
as , the number of enclosed voids is denoted
as, (Fig. 2). Generally, can be computed as
follows:
󰇛󰇜󰇛󰇜󰇛󰇜,
since , thus .
Fig. 2: The circle has ,
The sphere has, , ,
The torus has  , .
4 Persistent Homology
The concept of persistence homology was
developed, [5]. The main idea of persistence
indicates the topological characteristics, which
persist over a considerable parameter range to be a
signal feature. Short-lived characteristics, on the
other hand, can be ignored as noise. Using the
persistence homology, one can avoid choosing a
single . Alternatively, we have to define the
interval of for which that feather occurs. In other
words, persistence homology is the method for
studying homology at multiple scales
simultaneously.
To realize how the persistence homology works,
assume that we have a sequence for Vietoris-Rips
complexes 󰇛󰇜
corresponding to the rising
sequence of parameters 
. A chain of inclusion
maps exists as follows:
󰇛󰇜󰇛󰇜󰇛󰇜
Instead of examining the homology of the individual
terms 󰇛󰇜, one examines the homology of the
iterated inclusions 󰇛󰇜󰇛󰇜.
These chains reveal which features have long
persistence intervals. As the  is born at a time
if  isn’t in the inclusion image before the time ,
whereas  dies entering time if  isn’t
supported by the inclusion map 󰇛󰇜󰇛󰇜.
The birth at and death at of  are recorded
in the persistence diagram as ordered tuples points
().
The 18 uniform randomly generated points are
shown in (Fig. 3). We can observe from the figure
that at , ten points are only connected.
However, at , almost all the points are
connected, giving birth to a circular hole. At ,
the Vietoris-Rips complexes are finished.
Fig. 3: Example of the persistent homology using a
Vietoris-Rips complex at , 
 respectively.
5 Persistent Landscapes
Persistent Landscapes, produced by Bubenik, [6],
can be considered as a diagram summarizing the
data contained differently on the persistence
diagram. The basic usage of the persistent
landscapes enables us to summarize and compute
data and use traditional statistics indicators, such as
averages, median, and variance, instead of a barcode
plot or a persistence diagram. Persistent landscapes
may be considered a rotational version of a barcode
diagram. To formulate the persistent landscapes
diagram, begin by building a triangle whose base
relates to a generalized persistence interval 󰇛󰇜 and
with a top vertex at the intersection of the vertical
line going through the midpoint 󰇛
󰇜 and the
circle passing through the endpoints, centered at the
midpoint. Consequently, an isosceles right triangle
is formed with the catheter intersecting at 󰇛

󰇜.
Furthermore, Bubenik proposed that the persistent
landscapes descriptor 󰇛󰇜 is dependable during the
statistical analysis and comparisons study. To obtain
󰇛󰇜, it is required first to compute 󰇛󰇜,
according to the following formula:
󰇛󰇜󰇟
󰇠
󰇛
󰇠
,
where from 1: and is represented as a number
of the points for the persistent shape. It is important
to emphasize that 󰇛󰇜 is produced individually
for each k-dimension. Then, 󰇛󰇜 is the biggest
value of󰇛󰇜 when the homology dimension is
considered. At  then, 󰇛󰇜may be understood
as the greatest possible distance of an interval
centered about. We may assume that persistence
landscapes represent an effective data analysis tool
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
24
Volume 22, 2023
in statistical topology with the above definition, [7].
Fig. 4 reveals the persistent landscapes for certain
points.
Fig. 4: The persistent landscapes diagram for
random data.
6 Hypothesis Testing based on TDA
TDA inference is an emerging area of research at
the intersection of statistics, computational
geometry, and algebraic topology. The persistent
homology framework has been used to construct
statistical foundations for inference in the latest
work, e.g., [8], [9], and [10]. There are many
references about TDA, such as [11], [12], and [13].
Yet, this study focuses on the usage of TDA tools in
testing hypothesis between two high-dimensional
data sets.
a) Hypothesis Testing Based On Persistent
Homology
The authors in [14] designed a test that is reliable
for comparing two sets of persistent homology
diagrams; each of them has a finite number of
persistent homology diagrams. We can call this
situation a multivariate persistent homology test.
Since this topic is beyond our scope, we confine
ourselves to converting that test into the un-
invariant case.
The test statistic that may be used to compare two
persistent homology diagrams based on [15] may be
expressed as:
(2)
where 󰇛󰇜 is the Wasserstein distance
between and . By calculating the Wasserstein
distance using the Hungarian algorithm. Let
 and  These are
points matching to and .
The Hungarian algorithm included two samples,
equal in size, which is accomplished by giving
points to the second sample and points to the first
sample, producing points for both samples.
The additional points are perpendicular distances
that are a duplicate of a diagonal. The cost matrix is
then constructed, whose entries are the square of
Euclidean distances. Then, the optimum column
(with the least distance in that column) gets for each
row. Lastly, the Wasserstein distance (determined
independently for points of dimensions zero, one,
two, and so on) is the total of the optimal distances,
implying that the Hungarian method provides the
lowest cost value.
Because the sample distribution to is undefined,
various nonparametric methods, such as jackknifing,
a permutation test, or bootstrapping, can be taken to
obtain an empirical distribution for the statistical
test. For applying this technique, the samples taken
should reflect their populations. Hence, Robinson
and Turner, [16], preferred to use the null
hypothesis significance test or the permutation
technique to determine the relevance of . The
nonparametric tool as the permutation approach
entails permuting the data in the sample by shuffling
their labels and computing to every permutation.
The null distribution is constructed by collecting
from permuted data. If we compare the two groups
as statistically identical, random permutations
applied to the observational data have no effect. In
this situation, the observed test statistic falls within
the range of permutations. The following steps are
for getting a -value using a permutation test:
Data: and with two sample sizes and ,
respectively.The number of permutation samples is
denoted as .
Results: -value for
Calculate from the original sample data.

Split the labels for the group at random into distinct
groups of size and ;
Calculate each permutation, take a sample, and
record the results in ;
End.
-value is the # of times that is bigger than
dividing by .
The main drawback of the test is that it depends
on the whole points in the persistence diagrams
without eliminating the noisy observations, [2].
Thus, our first proposed test is operating on the
signal points of the persistence diagram only.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
25
Volume 22, 2023
b) Hypothesis Testing Based On Persistent
Landscapes
The average of λ󰇛ε󰇜 was used to create two new
statistical tests that may be functioned to evaluate
the difference between two provided samples in the
case of high dimensional, [7]. To construct the first
test, define the average persistent landscape over all
persistent points taken the dimension of the points
into account:
󰇛󰇜
,
which yields to 󰇟󰇠. Hence, we may
represent the test statistics as follows:

󰇛󰇜󰇛󰇜 (3),
where is the average of the 󰇛󰇜 corresponding
to the sample . Although the test statistics  has
unknown sample distribution as we do not have
such knowledge about , Bubenik, [6], proved
that for large , it is possible to consider the
standard normal distribution as an asymptotic
distribution for using the central limit theorem
and the law of large numbers.
In addition, he suggested that one could
conduct  at all simultaneously using
multivariate -test or Hotelling's -square test. The
second test statistics proposed by Bubenik in, [6],
can be expressed as follows:
󰇛
󰇜 (4),
where 󰇟󰇠 and is the variance-
covariance matrix of order corresponding to
the sample of size. The main drawback that
may be thrown at and  tests is that it relies on
the whole points in the persistence diagram, which
leads to all the landscape values being used. Thus,
our second and third proposed tests are operating
and  only on the significant points of the
persistence landscapes.
c) Hypothesis Testing Based On Density
Estimation
A density estimate approach is a nonparametric
approach used to estimate the underlying continuous
distribution over a finite point set applied and
studied in various contexts. Versatile methods can
be adopted to estimate the density of the studied
data, [4]; however, -Nearest Neighbors () or
-NN are adopted to estimate the density points for
our point cloud data. Although a broad range of
authors uses  in classification and clustering,
we decided to utilize  in testing that the given
two groups of point cloud data are similar or not
based on the following  density estimator:
󰆹󰇛󰇜
󰇛󰇜 (5)
where is the total number of points in the
dataset, is the number of points we want in our
neighborhood, often equals 10% of , is a vector
of our given point, 󰇛󰇜 is the Euclidean distance
to the nearest point, and is the volume of the
unit sphere in the dimension of the data taking the
following expression:

󰇛
󰇜. (6)
Herein, we can summarize our main idea concerning
our fourth test as first compute the 󰆹󰇛󰇜 for the two
groups of the data, then calculate the following test
statistic:
󰇻󰆹󰆹󰇻
󰇛󰆹󰇜󰇛󰆹󰇜, (7)
where 󰆹 and 󰆹 are the average of the density
estimates for each group separately. Since the
sample distribution for is unknown, the critical
values can be obtained via operating the
permutation test, [4].
7 Simulation Study
The practical performance of the tests above is
studied in this section. We compare the suggested
tests, i.e.,,,, and , to these current
tests ,, and . The first three proposed tests
are computed based on the Bottleneck distance, not
on the Hausdorff distance, as the latter has no
noteworthy effect on the tests. To implement the
comparison, we performed the tests above on
standard geometric objects, which may be produced
using the Geozoo Package, then documented the p-
values for every test by applying the TDA
technique, [4], [15]. When the two groups are
formed from identical geometric objects, the p-value
is indicated as the size of the test. Otherwise, the
power of the test is determined by the -value.
Because it may be difficult to make theoretical
comparisons concerning the performance of past
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
26
Volume 22, 2023
experiments, one may turn to Monte Carlo
simulation, which is currently a widely employed
scientific technique for solving mathematically
insoluble problems and high-cost experiments. Even
still, simulation has downsides: It may consume a
large amount of computational power and cannot
provide perfect results, and the model and inputs are
employed to determine the quality of the output.
The comparison between statistical tests should be
conducted in several contexts, which may be stated
as follows:
1. Various sample sizes: For our simulation, we
run two distinct sample sizes, i.e., 50 and 100.
2. Data from various dimensional point clouds: In
this study, we decide to perform the simulation
at different dimensions of the generated data,
such as:
a. At the: The comparison is between a circle
with a radius equal to one and a normalized
square.
b. At the: The comparison is between a sphere
with a radius equal to two, and a torus with a
radius from the center equal to two.
c. At the: The comparison is between a flat
torus with a radius equal to one, and a Klein
bottle with an inner radius equal to one.
3. Different dimension holes: At  and the
power and size for every test, unless are
determined, allowing us to illustrate which
dimensions the tests can completely represent
the object's topological properties.
4. Different levels of the peak: Concerning and
, the power and size for every test are
computed at peak equal to one, two, and three.
Thus, the tests and  are operated
simultaneously for  and .
Under the aforesaid parameters, the outputs from the
Monte Carlo simulation are applied using 100
replicates (increasing the replicates will not change
the finals results) and 100 permutations at 99
percent confidence intervals, equivalent to the
Vietoris-Rips complex, and they are compatible
with previous works, [8], [17]. Table 1 summarizes
and organizes the results. The total results lead to a
number of conclusions, which are discussed in
detail:
1) The study shows that increasing the sample
size has a significant impact on the simulated
power. The size of these tests yields an increase in
the power of the test and reduces the size of the
tests. Consequently, using these tests with high
sample numbers is suggested.
2) The Betti dimension is a factor or impact on
the performance of overall tests. In general, with
high levels of Betti dimension, the final decision is
most likely to be correct. Conversely, the
dimensional point cloud data have no sequential
impact on these presented results.
3) The tests based on the persistent landscapes
may have some difficulties that might arise in real
life, especially at low , that in some situations, one
cannot compute the tests  and  at high levels
of which leads not to compute the tests and
.
4) Another problematic issue associated with the
tests based on the persistent landscapes is that at
different levels of yields a wide range of values,
which may cause some confusion for the researchers
during decision-making. The results reveal that the
second and the third level of can be recommended
at the Betti dimension equal to zero, whereas
otherwise.
5) We observed the superiority of the power of
to the remaining tests in almost all the simulated
cases. In contrast, its size is relatively far from the
ideal nominal level of 1%.
6) It is observed that the size of the tests based
on persistent homology is close to the nominal level
of 1% compared to the tests based on persistent
landscapes and density estimation. In contrast, the
latter's power is superior to the first one.
This phenomenon refers to the fact that the tests
based on persistent homology accept the null
hypothesis. In contrast, the tests based on either
persistent landscapes or density estimation reject the
null hypothesis. Consequently, one can surely
depend on the tests based on persistent homology in
the case of rejection and the tests based on persistent
landscapes or density estimation in the case of
acceptance.
7) Inherently, removing the points with short
lifetimes from the simulated persistent homology
improves, in most cases, the performance of the
tests. Therefore, one needs to implement the other
methods, [2], to study the effect of applying the
remaining methods on the performance of the tests.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
27
Volume 22, 2023
Table 1. A simulation of the study's size and power of test statistics.
8 Real-Life Applications
TDA can be employed in a broad range of fields
since it is an extremely useful tool for evaluating
and analyzing massive amounts of data. TDA has
been widely used in the biology field over recent
decades. This work analyzes two empirical world
datasets on the Wisconsin breast cancer dataset from
the UCI repository (683 patterns) and the heart
disease dataset from Statlog (270 patterns). These
data were extensively cited and studied by other
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
28
Volume 22, 2023
researchers for different purposes. Therefore, it may
be desirable to implement the above tests to
examine whether the tests can successfully
distinguish between patients with benign or
malignant breast cancer, those who suffer from heart
disease, and those who are not suffering from heart
disease.
The Wisconsin breast cancer dataset consists of one
predicted variable (benign-malignant), and nine
continuous attributes ranging from 1 to 10, which
can depend on our analysis. Furthermore, it is noted
that the points concerning benign patients take too
much time during the analysis; therefore, we select a
random sample equal to 240, running the sequential
maximum landmark method used by JPLEX.
However, the heart disease dataset includes one
predicted variable (absence -presence) of heart
disease, six continuous attributes, and seven
dichotomous variables. Thus, the seven
dichotomous variables will be omitted from the
analysis, and the six continuous attributes are only
utilized.
Fig. 5 reveals the persistent homology, the barcode
plots, and the probability distribution of the 
density for the two datasets according to each
predicted variable.
However, Fig. 6 presents the persistent landscape
diagrams in different dimensions and different .
Based on these figures, one can visually compare
the patients with benign breast cancer and those
without. We explored the type of breast cancer of
the patient (benign-malignant) that substantially
affects the topological features' shapes.
Concurrently, figures illustrate that the patients with
heart disease or without had a weak effect on the
topological features' shapes. AT zero dimensions, all
topological characteristics are identical, and the two
sampling distributions of the  density are
slightly different.
Conversely, (Tables 2 and Tables 3) display the p-
values associated with the whole tests under study.
Characteristically, all the tests stated a statistical
difference among patients with benign breast cancer
and those without at a nominal level of 1%. In
contradiction concerning the heart disease dataset,
all the tests failed to reject the null hypothesis at
zero and almost two dimensions. Still, in one
dimension, all the tests successfully rejected the null
hypothesis that there is a significant difference
among patients with heart disease and those without
at a nominal level of 1%. One can notice
surprisingly that despite the similarity of the two
sampling distributions of the  density in the
case of the heart disease dataset, perfectly
distinguished between the two groups.
Consequently, in our view, can be recommended
in practical life.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
29
Volume 22, 2023
Table 2. The empirical -values for the statistics tests corresponding to breast cancer.
Table 3. The empirical p-values for the statistics tests corresponding to heart disease databases.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
30
Volume 22, 2023
Fig. 5: The topological features for the breast cancer, and heart disease databases, respectively.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
31
Volume 22, 2023
Fig. 6: The persistent landscape for the breast cancer, and heart disease databases, respectively.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
32
Volume 22, 2023
9 Conclusion
In this paper, we displayed TDA techniques in
hypothesis testing. It is proposed that this test is
based on the nearest neighbour distance function. In
addition, a suggested modification for presented
tests depending on TDA is proposed. At different
patterns, a comparison depending on persistent
homology is performed among tests. These tests are
based on a distance function, and other tests depend
on a persistent landscape with two requirements: the
test's size and power. According to our observations,
tests that depend on persistent homology are much
more appropriate. However, tests based on
persistent landscape or distance function have
higher power than the remaining. Generally, all
TDA-based tests have fulfilled properties at
dimension one; if the sample size for the point cloud
data increases, it will positively impact the overall
number of tests. We demonstrated the efficacy of
the above tests on Statlog's heart disease dataset and
Wisconsin breast cancer dataset. There is still much
work to be conducted in future studies. For instance,
in generalizing the preceding tests to more than two
groups, comparing the various methods, [2], in-
depth clustering analysis depends on TDA and
evaluating it to another statistically existing method.
As TDA techniques improve, we expect many
researchers to apply topological analysis in their
studies.
Acknowledgments:
The authors would like to thank the team of TDA,
namely Brittany Terese Fasy, Jisu Kim, and
Clement Maria, for their help, advice, vast expertise,
and willingness to share their time freely. Moreover,
a lot of gratitude to Dr. Fabrizio Lecci for her help
in sending her thesis.
References:
[1] Bubenik, Peter, and Nikola Milićević,
"Homological algebra for persistence modules",
Foundations of Computational Mathematics
21.5 (2021): 1233-1278.
https://doi.org/10.1007/s10208-020-09482-9.
[2] Fasy, Brittany Terese, et al. "Confidence sets for
persistence diagrams", The Annals of Statistics
(2014): 2301-2339. https://doi.org/10.1214/14-
AOS1252.
[3] Alaa, H. N., and S. A. Mohamed. "On the
topological data analysis extensions and
comparisons", Journal of the Egyptian
Mathematical Society 25.4 (2017): 406-413.
https://doi.org/10.1016/j.joems.2017.07.001.
[4] Fasy, Brittany Terese, et al. "Introduction to the
R package TDA", arXiv preprint
arXiv:1411.1830 (2014).
https://doi.org/10.48550/arXiv.1411.1830.
[5] Edelsbrunner, Herbert, David Letscher, and
Afra Zomorodian. "Topological persistence and
simplification", Proceedings 41st annual
symposium on foundations of computer science.
IEEE, 2000. DOI:10.1109/SFCS.2000.892133.
[6] Bubenik, Peter. "Statistical topological data
analysis using persistence landscapes", J.
Mach. Learn. Res.16.1 (2015): 77-102.
https://doi.org/10.48550/arXiv.1207.6437
[7] Balchin, Scott, and Etienne Pillin. "Comparing
Metrics on Arbitrary Spaces using Topological
Data Analysis", arXiv preprint
arXiv:1503.04619 (2015).
https://doi.org/10.48550/arXiv.1503.04619
[8] Carlsson, Gunnar. "Topology and data",
Bulletin of the American Mathematical Society
46.2 (2009): 255-308. DOI:10.1090/S0273-
0979-09-01249-X.
[9] Emmett, Kevin, et al. "Parametric inference
using persistence diagrams: A case study in
population genetics", arXiv preprint
arXiv:1406.4582 (2014).
https://doi.org/10.48550/arXiv.1406.4582
[10] Gamble, Jennifer, and Giseon Heo.
"Exploring uses of persistent homology for
statistical analysis of landmark-based shape
data", Journal of Multivariate Analysis 101.9
(2010): 2184-2199.
https://doi.org/10.1016/j.jmva.2010.04.016.
[11] Kim, Wonse, et al. "Investigation of flash
crash via topological data analysis", Topology
and its Applications 301 (2021): 107523.
https://doi.org/10.1016/j.topol.2020.107523.
[12] Dłotko, Paweł, and Thomas Wanner.
"Topological microstructure analysis using
persistence landscapes", Physica D: Nonlinear
Phenomena 334 (2016): 60-81.
https://doi.org/10.1016/j.physd.2016.04.015.
[13] Sizemore, Ann E., et al. "The importance of
the whole: topological data analysis for the
network neuroscientist", Network
Neuroscience 3.3 (2019): 656-673.
https://doi.org/10.1162/netn_a_00073.
[14] Artamonov, Oleg. "Topological methods for
the representation and analysis of exploration
data in oil industry", Diss. Technische
Universität Kaiserslautern, 2010.
urn:nbn:de:hbz:386-kluedo-25456.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
33
Volume 22, 2023
[15] Curran, James, and Maintainer James M.
Curran. "Package 'Hotelling'”, R package
version (2017): 1-0.
https://github.com/jmcurran/Hotelling.
[16] Robinson, Andrew, and Katharine Turner.
Hypothesis testing for topological data
analysis”, Journal of Applied and
Computational Topology 1.2 (2017): 241-261.
https://doi.org/10.1007/s41468-017-0008-7
[17] Edelsbrunner, Herbert, and John L. Harer.
Computational topology: an introduction”,
American Mathematical Society, 2010.
https://doi.org/10.1007/978-3-540-33259-6_7.
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors completed all aspects of the study
through diligent work and an analysis of numerous
sources and achievements in the field of
mathematics. The final version has been read and
accepted by all authors.
-S. Z. Rida, Mathematics Department, Faculty of
Science, South Valley University, Qena, Egypt
-Alaa Hassan Noreldeen, Mathematics Department,
Faculty of Science, Aswan University, Aswan,
Egypt.
-Faten. R. Karar, Mathematics Department, Faculty
of Science, Aswan University, Aswan, Egypt.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.3
S. Z. Rida, Alaa Hassan Noreldeen, Faten R. Karar
E-ISSN: 2224-2880
34
Volume 22, 2023
Conflict of Interest
The authors have no conflicts of interest to declare
that are relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US