A Comparison of Statistical Dependency and Functional Dependency
between Attributes Based on Data
HYONTAI SUG
Department of Computer Engineering,
Dongseo University,
47 Jurye-ro, Sasang-gu, Busan, 47011,
REPUBLIC OF KOREA
Abstract: - Chi-squared test is a standard statistical test to ascertain independence between categorical
variables. So, it is recommended to do the test for the attributes in the datasets, and remove any redundant
attributes before we supply the datasets to machine learning algorithms. But, if we have many attributes that are
common in real-world datasets, it is not easy to choose two attributes to do the independence test. On the other
hand, several automated algorithms to find functional dependencies based on data have been suggested.
Because functional dependencies show many-to-one relationships between values of attributes, we could
conjecture that there might be statistical dependence in the found functional dependencies. For us to overcome
the problem of choosing appropriate attributes for statistical dependency tests, we may use some algorithms for
automated functional dependency finding. We want to confirm that the found functional dependencies can
show statistical dependence between attributes in real-world datasets. Experiments were performed for three
different real-world datasets using SPSS to confirm the statistical dependence of functional dependencies that
are found by an open-source tool called FDtool, where we can use FDtool for automated functional dependency
discovery. The experiments confirmed that there exists statistical dependence in the found functional
dependencies and showed improvements in decision trees after removing dependent attributes.
Key-Words: - Artificial intelligence, machine learning, classification, statistical independence, functional
dependency, knowledge modeling, preprocessing, relations, data tables
Received: May 23, 2021. Revised: August 29, 2022. Accepted: September 24, 2022. Published: October 25, 2022.
1 Introduction
Determining functional dependency is an important
theoretical basis for the normalization of relational
databases. If we look at the purpose of
normalization, it is to store one fact in one place in
the databases, thereby minimizing potential errors
by storing data redundantly. The functional
dependencies in the attributes of relations are
determined by checking whether they have a many-
to-one relationship with all the values that may
appear in the attributes. And such a decision is made
by the database designer, [1]. On the other hand,
there is a series of research works to find out such
functional dependencies automatically from stored
data. Finding functional dependencies based on
stored data may require a lot of computing time
unless we apply some elaborate algorithms because
we can have exponential combinations of attributes
to consider. Automated algorithms try to find
functional dependencies as efficiently as possible,
so efficient algorithms have been suggested, [2]. But,
functional dependencies found by data may not be
real functional dependencies. Let’s see an example.
We have a shipment relation like table 1. S# and P#
mean supplier number and part number respectively.
Table 1. A shipment relation
S# P# Quantity
S1 P1 100
S1 P2 100
S2 P1 200
S2 P2 200
Then, we have functional dependencies based on
data like;
{S#, P#} -> {Quantity},
{S#} -> {Quantity},
{Quantity} -> {S#}.
But, we know that the first one only is the real
functional dependency, while the others are not.
On the other hand, a statistical method to check
dependency between categorical attributes is the
chi-squared test, [3]. It is recommended that when
we have a contingency table of 2 by 2, we should
have all expected frequencies > 5, and for a larger
table, no more than 20% of all cells may have an
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
225
Volume 19, 2022
expected frequency < 5, and all expected
frequencies > 1, [4]. For example, assume that we
surveyed replies for a policy by 100 men and
women. We can have a contingency table like table
2.
Table 2. A Contingency table of a survey
yes
no
men
28
16
women
21
35
total
49
51
We can determine whether the replies are dependent
on sex with the chi-squared test. Because of their
good understandability, decision trees are important
data mining tools when human wants to understand
the constructed knowledge models like the bio-
medical domain, and this kind of dependency check
between conditional attributes is important for
classification task of data mining like decision trees
that are considered one of the most important
machine learning algorithms, [5]. The target datasets
of decision trees consist of conditional attributes and
decisional attributes. The precondition of
conditional attributes is that they are independent of
each other and dependent on decisional attributes
only. Because decision trees have the property of
low bias but high variance, [6], [7], which means
that decision trees reflect the composition of data
themselves very well, if we have some dependency
between conditional attributes, we may have more
complex trees or the final knowledge models, [8],
[9]. So, J.R. Quinlan recommended getting rid of
dependency between conditional attributes before
we begin to generate decision trees by doing the
dependency check of the chi-squared test before in
his book of C4.5, [10]. But, if we have many
attributes in conditional attributes, we can have
exponential combinations of attributes, so statistical
dependency checking can be very time-consuming.
Because found functional dependencies based on
data represent the relationship of many to one
between attributes, there could be regularities in the
occurrence of values of the attributes. In other
words, there could be statistical dependence
between the attributes. Therefore, we want to find
out the statistical significance of functional
dependencies in real-world datasets by doing the
chi-squared test for the same attribute sets which are
found to have functional dependencies. As a result,
we may save some time for the statistical
dependency check, otherwise, if we have tried
exhaustively on the time-consuming tasks.
2 Related Work
Functional dependencies are used to determine the
many-to-one relationship of values between the
attributes of relations or relation variables, [11], and
serve as a criterion for judgment for normalization.
That is, a relation variable will be in the second
normal form if it is in the first normal form and
every non-key attribute is fully functionally
dependent on the primary key. And a relation
variable will be in the third normal form if it is in
the second normal form and it does not contain any
transitive functional dependency. It is recommended
that a relation variable should be at least in the third
form in a practical sense. Since database designers
sometimes make mistakes in database design,
several algorithms have been proposed to discover
functional dependencies using data stored in
relations or data tables. We can refer to the found
information for further normalization. The
developed algorithms try to find functional
dependencies as efficiently as possible so several
efficient algorithms have been suggested, [2], [12],
[13]. On the other hand, unlike traditional functional
dependencies, functional dependencies limited to
specific values of attributes, called conditional
functional dependencies, are used for data cleaning
purposes, [14], [15]. Moreover, approximate
conditional functional dependencies (ACFD) which
can be applied to the subsets of tuples in relation
have been suggested, [16]. The format of functional
dependencies of ACFD has similarity with class
association rules, [17], [18].
Because functional dependencies represent many-
to-one correspondences of attribute values including
one-to-one relationships of values that appear in a
relation, if there is a functional dependency, there
may be some statistical dependency between the
two attributes or between the two sets of attributes
that appear in the left-hand side (LHS) and the right
-hand side (RHS) of the functional dependency. The
chi-squared test is a well-known method for
statistical independence for categorical attributes. So,
we may want to do statistical tests for found
functional dependencies based on data.
We need to discretize continuous or numerical
attributes before we apply the chi-squared test
because we can apply the test for categorical
attributes only. Many discretization algorithms have
been suggested, and García et al. compared 80
different discretization algorithms, [19]. For
example, a simple method for discretization is called
binning. Binning is the process of dividing a
continuous variable into a constant interval and
converting each interval into a value for a new
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
226
Volume 19, 2022
categorical variable. Among the many discretization
methods, we will use a method based on
information entropy and minimum description
length (MDL) principle by Fayyad and Irani, [20].
Their discretization method finds the best split in
which the bins are as pure as possible. The purity of
a bin is measured by the ratio of the values in a bin
having the same class label. The method is
characterized by finding the split with the MDL that
is based on entropy. Because we want to see the
effect of eliminating redundant attributes with a
well-known decision tree algorithm, C4.5, that is
also based on the entropy and MDL principle. C4.5
is known as one of the top 10 data mining
algorithms, [21].
3 Problem Formulation
To do an independent test that determines that two
categorical variables are statistically related or
independent of each other, chi-squared test is often
used, [22]. In statistics, we usually use the
terminology, categorical variables, while in
computer science we use the terminology, nominal
attributes, so the two terminologies have the same
meaning.
Let’s see a simple example that needs the
independence test. We have a categorical variable
called educational level with 5 categorical values,
such as primary school graduate, secondary school
graduate, high school graduate, college degree, and
graduate degree, and a categorical variable called
annual income with 3 values of the upper, middle,
and lower class. You can use a chi-squared test for a
problem such as determining whether they are
independent of each other or not. We can create a
contingency table and do a chi-squared test to see
the relation.
If the two variables that classify data are called X
and Y, and the variables X have m and the variable
Y has n categories or different values, then an m×n
contingency table can be created. In the contingency
table, the element at row i and column j called, Oij,
represents the observations corresponding to the ith
category of X, and the jth category of Y.
Since the null hypothesis H0 that two variables
are independent is known to approximate the chi-
squared distribution with the degree of freedom (m-
1)(n-1), if the observations O11, O21, ..., and Omn do
not differ from the corresponding expected
frequency E11, E21, ..., and Emn, the value of the test
statistic will be 0. Conversely, if the difference
between the degree of observation and the expected
frequency is large, the value of the test statistic will
also be large, so H0 will not be established. That is,
we can make a statistical judgment that the two
variables are not independent. The definition of
functional dependency based on data can be defined
as follows, [11].
Definition 1. Let r be a relation over the set of
attributes U, and A, B be any subset of U. Then B is
functionally dependent on A, A B, if and only if
each A value in r is associated with precisely one B
value. □
Based on the above definition, we can make a
contingency table for the relation r as follows:
If the two sets of attributes A and B over r, where
the attribute set A has m and the attribute set B has n
different values, then an m×n contingency table can
be created. In the contingency table, Oij represents
the observations corresponding to the ith value of A,
and the jth value of B.
In the case of a chi-squared test, there is the
inconvenience of having to specify two attributes to
perform the test. On the other hand, because the
algorithm to find functional dependencies by data
automatically can find all the functional
dependencies, we may use the automated algorithm
complementarily with the chi-squared test to select
the two attributes. However, when the size of the
data is small, relatively much more functional
dependencies will be found compared to large-sized
data, and some of them can be fake functional
dependencies because of the property of functional
dependency of many to one relationship as we can
see from the simple example relation in table 1.
Therefore, large datasets are preferred for more real
functional dependencies based on data. Moreover,
fewer columned datasets are not for our interests
because our final goal is to find simpler decision
trees of which the complexity of the trees is affected
largely by the many conditional attributes.
M. Buranosky et al. implemented an automated
program called FDtool that finds functional
dependencies in the datasets of tabular form, [23].
FDTool is a Python-based open-source program to
mine functional dependencies and candidate keys.
The number of attributes of the input data is limited
to 26. For the statistical test, we will use a well-
known tool, IBM SPSS, [24], for our experiment.
For discretization and decision tree training we will
use an open-source tool called Weka, [25].
3.1 Experimental Procedure
We want to check whether a given dataset
consisting of conditional and decisional attributes
for data mining has functional dependencies and
statistical dependencies between the conditional
attributes. To check functional dependencies, we use
FDtool, then, if some functional dependencies are
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
227
Volume 19, 2022
found, we try to do chi-squared tests for the set of
attributes in the found functional dependencies,
before we supply the dataset to generate decision
trees. The detailed procedure for our experiment is
as follows:
EXPERIMENTAL PROCEDURE:
INPUT: a dataset D in tabular form.
BEGIN
1. Discretize continuous attributes if they are
contained in conditional attributes in D;
2. Find functional dependencies based on data
in conditional attributes using FDtool;
3. For each found functional dependency Do
Do the chi-squared test between LHS and
RHS of found functional dependency
unless RHS has only one value;
End For;
4. For each LHS and RHS of found functional
dependency that is decided to have
statistical dependency Do
Let S be LHS or RHS of the found
functional dependency;
Remove the columns of attributes in S
from D;
Generate a decision tree;
End For;
5. Select the best decision tree regarding the
size and accuracy from the result of 4.
END.
In the procedure, LHS and RHS mean the left-hand
side and the right-hand side of the found functional
dependencies respectively. Because FDtool can find
functional dependencies based on data, we may
have functional dependencies of RHS consisting of
only one value, that is, a constant, and in that case,
we cannot do chi-squared tests as indicated 3rd in the
procedure.
4 Experimentation
Because FDtool limits the number of input attributes,
and we want datasets having a mix of continuous
and nominal attributes as well as having a lot of
instances, there are not many public datasets that
meet our criteria. So, we chose three public datasets,
called adult, bank, and credit approval datasets from
the UCI machine learning repository for our
experiments, [26]. The datasets consist of 14 ~ 16
conditional attributes and one decisional attribute.
Because the chi-squared test can be done between
nominal attributes, we need to discretize any
continuous attributes in the datasets. We apply
FDtool after discretization to find functional
dependencies between conditional attributes, and we
expect many functional dependencies based on
stored data in each dataset.
4.1 Adult Dataset
The purpose of the adult dataset is a census dataset
to predict whether income exceeds $50K/year
depending on various aspects. The data set has
48,842 records and has 14 conditional attributes and
one decisional attribute, named ‘class’ having two
different values, >50K or <=50K. The 14
conditional attributes consist of numerical and
categorical attributes as in table 3. Categorical
attributes have nominal values, while numerical
attributes have numbers, so we need to discretize
them. We applied Fayyad and Irani’s multi-valued
discretization method implemented in Weka. Table
3 shows the results. In the notation ‘(‘ means ‘<’,
and ‘]’ means ‘<=’. For example, (21.5~23.5] means
‘21.5 < an interval <= 23.5’.
Table 3. Discretized conditional attributes of the
adult dataset
ATTRIBUTE
VALUES
age
Numeric => (-∞~21.5], (21.5~23.5],
(23.5~24.5], (24.5~27.5], (27.5~30.5],
(30.5~35.5], (35.5~41.5], (41.5~54.5],
(54.5~61.5], (61.5~67.5], (67.5~ ∞]
workclass
8 nominal values (Private, …, Never-
worked)
fnlwgt
Numeric => ‘all’
education
16 nominal values (Bachelors, ,
Preschool)
education-num
Numeric => (-∞~8.5], (8.5~9.5],
(9.5~10.5], (10.5~12.5], (12.5~13.5],
(13.5~14.5], (14.5~ ∞]
marital-status
7 nominal values (Married-civ-spouse,
… , Married-AF-spouse)
occupation
14 nominal values (Tech-support, …,
Armed-Forces)
relationship
6 nominal values (Wife, ,
Unmarried)
race
5 nominal values (White, …, Black)
sex
2 nominal values (Female, Male)
capital-gain
Numeric => (-∞~57], (57~3048],
(3048~3120], (3120~4243.5],
(4243.5~4401], (4401~4668.5],
(4668.5~4826], (4826~432.5],
(4932.5~4973.5], (4973.5~5119],
(5119~5316.5], (5316.5~5505.5],
(5505.5~5638.5], (5638.5~6389],
(6389~6457.5], (6457.5~6505.5],
(6505.5~6667.5], (6667.5~70555.5],
(7055.5~ ∞]
capital-loss
Numeric => (-∞~1551.5],
(1557.5~1568.5], (1568.5~1820.5],
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
228
Volume 19, 2022
(1820.5~1834.5], (1834.5~1846],
(1846~1859], (1859~1881.5],
(1881.5~1894.5], (1894.5~1927.5],
(1927.5~1975.5], (1975.5~1978.5],
(1978.5~2168.5], (2168.5~2176.5],
(2176.5~2218.5], (2218.5~2310.5],
(2310.5~2364.5], (2364.5~2384.5],
(2384.5~2450.5], (2450.5~2469.5],
(2469.5~3089.5], (3089.5~ ∞]
hours-per-
week
Numeric (-∞~34.5], (34.5~39.5],
(39.5~41.5], (41.5~49.5], (49.5~61.5],
(61.5~ ∞]
native-country
41 nominal values (United-States, ,
Holand-Netherlands)
4.1.1 Checking Functional Dependencies for
Adult Dataset
14 functional dependencies in the conditional
attributes were found as follows:
{age} -> { fnlwgt}
{ workclass} -> { fnlwgt}
{education} -> {education-num}
{education} -> { fnlwgt}
{education-num} -> { fnlwgt}
{marital-status} -> { fnlwgt}
{occupation} -> { fnlwgt}
{relationship} -> { fnlwgt}
{race} -> { fnlwgt}
{sex} -> { fnlwgt}
{capital-gain} -> { fnlwgt}
{capital-loss} -> { fnlwgt}
{hours-per-week} -> { fnlwgt}
{native-country} -> { fnlwgt}
Because functional dependency between attributes
represents many to one relationship in attributes
values, the found functional dependencies show
such relationships, but attribute fnlwgt has only one
value, ‘all’ as we can see from the third row in table
3, the functional dependencies that have fnlwgt in
left-hand side (LHS) or right-hand side (RHS) are
meaningless for the chi-squared test. Therefore, only
the functional dependency, {education} ->
{education-num} is left to do the statistical test.
Note that the attribute education has 16 different
values, while the attribute education-num has 7
different values as in table 3. Chi-squared test was
done for the two attributes, education, and
education-num. The contingency table (cross table)
of the two attributes can be summarized in table 4.
Note that the original cross table is 16×7, but only
one column has no zero value in each row of the
table. So, for notational convenience and easy
understanding, it is summarized as shown in table 4.
We can see that the values of attribute education
have many to one relationship to attribute
education-num as shown in table 4.
Table 4. Corresponding values in the cross table of
the two attributes, education and education-num in
the adult dataset
education
education-num
frequency
10th
(-∞~8.5]
1389
11th
(-∞~8.5]
1812
12th
(-∞~8.5]
657
1st-4th
(-∞~8.5]
247
5th-6th
(-∞~8.5]
509
7th-8th
(-∞~8.5]
955
9th
(-∞~8.5]
756
Assoc-acdm
(10.5~12.5]
1601
Assoc-voc
(10.5~12.5]
2061
Bachelors
(12.5~13.5]
8025
Doctorate
(14.5~ ∞]
594
HS-grad
(8.5~9.5]
15784
Masters
(13.5~14.5]
2657
Pre-school
(-∞~8.5]
83
Prof-school
(14.5~ ∞]
834
Some-college
(9.5~10.5]
10878
TOTAL
48842
The following table 5 shows the result of the chi-
squared test of the two attributes, education and
education-num. Because the value of asymptotic
significance is 0.0 and Pearson chi-square is very
large, we can decide the two attributes are
dependent.
Table 5. The result of the chi-squared test of the two
attributes, education and education-num in the adult
dataset
Value
Degree
of
freedom
Asymptotic
significance(2-
sided)
Pearson
Chi-
square
293052.0
90
0.0
4.1.2 Generating Decision Trees
Table 6 shows the property of the decision tree
generated by J4.8 which is a version of C4.5 written
in Java in Weka from the original and discretized
adult dataset. All experiments are performed in 10-
fold cross-validation.
Table 6. Decision tree from the discretized adult
dataset
Number of leaves
522
Size of the tree
584
Accuracy
86.743%
Predicted
>50K
Predicted
≤50K
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
229
Volume 19, 2022
Confusion matrix
Actual
>50K
7226
4461
Actual
≤50K
2014
35141
We try to generate decision trees for the dataset
having select attributes only. Table 7 shows the
property of the decision tree generated from the
adult dataset of which attribute education-num
which is RHS of the functional dependency
{education} -> {education-num} is omitted.
Table 7. Decision tree from the discretized adult
dataset of which education-num attribute is omitted
Number of leaves
577
Size of the tree
644
Accuracy
86.6447%
Confusion matrix
Predicted
>50K
Predicted
≤50K
Actual
>50K
7143
4544
Actual
≤50K
1976
35179
Table 8 shows the property of the decision tree
generated from the adult dataset of which attribute
education which is LHS of the functional
dependency {education} -> {education-num} is
omitted.
Table 8. Decision tree from the discretized adult
dataset of which education attribute is omitted
Number of leaves
516
Size of the tree
578
Accuracy
86.7515%
Confusion matrix
Predicted
>50K
Predicted
≤50K
Actual
>50K
7218
4469
Actual
≤50K
2002
35153
As we compare table 7 and table 8, omitting
attribute education has some better effect in
reducing the size of the tree as well as some increase
of the accuracy. Note that the type of values of the
attribute education consists of more variety than the
attribute education-num. In other words, we have
many-to-one relationships in the functional
dependency, {education} -> {education-num}, and
there is a variety of them as summarized in table 9.
This is the reason why we have a better decision tree
when we omit attribute education.
Table 9. Attribute values of education and
education_num that make up the many-to-one
relationship
education
education-num
frequency
10th, 11th, 12th,
1st-4th, 5th-6th,
7th-8th, 9th,
Pre-school
(-~8.5]
6408
Assoc-acdm,
Assoc-voc
(10.5~12.5]
3602
Bachelors
(12.5~13.5]
8025
Doctorate,
Prof-school
(14.5~ ]
594
HS-grad
(8.5~9.5]
15784
Masters
(13.5~14.5]
2657
Some-college
(9.5~10.5]
10878
TOTAL
48842
4.2 Bank Dataset
The purpose of the bank dataset is to predict if the
client will subscribe to a term deposit for direct
marketing campaigns of a Portuguese banking
institution. The data set has 45,211 records and 16
conditional attributes and one decisional attribute,
named ‘y’, having two different values, yes or no.
The 16 conditional attributes have a variety of
values as in table 10, where the values of numeric
attributes are discretized.
Table 10. Discretized conditional attributes of bank
dataset
ATTRIBUTE
VALUES
age
Numeric => (-~25.5],
(25.5~29.5], (29.5~60.5],
(60.5~ ]
job
12 nominal values (admin., ,
services)
marital
3 nominal values (divorced,
married, single)
education
4 nominal values (Primary, ,
unknown)
default
2 nominal values (no, yes)
balance
Numeric => (-~46.5],
(46.5~105.5], (105.5~1578.5],
(1578.5~ ]
housing (housing loan)
2 nominal values (no, yes)
loan (personal loan)
2 nominal values (no, yes)
contact
3 nominal values (cellular, ,
unknown)
day
Numeric => (-~1.5],
(1.5~4.5], (4.5~9.5],
(9.5~10.5], (10.5~16.5],
(16.5~21.5], (21.5~25.5],
(25.5~27.5], (27.5~29.5],
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
230
Volume 19, 2022
(29.5~30.5], (30.5~ ]
month(last contact
month of year)
12 nominal values (jan, ..., dec)
Duration(last contact
duration, in seconds)
Numeric => (-~77.5],
(77.5~130.5], (130.5~206.5],
(206.5~259.5], (259.5~410.5],
(410.5~521.5], (521.5~647.5],
(647.5~827.5], (827.5~ ]
Campaign(number of
contacts performed
during this campaign)
Numeric => (-~1.5],
(1.5~3.5], (3.5~11.5], (11.5~
]
Pdays(number of days
that passed by after the
client was last
contacted)
Numeric => (-~8.5],
(8.5~86.5], (86.5~99.5],
(99.5~107.5], (107.5~177.5],
(177.5~184.5], (184.5~203.5],
(203.5~316.5], (316.5~373.5],
(373.5~ ]
Previous(number of
contacts performed
before this campaign)
Numeric => (-~0.5], (0.5~
]
Poutcome(outcome of
the previous marketing
campaign)
4 nominal values (failure, …,
success)
4.2.1 Checking Functional Dependencies for
Bank Dataset
One functional dependency was found in the
conditional attributes based on the dataset as follows:
{poutcome, pdays} -> {previous}
Based on the found functional dependency
between conditional attributes, we can do chi-
squared tests between the attribute sets, {poutcome,
pdays} and {previous}. Before we give input for
SPSS, we combined the values of the two attributes,
the poutcome, and the pdays, named
pdays_poutcome, row by row. Chi-squared test was
done for the attributes. The cross table of the
attributes can be summarized in table 11. Even
though pdays has 10 nominal values and poutcome
has 4 nominal values, there are no rows
corresponding to the values, (177.5~184.5]unknown,
(203.5~316.5]unknown, (316.5~373.5]unknown,
(8.5~86.5]unknown, and (99.5~107.5]unkown. As a
result, table 11 has 35 rows only. Table 11 shows
the many-to-one relationship between attribute
pdays_poutcome and attribute previous except the
second row which represents the one-to-one
correspondence of values. We can see that the
summary of the cross table of the bank dataset has a
very simple many-to-one relationship of values
compared to those of the adult dataset in table 4.
Table 11. Corresponding values in the cross table of
the two attributes, pdays_poutcome and previous in
bank dataset
pdays_poutcome
previous
frequency
(-∞~8.5]failure
(0.5~ ∞]
10
(-∞~8.5]unknown
(-∞~0.5]
36954
(-∞~8.5]other
(0.5~ ∞]
83
(-∞~8.5]success
(0.5~ ∞]
15
(107.5~177.5]failure
(0.5~ ∞]
971
(107.5~177.5]unknown
(0.5~ ∞]
1
(107.5~177.5]other
(0.5~ ∞]
318
(107.5~177.5]success
(0.5~ ∞]
124
(177.5~184.5]failure
(0.5~ ∞]
275
(177.5~184.5]other
(0.5~ ∞]
68
(177.5~184.5]success
(0.5~ ∞]
287
(184.5~203.5]failure
(0.5~ ∞]
346
(184.5~203.5]unknown
(0.5~ ∞]
1
(184.5~203.5]other
(0.5~ ∞]
131
(184.5~203.5]success
(0.5~ ∞]
173
(203.5~316.5]failure
(0.5~ ∞]
1116
(203.5~316.5]other
(0.5~ ∞]
402
(203.5~316.5]success
(0.5~ ∞]
123
(316.5~373.5]failure
(0.5~ ∞]
1377
(316.5~373.5]other
(0.5~ ∞]
506
(316.5~373.5]success
(0.5~ ∞]
66
(373.5~∞]failure
(0.5~ ∞]
185
(373.5~∞]unknown
(0.5~ ∞]
2
(373.5~∞]other
(0.5~ ∞]
69
(373.5~∞]success
(0.5~ ∞]
67
(8.5~86.5]failure
(0.5~ ∞]
224
(8.5~86.5]other
(0.5~ ∞]
92
(8.5~86.5]success
(0.5~ ∞]
138
(86.5~99.5]failure
(0.5~ ∞]
285
(86.5.5~99.5]unknown
(0.5~ ∞]
1
(86.5~99.5]other
(0.5~ ∞]
139
(86.5~99.5]success
(0.5~ ∞]
420
(99.5~107.5]failure
(0.5~ ∞]
112
(99.5~107.5]other
(0.5~ ∞]
32
(99.5~107.5]success
(0.5~ ∞]
98
TOTAL
45211
The following table 12 shows the result of the
chi-squared test of the two attributes,
pdays_poutcome and previous. Because the value of
asymptotic significance is 0.0 and Pearson chi-
square is very large, we can decide the two
attributes are dependent.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
231
Volume 19, 2022
Table 12. The result of the chi-squared test of the
two attributes, pdays_poutcome and previous in the
bank dataset
Value
Degree
of
freedom
Asymptotic
significance(2-
sided)
Pearson
Chi-
square
45211.0
34
0.0
4.2.2 Generating Decision Trees
Table 13 shows the property of the decision tree
from the original, but discretized bank dataset. All
experiments are performed with 10-fold cross-
validation.
Table 13. Decision tree from discretized bank
dataset
Number of leaves
671
Size of the tree
807
Accuracy
90.3032%
Confusion matrix
Predicted
yes
Predicted
no
Actual
yes
2253
3036
Actual
no
1348
38574
We’ll try to remove attributes from the original
dataset in the following order; the pdays and the
poutcome together, and the previous based on the
found functional dependency, {poutcome, pdays} ->
{previous}. Table 14 shows the property of the
decision tree generated from the bank dataset of
which the attributes poutcome and pdays are
omitted.
Table 14. Decision tree from bank dataset minus
pdays and poutcome attribute
Number of leaves
835
Size of the tree
1033
Accuracy
89.8653%
Confusion matrix
Predicted
yes
Predicted
no
Actual
yes
2212
3077
Actual
no
1505
38417
Table 15 shows the property of the decision tree
generated from the bank dataset of which the
attribute previous is omitted.
Table15. Decision tree from bank dataset minus
previous attribute
Number of leaves
671
Size of the tree
807
Accuracy
90.3032%
Confusion matrix
Predicted
yes
Predicted
no
Actual
yes
2253
3036
Actual
no
1348
38574
If we compare table 13 which is from the original
data and table 15 above, we have the same result so
that removing a redundant attribute has no effect in
reducing the size or improving the accuracy of the
decision tree. This is because functional dependency
is a very simple many-to-one relationship.
4.3 Credit approval Dataset
The purpose of the credit approval dataset is to
decide positive or negative decisions for credit card
applications. The data set has 690 records and 15
conditional attributes and one decisional attribute,
named ‘class’, having two different values, + or -.
The dataset has been provided in the form of all
attribute names and values being changed to
meaningless symbols to protect confidentiality. The
15 conditional attributes have a variety of values as
in table 16, where the values of numeric attributes
are discretized.
Table 16. Discretized conditional attributes of credit
approval dataset
ATTRIBU
TE
VALUES
A1
2 nominal values (a, b)
A2
Numerical => (-~38.96], (38.96~ ]
A3
Numerical => (-~4.2075], (4.2075~ ]
A4
4 nominal values (u, y, l, t)
A5
3 nominal values (g, p, gg)
A6
14 nominal values (c, d, cc, i, j, k, m, r, q,
w, x, e, aa, ff)
A7
9 nominal values (v, h, bb, j, n, z, dd, ff, o)
A8
Numerical => (-~1.02], (1.02~ ]
A9
2 nominal values (t, f)
A10
2 nominal values (t, f)
A11
Numerical => (-~0.5], (0.5~2.5], (2.5~ ]
A12
2 nominal values (t, f)
A13
32 nominal values (g, p, s)
A14
Numerical => (-~105], (105~ ]
A15
Numeric => (-~492], (492~ ]
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
232
Volume 19, 2022
4.3.1 Checking Functional Dependencies for
Credit Approval Dataset
Three functional dependencies were found in the
conditional attributes based on the dataset as follows:
{A4} -> {A5}
{A5} -> {A4}
{A11} -> {A10}
Based on the found functional dependencies
between conditional attributes, we can do chi-
squared tests between the attribute sets, {A4, A5}
and {A10, A11}. The cross table of the attributes
{A4, A5} can be summarized in table 17. When we
read table 17, we may be confused if there are one-
to-many relations between A5 and A4. But, because
we have found two functional dependencies, A4 ->
A5 as well as A5 -> A4, they must be one-to-one.
In the table ‘?’ value means unknown value. Even
though A4 and A5 have 4 and 5 different values, the
dataset has only four combinations of values as in
table 17.
Table 17. Corresponding values in the cross table of
the two attributes, A4 and A5 in the credit approval
dataset
A4
A5
frequency
?
?
6
l
g
2
u
g
519
y
p
163
TOTAL
690
The following table 18 shows the result of the
chi-squared test of the two attributes, A4 and A5.
Because the value of asymptotic significance is 0.0
and Pearson chi-square is very large, we can decide
the two attributes are dependent.
Table 18. The result of the chi-squared test of the
two attributes, A4 and A5 in the credit approval
dataset
Value
Degree
of
freedom
Asymptotic
significance(2-
sided)
Pearson
Chi-
square
1380.0
6
0.0
The cross table of the attributes {A10, A11} can be
summarized in table 19. Even though A10 and A11
have 2 and 3 different values, the dataset has only
three combinations of values as in table 19.
Table 19. Corresponding values in the cross table of
the two attributes, A10 and A11 in the credit
approval dataset
A11
A10
frequency
(-~0.5]
f
395
(0.5~2.5]
t
116
(2.5~ ]
t
179
TOTAL
690
The following table 20 shows the result of the
chi-squared test of the two attributes, A10 and A11.
Because the value of asymptotic significance is 0.0
and Pearson chi-square is very large, we can decide
the two attributes are dependent.
Table 20. The result of the chi-squared test of the
two attributes, A10 and A11 in the credit approval
dataset
Value
Degree
of
freedom
Asymptotic
significance(2-sided)
Pearson
Chi-square
690.0
2
0.0
4.3.2 Generating Decision Tree
Table 21 shows the property of the decision tree
from the original, but discretized credit approval
dataset. All experiments are performed with 10-fold
cross-validation.
Table 21. Decision tree from discretized credit
approval dataset
Number of leaves
18
Size of the tree
25
Accuracy
87.2464%
Confusion matrix
Predicted
+
Predicted
-
Actual
+
255
52
Actual
-
36
347
We’ll try to remove attributes from the original
dataset in the following order; A4, and A5 based on
the found functional dependencies, {A4} -> {A5},
{A5} -> {A4}. Table 22 shows the property of the
decision tree generated from the credit approval
dataset whose attribute A4 is omitted.
Table 22. Decision tree from credit approval dataset
minus A4 attribute
Number of leaves
17
Size of the tree
24
Accuracy
87.2464%
Predicted
Predicted
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
233
Volume 19, 2022
Confusion matrix
+
-
Actual
+
255
52
Actual
-
36
347
Table 23 shows the property of the decision tree
generated from the credit approval dataset whose
attribute A5 is omitted.
Table 23. Decision tree from credit approval dataset
minus A5 attribute
Number of leaves
18
Size of the tree
25
Accuracy
87.2464%
Confusion matrix
Predicted
+
Predicted
-
Actual
+
255
52
Actual
-
36
347
Next, we’ll try to remove attributes from the
original dataset in the following order; A10 and A11
based on the found functional dependency, {A11} -
> {A10}. Table 24 shows the property of the
decision tree generated from the credit approval
dataset whose attribute A10 is omitted.
Table 24. Decision tree from credit approval dataset
minus A10 attribute
Number of leaves
19
Size of the tree
26
Accuracy
86.9565%
Confusion matrix
Predicted
+
Predicted
-
Actual
+
256
51
Actual
-
39
344
Table 25 shows the property of the decision tree
generated from the credit approval dataset whose
attribute A11 is omitted.
Table 25. Decision tree from credit approval dataset
minus A11 attribute
Number of leaves
18
Size of the tree
25
Accuracy
87.2464%
Confusion matrix
Predicted
+
Predicted
-
Actual
+
255
52
Actual
-
36
347
If we compare table 21 ~ table 25, removing the
redundant attribute A4 generates some good results
as shown in table 22. But, because the functional
dependencies are relatively simple, we do not get
much improvement in the decision trees.
All in all, we can conclude that checking
functional dependency to remove redundant
conditional attributes is statistically valid for the
real-world datasets, and we may get a better
decision tree when the found functional
dependencies contain some variety of values of
many-to-one relationships and the values of Pearson
chi-square is relatively large as summarized in table
26.
Table 26. Pearson chi-square of attributes in the
datasets of the experiment
Attribu
te
education,
education_
num in
adult
dataset
pdays-
poutcome,
previous in
bank
dataset
A4,
A5 in
credit
approv
al
dataset
A10, A11
in credit
approval
dataset
Pearso
n Chi-
square
293052
45211
1380
690
5 Conclusion
When we supply training datasets for the task of
data mining, independence between conditional
attributes in the datasets is recommended as a
preprocessing task. A well-known statistical method
for the task is the chi-squared test. We can choose
two categorical or nominal attributes, and we can
determine their independence easily. But, if we have
many attributes in the datasets, choosing some
appropriate two attributes will be a challenging task.
Therefore, if we have some automatic tool that can
help us choose appropriate attributes to do the test,
it’ll be very good for us to save time in our data
mining task. On the other hand, functional
dependencies represent many-to-one
correspondence between subsets of attributes in the
relation or table, including one-to-one relationships
of values that appear in a relation. So, if there is a
functional dependency, it is highly probable that
there is a statistical dependency between the two
attributes or between the two subsets of attributes
that appear between the left-hand side and the right-
hand side of the functional dependency. There are
several algorithms for the automated discovery of
functional dependencies based on data, but there can
be many fake functional dependencies because of
not enough data. So, it is very natural that using the
found functional dependencies with the automatic
tools we want to check their statistical dependencies
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
234
Volume 19, 2022
of them. In this work, we have checked the
statistical dependence by using three different public
datasets to see their statistical significance of found
functional dependencies and generated decision
trees after removing dependent attributes.
Experiments revealed that the found functional
dependencies have statistical dependence in real-
world datasets. Additionally, we could find out that
decision trees generated some better results after
dropping some redundant or dependent attributes,
especially when found functional dependencies have
a variety of values in the many-to-one relationships
and the value of Pearson chi-square is relatively
large for the attributes. Future works can be, for
generality, the extension of FDtool that can be
applied to datasets having attributes of less than
27 only.
Acknowledgement:
This work was supported by Dongseo University,
“Dongseo Frontier Project” Research Fund of 2021.
References:
[1] C.J. Date. Database Design and Relational
Theory: Normal Forms and All That Jazz, 2nd
ed., Apress, 2019.
[2] N. Asghar, A. Ghenai, Automatic Discovery of
Functional Dependencies and Conditional
Functional Dependencies: A Comparative
Study, University of Waterloo, April 2015.
[3] G.K. Kanji, 100 Statistical Tests, 3rd
ed., SAGE Publications Ltd, 2006.
[4] SPSS Tutorial: Chi-square test of
independence,
https://libguides.library.kent.edu/spss/chisquar
e, 2022.
[5] B.T. Jijo, A.M. Abdulazeez, Classification
Based on Decision Tree Algorithm for
Machine Learning, Journal of Applied Science
and Technology Trends, Vol.2, No.1, 2021,
pp. 20-28.
[6] M. Belkin, D. Hsu, S. Ma, S. Mandal,
Reconciling modern machine-learning
practice and the classical bias-variance trade-
off, PNAS, Vol. 116, No. 32, 2019, pp. 15849-
15854.
[7] P. Tare, S. Mishra, M. Lakhotia, K. Goyal,
Bias Variance Trade-off in Classification
Algorithms on the Census Income Dataset,
International Journal of Computer
Techniques, Vol. 6, Issue 3, 2019, pp. 1-5.
[8] M. Robnik-Sikonja, I. Kononenko, Attribute
dependencies, Understandability and Split
Selection in Tree-Based Models, Proceedings
of the Sixteenth International Conference on
Machine Learning, 1999, pp. 344-353.
[9] R. Elshwi, M.H. Al-Mallah, S. Sakr, On the
Interpretability of Machine Learning-based
Model for Predicting Hypertension, BMC
Medical Informatics, and Decision Making,
Vol.19, Article 146, 2019.
[10] J.R. Quinlan, C4.5: Programs for Machine
Learning, Elsevier, 2014.
[11] C.J. Date, An Introduction to Database
Systems, 8th ed., Pearson, 2003.
[12] L. Caruccio, S. Cirillo, V. Deufemia, and G.
Polese, Incremental Discovery of Functional
Dependencies with a Bit-vector Algorithm,
Proceedings of the 27th Italian Symposium on
Advanced Database Systems, 2019, pp. 146-
157.
[13] J. Liu, J. Li, C. Liu, and Y. Chen, Discover
dependencies from data a review, IEEE
Transactions on Knowledge and Data
Engineering, Vol. 24, No. 2, 2012, pp. 251-
264.
[14] P. Bohannon, W. Fan, F. Geerts, X. Jia, A.
Kementsietsidis, Conditional Functional
Dependencies For Data Cleaning, IEEE 23rd
International Conference on Data
Engineering, 2007, DOI:
10.1109/ICDE.2007.367920
[15] R. Salem, A. Abdo, Fixing Rules for Data
Cleaning Based on Conditional Functional
Dependency, Future Computing and
Informatics Journal 1, 2016, pp. 10-26.
[16] F. Azzalini, C. Criscuolo, L. Tanca, FAIR-DB:
Functional Dependencies to Discover Data
Bias, Proceedings of the EBDT/ICDT 2021
Joint Conference, 2021.
[17] D. Nguyen, L.T.T. Nguyen, B. Vo, W.
Pedrycz, Efficient Mining of Class
Association Rules with the Itemset Constraint,
Knowledge-Based Systems, Vol.103, 2016, pp.
73-88.
[18] M. Nasr, M. Hamdy, D. Hegazy, K. Bahnasy,
An Efficient Algorithm for Unique Class
Association Rule Mining, Expert Systems with
Applications, Vol. 164, 113978, 2021,
https://doi.org/10.1016/j.eswa.2020.113978
[19] S. García, J. Luengo, J. Sáez, V. López, F.
Herrera, A Survey of Discretization
Techniques: Taxonomy and Empirical
Analysis in Supervised Learning, IEEE
Transactions on Knowledge and Data
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
235
Volume 19, 2022
Engineering, 2013,
DOI:10.1109/TKDE.2012.35
[20] U.M. Fayyad, K.B. Irani, Multi-interval
discretization of continuous-valued attributes
for classification learning, Proceedings of the
Thirteenth International Joint Conference on
Artificial Intelligence, 1993, pp.1022-1027.
[21] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q.
Yang, H. Motoda, G.J. McLachlan, A. Ng, B.
Liu, P.S. Yu, Z. Zhou, M. Steinbach, D.J.
Hand, D. Steinberg, Top 10 algorithms in data
mining, Knowledge and Information System,
Vol. 14, 2008, pp. 1-37.
[22] J.N.K Rao, A.J. Scott, The Analysis of
Categorical Data from Complex Sample
Surveys: Chi-Squared Tests for Goodness of
Fit and Independence in Two-Way Tables,
Journal of the American Statistical
Association, Vol. 76, No. 374, 1981, pp. 221-
230.
[23] M. Buranosky, E. Stellnberger, E. Pfaff, D.
Diaz-Sanchez, C. Ward-Caviness, FDTool: a
Python application to mine for functional
dependencies and candidate keys in tabular
form [version 2, peer review: 2 approved],
F1000Research 2019, 7:1667,
https://doi.org/10.12688/f1000research.16483.
2
[24] A. Field, Discovering Statistics Using IBM
SPSS Statistics: North American Edition, 5th
ed., SAGE Publications Ltd., 2017.
[25] E. Frank, M.A. Hall, I.H. Witten, Data
Mining: Practical Machine Learning Tools
and Techniques, Morgan Kaufmann, Fourth
Edition, 2016.
[26] Dua and C. Graff, UCI Machine Learning
Repository [http://archive.ics.uci.edu/ml]
Irvine, CA, University of California, School
of Information and Computer Science, 2019.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2022.19.23
Hyontai Sug
E-ISSN: 2224-3402
236
Volume 19, 2022