Machine Learning Enabled Crop Recommendation System for
Arid Land
BATOOL ALSOWAIQ, NOURA ALMUSAYNID, ESRA ALBHNASAWI, WADHA ALFENAIS
SURESH SANKARANARAYANAN
Department of Computer Science
College of Computer Science and Information Technology
King Faisal University
Al Hofuf, KINGDOM OF SAUDI ARABIA
Abstract: The agriculture industry plays a significant role in the economy of many countries, and the population is regarded as an essential
profession. To increase agricultural production, crops are recommended based on soil, weather, humidity, rainfall, and other variables
which are beneficial to farmers as well as the nation. This paper explores the use of machine learning” algorithms to recommend crops in
for Arid land based on features selected from tropical climate where crops grow effectively. Five “machine learning” models have been
validated for recommendation of crops for arid land which resulted in “Random Forest” topping as the best model.
Keywords: Crop recommendation, Machine Learning, Random Forest. Humidity, rainfall
Received: October 24, 2022. Revised: May 19, 2023. Accepted: June 20, 2023. Published: July 10, 2023.
1. Introduction
Food and agriculture are two of the most important
sources of livelihood for most people in the world. With
the escalation of climate change, it is becoming
increasingly important for farmers to optimize their
production in order to maximize their yields and reduce
their costs. However, the traditional methods of selecting
and recommending crops are insufficient in determining
the best possible crop for a given environment. Choosing
the wrong crops will always result in lower productivity.
Families who rely entirely on agriculture have a difficult
time surviving. There has been good amount of work done
in employing machine learning models like “SVM, Naïve
Bayes, Decision Tree, Chaid, LEM2” algorithm for
selecting the best crop. Most of the work employing
machine learning have been done for one or two crops and
mostly focussed for tropical climatic condition like India
and other countries. The reason being most of agriculture
production is dependent of availability of fertile soil and
tropical climatic condition.
So accordingly, we here have focussed on validation
different machine learning model like “SVM, Naïve
Bayes, KNN, Decision Tree, Random Forest” in terms of
accuracy and error towards developing an Machine
Learning enabled crop recommendation from Arid land
point of view. Though the vision of Kingdom of Saudi
Arabia is to promote Agriculture in this region towards
increasing the yield and productivity, there is need to
develop crop recommendation system driven by machine
learning as a first step which would benefit the Arid land.
Secondly, we have chosen crop related dataset with
features like “NPK, pH, Temperature, Rainfall, Humidity”
which helps in the growth of crop in fertile soil with good
climatic condition. So, this dataset has been beneficial for
developing a Machine Learning enabled Crop
recommendation which would benefit in selecting the best
crop based on real time data received from Arid land based
on model trained for crop recommendation.
This system would provide farmers with the
opportunity to consider various agricultural benefits, such
as improved yields, reduced costs, and improved decision-
making processes. The rest of the paper is organized as
follows. Section II discusses the various work carried out
towards crop recommendation using machine learning.
Section III discussed the Crop Recommendation using
Machine Learning. Section IV discussed the performance
of machine learning model implemented for crop
recommendation with results. Section VI tabulates the
results of the model followed by discussion. Section VII
gives the concluding remarks and future work.
2. Literature Review
There has been some good amount of work carried out
pertaining to crop recommendation using Machine
Learning algorithms.
Authors in the paper [1] developed a crop
recommendation system using ensemble technique with
major voting using machine learning models “Random
Tree, CHAID, K-NN and Naïve Bayes”. The authors in
this work have resulted in an accuracy of 88%. The model
has been proved to be adaptive and utilized by farmers
through a GUI developed as Web portal. This study
provided great insight into the growth of the agriculture
sector by minimizing the wrong choices made by farmers
and increasing their productivity. The work has drawback
where they have focused on crops pertaining to district in
city in India and no other crops. In addition, they have
achieved an accuracy of 88% for limited dataset. There
has been no correlation shown among the features used
for crop classification towards recommendation which is
very important. There has been no error computed in
terms of precision and recall for showing how accurately
classification is done in terms of True Positive and True
Negatives.
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
56
Volume 1, 2023
The research in [2] describes the development of an
assembling technique of three machine learning models
which are “Random Forest, Naïve Bayes and Linear
SVM” towards crop recommendation resulting in an
improved crop productivity. The system is based on soil
specific physical and chemical characteristics such as
“NPK, pH”, soil type, pores in soil , average rainfall,
sowing season and temperature of the surface. This has
been focused on two crops which are “Kharif and Rabi”
resulting in an accuracy of 99.91%. The challenge in this
work is limited crop used for recommendation though
dataset of these crops is large. Secondly, though the
accuracy achieved is higher, there has been no correlation
seen among features for crop recommendation. Also, they
have worked as ensemble method for classification using
all four models. There has been no error computed in
terms of precision and recall for showing how accurately
classification is done in terms of True Positive and True
Negatives.
Researchers in [3] have worked towards deploying four
machine learning models which are “SVM, Decision Tree,
Naïve Bayes, Random Forest, KNN, Logistic
Regression”. This resulted in selection of appropriate crop
using the machine learning model with a classification
accuracy of 89.66% for “SVM” followed by “KNN,
Random Forest, Naive Bayes, Decision Tree and Logistic
Regression”. Followed by the crop selected, the system
would suggest the past that would affect the crop followed
by recommendation towards pest control. This work has
focused on different crops with their selection. The model
has not achieved higher accuracy as the number of
samples of each crop is less for some crops and high for
some crops. It does not have a consistent number of
samples of each crop. There is also possibility of data
imbalance which has not been explored and correlation of
soil features contributing for crop selection also not
evaluated. There has been no error computed in terms of
precision and recall for showing how accurately
classification is done in terms of True Positive and True
Negatives.
The research paper [4] proposes a methodology for
improving agricultural crop yield by accounting for the
soil micro and macronutrient levels to predict crop
suitability. To achieve this, fuzzy logic and rough set rule
induction were used to create rules for the dataset and
evaluate different algorithms for their accuracy. The
results of the evaluation found that the “LEM2” algorithm
gave the highest prediction accuracy of 89% for the
dataset without fuzzy logic, while the AQ algorithm
showed better accuracy than others with 3 linguistic
variables. The paper concluded that the proposed
methodology could be used in all situations, as it could
help farmers to determine the crop that best suits their soil
type and could reduce soil erosion, as farmers could shift
to lesser-water intensive crops when water availability is
low. Though the work has considered 23 crops with 16
features for crop selection, there has been no usage of
machine learning model deployed for selection of crops
with higher accuracy.
3. Crop Recommendation Using
Machine Learning
So, based on literature reviewed pertaining to machine
learning for crop recommendation system based on soil
features and climatic condition, we in this paper have
worked on dataset collected pertinent towards tropical
climate for crop recommendation for Arid land. The
reason for choosing dataset pertaining to tropical climate
condition is towards training the model with soil features
for the recommendation of crop. The model trained and
evaluated would be used for crop recommendation for
Arid land based on features collected in real time. The
availability of Arid land dataset for crop recommendation
is not available as agriculture in Arid land is really
challenging and is one of the primary focusses of
Kingdom of Saudi Arabia. So before going into the
results and analysis of machine learning model, we
investigate methodology of proposed work pertaining to
following models which are “Support Vector machine,
Decision tree, Random Forest, K-Nearest Neighbor and
Naïve Bayes”. The details about the model are explained
in brief.
3.1 Random Forest Algorithm
“Decision Tree” are created based on variety of
samples and averaging and majority voting are employed
for classification and regressing. It can handle categorical
variables in the case of classification. Our project provides
solutions to the multiclass problem. The principle of
“Bagging is employed for Random Forest”. “Random
Forest” also employed an ensemble method known as
“Bootstrap Aggregation”. Each sample is trained
independently producing results. The final decision is
based on majority voting where the results of all models
merged which is termed as aggregation Description
3.2 Decision Tree
In a “decision tree” algorithm, a root node provides the
optimal split. It is represented by the predictor variable
which helps to divide the data set into two or more
subsets. Then the entropy or Gini Index is used to measure
the homogeneity of a split. It measures the potential for
information gain when splitting a node. Then, each node
has a split of the predictor variable which yields the best
homogeneity. Building a decision tree continues until all
the nodes are pure, meaning all the nodes are
homogeneous and belong to the same class. The process
stops here, and the final tree is used to make predictions.
3.3 Naïve Bayes
The “Naïve Bayes” algorithm is a “supervised learning”
method for classification problems. This method makes
the prediction based on probability which we call it as
“probabilistic classifier”. The machine learning model is
based on “Bayes Theorem” which calculates the
likelihood of hypothesis of data.
3.4 Support Vector Machine
“Support vector machine (SVM) is utilized for
classification and regression. “Hyperplane” in N-
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
57
Volume 1, 2023
dimensional space is found using this algorithm which
classifies the data points. Hyperplane is used to essentially
divide the input features into two classes. It becomes
challenging when the number of features exceed three.
3.5 K-Nearest Neighbors
The “K-Nearest Neighbors (KNN)” algorithm is a simple,
“supervised learning” algorithm used for classification
and regression. The algorithm works by comparing an
input sample to the k nearest neighbors in the training
dataset. The k neighbors are determined based on their
proximity to the input sample, typically using Euclidean
distance as a measure of similarity. KNNis known as a
non-parametric algorithm because it does not require
assumptions about the underlying probability distribution
of the data, making it a popular choice for many
classification and regression problems.
4. Performance of MAchine learning
model for crop recommendation
For our work, we have acquired a dataset as CSV file
from Kaggle [5]. The dataset contains eight columns
comprising of 2200 records, and 17600 fields. The column
contain “N” which is the amount of “Nitrogen” content in
soil, P” which is the amount of Phosphorus content in
soil, “K” which is the amount of “Potassium content in
soil, temperature which is in degree Celsius, humidity
which is the relative humidity in %,” “PH” value of the
soil, rainfall which is rainfall in mm and label which is the
classification’s name. In each record, the datatype for each
column is shown in Table 1. In the Crop Recommendation
dataset, we have 22 crops which are as follows: “Rice,
Maize, Jute, Cotton, Coconut, Papaya, Orange, Apple,
Muskmelon, Watermelon, Grapes, Mango, Banana,
Pomegranate, Lentil, Blackgame, Mung Bean, Moth
Beans, Pigeon Peas, Kidney Beans, Chickpea, Coffee
Table 1 The datatype of the dataset's features.
-N (Nitrogen)
-P (Phosphorous)
-K (Potassium)
-temperature
-humidity
-PH
-rainfall
Label
int64
float64
STRING
4.1 Data Visualization and analysis
For visualizing the dataset content and characteristic
we used Python's Pandas profiling library that automates
exploratory data analysis. It generates a dataset profile
report that provides valuable insights. With the profile
report, we can know which variables to use and which
ones to drop. The following is some illustration and
discussion of the generated report about the dataset:
Fig. 1 Detailed overview of the dataset-1.
The first generator is the overview, which shows the
information about the statistics of the dataset. As shown in
Fig.1, the number of columns is 8, the number of rows is
2200, missing values are 0 etc.
Fig. 2 Detailed overview of the dataset-2.
Fig. 2 shows the features that strongly affect the label
(crop) which are P, K, humidity, and rainfall. It also
presents feature N which mostly contains zero since it has
27 rows but this does not affect the dataset.
Fig. 3 Missing values View.
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
58
Volume 1, 2023
The Missing values view shows how many missing values
are in each feature, as in Fig. 3. The dataset has 2200 rows
which means no missing values in our dataset.
Fig. 4 Correlation View.
Fig. 4 demonstrates the correlation view which illustrates
the relationship between all features together, a positive
number indicates a strong relationship, while a negative
number indicates a light relationship.
Now based on visualization of dataset with correlation of
features, there is need to evaluate the machine learning
models. So, towards this, dataset is split into training and
testing. The performance of machine learning is enhanced
by splitting into training and testing which is a basic step
of data preprocessing. We take 80% of dataset for training
and remaining 20% for testing.
After training the machine learning model using a trained
set of data, it is important to use evaluation metrics to
measure the performance of the ML model. For a
different set of machine learning algorithms, there exist
many evaluation measures. We will discuss only the
Classification Evaluation Matrix since our selected dataset
‘Crop Recommendation’ belongs to this category There
are different approaches for this type of evaluation matrix
[6] such as:
“Confusion matrix
Accuracy
Precision
Recall
Specificity
F1 Score”
A. Confusion Matrix
It is a tabular summary of several true and false
predicates made by the ML classifier model to measure its
performance. It is considered a useful tool for visualizing
the measurement of Accuracy, Precision, Recall, and
“AUC-ROCcurve. It simply shows n-dimensional, where
n indicates the number of classes in the dataset. It consists
of four parts[8]:
1) True positive (TP): It indicates the situation
where both the actual value and the predicted
value by the ML model are positive.
2) True negative (TN): It indicates the situation
where both the actual value and the predicted
value by the ML model are negative.
3) False positive (FP): It indicates a situation where
the actual value is negative, but the predicted
value by the ML model is positive.
4) False negative (FN): It indicates the situation
where the actual value is positive, but the
predicted value by the ML model is negative.
4.2 Accuracy
It calculates the ratio of the correct prediction made by
the ML model over the total number of instances
evaluated. It is a valid technique when a dataset is
balanced, in which the proportion of all instances of each
class is quite similar. However, it is invalid in the opposite
situation where the proportion of the total number of
instances per class is far from each other. The following is
the accuracy formula:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 +𝑇𝑁
𝑇𝑃 +𝑇𝑁 +𝐹𝑃 +𝐹𝑁
4.3 Precision
It is an approach that deals with an imbalanced dataset.
It calculates s ratio of correct positive predictions to the
total number of instances classified in the positive class.
The following is the precision formula:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃 +𝐹𝑃
4.4 Recall
It can be called Sensitivity is a helpful measurement
when a dataset is imbalanced where the minority class is
positive. It calculates the ratio of correct positive
predictions to the dataset's overall number of positive
instances. The following is the recall formula:
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃 +𝐹𝑁
4.5 Specificity
It is the opposite of Sensitivity in which the minority
class of an imbalanced dataset is negative. It calculates the
ratio of correct negative predictions to the overall number
of negative instances in the dataset. The following is the
specificity formula:
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁
𝑇𝑁 +𝐹𝑃
4.6 F1-Score
It is useful where it is important to avoid false positives
and negatives. This metric represents the harmonic mean
between recall and precision values. The following is the
F1-Score formula:
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
59
Volume 1, 2023
5. |Result and Discussion
We have evaluated the four machine learning models
which are SVC, Naïve Bayes, Decision tree, Random
Forest and KNN” Algorithm towards crop
recommendation of different crops based on features
correlated. The models have been evaluated in terms of
precision, recall, F1 score and accuracy. In addition, the
accuracy of model towards training and testing tabulated
too in Table 2 and Table 3 respectively. Fig.5 shows the
model’s performance.
Table 2 Comparison between the performance scores of the
modes.
Accuracy
Precision
Recall
F1
98.36%
98.36%
98.63%
98.40%
98.18%
98.18%
98.25%
98.20%
99.45%
99.45%
99.60%
99.50%
98.00%
98.14%
98.10%
98.10%
99.09%
99.10%
99.30%
99.15%
Table 3 Models' training and testing accuracy.
Model
Train accuracy
Test accuracy
SVC
97.88 %
98.36%
Decision Tree
100%
98.18%
Random Forest
100%
99.45%
KNN
98.79 %
98.00%
Naïve Bayes
99.58 %
99.09%
Fig.3 Comparison plot between the models based on the
Accuracy.
6. Conclusion and future work
The rapid advancement of technology has seen the
emergence of machine learning algorithms, which could
help revolutionize agricultural production. Selecting the
wrong crops leads to lower productivity. Also, families
have a difficult time surviving if they rely entirely on this
income. This paper discussed the possibilities of
employing machine learning algorithms to develop a crop
recommendation system, which helps farmers make
informed decisions on which crops to grow based on
factors such as climate and soil quality. This paper have
validated five machine learning algorithms which are
“SVC, Decision Tree, Random Forest, KNN and Naïve
Bayes” towards crop recommendation which resulted in
Random forest with highest accuracy of 99.45 %,
followed by Naïve Bayes, SVC, Decision Tree and KNN
respectively. We chose the best model, which is Random
Forest based on accurate results because the Crop
Recommendation dataset is balanced which calculates the
ratio of a correct prediction made by the ML model over
the total number of instances evaluated. Moreover, we
tested the overfitting and underfitting and we observe that
the Random Forest model is generalized.
In future, this work can explore different approaches
such as the more sophisticated machine learning
algorithms like Deep Neural with optimization for better
accuracy of the recommendation system. The system can
be integrated with Crop yield prediction for providing
better productivity. Additionally, the system could
introduce larger datasets with various features such as
market prices, weather forecasts, etc. to improve the
system's predictive accuracy.
References
[1] S. Pudumalar, E. Ramanujam, R. H. Rajashree, C. Kavya, T.
Kiruthika and J. Nisha, "Crop recommendation system for
precision agriculture," 2016 Eighth International Conference on
Advanced Computing (ICoAC), Chennai, India, 2017, pp. 32-36,
doi: 10.1109/ICoAC.2017.7951740.
[2] N. H. Kulkarni, G. N. Srinivasan, B. M. Sagar and N. K. Cauvery,
"Improving Crop Productivity Through A Crop Recommendation
System Using Ensembling Technique," 2018 3rd International
Conference on Computational Systems and Information
Technology for Sustainable Solutions (CSITSS), Bengaluru, India,
2018, pp. 114-119, doi: 10.1109/CSITSS.2018.8768790.
[3] A. Kumar, S. Sarkar and C. Pradhan, "Recommendation System
for Crop Identification and Pest Control Technique in Agriculture,"
2019 International Conference on Communication and Signal
Processing (ICCSP), Chennai, India, 2019, pp. 0185-0189, doi:
10.1109/ICCSP.2019.8698099.
[4] A. M. Rajeswari, A. S. Anushiya, K. S. A. Fathima, S. S. Priya and
N. Mathumithaa, "Fuzzy Decision Support System for
Recommendation of Crop Cultivation based on Soil Type," 2020
4th International Conference on Trends in Electronics and
Informatics (ICOEI)(48184), Tirunelveli, India, 2020, pp. 768-773,
doi: 10.1109/ICOEI48184.2020.9142899.
[5] Ingle, A. (2020) Crop recommendation dataset, Kaggle. Available
at: https://www.kaggle.com/datasets/atharvaingle/crop-
recommendation-dataset (Accessed: February 11, 2023).
[6] "A Guide to Evaluation Metrics for Classification Models |
Deepchecks", Deepchecks, 2022. [Online]. Available:
https://deepchecks.com/a-guide-to-evaluation-metrics-for-
classification-models/. [Accessed: 27- Apr- 2022].
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
60
Volume 1, 2023
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors equally contributed in the present
research, at all stages from the formulation of the
problem to the final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study.
Conflict of Interest
The authors have no conflicts of interest to declare
that are relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
International Journal of Environmental Engineering and Development
DOI: 10.37394/232033.2023.1.7
Batool Alsowaiq, Noura Almusaynid, Esra Albhnasawi,
Wadha Alfenais, Suresh Sankaranarayanan
E-ISSN: 2945-1159
61
Volume 1, 2023