
problems with YOLO, the identification of small
objects in groups and the localization accuracy—
were supposed to be addressed by YOLOv2 [14]. By
implementing batch normalization, YOLOv2 raises
the network's mean Average Precision. The addition
of anchor boxes, as suggested by YOLOv2, was a
considerably more significant improvement to the
YOLO algorithm. As is well known, YOLO predicts
one object for every grid cell. Although this
simplifies the constructed model, it causes problems
when a single cell contains several objects because
YOLO can only assign one class to the cell.
YOLOv2 removes this restriction by allowing the
prediction of several bounding boxes from a single
cell. The network is instructed to anticipate five
bounding boxes for each cell to do this. YOLO9000
[14] was presented as a technique to discover more
classes than COCO as an object detection dataset
could have made possible, using a similar network
design to YOLOv2. Although YOLO9000 has a
lower mean Average Precision than YOLOv2, it is
still a powerful algorithm because it can identify over
9000 classes. YOLOv3 [15] was proposed to enhance
YOLO with modern CNNs that utilize residual
networks and skip connections. YOLOv2 employs
DarkNet-19 as the model architecture, but YOLOv3
uses the significantly more intricate DarkNet-53, a
106-layer neural network with residual blocks and
up-sampling networks, as the model backbone. With
the feature maps being extracted at layers 82, 94, and
106 for these predictions, YOLOv3's architectural
innovation allows it to forecast at three different
sizes.
YOLOv4 [16] is built using CSPDarknet53 as the
backbone, SPP (Spatial pyramid pooling), and PAN
(Path Aggregation Network) for what is known as
"the Neck," and YOLOv3 for "the Head" following
recent research findings. This system uses the latest
algorithm, YOLOv5, which uses the PyTorch
framework possessing many advantages such as
smaller size, higher performance, and better
integration than YOLOv4.
2 Related Works
In the works of [2][6], the authors have presented
experimental results that that YOLOv4 had better
performance, F1 score, precision, recall, and mAP
values compared to other models in [2] and YOLOv3
has demonstrated better results in performance and
accuracy than R-CNN and Fast R-CNN [6]. Yanhong
Yang [3] uses the SSD algorithm to achieve vehicle
classification and positioning, from picture
collection, picture calibration, model training, and
model detection, several aspects of the detailed
introduction of the vehicle classification process.
THE PASCAL VOC dataset was used, and the
TensorFlow framework and SSD model with the
VGG16 model were used for model training. In [2-9]
[11][12] Common vehicle categories are bus, car,
truck, bus, and motorbikes. In [6-7] limitations were
how to effectively detect vehicles in complex
environments. Due to the limitations of hardware and
time, in-depth research can be conducted in the future
on the aspects of improving accuracy, improving
detection accuracy, and improving calibration
methods. A combination of YOLOv4 and
DeepSORT has been used in [7] for vehicle detection
and real-time object tracking, respectively.
In [8] proposes a CNN model for vehicle
classification with low-resolution images from a
frontal perspective. The model was trained as a
multinomial logistic regression where the cross-
entropy of the ground truth labels and the model's
prediction estimate the error. Data augmentation was
performed to prevent overfitting. A leaky rectifier
activation function (LReLU) instead of (ReLU) was
set up for the convolution output. However, [10]
proposed a CNN architecture for vehicle type
classification. The system requires only one input, a
vehicle image. The model consists of two
convolution layers, 1st, and 2nd layers. Two pooling
layers and four activation functions (ReLU) The 3rd,
4th, and 5th layers are fully connected. In [12]
proposed the network developed has a total of 13
layers, 1 convolutional input layer, 11 intermediate
layers including a combination of Rectified Linear
Unit (ReLU) activation, convolutional, dropout,
max-pooling, flatten, and densely-connected layers,
and 1 Softmax output layer. In the works [6][11] the
gathered datasets from public sources such as COCO,
OpenImage, PASCAL VOC, and some works their
traffic data collected from camera sources. The
dataset split was 80:20 80% for training and 20% for
testing [6][9].
In the works [1][12] The test data gave it had
produced better accuracies with pictures with high
definition while for the pictures with low definition,
the recognition accuracy decreases. It is also
observed that the probability of identifying small cars
as medium-sized vehicles is only 8.69%, and the
probability of identifying large cars is lower, 2.14%
only in [1] and. Further improvements in prediction
accuracy include training on more quality images to
allow it to extract more features from the data and
further divide it into more classes [12]. In [5][10] the
authors wish to aim for better accuracies and stability
by searching for suitable hyperparameters. Research
gaps in [5-7] show the need to cover more variations
of vehicles, Cars Image datasets need more data to
classify, train, and real-time data analysis of the
International Journal of Computational and Applied Mathematics & Computer Science
DOI: 10.37394/232028.2023.3.3
Nadin Pethiyagoda, Mwp Maduranga,
Dmr Kulasekara, Tl Weerawardane