training data. Experimental results show that the
proposed method improves the performance of the
model. The two pre-trained models showed 50%
accuracy on the test data, and the performance of the
models continued to improve as they learned
unlabeled data through active learning. It is
expected that the proposed method will contribute to
increasing the efficiency of the training data
generation and labeling process of deep learning
models.
The paper is organized as follows. Section 2
describes related concepts and research. We
describe the structure of our proposed technique in
Section 3 and present experimental results in
Section 4. Finally, we conclude in Section 5 with a
discussion of the work and future work.
2 Related Work
2.1 Active Learning
Active learning is a form of machine learning,
which refers to the process by which a model learns
by selecting data on its own. Machine learning
models are typically trained using labeled training
data and then make predictions on new data.
However, obtaining unlabeled data is costly and
time-consuming, and Active Learning was
developed to overcome these constraints[4,6].
The goal of Active Learning is to obtain the
maximum performance gain by labeling as few
samples as possible. To do this, it selects the most
useful samples from the unlabeled dataset and
offloads the labeling task to an oracle (e.g., a human
annotator), with the goal of minimizing labeling
costs while maintaining performance. Active
Learning approaches can be categorized into the
following scenarios: Membership Query Synthesis,
Stream-based Selective Sampling, and Pool-
based[1-3].
Membership query synthesis is an approach that
aims to generate samples from the input space and
query their labels. This method primarily leverages
generative adversarial neural networks (GANs) for
data generation, where the most informative
samples can play an important role in improving
model performance.
Stream-based approaches allow the model to
request additional labels from data that arrive
sequentially in the form of streams. When the input
distribution is uniform, stream-based methods can
behave similarly to membership query learning, but
when the distribution is non-uniform and unknown,
it makes sense to draw queries from the actual
underlying distribution. This method is less studied
in vision-related tasks compared to membership
query synthesis and pool-based strategies but is
effective for tasks where large amounts of data are
generated in real time.
The pool-based active learning approach is used
in situations where a small number of labeled data
and many unlabeled data are available. This method
involves selecting samples from a pool of data and
querying them for labels and is often the most
practical method because large amounts of
unlabeled data can often be collected at once.
Typical methods utilize entropy to measure
uncertainty and select samples with higher
uncertainty.
Active Learning has great potential for reducing
the cost of labeling data and helping develop
efficient models. The method can be used in real-
world applications by reducing the time and money
required for labeling. It can also be used effectively
in areas where human annotators are required to
minimize effort and domain expertise. Active
Learning is one of the core principles of data-driven
learning, and it is expected to show a lot of potential
in real-world applications.
2.2 Sampling Startegy
In Active Learning, various sampling strategies
have been developed to select the most informative
data points to improve model performance.
Uncertainty sampling is a strategy that selects data
based on how uncertain the model is about its
current predictions. It uses methods such as least
confident, margin sampling, and entropy to calculate
uncertainty and select the most uncertain data.
The least confident method selects the data with
the lowest probability. Margin sampling selects the
data with the smallest difference in probability
between the most probable class and the next most
probable class. The entropy method calculates the
entropy and selects the data with the highest
entropy. Uncertainty sampling, especially the
entropy method, is the most widely used sampling
strategy because it is simple and effective.
Other sampling strategies include Query-By-
Committee, Expected Model Change, Variance
Reduction, and Density-Weighted Methods.
Query-By-Committee is a method that uses
multiple models or ensembles to select data. Each
model makes predictions from a different
perspective on the training data, estimates the
uncertainty, and selects the most uncertain data.
Expected Model Change measures the amount of
information gain by predicting the change in the
model when new data is added to training and
selects data with the largest information gain.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.14
Seongjin Oh, Jongpil Jeong,
Chae-Gyu Lee, Juyoung Yoo, Gyuri Nam