Visual Question Generation Answering (VQG-VQA) using Machine

Learning Models

ATUL KACHARE, MUKESH KALLA, ASHUTOSH GUPTA

Computer Science and Engineering

Sir Padampat Singhania University

Udaipur, Rajasthan

INDIA

Abstract: - Presented automated visual question-answer system generates graphics-based question-answer pairs.

The system consists of the Visual Query Generation (VQG) and Visual Question Answer (VQA) modules. VQG

generates questions based on visual cues, and VQA provides matching answers to the VQG modules. VQG system

generates questions using LSTM and VGG19 model, training parameters, and predicting words with the highest

probability for output. VQA uses VGG-19 convolutional neural network for image encoding, embedding, and

multilayer perceptron for high-quality responses. The proposed system reduces the need for human annotation

and thus supports the traditional education sector by significantly reducing the human intervention required to

generate text queries. The system can be used in interactive interfaces to help young children learn.

Key-Words: - Visual Question Generation, Visual Question Answer, Image Feature Extraction, E-Learning

System

Received: October 12, 2022. Revised: May 19, 2023. Accepted: June 11, 2023. Published: June 28, 2023.

1 Introduction

Question answering (QA) and question generation

(QG) are essential tasks in communication due to

progress made in various areas of machine learning.

Neural networks have greatly improved the speed

and accuracy of image processing tasks such as ob-

ject recognition and image segmentation and natural

language processing tasks such as input recognition,

language generation, and question answering. The

result is a multidisciplinary project known as VQA

and VQG that combines these methods and com-

bines computer vision and natural language process-

ing techniques. VQA and VQG must make infer-

ences between text questions and answers based on

the content of related images. VQA focuses on an-

swering questions about images, while VQG aims to

generate meaningful questions based on the content

of the images and the answers given. Visual Ques-

tion Answering (VQA) and Visual Question Gener-

ation (VQG) are popular topics in computer vision

but are often studied separately despite their intrinsi-

cally complementary relationships. This paper aims

to comprehensively review visual query generation

and query answering, including their methods and ex-

isting datasets.

VQG, considered complementary to VQA, has re-

cently attracted considerable attention as a fascinat-

ing problem. Its objective is to generate meaningful

questions based on input images. This task involves

image comprehension and natural language genera-

tion, often employing deep learning techniques. In

VQG, the first step is comprehending the picture and

generating a coherent sequence of texts that consti-

tute syntactically and semantically valid questions.

Image comprehension involves successfully detect-

ing objects, classifying objects, labeling them, identi-

fying relationships among objects, understanding the

scene, and classifying the scene.

Visual question-answering systems aim to respond

to natural language questions based on visual input

accurately. A broader perspective of this problem is

to develop systems that can comprehend image con-

tent in a human-like manner and effectively commu-

nicate about it using natural language. This task is

challenging as it requires the interaction and synergy

between image-based and natural language models.

It is widely regarded as a crucial milestone in the de-

velopment of artificial intelligence and represents the

effort to make computers as intelligent as humans.

Some researchers have even proposed using visual

question answering as a benchmark for evaluating AI

systems’ capabilities, like the Turing Test concept,

[1].

To provide an overview of the subproblems in-

volved in visual question answering, consider the fol-

lowing examples in Table 1:

Solving these challenges involves four main steps:

Image Featurization, Question Featurization, Joint

Feature Representation, and Answer Generation. The

remaining paper is organized as follows: section

2 briefly describes the literature review on VQG

and VQA. The proposed automated visual question-

answer system is presented in section 3. The dataset,

data preparation, exploratory data analysis, and re-

sults are reported in section 4. Finally, section 5 con-

cludes the work.

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

663

Volume 22, 2023

Table 1: Computer Vision Task required to be solved

by VQA, [2]

CV Task Representative VQA

Question

Object Recognition What is in the image?

Object Detection Are there any books in the

image?

Attribute Classifi-

cation

What color is the book

cover?

Scene Classifica-

tion

Is it a day or night?

Counting How many chairs are there?

Activity Recogni-

tion

What the person is doing?

Spatial Relationship

among Objects

What is on the desk be-

tween the bottle and the

computer?

2 Review of Literature

A neural design for answering natural language ques-

tions on the illustrations using an amalgamation of

CNN and LSTM is reported, [3]. They performed

experiments and concluded that the system with Im-

ages performed better than the previous system with-

out images and also proposed two metrics Average

Accord which considers the human divergence, and

Min Accord, which captures the dispute in Human

QA.

A free-form and open-ended VQA that can answer

any conversational language inquiry about a picture

by combining VGGNet alongside LSTM for image

and question processing is created, [4]. It can answer

”wh” questions along with the knowledge represen-

tation questions. The paper suggested a need for a

task-specific dataset to answer the practical VQA ap-

plication questions more efficiently.

In, [5], the authors fabricated the visually trained

queries with distinct varieties for an individual pic-

ture using VGGNet and LSTM. Authors experiment

with VisualQA and Visual7W datasets, generating

captions and questions for comparing existing sys-

tems. They found that the system could generate a

more significant number of questions than the previ-

ously existing systems. Even the questions generated

were reasonable and grammatically well-formed.

A unique neural network architecture that im-

proves image-based query generation using a de-

pendency analysis tree and a CNN, [6]. The pro-

posed model, a parse-tree-directed residual network

(PTGRN), comprises three collective modules. The

attentiveness module uses a native visible exhibit,

the closed residual composition module aggregates

previously minified evidence, and the parse-tree-

driven dissemination component forwards the de-

creased suggestion alongside the parse tree.

In, [7], a system for generating free-form un-

constrained video questions and answers using an

attention model with bidirectional LSTM is devel-

oped. The researchers created a video quality con-

trol dataset using automatic query generation and pro-

posed two alerting mechanisms. Sequential Video

Attention computes video attention while preserving

the sequential structure of the question, and Tempo-

ral Question Attention collects question attention for

each video frame.

A new approach to integrate advanced concepts in

a CNN-RNN approach. The system includes image

analysis that learns to combine image and semantic

properties using CNN successfully proposed in, [8].

The linguistic segment uses LSTM to learn the as-

sociation between the attribute vector and the word

sequence. The authors use ImageNet’s pre-trained

VGGNet for initialization, adapting for multi-label

datasets and attribute predictions. They use seman-

tic attribute predictive values instead of natural fea-

tures. A partitioned CNN generates proposal regions,

aggregating output for high-level image depiction.

In, [9], authors uses a dual learning framework

for joint learning of VQG and VQA tasks. The sys-

tem uses GRNN and NeuralTalk2 model. We use two

agents using dual-learning-adjusted pre-trained mod-

els. The proposed model regularly outperforms exist-

ing VQA techniques, which suggests that dual learn-

ing offers a natural closed-loop strategy for both VQG

and VQA tasks and that the VQA task supports the

development of the VQG job’s performance.

In, [10], the authors proposed introducing two

tasks of VQA improvisation for VQG quality assur-

ance using the RNN and LSTM model. They formu-

lated VQG and VQA as reverse processes, separat-

ing the system and the duality regulator. The pro-

posed method reconstructs the VQA model into its

dual VQG form, allowing us to train a single model

with two conjugate tasks. The experimental results

show that iQAN, with two training courses, learned

the interaction between answers, questions, and im-

ages bi-directionally.

A fine-grained picture and query architecture that

allows deep neural networks and a co-attention frame-

work to identify interests devised in, [11]. They pro-

posed that to support VQA, we must solve a trio of is-

sues: distinctive fine-grained illustrations across the

picture and the query; multidisciplinary feature amal-

gamation which can record the intricate relationships

of multidisciplinary traits; and inherent response fore-

casting that is capable of taking note of complicated

associations across various distinct responses to an

identical concern. It is feasible to successfully elim-

inate irrelevant characteristics and obtain distinctive

characteristics for photos and queries employing an

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

664

Volume 22, 2023

integrated awareness paradigm.

In, [12], authors developed a system that automat-

ically generates visual question-answer pairs by gen-

erating and answering an image given as input. In

this work, VQG and VQA are performed sequentially.

The VQG model uses CNN and RNN, while the VQA

model uses CNN to extract visual features, using the

question embedding and visual features as input to ob-

tain the answer. It eliminates the need for human in-

tervention in data input. This method creates an entity

pair and an attribute based on the question and the an-

swer. However, the system’s limitation is that it can

only give a one-word answer to a question.

In, [13], authors investigated the association be-

tween top-down visibility and text production to de-

termine whether accurate visibility responses bene-

fit text generation. The application consists of text-

based top-down visibility and visibility-based VQG

mesh. The first component takes an image and a

query as input and creates a visibility map correlat-

ing with the query. The second component takes as

input an image and visibility, which generates an ap-

propriate query based on the most critical areas. A

two-supervised network with dynamic parameter pre-

diction was proposed. This method uses the proba-

bilistic relationship between top-down detection and

text production while dynamically predicting parame-

ters to fully encode the given text into a convolutional

networking parameter to encode the given text into a

convolutional network fully. The proposed top-down

explanatory method correlates well with human atten-

tion.

The impact of prediction parameter modification

on VQA was examined in, [14], using Stochastic Gra-

dient Descent (SGD) with pre-trained faster R-CNN

and LSTM models. Benchmark VQA datasets used

for evaluation. The proposed method effectively han-

dles various inquiries requiring different levels of se-

mantic understanding beyond simple image content

questions. The VQA model achieves improved accu-

racy and generates answers for free-form unrestricted

questions based on images. It consists of four compo-

nents: Image Features, Question Features, Parameter

Prediction, and training using the Stochastic Gradient

Descent Approach.

In, [15], authors discussed the different textual and

visual image feature extraction methods and the sin-

gle and multi-hop attention models for Visual Ques-

tion answering systems. In visual feature extrac-

tion, authors have discussed from LeNet model with 2

CNN layers for handwritten digit recognition to com-

plex VGG, GoogleNet, and ResNet models. In tex-

tual feature extraction, we have a family of RNNs,

including LSTM, GRU, and Bi-LSTM, in the case of

fusion strategies broadly classified as vector opera-

tors, neural networks, or bilinear pooling. Also, the

authors have highlighted the single and multi-hop at-

tention models for textual and visual channels such as

LSTM-Attention and MLAN.

In, [16], the authors built a model that increases

the collective knowledge between images, presumed

responses, and produced questions. They introduced

a continuous latent space that varied in the expected

response project to deal with differences in individ-

ual natural language codes. They normalize this con-

cealed space with a subsequently suppressed space

that guarantees similar response groups. In addition,

if the system does not know the expected answer,

the second hidden space can contain objects and ma-

terials. It can generate targeted questions to elicit,

which quantifies the model’s ability to store informa-

tion about expected response categories, resulting in

more diverse, targeted questions.

A visual question-answering system for Remote

Sensing Data with the help of the Convolution

and Recurrent Neural Networks with a bitwise

merger of both designed in, [17]. They built two

datasets using low- & high-resolution images of im-

ages/question/answer triplets. CNN and RNN com-

bine for visual and natural language question pre-

diction, fused by point multiplication and Open-

StreetMap for QA generation. Nevertheless, OSM’s

limitations caused the system’s accuracy degradation.

Also, the answers were limited compared to the tradi-

tional VQA dataset.

In, [18], authors presented a current standardized

data source containing queries prepared by human an-

notators while considering inquiries people would ask

multimedia virtual assistants. A pre-trained CNN and

Text encoder LSTM model analyze image content and

metadata to generate meaningful queries. They pre-

sented a new dataset that included nearly four times

the number of queries as the OK-VQA dataset. In ad-

dition, their approach was tested against industry stan-

dard evaluation measures such as BLEU, METEOR,

ROUGE, and CIDEr to determine the relevance of

the questions created with the questions submitted by

users. They also examined the diversity of produced

queries using generative strength and originality cri-

teria and discovered that they performed better.

Complex tasks for classification and response gen-

eration were transformed into several simple tasks by,

[19]. They used pre-trained ResNet152 for image

extraction and three types of attachments (location,

segment, and character) for text processing. The au-

thors also used a multi-terminal self-controlled trans-

former to reduce the computational cost. On the Im-

ageClef2029 VQA-Med dataset, the suggested sys-

tem, CGMVQA, was tested. It demonstrated higher

accuracy for fundamental questions, making it appro-

priate for early medical students and patient care.

In, [20], authors introduced an innovative

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

665

Volume 22, 2023

response-oriented approach called Radial Graph

Convolutional Network (Radial-GCNy. The paper

proposed a method that identifies a set of candidate

regions in an image, then identifies a core response

region within those regions. Construct a radial

graph, and use graph convolution for contextualiz-

ing graphic and semantic descriptions. The mean

association function is then fed to a universal LSTM

interpreter to generate a profound query correlated

with the vision and the response. The central concept

of the proposed system is that no extensive image

analysis is required to prepare inputs for query

generation. The model effectively uses graphical and

semantic knowledge to determine the location of the

response area, which complicates space and time.

In, [21], authors improved the fairness of the an-

swers in terms of ethically sensitive attributes with the

help of faster R-CNN and Gated Recurrent Network.

The system consists of two basic models: the SAP

and VQA modules. VQA module primarily gener-

ates all possible (fair & unfair) answers, whereas the

SAP module predicts the sensitive attributes of an-

swers. Answers of both modules are combined using

a debiased fusion scheme to yield the final solution.

The system’s limitation is that it was studied only on

gender attributes. Hence the generalizability of the

system is not evaluated.

The VQA model presented by, [22], is based on

detecting visual relationships between multiple ob-

jects. Word vector similarity concepts are used in

layout models, swapping original object features for

image attributes and representing aspects and rela-

tional predicates. The classifier merged the compo-

nents of the image feature and the question vector as

input for proving the answer. The system comprised

an object-detection model and an object-related es-

timation model. An aspect ratio model was applied

to increase the model’s generalizability and aid in the

conclusion of linkages.

An intelligent manufacturing system using VQA

using the ImageNet dataset and ResNet pre-trained

for LSTM propsed, [23]. The conceptual model con-

sists of five parts: the physical HMC system, the vir-

tual HMC system, the service system of the HMC sys-

tem, the DT data of the HMC system, and the link

between the four deployed components. The pro-

posed VQA model, a video-text association network,

can understand visual and textual information. It can

answer simple multiple-choice questions and create

a sentence to answer open-ended questions. In this

way, people and machines can work together more

conveniently and efficiently.

In, [24], the authors proposed an attention-based

mechanism for generating visual queries using a sim-

ple RNN and LSTM encoder-decoder model with a

DNN-based attention mechanism. The paper com-

pares the result of a simple encoder-decoder model

with an attention-based model. The proposed model

is efficient and effortless. The downside is that the

system only focuses on valid questions about color

specifications.

A difficulty-driven generative network was pro-

posed by, [25], where an automatic question generator

generates questions with difficulty levels adjusted ac-

cording to the user’s skill and experience using RCNN

and LSTM. They used the training domain difficulty

index to identify a difficulty variable demonstrating

the intricacy intensity of the questions and combined

it with our model to power the generation of questions

with manageable difficulties. The difficulty manage-

ment mechanism combines the difficulty information

with the decipherer commencement, and each time

step contributes to managing the intricacy level of the

created difficulties.

In, [26], the authors purported the knowledge-

based Visual Question Generation model. They have

used pre-trained models to generate the object-level

features of objects in images. Then Encoder model

combines visual object level and non-visual knowl-

edge information. The overall system mainly con-

sists of 4 different components: the visual concept

feature extractor, knowledge feature extractor, target

object extractor, and decoder module. In Visual ex-

traction, not only are image features extracted, but

with the help of a Graph Neural Network, the spa-

tial relationship between multiple objects is detected

and represented using a Sparse Graph. The Answer-

Aware module, a part of the Knowledge feature, is

vital in finding non-visual information.

A solution involving a transformer-grounded sight

and verbal model proposed by, [27]. The researchers

used the Swin Transformer encoder to produce a

multiscale visual representation. This representation

serves as a prefix that helps Generative Pretrained

Transformer-2 decoders generate several queries in

paragraph form, effectively parsing the rich visual in-

formation in Remote sensing scene captures. An au-

dio decoder was optimized using the RS dataset for

related queries from photos. The model was accessed

using two VQA data sources, and a new, fully human-

annotated TextRS-VQA data source was introduced

to improve the assessment of VQG models.

In, [28], authors used a fully automated method

to create the first comprehensive His VAQA (Ara-

bic Visual Question Answering) dataset. This dataset

consists of approximately 138,000 triplets of photo-

question-answer (IQA) pairs focused on yes/no ques-

tions related to real-world photos. They created

their database structure and his IQA ground truth-

generating technique exclusively for the automated

compilation of VAQA datasets. The five components

of the system are answer prediction, question prepa-

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

666

Volume 22, 2023

ration, visual feature extraction from text, and fea-

ture fusion and answer prediction. The authors identi-

fied the most efficient strategy for Arabic queries dur-

ing the question preprocessing and presenting stages

of this study, as it was the first study to investigate

VQA in Arabian. To do this, they created 24 Ara-

bic VQA models that tested four His LSTM networks

using various question tokenization schemes, 3-word

embedding techniques, and architectural designs. To

assess the efficacy of several Arabic VQA models

authors did a thorough performance analysis of the

VAQA dataset. According to the trial results, the

Arabic VQA model performed between 80.8% and

84.9%.

VQG and VQA can be combined to form a dual

system that eliminates the need for human annota-

tors and avoids relying on image captioning. The

VQG component generates questions based on input

images. It learns to generate questions that are rel-

evant to the visual content of the images. By us-

ing fine-grained parameters, the VQG system can be

trained to produce more robust and detailed questions,

capturing various aspects of the visual information.

The VQA component answers the questions gener-

ated by the VQG system. It takes the input image

and the corresponding question and produces an an-

swer. By using fine-grained parameters, the VQA

system can be trained to provide more accurate and

detailed answers, considering subtle visual cues and

nuances. The combined VQG and VQA system can

be trained using large datasets that contain paired im-

ages, questions, and answers. This eliminates the

need for manual annotation, as the questions and an-

swers can be automatically generated and paired with

the images. The system can be trained end-to-end, op-

timizing both the question generation and answer pre-

diction tasks simultaneously. By using fine-grained

parameters, the system can capture more nuanced and

detailed information from the images, leading to more

robust performance. Fine-grained parameters allow

the system to focus on specific visual attributes, ob-

jects, or relationships, improving its ability to un-

derstand and generate relevant questions and accu-

rate answers. Overall, the dual system of VQG and

VQA, combined with fine-grained parameters, pro-

vides a self-contained framework for generating ques-

tions and answering them based on visual input. It re-

duces the reliance on human annotators and avoids the

limitations of image captioning, while also enabling

the system to achieve higher robustness and accuracy

in understanding and processing visual information.

3 Proposed System

The proposed system will combine the VQG and

VQA systems to first create the questions from the

input images provided and then create the answers

for those questions, which will be used later in the

system. In the Visual Question Generation (VQG)

system shown in Figure 1, LSTM produces ques-

tions, and the pre-trained CNN model extracts image

features—the COCO and VQA dataset used for train-

ing and evaluation.

The LSTM method uses parameters to train, with

each image and query as a record in the embedding.

The method repeats the process to reveal previous em-

beddings and predict future output states. The highest

significant probability predicts the word and the gen-

erated embedding matches the word with the highest

probability in each successive layer. The generated

questions are then output using the generated words.

Figure 1: Visual Question Generation Model

Figure 2: Visual Question Answering Model

Figure 2 indicates the high-level baseline archi-

tecture of our own VQA system. The image has

a 224x224 scale. A convolutional neural network

(CNN) made up of VGG-19 receives the scaled im-

age as input.

CNN outputs a characteristic vector that encodes

the content of the photo, known as picture embedding.

The query is passed to the embedding layer, creat-

ing compact, whole-dimensional embedding vectors.

So, they are first projected to an equal range of di-

mensions using the corresponding connected layers

(linear transformations) and then blended with point-

wise multiplication (multiplying the values within the

corresponding dimensions). The very last degree of

the VQA model is a multilayer perceptron with a fi-

nal SoftMax nonlinearity that outputs the rating dis-

tribution for each of the highest quality ok (1000) re-

sponses. Changing the responses to a k-manner cat-

egory assignment allows us to educate a VQA model

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

667

Volume 22, 2023

on using the go-entropy loss among the generated re-

action distribution and the floor reality.

4 Experimental Setup

4.1 Dataset

We have used the VisualQA (VQA) and COCO

datasets to evaluate the proposed system.

VQA Dataset: The dataset contains the 20000 im-

ages in the training set and 60000 question-answers

pair for training images. Also, we have 10000 val-

idation images and 20000 test images. The dataset

contains 30000 question-answers pair for validation.

COCO Dataset: The dataset contains the 82783

images in the training set and 4437570 question-

answers pair for training images. Also, we have

40504 validation images and 81434 test images. The

dataset contains 2143540 question-answers pair for

validation.

4.2 Data Preparation

Since unstructured and structured data are used, it

is necessary to properly administer generator proto-

type management in multimodal systems for generat-

ing solutions.

4.3 Exploratory Data Analysis

The Figure 3, Figure 4, Figure 5, and Figure 6 show

some exploratory data analysis for the dataset.

Figure 3: Word Cloud on Question-Type

4.4 Result

The Table 2 below shows the system’s accuracy with

different batch sizes and datasets.

Figure 4: Length of Questions

Figure 5: Sample Triplet (Image, Question, Answer)

Figure 6: Output of the System

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

668

Volume 22, 2023

Table 2: Accuracy with different modes, datasets, and

batch sizes

Model Batch Accuracy

Size Top 1 Top 3 Top 5

VQA with 16 0.5108 0.7904 0.8456

VGG19 32 0.5165 0.7917 0.8459

64 0.5224 0.7943 0.8514

128 0.5136 0.7914 0.8443

256 0.5251 0.7966 0.8521

512 0.5151 0.7928 0.8473

VQA with 256 0.4883 0.7598 0.8206

ResNet 152 512 0.2520 0.4856 0.5447

COCO with 64 0.3509 0.6566 0.7179

VGG19 128 0.3566 0.6550 0.7157

256 0.3603 0.6608 0.7220

512 0.3701 0.6663 0.7261

5 Conclusion

An interdisciplinary field called VQA combines ver-

bal expertise with visual information. The impor-

tance of this field rests in how it combines language

comprehension with visual interpretation. However,

a computerized tool combining VQG and VQA ca-

pabilities is not available. In our project, we de-

veloped a system that responds to questions that our

VQG model creates using our VQA model. Accord-

ing to our encouraging results, with enough training

data, the system should be able to produce questions

and endorse them by employing a powerful question-

answering component.

Our long-term goals include improving the sys-

tem’s capabilities and addressing its flaws. We aim to

make the VQA technique flexible to produce whole

sentences because the existing VQA tool only allows

users to respond to questions generated by VQG with

a single word. By incorporating emotion recognition,

motion detection, and event comprehension into pho-

tos and creating pertinent answers based on these fea-

tures, we hope to improve the system. In addition, we

want to improve the precision of the existing VQG,

and VQA approaches to create more naturally occur-

ring query-response pairings.

References:

[1] D. Geman, S. Geman, N. Hallonquist, and

L. Younes, “Visual turing test for computer

vision systems,” Proceedings of the National

Academy of Sciences, vol. 112, no. 12, pp. 3618–

3623, 2015.

[2] S. Manmadhan and B. C. Kovoor, “Visual ques-

tion answering: a state-of-the-art review,” Ar-

tificial Intelligence Review, vol. 53, pp. 5705–

5745, 2020.

[3] M. Malinowski, M. Rohrbach, and M. Fritz,

“Ask your neurons: A neural-based approach

to answering questions about images,” in Pro-

ceedings of the IEEE international conference

on computer vision, pp. 1–9, 2015.

[4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Ba-

tra, C. L. Zitnick, and D. Parikh, “Vqa: Vi-

sual question answering,” in Proceedings of the

IEEE international conference on computer vi-

sion, pp. 2425–2433, 2015.

[5] S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang,

“Automatic generation of grounded visual ques-

tions,” arXiv preprint arXiv:1612.06530, 2016.

[6] Q. Cao, X. Liang, B. Li, and L. Lin, “Inter-

pretable visual question answering by reason-

ing on dependency trees,” IEEE transactions

on pattern analysis and machine intelligence,

vol. 43, no. 3, pp. 887–901, 2019.

[7] H. Xue, Z. Zhao, and D. Cai, “Unifying the

video and question attentions for open-ended

video question answering,” IEEE Transactions

on Image Processing, vol. 26, no. 12, pp. 5656–

5666, 2017.

[8] Q. Wu, C. Shen, P. Wang, A. Dick, and A. Van

Den Hengel, “Image captioning and visual ques-

tion answering based on attributes and external

knowledge,” IEEE transactions on pattern anal-

ysis and machine intelligence, vol. 40, no. 6,

pp. 1367–1381, 2017.

[9] X. Xu, J. Song, H. Lu, L. He, Y. Yang, and

F. Shen, “Dual learning for visual question gen-

eration,” in 2018 IEEE International Confer-

ence on Multimedia and Expo (ICME), pp. 1–6,

IEEE, 2018.

[10] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang,

X. Wang, and M. Zhou, “Visual question gener-

ation as dual task of visual question answering,”

in Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pp. 6116–

6124, 2018.

[11] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Be-

yond bilinear: Generalized multimodal factor-

ized high-order pooling for visual question an-

swering,” IEEE transactions on neural networks

and learning systems, vol. 29, no. 12, pp. 5947–

5959, 2018.

[12] S. Nahar, S. Naik, N. Shah, S. Shah, and L. Ku-

rup, “Automated question generation and an-

swer verification using visual data,” Modern Ap-

proaches in Machine Learning and Cognitive

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

669

Volume 22, 2023

Science: A Walkthrough: Latest Trends in AI,

pp. 99–114, 2020.

[13] S. He, C. Han, G. Han, and J. Qin, “Explor-

ing duality in visual question-driven top-down

saliency,” IEEE transactions on neural net-

works and learning systems, vol. 31, no. 7,

pp. 2672–2679, 2019.

[14] S. Jha, A. Dey, R. Kumar, and V. Kumar, “A

novel approach on visual question answering by

parameter prediction using faster region based

convolutional neural network,” IJIMAI, vol. 5,

no. 5, pp. 30–37, 2019.

[15] D. Zhang, R. Cao, and S. Wu, “Information fu-

sion in visual question answering: A survey,”

Information Fusion, vol. 52, pp. 268–280, 2019.

[16] R. Krishna, M. Bernstein, and L. Fei-Fei, “In-

formation maximizing visual question genera-

tion,” in Proceedings of the IEEE/CVF Confer-

ence on Computer Vision and Pattern Recogni-

tion, pp. 2008–2018, 2019.

[17] S. Lobry, D. Marcos, J. Murray, and D. Tuia,

“Rsvqa: Visual question answering for re-

mote sensing data,” IEEE Transactions on Geo-

science and Remote Sensing, vol. 58, no. 12,

pp. 8555–8566, 2020.

[18] A. Patel, A. Bindal, H. Kotek, C. Klein,

and J. Williams, “Generating natural questions

from images for multimodal assistants,” in

ICASSP 2021-2021 IEEE International Confer-

ence on Acoustics, Speech and Signal Process-

ing (ICASSP), pp. 2270–2274, IEEE, 2021.

[19] F. Ren and Y. Zhou, “Cgmvqa: A new classi-

fication and generative model for medical vi-

sual question answering,” IEEE Access, vol. 8,

pp. 50626–50636, 2020.

[20] X. Xu, T. Wang, Y. Yang, A. Hanjalic, and

H. T. Shen, “Radial graph convolutional net-

work for visual question generation,” IEEE

transactions on neural networks and learning

systems, vol. 32, no. 4, pp. 1654–1667, 2020.

[21] S. Park, S. Hwang, J. Hong, and H. Byun, “Fair-

vqa: Fairness-aware visual question answer-

ing through sensitive attribute prediction,” IEEE

Access, vol. 8, pp. 215091–215099, 2020.

[22] Y. Xi, Y. Zhang, S. Ding, and S. Wan, “Visual

question answering model based on visual rela-

tionship detection,” Signal Processing: Image

Communication, vol. 80, p. 115648, 2020.

[23] T. Wang, J. Li, Z. Kong, X. Liu, H. Snoussi, and

H. Lv, “Digital twin improved via visual ques-

tion answering for vision-language interactive

mode in human–machine collaboration,” Jour-

nal of Manufacturing Systems, vol. 58, pp. 261–

269, 2021.

[24] C. Patil and A. Kulkarni, “Attention-based vi-

sual question generation,” in 2021 International

Conference on Emerging Smart Computing and

Informatics (ESCI), pp. 82–86, IEEE, 2021.

[25] F. Chen, J. Xie, Y. Cai, T. Wang, and Q. Li,

“Difficulty-controllable visual question gener-

ation,” in Web and Big Data: 5th Interna-

tional Joint Conference, APWeb-WAIM 2021,

Guangzhou, China, August 23–25, 2021, Pro-

ceedings, Part I 5, pp. 332–347, Springer, 2021.

[26] J. Xie, W. Fang, Y. Cai, Q. Huang, and Q. Li,

“Knowledge-based visual question generation,”

IEEE Transactions on Circuits and Systems for

Video Technology, vol. 32, no. 11, pp. 7547–

7558, 2022.

[27] L. Bashmal, Y. Bazi, F. Melgani, R. Ricci,

M. M. Al Rahhal, and M. Zuair, “Visual

question generation from remote sensing im-

ages,” IEEE Journal of Selected Topics in Ap-

plied Earth Observations and Remote Sensing,

vol. 16, pp. 3279–3293, 2023.

[28] S. M. kamel, S. I. Hassan, and L. Elrefaei,

“Vaqa: Visual arabic question answering,” Ara-

bian Journal for Science and Engineering,

pp. 1–21, 2023.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present re-

search, at all stages from the formulation of the prob-

lem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflicts of Interest

The authors have no conflicts of interest to

declare that are relevant to the content of this

article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International , CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on SYSTEMS

DOI: 10.37394/23202.2023.22.67

Atul Kachare, Mukesh Kalla, Ashutosh Gupta

E-ISSN: 2224-2678

670

Volume 22, 2023