Enhancing the Reliability of Academic Document Certification Systems
with Blockchain and Large Language Models
JEAN GILBERT MBULA MBOMA1, OBED TSHIMANGA TSHIPATA1, WITESYAVWIRWA
VIANNEY KAMBALE2,3, MOHAMED SALEM2, MUDIAMPIMPA TSHYSTER JOEL1,
KYANDOGHERE KYAMAKYA1,2
1Génie Electrique et Informatique
Université de Kinshasa (UNIKIN)
H8J5+6PX Kinshasa
DEMOCRATIC REPUBLIC OF THE CONGO
2Institute for Smart Systems Technologies
Universitaet Klagenfurt
9020 Klagenfurt
AUSTRIA
3Faculty of Information and Communication Technology
Tshwane University of Technology
Private Bag x680, Pretoria, 0001
SOUTH AFRICA
Abstract: Verifying the authenticity of documents, whether digital or physical, is a complex and crucial challenge
faced by a variety of entities, including governments, regulators, financial institutions, educational establishments,
and healthcare services. Rapid advances in technology have facilitated the creation of falsified or fraudulent docu-
ments, calling into question the credibility and authenticity of academic records. Most existing blockchain-based
verification methods and systems focus primarily on verifying the integrity of a document, paying less atten-
tion to examining the authenticity of the document’s actual content before it is validated and registered in the
system, thus opening loopholes for clever forgeries or falsifications. This paper details the design and imple-
mentation of a proof-of-concept system that combines GPT-3.5’s natural language processing prowess with the
Ethereum blockchain and the InterPlanetary File System (IPFS) for storing and verifying documents. It explains
how a Large Language Model like GPT-3.5 extracts essential information from academic documents and encrypts
it before storing it in the blockchain ensuring document integrity and authenticity. The system is tested for its
efficiency in handling both digital and physical documents, demonstrating increased security and reliability in
academic document verification.
Key-Words: Blockchain, Large Language Models, IPFS, Document Verification, Document Authentication,
Reliability, SHA-256, Digital Signature
Received: August 27, 2024. Revised: July 9, 2024. Accepted: August 13, 2024. Published: September 25, 2024.
1 Introduction
Large Language Models and Blockchain may be two
technologies with opposed approaches and enormous
potential, but one thing seems certain: they both have
the potential to transform our daily lives consider-
ably and solve many problems whose solutions have
hitherto remained incomplete or even unsatisfactory.
The qualities of blockchain—decentralization, secu-
rity, immutability, and transparency—have drawn the
attention of numerous sectors since its inception in
2008, shortly after the paper on Bitcoin was pub-
lished [1]. Notably, large language models (LLMs)
like GPT-3 and GPT-4 have become quite strong tools
for natural language processing in the last several
years. These models have demonstrated an impres-
sive ability to generate and understand content con-
textually and coherently while being remarkably ver-
satile in a variety of natural language-related tasks [2].
Therefore, the combination of blockchain technol-
ogy with LLMs presents the possibility of develop-
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
419
Volume 21, 2024
ing novel applications that capitalize on the advan-
tages of both fields of technology. On the other
hand, the steady increase in the number of falsi-
fied academic documents, whether digital or phys-
ical, has highlighted the limits of traditional verifi-
cation methods [3], paving the way for the explo-
ration of digital solutions better suited to contempo-
rary challenges. Consequently, alternatives such as
the digital signature [4], the QR code [5], [6], the
barcode [7], [8], [9], and various other means [10],
including the use of blockchain, have been imple-
mented to this end. Speaking of the use of blockchain,
it is undeniable that its characteristics of decentral-
ization, immutability, and transparency have greatly
complicated the task of counterfeiters; nevertheless,
loopholes remain, and can lead to dramatic situa-
tions, especially in the case of document verification
where the counterfeit could go unnoticed and be mis-
taken for the original. Most of the work and sys-
tems [3], [11], [12], [13], [14], [15], [16], [17], [18]
[19], [20], [21] proposed to date have been based on
the following assumptions:
1. The system (private or public blockchain) is con-
sidered very reliable (system reliability): fraud
within the system (university, college, etc.) is
underestimated.
2. Any document stored in the blockchain cannot
be altered without obvious evidence of tampering
(Document Integrity).
3. A document, subject to verification, is consid-
ered valid if it is present in the system, i.e., if its
hash is stored in the blockchain. (Verification as-
sumption).
These hypotheses are certainly relevant, but present
certain vulnerabilities:
1. The system may be compromised to some extent;
some validators may be dishonest or malicious.
2. Instead of directly modifying the original stored
in the blockchain, the fraudster can manage to
have a counterfeit validated, although this is not
an easy task.
3. If a counterfeit manages to be validated in the
blockchain without being detected, it will easily
pass verification, which would be catastrophic.
It is therefore imperative to add a security mechanism
to considerably tighten the validation process before
registering a document. This would significantly in-
crease the efficiency and reliability of document au-
thenticity verification at a later stage.
2 Aims and Objectives
The main objective of our paper is to present a sys-
tem in which Large Language Models are combined
with blockchain technology to address this challenge.
Our approach aims to use an LLM, GPT-3.5 in this
case, to enhance the efficiency of document valida-
tion at registration and enable more reliable verifica-
tion at a later stage; the idea being that if validation
is much more rigorous, it is highly likely that docu-
ments registered on the blockchain are authentic and
valid, and verification becomes simpler. When regis-
tering a document in our system, the LLM’s role will
be to extract the essential information from a docu-
ment, whether it’s a quotation sheet, a diploma, a cer-
tificate, a parcel document, or any other commercial
or administrative document. This information will
then be encrypted using a cryptographic hash func-
tion, specifically SHA-256. The result of this hash
will then be recorded on the Ethereum blockchain,
along with the document’s content identifier (CID).
This CID will be provided by the IPFS storage net-
work we’ll be using to host the document. It’s impor-
tant to note that the blockchain will mainly be used
to store the CID and the hash generated by the LLM.
Consequently, if we want to check the validity of any
document possibly issued by our system, the latter
will first extract the hash value associated with the key
information of said document and compare this value
with those of different documents present in the sys-
tem. If there is a perfect match, the file under exam-
ination is certified authentic; if not, further analysis
will be carried out to determine whether it is either
dubious or simply missing from the system and ap-
ply the appropriate measures. In the following lines,
we first outline the fundamental concepts related to
LLMs, blockchain, and the IPFS decentralized stor-
age protocol, while reviewing the work done in the
context of document verification and authentication
via blockchain. This is followed by a detailed pre-
sentation of our solution and the web platform used
to implement our approach, as well as a discussion of
the results obtained. At last, a conclusion is given in
which prospects of improvement are also mentioned.
3 Background
3.1 Overview of LLMs
3.1.1 Introduction
Recently, there has been the emergence of sophisti-
cated machine learning models capable of compre-
hending and producing writings in natural language.
The Large Language Models (LLMs) refer to these.
These models are usually trained on large and exten-
sive textual datasets [22]. The development of these
LLMs has been inspired mainly by the introduction
of a neural network architecture called Transformers
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
420
Volume 21, 2024
which enables them to mimic the complex structure
of language and assists them in capturing long-range
dependencies. This architecture enables them to learn
and generate content with enhanced semantic richness
and coherence. Like GPT-3 [23], PaLM [24], Galac-
tica [25], and LLaMA [26], LLMs stand out due to
their remarkable large-scale number of parameters,
which can typically reach tens or hundreds of bil-
lions. A way to interact with these Large Language
Models (LLMs) often involves prompt engineering,
a method discussed in [23] and [27]. Prompt engi-
neering involves users crafting precise and accurate
prompts aimed at directing the LLMs to generate re-
quired outputs or performing certain tasks outlined by
the input prompts. This systematic approach is widely
used in contemporary evaluation methods, enabling
humans to engage with LLMs through questioning
or dialogue, essentially having conversations in nat-
ural language with these models. Table 1 provides a
succinct yet essential comparison of conventional ma-
chine learning, deep learning, and LLMs.
In order to obtain a thorough understanding, let us
examine the various language modeling approaches
that have been used in the creation of LLMs thus
far. Language modeling (LM) research has received
a great deal of attention in the literature, as demon-
strated by [28]. There are four major phases of devel-
opment that this research may be divided into:
Statistical language models (SLM): these
emerged in the 1990s and are based on sta-
tistical learning. Building a word predic-
tion model based on the Markov assumption—
that is, predicting the next word based on
the current context—is the fundamental idea.
SLMs have been widely used to enhance task
performance for natural language processing
(NLP) [29], [30], [31], and information retrieval
(IR) [32], [33]. However, because of the ex-
ponential number of transition probabilities that
must be computed, high-order language mod-
els frequently suffer from the ”curse of dimen-
sionality,” making accurate estimation of them
challenging. Particular smoothing techniques,
including back-off estimation and Good-Turing
estimation, have been created to solve this is-
sue and lessen the impact of sparse data. These
techniques are intended to enhance the estimate
of high-order language models and mitigate the
problem of data sparsity.
Neural language models (NLM): The founda-
tion of NLM is neural networks, such as Recur-
rent Neural Networks (RNNs) [1], [34]. To pre-
dict the next word, these large language models
(NLMs) rely on distributed word vectors, other-
wise known as aggregated context features which
capture the meaning of words based on their con-
text in a sentence. This approach enables the
efficient extraction of word vectors from an ex-
tensive textual dataset, using the capabilities of
models like Word2Vec [35], GloVe [36], and
FastText [37]. NLMs are also specifically de-
signed to capture not only dense vector repre-
sentations of individual words and sequences
but also their long-term contextual dependencies.
This makes well-trained NLMs synthesize coher-
ent text through accurate next-word prediction
based on prior context. Notably, the capabilities
of these models extend beyond text generation,
but extending various NLP tasks such as speech
recognition, automatic translation, and a so many
others [38].
Pre-trained language models (PLM): These
models become proficient at performing specific
natural language processing (NLP) tasks after
they’ve been trained on vast amounts of text in-
put. This training also called pre-training in-
volves exposing the model to large and varied
sets of text, enabling it to mimic the structures,
statistical patterns, and semantic relationships
found within language. As a result of this com-
prehensive training, these language models can
effectively predict and select the most likely next
word in a sentence, drawing on their understand-
ing of the context developed during training.
Thanks to this approach, the model can capture
different linguistic features and gain a sophisti-
cated understanding of language. Typically, the
pre-training phase is unsupervised, i.e. the model
acquires knowledge from the data without the
need for explicit annotations or labels.
After pre-training, PLMs can be fine-tuned for
specific downstream tasks such as text classification,
question answering, language translation, and other
related applications. Fine-tuning is a process during
which the model is trained on smaller, task-specific
datasets containing labeled instances. In this way, the
model adjusts its pre-training knowledge and abili-
ties to function well on particular tasks. On the other
hand, Pre-trained Language Models (PLMs) capture
contextual dependencies in both directions, in con-
trast to Neural Language Models (NLMs) which pre-
dict the next word based on the previous context.
PLMs consider both the words that come before and
after the word they are predicting to comprehend its
context properly. ELMo [39] is one example of this,
as it uses a bidirectional LSTM network (biLSTM)
for pre-training.
Large language models (LLM): These are
scaled-up versions of PLMs with, for the most
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
421
Volume 21, 2024
Table 1.Comparative study between traditional Machine Learning, Deep Learning, and LLMs (Source: [27])
Comparison Traditional ML Deep Learning LLMs
Training Data Size Large Large Very Large
Feature Engineering Manual Automatic Automatic
Model Complexity Limited Complex Very Complex
Interpretability Good Poor Poorer
Performance Moderate High Highest
Hardware Requirements Low High Very High
part, billions or hundred billions of parameters.
They are created by scaling PLMs (e.g., model
size or data size). In fact, research has demon-
strated that scaling PLMs frequently results in
the model’s increased capabilities and perfor-
mance on downstream tasks [40]. Larger PLMs,
such as the 175B-parameter GPT-3 and the 540B-
parameter PaLM, have been shown to exhibit
different behaviors during training than smaller
PLMs, like the 330M-parameter BERT and 1.5B-
parameter GPT-2, according to a study by [28].
These big PLMs are very good at difficult tasks
because they have emergent skills that allow
them to tackle complex task sequences. As a
result, in the area of language modeling, AI al-
gorithms have grown incredibly strong and effi-
cient.
It is crucial to remember that an LLM is not always
more capable than a small PLM, and certain LLMs
might not exhibit emerging abilities. We can list the
GPT-series (GPT-3, GPT-3.5, GPT4), Bard, Falcon,
Llama, Bloom, and so on as examples of LLMs. An
intriguing chronology of the major language models
that have been in use recently is presented in the lit-
erature [28] (Fig.1).
3.1.2 Emergent Abilities of LLMs
In the context of LLMs, emergent abilities refer to the
unexpected or unplanned capabilities that the models
display during training or application. These skills
come from the models’ exposure to enormous vol-
umes of language data rather than being specifically
programmed or taught to them. They are among the
key characteristics that set LLMs apart from earlier
PLMs [2]. It can be difficult to determine the essen-
tial size for emergent skills of LLMs because it varies
based on the model or job being employed. This is
the least size needed to possess a specific capacity.
Three emergent skills that are typical for LLMs are
introduced in the literature [28].
In-context learning (ICL): Formally introduced
by GPT-3, [23], it occurs when a model can
complete a task using just a prompt made up of
input-output samples. based solely on a prompt
composed of input-output examples. Without
any specific pre-training, the LLM can pick up
knowledge from these examples [41]. The GPT-
1 and GPT-2 models cannot be regarded as hav-
ing the same level of ICL capacity as the GPT-3
model [28].
Instruction following: On previously untested
tasks that are also defined using instructions,
LLMs have shown good performance through in-
struction tuning, which entails fine-tuning with
a combination of multi-task datasets organized
with natural language descriptions [42].
Step-by-step reasoning: Unlike small language
models, LLMs can tackle difficult tasks involv-
ing numerous reasoning steps (like mathemati-
cal word problems) using the chain-of-thought
(CoT) prompting method. By using a prompt-
ing mechanism that includes intermediate levels
of reasoning to arrive at the final answer, LLMs
may handle problems of this nature [43].
3.1.3 Key Techniques for LLMs
Several key strategies are available to greatly increase
the capacity of LLMs; we will quickly discuss and
introduce a few here that have the potential to greatly
increase LLM success [28].
Scaling: As previously observed, the scaling
laws show that expanding the model and dataset
in addition to increasing the training computation
often enhances the LLM’s capabilities and per-
formance [40], [44]. Furthermore, data scaling
requires the use of a suitable cleaning procedure.
Training: To successfully train a good LLM
(known for its huge model size), distributed train-
ing algorithms are needed to learn efficiently the
network parameters of LLMs. Many optimiza-
tion frameworks, such as DeepSpeed, [45], and
Megatron-LM, [46], have been made available to
assist with the creation and deployment of paral-
lel algorithms.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
422
Volume 21, 2024
Fig.1: A timeline of recent large language models (having a size larger than 10B) (Source: [28])
Ability eliciting: After receiving extensive pre-
training on large-scale corpora, LLMs may be
able to solve common tasks. These qualities
might not be apparent when carrying out par-
ticular jobs. To bring forth these skills, it is
helpful to create task instructions that are rele-
vant or in-context learning techniques. Strategies
like instruction tailoring and Chain-of-Thought
prompting can be used to improve the ability to
generalize on untested tasks.
Alignment tuning: Because LLMs are trained
to identify the features of diverse data sets,
they may produce biased or harmful material.
The InstructGPT technique, [47], uses human
input in conjunction with reinforcement learn-
ing, [47], [48], to steer LLMs in a manner consis-
tent with human ideals. Using a similar method-
ology, ChatGPT exhibits a strong alignment ca-
pability by generating well-mannered and non-
offensive responses, including declining to re-
spond to inflammatory inquiries.
Tools manipulation: LLMs are primarily trained
to produce text from big corpora; nevertheless,
they exhibit worse performance on tasks that are
not text-based (numerical calculation). They are
not able to capture current information and are
only able to process training data. Using exter-
nal tools such as a calculator, which can per-
form accurate calculations [2], and search en-
gines, which can assist in locating unknown ma-
terial [49], is one potential approach. Addi-
tionally, ChatGPT’s usage of third-party plugins
greatly increases LLMs’ capabilities.
The key ideas of blockchain technology will be
quickly covered in the following section after we have
briefly examined the key facets of language models,
in particular LLMs.
3.2 Overview of Blockchain
3.2.1 Blockchain network and structure
A distributed ledger that is decentralized is called
a blockchain. It maintains an expanding list of
”blocks,” or immutable records [1], [50], [51], [52],
and [53]. Blocks are linked together using a hash pro-
duced by a cryptographic technique, as seen in Fig.
2 [54]. Because of this, blockchain can function as a
trustworthy way to record transactions [55]. The peer-
to-peer network (Fig. 3) ensures that all nodes have a
copy of the full ledger and automatically corrects any
node that attempts a fraudulent change. This leads to
redundancy and security and eliminates the need for
a central authority, [56].
Six levels make up the blockchain system, as seen
in [57], and [58]. The core of blockchain architec-
ture is the data layer, which includes time stamp-
ing, chain structures, and blocks. Blocks function
as storage containers for transactions and their meta-
data in this layer. Bitcoin serves as an illustration of
this, as each block comprises the following necessary
components: the block size, block header, transac-
tion counter, and transactions [55]. A cryptographic
hash technique is applied to the block header in or-
der to create a block hash, which is used to guar-
antee the unique identification of every block in the
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
423
Volume 21, 2024
Fig.2: Blockchain structure
Fig.3: Peer-to-Peer network
blockchain [1], [53], [59]. The SHA256 hash algo-
rithm is used in the instance of Bitcoin. Additionally,
the hash of the parent block is incorporated into the
header of each block, forming a smooth and temporal
connection between the blocks (Fig. 2). The genesis
block is the first block in a blockchain that exists on
its own without a parent block (Fig. 2). The inter-
connected structure created by the blockchain’s con-
struction ensures the security and integrity of the data
stored on it. Its chain of information is immutable
since every block has a distinct hash. The blockchain
is extremely resistant to tampering since any attempt
to change the content of a single block will result in an
inconsistent chain. These characteristics offer a solid
foundation for safe, decentralized data management
systems [59].
3.2.2 Consensus mechanisms
Blockchain’s consensus mechanism, which ensures
network participants agree on the veracity of transac-
tions, is one of its key features. To maintain trust and
reduce malicious activity within the network, a num-
ber of consensus algorithms, including popular ones
like Proof of Work (POW), Proof of Stake (POS),
and Practical Byzantine Fault Tolerance (PBFT), have
been developed. The popular consensus classification
is displayed in Fig. 4. Blockchain networks attain
high-security levels and do away with the possibility
of a single point of failure through consensus.
Proof of Work (PoW): Stands as a pioneer-
ing consensus mechanism within the realm of
blockchain technology, [60]. To create new
blocks for the blockchain, it uses the computa-
tional competition principle. In Proof of Work
(PoW), miners perform calculations to yield a
value, and the winner is the miner who is able to
create a value that is less than the network’s pre-
determined threshold [59], [60]. Proposals have
been made for Proof of Weight, Proof of Reputa-
tion, Proof of Space, Proof of History, and Proof
of Burn as variations of Proof of Work.
Proof-of-Stake (PoS): This method has a major
benefit over Proof-of-Work (PoW) in that it does
not require expensive mining equipment [43].
Nodes have the option to mine or validate blocks
in a Proof of Stake (PoS) system according to
their stake. The latter is simply the quantity of
coins they possess [60], [61]. With this method,
users buy cryptocurrency and use it to get ac-
cess to block creation opportunities. Introduced
in [61], Delegated Proof-of-Stake (DPoS) is an
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
424
Volume 21, 2024
Fig.4: Classification of Blockchain Consensus Mechanism
additional variant of Proof of Stake.
Practical Byzantine Fault Tolerance (PBFT): is
a consensus algorithm to ensure fault tolerance
in distributed systems, especially when there are
malicious or faulty nodes [62].
Compared to Proof-of-Work and Proof-of-Stake,
PBFT does not depend on stake-based or mining pro-
cesses. Rather, to reach a consensus, PBFT makes
use of a sequence of message exchanges between
nodes. A designated leader node in PBFT proposes
a block of transactions to start the consensus pro-
cess. Other nodes participate in a multi-round vot-
ing procedure following a node’s proposal of a block.
They converse with one another during this procedure
to ascertain the validity of the suggested block [62].
When a significant number of nodes come to an agree-
ment, the suggested block is put into the blockchain
and deemed approved. By using PBFT, the system
may withstand a specific number of malfunctioning
nodes and still retain its liveness and safety charac-
teristics [62].
3.2.3 Smart contracts
A smart contract is a computer protocol designed to
autonomously execute, enforce, verify, and restrict
the execution of its instructions. It makes it possible
to execute transactions without the use of middlemen
between anonymous or untrusted parties [63]. These
are irreversible and traceable transactions. The com-
ponents of a smart contract include value, address,
function, and state. The associated code is run when
a transaction is input; this results in an output event
and a state change that is determined by the func-
tional logic that has been defined. Every party to the
smart contract agrees in advance to its terms and con-
ditions, including its triggering scenarios, state tran-
sition procedures, and liabilities for breaking the con-
tract. The smart contract is then deployed on the
blockchain as code, and it will start working auto-
matically as soon as the predefined requirements are
satisfied. Ethereum is the most widely used platform
for the development of smart contracts. According
to [64], Ethereum outperforms Hyperledger Fabric in
terms of the quantity of transactions that are com-
pleted successfully. The majority of developers write
smart contracts using Solidity and Serpent. Chain
code, also known as smart contracts, can be imple-
mented using Hyperledger Fabric. Usually, Go or
Java is used to develop it; the source [64] states that
Go is the preferred language for best performance.
3.2.4 Oracles
The difficulty of blockchain technology to directly ac-
cess external data initially hampered its integration
with the real world. The idea of oracles was pre-
sented as a way around this restriction. In order to
connect the blockchain to other domains and facilitate
its use across multiple industries, oracles serve as cen-
tralized, reliable third parties that supply blockchain
with real-world data (Fig. 5). Consensus oracles and
centralized oracles are two different kinds. Consen-
sus oracles, in contrast to centralized oracles, involve
groups of oracles and are managed by a single author-
ity [65].
Fig.5: An example of how Oracle interacts with
smart contracts
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
425
Volume 21, 2024
3.3 Overview of IPFS
The Interplanetary File System (IPFS) is a decentral-
ized file storage system that connects all computing
devices to a single file system. To better understand
IPFS, we can compare it to the Internet, but instead
of depending on centralized servers, IPFS works us-
ing the collective power of the computers connected
to the network [66]. In practice, IPFS works a bit like
BitTorrent [60], where files are shared among users
in a decentralized manner. However, IPFS goes a
step further by using a hash system to uniquely iden-
tify files. Each file is assigned a unique ”hash” based
on its content, called a Content Identifier (CID). This
means that if a file’s content changes, its hash changes
as well, ensuring that files are intact and unaltered.
This CID consists of 4 fields [67]:
Multibase prefix: which indicates one of the 24
basic encoding methods used to create the binary
content identifier (CID).
CID version identifier: which indicates the ver-
sion of the CID. There are currently two versions
(v0 and v1).
Multicode identifier: indicating how the ad-
dressed data was encoded.
Multihash: which contains metadata indicating
the default hash function used (SHA-256) and the
default length (32 octets) of the actual hash of the
contents. The term ”multihash” comes from the
fact that it can support any hash algorithm.
CID = <Multibase> (cid-version multicode multihash)
When content is added to IPFS, it is divided
into chunks (256kb by default), and each chunk is
given its own CID (Content Identifier). The CID
of each chunk is obtained by applying a hash func-
tion to its content and adding the metadata mentioned
above. Once all chunks have a CID, IPFS constructs
a Merkle Directed Acyclic Graph (Merkle DAG) of
the file [66], [68]. This DAG is the form in which
the file is provided by the original content editor. A
Merkle DAG is a data structure similar to a Merkle
tree but without balance requirements. The root node
combines all the CIDs of its descendant nodes to
form the final content CID (commonly referred to
as the root CID). In addition, all files exchanged on
the IPFS network are stored in Distributed Hash Ta-
bles (DHTs) [11], [66], [67]. A DHT is essentially
a distributed data structure that allows information
to be stored and retrieved using keys (in this case,
CIDs) to obtain the corresponding values (PeerIDs
and associated location information); the IPFS DHT
is based on the Kademlia protocol [66], [69], which
is a well-established technology for managing dis-
tributed DHTs and is similar to the way the BitTor-
rent Mainline DHT [70] works for file distribution on
the BitTorrent network. It’s worth noting that one of
the advantages of IPFS is that it has no single point of
failure. This means that there is no central server on
which the entire network rests, making it more robust
and resilient. IPFS network nodes work together to
store and share files, and they don’t need to trust each
other for the system to work.
4 Related Works
In this section, we review recent research related to
the blockchain-based verification of academic doc-
uments. Among the proposed approaches, a few
caught our attention: The study in [12] outlines the
importance of certificate verification and its impact
on our society. It briefly discusses traditional veri-
fication methods and their limitations and proposes
a blockchain-based graduation certificate verification
system. This system is used not only for verification
but also for generating new certificates; it generates
digitally signed certificates using the asymmetric key
and timestamp. Students receive a copy to use as they
wish, and employers can verify the authenticity of
these documents through the system by entering the
public key of the university (issuing institution) and
the digital signature applied. The authors in [13] dis-
cuss the current verification process and the prolifer-
ation of fake credentials. They proposed, as a proof-
of-concept, a prototype of an open-source blockchain-
based Ark platform that aims to provide higher edu-
cation institutions with a credit and rating system and
potential employers with a tool to validate a candi-
date’s academic information. However, the article did
not clearly present the technical details of implement-
ing the proposed solution. The work in [14] proposed
a solution that involves creating a platform for all
the credentials that a student may possess. Through
the platform, students store all their diplomas on the
blockchain. To verify a diploma, a person needs the
student’s login and password. The consensus algo-
rithm used for validation is Proof of Work (PoW), but
some details regarding validation are not clearly dis-
closed. The study in [11] offers an interesting imple-
mentation of a new document verification system that
combines blockchain and the InterPlanetary File Sys-
tem (IPFS) to increase the efficiency of detecting a
forged document. However, he points out that one
of the limitations of this system is that it only checks
the availability of documents, but not their integrity,
i.e., their content; in other words, the system checks
changes made to the file without examining the con-
tent of the file. As an alternative, he suggests im-
plementing Optical Character Recognition (OCR) in
the system to overcome the limitation of verifying file
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
426
Volume 21, 2024
content, and to create a better document verification
system based on blockchain technology. Instead of
the suggestion proposed in [11] to use OCR, we out-
line below our approach to effectively solve this ma-
jor limitation of current blockchain-based document
verification systems.
5 Solution Overview
Our proof of concept implements basic features to
demonstrate how LLM is integrated to enhance se-
curity and avoid malicious certification actions. The
system has 3 main interactions:
Document Request: The student requests his
academic document from his institution and
awaits the request to be processed (step 1 in Fig.
6). Once ready he receives an email with the link
to the requested document that he can download
and share with other institutions (step 2* in Fig.
6).
Certification process: The institution will pro-
ceed with its normal process to check student in-
formation and the validity of the requested docu-
ment. Once the document is ready, an authorized
staff member will use our platform for certifica-
tion and send the document to the student by mail
(steps 2, 2’, 2” and 2”’ in Fig. 6).
Verification by other institutions: A third-
party institution uses a mobile application by
scanning the QR code, using the link added at the
bottom of the document (Fig. 7), or inserting the
document hash to check the authenticity of the
document (step 3 in Fig. 6).
The main contribution of this paper resides in steps
2’ and 2’ in Fig. 6. We added LLM to extract key in-
formation that makes a document unique. Table 2 in-
dicates key information for a student transcript or aca-
demic diploma based on the student transcript sample
in Fig. 7. Therefore, before saving the document in
blockchain or IPFS, the LLM extracts key informa-
tion and checks in the blockchain if the hash gotten
by hashing this set of information exist. It should be
noted that the IPFS here is used to store the uploaded
file in the decentralized network and the blockchain
stores the key information hash and the IPFS file hash.
5.1 System Flow
The three main processes as stated in the previous sec-
tion are the Document Request process, Certification
process, and Verification process. The request pro-
cess is the simplest one. The student only has to pro-
vide information such as his personal information, his
academic information, and the documents he needs.
The certification process is described in Fig. 8. The
Table 2.Key information that makes a student tran-
script unique
Key
Document Title
Student Name
Year of Study
Option
Academic Year
authorized staff member first login before taking fur-
ther actions. For certification purposes, he will pro-
vide the PDF of the document. The content of the
document is then concatenated with the prompt pro-
vided in Fig. 9. After the LLM has returned the de-
sired response, it is hashed with the sha256 algorithm
and used to check if such hash already exists in the
blockchain. We save this file in the IPFS network
only if the hash does not exist. And at last, save the
IPFS file hash, document hash, and timestamp in the
blockchain.
The Verification process remains simple and is de-
scribed in Fig. 10. The third-party institution may
use the URL provided at the bottom of the document
(Fig. 7) or scan the QR code to get the document hash.
Given the hash, the platform returns the real file URL
saved in the IPFS network corresponding to the hash
provided, or else it returns an Error message.
5.2 Prototype Implementation
In a more detailed view, our system has connected 4
technologies (Fig. 11). The smart contract is written
in Solidity, the web application developed in Python
and accessible via http://127.0.0.1:5000 (Fig. 12), the
IPFS network (locally installed), OpenAI as LLM and
Ethereum network simulated by Ganache. Details on
the development environment are described in Table
3.
The web application is divided into three sections:
An administration panel, a Request Page, and a Ver-
ification page for third-party institutions.
Table 3.Development Environment
Component Description
Hardware
Intel(R) Core (TM)
i7-4600M CPU @
2.90GHz
2.90 GHz
Memory 12.0 Go
Operating System Windows 10 Professional
Blockchain Platform Ethereum with Ganache
IPFS Network Local Desktop version
Programming Language Python, Solidity
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
427
Volume 21, 2024
Fig.6: System actors and Operation flow
Fig.7: Sample of a certified student transcript
a. The Administration Panel: To access the admin-
istration zone, the authorized staff member has to
successfully log in to the system. He can view re-
quests and respond to a request by uploading the
requested pdf file.
b. Request page and Verification page: These two
pages are open to the public to allow students to
request documents or other institutions to verify
them.
c. OpenAI: The process of text extraction is excel-
lently done by the general-purpose model GPT-
3.5-Turbo-16k. So, we do not need to finetune a
model for this purpose. He receives the prompt
(Fig. 10) and the file content and returns the key
information as requested in the prompt..
d. IPFS network: it is a peer-to-peer content
delivery network that stores stores, retrieves,
and locates data based on the fingerprint of
its actual content rather than its name or lo-
cation. For our solution, we install a sin-
gle node in our computer for test purposes.
The backend connects to the IPFS via an API
with http://localhost:5001/api/v0 as the base
URL to push files. The URL of the cer-
tified file sent to the user has this format:
http://127.0.0.1:8080/ipfs/filehash.
e. Ethereum blockchain: To avoid using the public
Ethereum network, we used Ganache as a truf-
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
428
Volume 21, 2024
Fig.8: Flow diagram for the Certification Process
fle Suite to have a test network. The backend
connects to the smart contract through the RPC
endpoint http://127.0.0.1:7545.
5.3 Results and Discussion
Our prototype implementation successfully demon-
strated that the integration of LLM added a layer of
trust to the certification process. This trust is estab-
lished through the rigorous verification of the hash
obtained after encrypting key information of the doc-
ument with the sha256 algorithm. The test result
shown in Table 4 demonstrated that if key information
of a document, as described in Table 2, remains un-
changed, the hash value remains the same, indicating
that the uploaded file may be considered fraudulent.
Conversely, if any key information, as described in
Table 2, is modified or missing, it results in a change
of the hash value, making the document different or
not yet certified. Our solution is flexible and no huge
modification is required if the type of the document
is changed. The solution may be used in any other
industry because only key information will change.
Thus, the prompt will change or a finetuning will be
required based on the complexity of the document.
Integrating LLM into the certification process
raises several challenges. Privacy and data security
concerns are at the top of the list. The Issuing Insti-
tutions must ensure that sensitive student data is han-
dled securely and in compliance with relevant regu-
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
429
Volume 21, 2024
Fig.9: Flow diagram for the Verification Process
Return only a JSON with the following attributes from the transcript document below:
"document_title," "student_name," "academic_year," "option," and "year_of_study.".
Please note that no additional text should be included except the JSON.
NOTE: DO NOT PROVIDE ANY OTHER TEXT EXCEPT THE VALID JSON, AND THE EX-
TRACTED INFO HAS TO BE IN LOWERCASE
Fig.10: The Prompt used to extract key information from the document. This prompt is concatenated with the
document content and sent to the LLM
Table 4.Summary of Use cases and results obtained by changing document content and evaluating hash.
Case HASH Remark
key information of a
document is not modified
Remains
unchanged
If key information as described in Table 2 is not
modified and any other info is modified the Hash
value will remain the same and the uploaded file
will be considered fraudulent
Missing key information in
the document
Changes In this case, the document is considered different
and the student cannot use a document with
missing information
Any of the key information
is changed
Changes In this case, the student may be requesting a
transcript of another academic year for example.
Thus, we consider that it is a different document
lations. To enforce security, we suggest that the plat-
form be fully decentralized to avoid human interac-
tions with the centralized server. It means that the
backend should be replaced only by the smart con-
tract for the entire work. Another challenge resides
in the accuracy of LLM in extracting key informa-
tion. In this work, we use the general-purpose model
because it performs accurately. But in the case where
the document is too complex and requires a fine-tuned
model. In this situation, the institution should contin-
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
430
Volume 21, 2024
Fig.11: Architecture of the entire system. The backend sends the document content to the LLM so that this
one returns back the key information of the document to be hashed. And the other institutions verify the integrity
of documents via the Blockchain
(a) The verification page where the user inserts the
document hash for verification
(b) The Request Page, where the User provides his
personal and academic information and the docu-
ment he needs
(c) The login page where the authorized Staff mem-
ber puts his email and password for authentication
(d) The Certification page. The Administrator only
has to drag and drop the file to be certified
Fig.12: Main Pages of the Web application platform
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
431
Volume 21, 2024
uously evaluate and fine-tune the model to ensure ac-
curate results. Finally, the integration of LLM in the
certification process adds more latency in the request
because the communication with the LLM mostly de-
pends on the network bandwidth.
6 Conclusion
In this article, we present a new approach to improve
the security of academic document verification sys-
tems. We are convinced that to achieve fast, easy,
and credible verification, it is essential to take rigor-
ous care of each document to be saved in our sys-
tem; therefore, we have implemented an additional
security mechanism during the certification and file
upload to avoid any forgery or fraudulent maneu-
vers. This security mechanism is based on the use
of GPT-3.5 to extract key information from the docu-
ment and encrypt it using a hash function (SHA-256).
This hash is then used for checking if any previously
saved document has the same signature. This ad-
ditional check ensures that any student transcript or
diploma is issued twice with different content. We
also have presented a hybrid architecture where a cen-
tralized server is served to issue documents, a decen-
tralized blockchain to store file hash and document
content hash, and an interplanetary file storage system
to store issued documents. The system stores, gener-
ates, and delivers certified academic documents, but
also verifies their authenticity on demand. The ver-
ification process is adapted to both digital and phys-
ical (printed) documents. By evaluating the impact
of the integration of LLM in the certification process,
we conclude that the run time increased by 40% and
tightly depends on the network bandwidth. In terms
of security, the certification institution must ensure
that the LLM platform aligns with internal and gov-
ernment compliances in terms of data security and pri-
vacy.
As future research, our study can focus on optimiz-
ing algorithms to reduce run time and a deep study on
security vulnerability that such integration comes up
with. We also plan to modify our system to accommo-
date a wider variety of academic texts and languages.
This development will involve tailoring the AI model
to suit diverse document structures and language nu-
ances, depending on the specific context. The source
code used for this work is available on GitHub [71].
Declaration of Generative AI and
AI-assisted technologies in the writing
process
During the preparation of this work, the authors par-
tially used the tool Grammarly to polish the grammar
and some parts of the wording style. After using this
tool, the authors reviewed and edited the content as
needed and take full responsibility for the content of
the publication.
References:
[1] S. Nakamoto, “Bitcoin: A peer-to-peer elec-
tronic cash system,” 2008. [Online]. Available:
https://bitcoin.org/bitcoin.pdf
[2] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu,
M. Lomeli, E. Hambro, L. Zettlemoyer, N. Can-
cedda, and T. Scialom, “Toolformer: Language
models can teach themselves to use tools,” Ad-
vances in Neural Information Processing Sys-
tems, vol. 36, pp. 68 539–68 551, 2024.
[3] M. M. Rahman, M. T. K. Tonmoy, S. R. Shi-
hab, and R. Farhana, “Blockchain-based certifi-
cate authentication system with enabling correc-
tion,” arXiv preprint arXiv:2302.03877, 2023,
https://doi.org/10.48550/arXiv.2302.03877.
[4] M. Gonzalez-Lee, C. Santiago-Avila,
M. Nakano-Miyatake, and H. Perez-Meana,
“Watermarking based document authentication
in script format,” in 2009 52nd IEEE Inter-
national Midwest Symposium on Circuits and
Systems. Cancun, Mexico: IEEE, 2009, pp.
837–841,
https://doi.org/10.1109/MWSCAS.2009.
5235898.
[5] I. Tkachenko, W. Puech, C. Destruel, O. Strauss,
J.-M. Gaudin, and C. Guichard, “Two-level qr
code for private message sharing and document
authentication,” IEEE Transactions on Informa-
tion Forensics and Security, vol. 11, no. 3, pp.
571–583, 2015,
https://doi.org/10.1109/TIFS.2015.2506546.
[6] A. T. Arief, W. Wirawan, and Y. K. Suprapto,
“Authentication of printed document using
quick response (qr) code,” in 2019 International
Seminar on Intelligent Technology and Its Appli-
cations (ISITIA). Surabaya, Indonesia: IEEE,
2019, pp. 228–233,
https://doi.org/10.1109/ISITIA.2019.8937084.
[7] M. Salleh and T. C. Yew, “Application of 2d
barcode in hardcopy document verification
system,” in Advances in Information Security
and Assurance: Third International Conference
and Workshops, ISA 2009, Seoul, Korea, June
25-27, 2009. Proceedings 3. Seoul, Korea:
Springer, 2009, pp. 644–651,
https://doi.org/10.1007/978-3-642-02617-1_
65.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
432
Volume 21, 2024
[8] A. Husain, M. Bakhtiari, and A. Zainal, “Printed
document integrity verification using barcode,”
Journal Teknologi (Sciences and Engineering),
vol. 70, no. 3, pp. 99–106, 2014.
[9] C. M. Li, P. Hu, and W. C. Lau, “Authpaper:
Protecting paper-based documents and creden-
tials using authenticated 2d barcodes,” in 2015
IEEE International Conference on Communica-
tions (ICC). London, UK: IEEE, 2015, pp.
7400–7406,
https://doi.org/10.1109/ICC.2015.7249509.
[10] M. A. A. Alameri, B. Ciylan, and B. Mahmood,
“Computational methods for forgery detec-
tion in printed official documents,” in 2022
ASU International Conference in Emerging
Technologies for Sustainability and Intelligent
Systems (ICETSIS). Alexandria, Egypt: IEEE,
2022, pp. 307–313,
https://doi.org/10.1109/ICETSIS55481.2022.
9888875.
[11] M. D. R. Zainuddin and K. Y. Choo, “De-
sign a document verification system based
on blockchain technology,” in Multimedia
University Engineering Conference (MECON
2022). Melaka, Malaysia: Atlantis Press,
2022, pp. 229–244,
https://doi.org/10.2991/978-94-6463-082-4_
23.
[12] O. Ghazali and O. S. Saleh, “A graduation cer-
tificate verification model via utilization of the
blockchain technology,” Journal of Telecommu-
nication, Electronic and Computer Engineering
(JTEC), vol. 10, no. 3-2, pp. 29–34, 2018.
[13] J. G. Dongre, S. M. Tikam, and V. B. Gharat,
“Education degree fraud detection and student
certificate verification using blockchain,” Int. J.
Eng. Res. Technol, vol. 9, no. 4, pp. 300–303,
2020.
[14] M. R. Suganthalakshmi, M. C. Praba, M. K.
Abhirami, M. S. Puvaneswari, and A. Prof,
“Blockchain based certificate validation sys-
tem,” 2022. [Online]. Available: https://www.
irjmets.com/uploadedfiles/paper//issue_7_july_
2022/28889/final/fin_irjmets1659003745.pdf
[15] S. Jayalakshmi and Y. Kalpana, “A pri-
vate blockchain-based distributed ledger stor-
age structure for enhancing data security of aca-
demic documents.” Grenze International Jour-
nal of Engineering & Technology (GIJET),
vol. 9, no. 1, pp. 25–35, 2023.
[16] F. M. Enescu, N. Bizon, and V. M. Ionescu,
“Blockchain technology protects diplomas
against fraud,” in 2021 13th International
Conference on Electronics, Computers and Ar-
tificial Intelligence (ECAI). Pitesti, Romania:
IEEE, 2021, pp. 1–6,
https://doi.org/10.1109/ECAI52376.2021.
9515107.
[17] A. Gayathiri, J. Jayachitra, and S. Matilda,
“Certificate validation using blockchain,” in
2020 7th International Conference on Smart
Structures and Systems (ICSSS). Chennai,
India: IEEE, 2020, pp. 1–4,
https://doi.org/10.1109/ICSSS49621.2020.
9201988.
[18] I. T. Imam, Y. Arafat, K. S. Alam, and S. A.
Shahriyar, “Doc-block: A blockchain based
authentication system for digital documents,”
in 2021 Third International Conference on
Intelligent Communication Technologies and
Virtual Mobile Networks (ICICV). Vellore,
India: IEEE, 2021, pp. 1262–1267,
https://doi.org/10.1109/ICICV50876.2021.
9388428.
[19] N. Malsa, V. Vyas, J. Gautam, A. Ghosh,
and R. N. Shaw, “Certbchain: a step by step
approach towards building a blockchain based
distributed application for certificate verifica-
tion system,” in 2021 IEEE 6th International
Conference on Computing, Communication and
Automation (ICCCA). Greater Noida, India:
IEEE, 2021, pp. 800–806,
https://doi.org/10.1109/ICCCA52192.2021.
9666311.
[20] A. D. B. Machado, M. Sousa, and F. D. S.
Pereira, “Applications of blockchain technology
to education policy,” Applications of blockchain
technology to education policy, pp. 157–163,
2019.
[21] V. Yfantis and K. Ntalianis, “A blockchain plat-
form for teaching services among the students,”
WSEAS Transactions on Advances in Engineer-
ing Education, vol. 19, pp. 141–146, 2022.
[22] M. Shanahan, “Talking about large language
models,” Communications of the ACM, vol. 67,
no. 2, pp. 68–79, 2024,
https://doi.org/10.1145/3624724.
[23] T. Brown, B. Mann, N. Ryder, M. Subbiah,
J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal,
A. Herbert-Voss, G. Krueger, T. Henighan,
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
433
Volume 21, 2024
R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Win-
ter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner,
S. McCandlish, A. Radford, I. Sutskever, and
D. Amodei, “Language models are few-shot
learners,” Advances in neural information pro-
cessing systems, vol. 33, pp. 1877–1901, 2020.
[24] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
G. Mishra, A. Roberts, P. Barham, H. W.
Chung, C. Sutton, S. Gehrmann, P. Schuh,
K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran,
E. Reif, N. Du, B. Hutchinson, R. Pope, J. Brad-
bury, J. Austin, M. Isard, G. Gur-Ari, P. Yin,
T. Duke, A. Levskaya, S. Ghemawat, S. Dev,
H. Michalewski, X. Garcia, V. Misra, K. Robin-
son, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi,
D. Dohan, S. Agrawal, M. Omernick, A. M.
Dai, T. S. Pillai, M. Pellat, A. Lewkowycz,
E. Moreira, R. Child, O. Polozov, K. Lee,
Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat,
M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck,
J. Dean, S. Petrov, and N. Fiedel, “Palm: Scal-
ing language modeling with pathways,” Journal
of Machine Learning Research, vol. 24, no. 240,
pp. 1–113, 2023.
[25] R. Taylor, M. Kardas, G. Cucurull, T. Scialom,
A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez,
and R. Stojnic, “Galactica: A large lan-
guage model for science,” arXiv preprint
arXiv:2211.09085, 2022,
https://doi.org/10.48550/arXiv.2211.09085.
[26] H. Touvron, T. Lavril, G. Izacard, X. Mar-
tinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
N. Goyal, E. Hambro, F. Azhar, A. Rodriguez,
A. Joulin, E. Grave, and G. Lample, “Llama:
Open and efficient foundation language mod-
els,” arXiv preprint arXiv:2302.13971, 2023,
https://doi.org/10.48550/arXiv.2302.13971.
[27] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang,
K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang,
W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang,
and X. Xie, “A survey on evaluation of large
language models,” ACM Transactions on Intel-
ligent Systems and Technology, vol. 15, no. 3,
pp. 1–45, 2024,
https://doi.org/10.1145/3641289.
[28] W. X. Zhao, K. Zhou, J. Li, T. Tang,
X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang,
Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen,
J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu,
P. Liu, J.-Y. Nie, and J.-R. Wen, “A sur-
vey of large language models,” arXiv preprint
arXiv:2303.18223, 2023,
https://doi.org/10.48550/arXiv.2303.18223.
[29] S. M. Thede and M. Harper, “A second-order
hidden markov model for part-of-speech tag-
ging,” in Proceedings of the 37th annual meet-
ing of the Association for Computational Lin-
guistics, 1999, pp. 175–182.
[30] L. R. Bahl, P. F. Brown, P. V. De Souza, and
R. L. Mercer, “A tree-based statistical language
model for natural language speech recognition,”
IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 37, no. 7, pp. 1001–
1008, 1989,
https://doi.org/10.1109/29.32278.
[31] T. Brants, A. Popat, P. Xu, F. J. Och, and
J. Dean, “Large language models in machine
translation,” in Proceedings of the 2007 Joint
Conference on Empirical Methods in Natu-
ral Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL),
2007, pp. 858–867.
[32] X. Liu and W. B. Croft, “Statistical language
modeling for information retrieval.” Annu. Rev.
Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005.
[33] C. Zhai, “Statistical language models for infor-
mation retrieval a critical review,” Foundations
and Trends® in Information Retrieval, vol. 2,
no. 3, pp. 137–213, 2008. [Online]. Available:
10.1561/1500000008
[34] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ,
and S. Khudanpur, “Recurrent neural network
based language model.” in Interspeech, vol. 2,
no. 3. Makuhari, 2010, pp. 1045–1048.
[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean,
“Efficient estimation of word representations in
vector space,” arXiv preprint arXiv:1301.3781,
2013,
https://doi.org/10.48550/arXiv.1301.3781.
[36] J. Pennington, R. Socher, and C. D. Manning,
“Glove: Global vectors for word representa-
tion,” in Proceedings of the 2014 conference on
empirical methods in natural language process-
ing (EMNLP), 2014, pp. 1532–1543.
[37] P. Bojanowski, E. Grave, A. Joulin, and
T. Mikolov, “Enriching word vectors with sub-
word information,” Transactions of the associ-
ation for computational linguistics, vol. 5, pp.
135–146, 2017,
https://doi.org/10.1162/tacl_a_00051.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
434
Volume 21, 2024
[38] R. Collobert, J. Weston, L. Bottou, M. Karlen,
K. Kavukcuoglu, and P. Kuksa, “Natural lan-
guage processing (almost) from scratch,” Jour-
nal of machine learning research, vol. 12, pp.
2493–2537, 2011.
[39] M. E. Peters, M. Neumann, M. Iyyer, M. Gard-
ner, C. Clark, K. Lee, and L. Zettlemoyer,
“Deep contextualized word representa-
tions,” CoRR abs/1802.05365 arXiv preprint
arXiv:1802.05365, 2018,
https://doi.org/10.48550/arXiv.1802.05365.
[40] J. Kaplan, S. McCandlish, T. Henighan, T. B.
Brown, B. Chess, R. Child, S. Gray, A. Rad-
ford, J. Wu, and D. Amodei, “Scaling laws
for neural language models,” arXiv preprint
arXiv:2001.08361, 2020,
https://doi.org/10.48550/arXiv.2001.08361.
[41] S. M. Xie, A. Raghunathan, P. Liang, and
T. Ma, “An explanation of in-context learning
as implicit bayesian inference,” arXiv preprint
arXiv:2111.02080, 2021,
https://doi.org/10.48550/arXiv.2111.02080.
[42] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W.
Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le,
“Finetuned language models are zero-shot learn-
ers,” arXiv preprint arXiv:2109.01652, 2021,
https://doi.org/10.48550/arXiv.2109.01652.
[43] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou,
“Chain-of-thought prompting elicits reasoning
in large language models,” Advances in neu-
ral information processing systems, vol. 35, pp.
24 824–24 837, 2022.
[44] J. Hoffmann, S. Borgeaud, A. Mensch,
E. Buchatskaya, T. Cai, E. Rutherford,
D. de Las Casas, L. A. Hendricks, J. Welbl,
A. Clark, T. Hennigan, E. Noland, K. Millican,
G. van den Driessche, B. Damoc, A. Guy,
S. Osindero, K. Simonyan, E. Elsen, J. W. Rae,
O. Vinyals, and L. Sifre, “Training compute-
optimal large language models,” arXiv preprint
arXiv:2203.15556, 2022,
https://doi.org/10.48550/arXiv.2203.15556.
[45] J. Rasley, S. Rajbhandari, O. Ruwase, and
Y. He, “Deepspeed: System optimizations en-
able training deep learning models with over
100 billion parameters,” in Proceedings of the
26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, 2020,
pp. 3505–3506,
https://doi.org/10.1145/3394486.3406703.
[46] M. Shoeybi, M. Patwary, R. Puri, P. LeGres-
ley, J. Casper, and B. Catanzaro, “Megatron-
lm: Training multi-billion parameter language
models using model parallelism,” arXiv preprint
arXiv:1909.08053, 2019,
https://doi.org/10.48550/arXiv.1909.08053.
[47] L. Ouyang, J. Wu, X. Jiang, D. Almeida,
C. Wainwright, P. Mishkin, C. Zhang, S. Agar-
wal, K. Slama, A. Ray, J. Schulman, J. Hilton,
F. Kelton, L. Miller, M. Simens, A. Askell,
P. Welinder, P. F. Christiano, J. Leike, and
R. Lowe, “Training language models to follow
instructions with human feedback,” Advances in
neural information processing systems, vol. 35,
pp. 27 730–27 744, 2022.
[48] P. F. Christiano, J. Leike, T. Brown, M. Martic,
S. Legg, and D. Amodei, “Deep reinforcement
learning from human preferences,” Advances in
neural information processing systems, vol. 30,
2017.
[49] R. Nakano, J. Hilton, S. Balaji, J. Wu,
L. Ouyang, C. Kim, C. Hesse, S. Jain,
V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe,
T. Eloundou, G. Krueger, K. Button, M. Knight,
B. Chess, and J. Schulman, “Webgpt: Browser-
assisted question-answering with human feed-
back,” arXiv preprint arXiv:2112.09332, 2021,
https://doi.org/10.48550/arXiv.2112.09332.
[50] R. T. Kwok and B. Maurice, “Aperiodic lin-
ear complexities of de bruijn sequences,” in Ad-
vances in Cryptology—CRYPTO’88: Proceed-
ings 8. Springer, 1990, pp. 479–482,
https://doi.org/10.1007/0-387-34799-2_33.
[51] S. Haber and W. S. Stornetta, “Secure names for
bit-strings,” in Proceedings of the 4th ACM Con-
ference on Computer and Communications Se-
curity, 1997, pp. 28–35.
[52] J. G. M. Mboma, O. T. Tshipata, W. V. Kam-
bale, and K. Kyamakya, “Assessing how large
language models can be integrated with or
used for blockchain technology: Overview
and illustrative case study,” in 2023 27th
International Conference on Circuits, Systems,
Communications and Computers (CSCC).
Rhodes (Rodos) Island, Greece: IEEE, 2023,
pp. 59–70,
https://doi.org/10.1109/CSCC58962.2023.
00018.
[53] A. Narayanan, J. Bonneau, E. Felten, A. Miller,
and S. Goldfeder, Bitcoin and cryptocurrency
technologies: a comprehensive introduction.
Princeton University Press, 2016.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
435
Volume 21, 2024
[54] G. Fox, “Peer-to-peer networks,” Computing in
Science & Engineering, vol. 3, no. 3, pp. 75–77,
2001,
https://doi.org/10.1109/5992.919270.
[55] M. Swan, Blockchain: Blueprint for a new econ-
omy. ”O’Reilly Media, Inc.”, 2015.
[56] Z. Zheng, S. Xie, H. Dai, X. Chen, and H. Wang,
“An overview of blockchain technology: Ar-
chitecture, consensus, and future trends,” in
2017 IEEE international congress on big data
(BigData congress). IEEE, 2017, pp. 557–564,
https://doi.org/10.1109/BigDataCongress.2017.
85.
[57] M. N. M. Bhutta, A. A. Khwaja, A. Nadeem,
H. F. Ahmad, M. K. Khan, M. A. Hanif,
H. Song, M. Alshamari, and Y. Cao, “A survey
on blockchain technology: Evolution, archi-
tecture and security,” Ieee Access, vol. 9, pp.
61 048–61 073, 2021,
https://doi.org/10.1109/ACCESS.2021.
3072849.
[58] C. V. B. Murthy, M. L. Shri, S. Kadry, and
S. Lim, “Blockchain based cloud computing:
Architecture and research challenges,” IEEE
access, vol. 8, pp. 205 190–205 205, 2020,
https://doi.org/10.1109/ACCESS.2020.
3036812.
[59] S. Ghimire and H. Selvaraj, “A survey on
bitcoin cryptocurrency and its mining,” in
2018 26th International Conference on Systems
Engineering (ICSEng). IEEE, 2018, pp. 1–6,
https://doi.org/10.1109/ICSENG.2018.
8638208.
[60] M. S. Ferdous, M. J. M. Chowdhury, M. A.
Hoque, and A. Colman, “Blockchain consen-
sus algorithms: A survey,” arXiv preprint
arXiv:2001.07091, 2020,
https://doi.org/10.48550/arXiv.2001.07091.
[61] E. Androulaki, A. Barger, V. Bortnikov,
C. Cachin, K. Christidis, A. D. Caro, D. Enyeart,
C. Ferris, G. Laventman, Y. Manevich,
S. Muralidharan, C. Murthy, B. Nguyen,
M. Sethi, G. Singh, K. Smith, A. Sorniotti,
C. Stathakopoulou, M. Vukolić, S. W. Cocco,
and J. Yellick, “Hyperledger fabric: a dis-
tributed operating system for permissioned
blockchains,” in Proceedings of the thirteenth
EuroSys conference, 2018, pp. 1–15,
https://doi.org/10.1145/3190508.3190538.
[62] M. Castro and B. Liskov, “Practical byzantine
fault tolerance,” in OsDI, vol. 99, no. 1999,
1999, pp. 173–186.
[63] M. Alharby and A. Van Moorsel, “Blockchain-
based smart contracts: A systematic mapping
study,” arXiv preprint arXiv:1710.06372, 2017,
https://doi.org/10.48550/arXiv.1710.06372.
[64] M. Dabbagh, M. Kakavand, M. Tahir, and
A. Amphawan, “Performance analysis of
blockchain platforms: Empirical evaluation
of hyperledger fabric and ethereum,” in 2020
IEEE 2nd International conference on artificial
intelligence in engineering and technology
(IICAIET). IEEE, 2020, pp. 1–6,
https://doi.org/10.1109/IICAIET49801.2020.
9257811.
[65] G. Caldarelli, “Overview of blockchain oracle
research,” Future Internet, vol. 14, no. 6, p. 175,
2022,
https://doi.org/10.3390/fi14060175.
[66] J. Benet, “Ipfs-content addressed, ver-
sioned, p2p file system,” arXiv preprint
arXiv:1407.3561, 2014,
https://doi.org/10.48550/arXiv.1407.3561.
[67] D. Trautwein, A. Raman, G. Tyson, I. Castro,
W. Scott, M. Schubotz, B. Gipp, and Y. Psaras,
“Design and evaluation of ipfs: a storage layer
for the decentralized web,” in Proceedings of the
ACM SIGCOMM 2022 Conference, 2022, pp.
739–752,
https://doi.org/10.1145/3544216.3544232.
[68] C. Helbling, “Directed graph hashing,” arXiv
preprint arXiv:2002.06653, 2020,
https://doi.org/10.48550/arXiv.2002.06653.
[69] P. Maymounkov and D. Mazieres, “Kademlia:
A peer-to-peer information system based on the
xor metric,” in International Workshop on Peer-
to-Peer Systems. Springer, 2002, pp. 53–65,
https://doi.org/10.1007/3-540-45748-8_5.
[70] M. S. Ferdous, M. J. M. Chowdhury, M. A.
Hoque, and A. Colman, “Blockchain consen-
sus algorithms: A survey,” arXiv preprint
arXiv:2001.07091, 2020,
https://doi.org/10.48550/arXiv.2001.07091.
[71] M. Mboma, “document-certification-with-
blockchain-authority-dashboard,” 2023, ac-
cessed: Sep. 16, 2023. [Online]. Available:
https://t.ly/w0Uj7
Contribution of Individual Authors to the Cre-
ation of a Scientific Article (Ghostwriting Policy)
The authors equally contributed in the present re-
search, at all stages from the formulation of the prob-
lem to the final findings and solution.
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
436
Volume 21, 2024
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study
Conflict of Interest
The authors have no conflicts of interest to declare
that are relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International , CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS
DOI: 10.37394/23209.2024.21.39
Jean Gilbert Mbula Mboma, Obed Tshimanga Tshipata,
Witesyavwirwa Vianney Kambale,
Mohamed Salem, Mudiampimpa Tshyster Joel,
Kyandoghere Kyamakya
E-ISSN: 2224-3402
437
Volume 21, 2024