Detecting Plagiarism in Student Assignments using Source Code

Analysis

SERHII MYRONENKO, YELYZAVETA MYRONENKO

System Design Department, Institute for Applied System Analysis,

National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute",

37, Peremohy Ave, Kyiv, 03056,

UKRAINE

Abstract: - The issue of accurately detecting semantically equivalent code remains a current problem for

students, teachers, and researchers. The goal of this article was to develop an alternative method for checking

software code for plagiarism using abstract syntax trees. To achieve this, an analysis of existing modern

methods of automatic comparison of original and modified program code for plagiarism detection was

conducted, focusing on abstract syntax trees, approaches based on machine learning, and token-based analysis.

Systems such as MOSS, JPlag, Codequiry, and Plaggie were reviewed. As a result of this work, it can be noted

that the proposed methods have sufficiently high indicators in the range of 79-85% for semantically equivalent

code and 93-95% similarity in cases demonstrating hash violation problems, which shows their potential

effectiveness in solving various tasks related to the detection of software code plagiarism. The practical value

of the research lies in the possibility of using the proposed method for checking small student assignments for

originality. Future research on this topic could include adding functionality for document collection, reducing

the time required for document verification, expanding the number of programming languages, and improving

system efficiency while operating large code fragments.

Key-Words: - Source Code Plagiarism, BERT, Code2Vec, PLP, Transformers, AST.

Received: January 19, 2024. Revised: July 2, 2024. Accepted: August 1, 2024. Published: September 4, 2024.

1 Introduction

Currently, the problem of plagiarism in student’s

works at higher educational institutions is quite

serious. This is linked to the fact that, as indicated in

articles, [1], [2], [3], the majority of experienced

students do not consider "borrowing" someone else's

work without citation as a violation of academic

integrity. Furthermore, some students rely on the

notion that information should be free and can be

"reused." As a result, learning, which is already

advantaged by a large number of practical and

laboratory works, becomes significantly time-

saving, even if it involves copying articles from the

Internet, which only takes a few minutes compared

to independent problem analysis, [4]. Another

argument for indulging in plagiarism for a student is

the awareness that their peers are also copying

works and receiving high grades for it.

To effectively assess the scale of the problem of

appearances in student works, existing research on

this topic was analyzed, which fully demonstrates

the problem's relevance. Several sociological

experiments, which were conducted both for

students and for lecturers at higher educational

institutions, were analyzed. Also were analyzed

articles, [2], [3], [5], [6], [7], which are dedicated to

the study of student psychology and their attitude

towards the problem of plagiarism. Several key

thoughts of the authors, which underline the

seriousness of the problem of plagiarism and, as a

consequence, the relevance of this work, will be

presented.

In the work, [5], the author considers that

plagiarism and cheating are natural, and people are

not particularly distinguished in using such a

deception tactic to gain a competitive advantage. At

the same time, the author points out that "reasonable

intervention" in this process is possible, and

numerous resources for lecturers can help create a

culture of integrity. The authors, [2], present similar

thoughts, and list several reasoned causes that lead

to the violation of academic integrity rules:

 Lack of time;

 Helping another in completing a task;

 The student does not understand the scope

of work required and is overwhelmed with

the task;

 External pressure (for example, parents

pushing for high grades);

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

367

Volume 12, 2024

 The student is unable to complete the task

independently;

 Laziness;

 The belief is that all other students are also

cheating.

The authors emphasize that the problem of

plagiarism needs to be solved at several levels –

from plagiarism detection systems to appropriate

normative acts and official rules and procedures. It

is also worth paying attention to the article, [2],

which describes in detail the students' point of view

on the problem of plagiarism. The authors of this

work fully share the higher education thoughts and

assertions and also present several key aspects:

 Students do not always clearly understand what

constitutes plagiarism. A comparable observation

was also noted in the article, [8];

 In different countries, there are different

understandings of plagiarism (Thus, without clear

rules, a student may not fully understand what is

considered plagiarism in one educational

institution).

 In this study, a sociological investigation was

conducted to understand students' perceptions of

what constitutes plagiarism in programming code.

The findings revealed that a significant number of

students do not fully grasp the concept of

plagiarism. For instance, according to materials

from the article, more than 50% of experienced

students believe that taking program code from a

book and rewriting it in another language is not

plagiarism. However, 94% of respondents

acknowledge that directly copying program code

from a book violates academic integrity rules.

After analyzing various articles, [2], [3], [5], [6],

[7], the following conclusions can be drawn (all

statistical data are taken from the materials of the

works):

 A large number of students significantly

underestimate the number of plagiarism incidents in

other works, which encourages them to violate

academic integrity rules;

 The majority of experienced students (more

than 85%) are neutral or positive towards the

practice of copying other works;

 Nearly 70% (mostly first-year students) support

the absence of punishment for plagiarism or limited

verbal warnings;

According to information from the article, [9],

most educators, and specifically 97% of experienced

ones, do not react to instances of academic

dishonesty in practice, with 60% of participants

limiting their response to verbal warnings.

Moreover, only 5% of experienced participants

allow the use of additional printed materials during

exams and other types of assessment works,

confirming the effectiveness of this practice in

combating plagiarism.

To highlight the prevalence of code plagiarism in

recent years, the following statistics from

researchers are provided:

 From the information from the article, [10], it

was confirmed that whereas in the 1940s only 20%

of college students admitted to cheating, nowadays

the percentage has increased to 75 - 98%.

 Approximately 42% of computer science

students admitted to copying code from other

students – that is the result of the survey, [11].

 Moreover, according to the article, [12], it was

proven that from 50% to 79% of students will

plagiarize at least once in their academic career.

As a premise, it can be concluded that the issue

of plagiarism in student works requires detailed

attention. The main justifications for performing this

research are maintaining academic integrity and the

potential support of academic institutions.

In the field of Information Technology (IT),

plagiarism can be broadly categorized into two

groups:

 Text plagiarism;

 Source code plagiarism.

A more detailed classification of plagiarism

types is presented in Figure 1, based on the

materials of works, [13], [14].

Notably, this work will focus on one of these

plagiarism groups, specifically source code

plagiarism. Further, more detailed descriptions of

the subgroups of program code plagiarism will be

provided. These include:

 Refactoring of structure: The developer

changes the order of functions and/or modifies the

syntax without altering the program's source.

 Addition of additional tabulation, spaces, and

comments, and further development of the work as

their own.

 Change of programming language: The

programmer rewrites the code in another

programming language without acknowledging the

authors of the original program code.

To prepare for this study, several works of other

researchers on the topic of code plagiarism detection

were analyzed. After careful studying of the selected

works, it is worth mentioning some of them,

particularly, [15], [16], [17] in more detail. The first

work for analysis was an article, [15], about the

DeepSim system. After analysis of this work, it can

be stated that the main advantages of this system are

feature learning, scalability, and generalization.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

368

Volume 12, 2024

Fig. 1: Classification of plagiarism types

On the other hand, the main disadvantages of

this system are overfitting, interpretability, and

robustness.

In the article, [16], the use of natural language

processing methods was explored. The main pros of

this work are code reusability, semantic

comprehension, and context relevance. At the same

time, the cons of this research are domain

specificity, limited vocabulary, and semantic gap.

It is also worth noting research, [17]. The

authors of this article studied the use of different

embedding techniques to represent code snippets in

a continuous vector space for efficient code

plagiarism detection. The main concerns for this

study are data quality, hyperparameter tuning, and

interpretability, whereas, on the other hand, the

potential benefits of this work are semantic

understanding, efficiency, scalability, and

robustness.

The objective of this research was to propose an

alternative method for verifying program code for

plagiarism using abstract syntax trees and neural

networks. Consequently, the following tasks of this

research were identified:

 To analyze popular systems for analyzing

program code for plagiarism and identify the most

effective among them;

 To determine the most effective system in

terms of versatility (if possible);

 To propose an alternative approach for solving

the task of detecting plagiarism in program code

using abstract syntax trees and neural networks;

 To compare the results obtained from the

analysis.

2 Methodology

This research is based on data collected from

existing systems and methods for detecting

similarities in program code, necessary for

analyzing their effectiveness in identifying

plagiarism in program code. The work employs and

combines a range of theoretical research methods:

logical method, analysis, synthesis, classification,

generalization and analogy, comparison, induction,

and interpretation. The methodological basis of the

research consists of modern methodologies aimed at

model structures designed for detecting plagiarism

in program code.

In general, the procedure of this research can be

divided into the following main stages:

1. Formation of a selection of existing systems

for comparison;

2. Conducting a general comparative analysis of

the possibilities of the systems;

3. Analysis of the results obtained with the

subsequent selection of the two best systems;

4. Conducting a comparative analysis of the

selected systems for accuracy in detecting program

code plagiarism;

Types of plagiarism

Text plagiarism

Intellectual

Self-plagiarism

Mosaic paraphrase

Plagiarism of an idea, approach,etc

Plagiarism with the use of metaphors

Plagiarism from illegitimate sources

Paraphrase

Literal

Copying/duplication

Source code plagiarism

Structure refactoring

Adding additional tabs, indents,

spaces, etc

Change of programming language

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

369

Volume 12, 2024

5. Analysis of the results of the comparison,

identifying potential shortcomings of the systems;

6. Proposing alternative methods for detecting

program code plagiarism and their practical

comparison;

7. Analysis of the obtained results and

conclusions.

As can be seen from Table 1, for comparison, we

selected 4 existing systems currently designated for

checking source code for plagiarism: MOSS, JPlag,

Codequiry, and Plaggie.

The choice of MOSS, JPlag, Codequiry, and

Plaggie for comparative analysis with other

plagiarism detection systems can be justified based

on several key factors that make these tools

particularly relevant and representative in the

context of code plagiarism detection:

2.1 Various Methodologies

Each of these systems employs different algorithms

and methods for detecting similarities in code. This

diversity allows for a comprehensive understanding

of various approaches to detecting plagiarism in

programming. To highlight the contrast between the

selected systems, it is worth mentioning that MOSS

uses a hash-comparison algorithm for detecting

matches, whereas JPlag is based on searching for

structural similarity, [18]. Whilst the Codequiry

system browses a large library of samples, which

includes web resources as well, Plaggie proposes

customizable algorithms for code similarity

detection, [19].

2.2 Popularity and Usage

The plagiarism checker software used for this study

is reputable and applied in different educational

institutions. For example, MOSS is becoming a

benchmark in universities because of its reliability,

which is a crucial point in any kind of study.

2.3 Support for Various Languages

The selected systems can be called versatile and

applicable to a large variety of academic tasks due

to their multiple programming language support.

For example, JPlag is a solid choice for

programming courses with a free selection of coding

language, because the system is known for its

“language independence”, [18].

2.4 Accessibility and Transparency

Transparency is one of the most important criteria

for academic studies, due to the importance of

detection procedure and its value to the whole

learning process.

Moreover, the free accessibility of plagiarism

detection tools which provide transparent results,

makes them effective options for practical research

and improvement of the educational process.

2.5 Different Target Audiences and Usage

Scenarios

All the above mentioned tools are eligible for

particular tasks and work environments.

To elaborate on the previous point, the

Codequiry system can effectively evaluate modern

programming assignments due to its focus on deep

analysis and document comparison to the vast

library of sources, [20].

Clear knowledge of each system functioning in

different cases can significantly help educators and

developers choose a plagiarism detection system

according to their needs.

To summarize, the comparison of MOSS, JPlag,

Codequiry, and Plaggie is based on the next

important characteristics:

 Multiple programming languages support;

 Transparency;

 Ability to adapt to different use-case

scenarios;

 Prevalence of use;

 variety of methodologies.

This comparison study will give educators and

developers a comprehension of plagiarism detection

tools as well as an understanding on methods of

improving the educational process.

3 Results

As a part of this study, a comparative analysis of

source code plagiarism systems was held based on

selected criteria that can help evaluate the efficiency

of the selected system. The study focuses on the

following characteristics: ease of use, template code

exclusion, scalability, clear and understandable

results, capability to be used as a web service,

support for multiple programming languages, and

open source code availability. Definitions of the

selected criteria are provided below.

Ease of Use

One of the most important criteria for plagiarism

detection systems. Consequentially, systems with

user-friendly and intuitive interfaces will be the

obvious choice for teachers and students. As a

result, ease of use greatly contributes to the

increasing efficiency of the learning process and

time-saving.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

370

Volume 12, 2024

Template Code Exclusion

This function helps to ensure the accuracy of the

code check and significantly reduces the number of

false positives. Template code removal is especially

useful for student assignment control as such tasks

often include some generic code, [21].

Scalability

In our digital era, working with the large

volumes of data is not a fancy feature, it is a

necessity. The ability to work with large sets of

documents simultaneously, fast adapting to different

data sizes, and ensuring that the required

computational power is provided are the necessary

characteristics of an effective plagiarism detection

system.

Clear and Understandable Results

Clear and understandable results are among the

most vital sources of information for educators.

Well-structured results provide the necessary

feedback for teachers, which greatly helps to

improve the educational process.

Capability to be used as a web service

A plagiarism detection system, that can be used

as a web service provides great opportunities for

people who work in different places. For example,

such a system is a convenient choice for teachers

who are regularly checking student assignments at

home. Usually, web services are always highly

maintained to stay up-to-date.

Support for Multiple Programming Languages

Support of multiple programming languages

makes the plagiarism detection system more

versatile and, as a result, it can be applied to a rather

wide variety of educational tasks. Furthermore, it

allows educational establishments to use a single

plagiarism detection system for their work.

Open Source Code Availability in Plagiarism

Detection Systems

Plagiarism detection systems with the open

source code are a proper choice for educational

institutions that value potential adaptability and

transparency. Furthermore, the availability of the

system’s source code can help to evaluate its

effectiveness, along with motivating the community

to improve existing functionality and deliver new

features.

Table 1 demonstrates the results of the

comparative study.

To sum it up, after examining the obtained

results, it can be ascertained that, according to the

selected characteristics, the MOSS system is the

most efficient.

Unfortunately, one of the candidates for

evaluation, the Codequiry tool, could not be fully

tested due to the absence of certain system functions

in open access.

The next step in the comparative analysis is to

compare the two best systems based on the results

of the previous evaluation (MOSS and JPlag) for

accuracy in detecting plagiarism (including all types

of code similarity). The results of this comparison

are presented in Table 2.

Table 1. Results of the Comparative Analysis of selected Plagiarism Detection Systems (MOSS, JPlag,

Codequiry, Plaggie) according to the chosen criteria

№

Characteristics of the comparison

MOSS

JPlag

Codequiry

Plaggie

Ease of use

Yes

Yes (free functional

was checked)

Template code exclusion

Yes

Scalability

Clear and understandable results;

result demonstration

Yes, meets all requirements

Yes, meets all

requirements

Yes (free functional

was checked)

Yes, meets all

requirements

Capability to be used as a web

service

Yes

Open source code availability

Absent, but several projects are

using MOSS algorithm

Absent

Support for multiple programming

languages

More than 20

Table 2. Results of the Comparative Analysis of the Two Best Systems (MOSS and JPlag) for Accuracy in

Detecting Plagiarism

Type of source code similarity

MOSS

JPlag

Identical code fragments

100 %

Changed variable names and/or their types

100 %

Added or removed individual operators

92 %

90 %

Semantically equivalent code

64 %

55 %

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

371

Volume 12, 2024

From the comparative table, it is evident that

both systems effectively handle the detection of the

1st and 2nd types of code similarity. The 3rd type of

code similarity was also detected quite effectively,

but the MOSS system showed more precise results.

Regarding the 4th type of similarity, the detection of

semantically equivalent code, both systems show

relatively low accuracy, but MOSS performs

slightly better than JPlag.

At first glance, the MOSS system appears quite

effective in detecting various types of plagiarism,

but it is not as effective after considering the general

principles of MOSS's operation and the creation of

systems that can "deceive" MOSS by using critical

system vulnerabilities, such as hash collision. The

problem of hash collision lies in the fact that if a

token is added to a hash window of length n, it

significantly reduces the effectiveness of MOSS. In

general, Mossad can generate up to 12 source code

variants, and after checking for plagiarism, the

original and MOSS do not show more than 25%

similarity. Due to this vulnerability, this work

proposes variants of program code testing for the

presence of obfuscation, which would be robust

against this vulnerability.

The approach proposed in this work involves a

shift from the traditional MOSS algorithm to more

o. The proposed block diagram is shown in Figure

For implementing the algorithm depicted in

Figure 2, two variants of realization for checking the

presence of obfuscation are proposed. A multilayer

perceptron was chosen for the first implementation

that compares vector representations of

programming code received from Code2Vec, [17],

[22]. As the second option, the existing TreeBERT

system, [23], [24], was modified to detect

plagiarism in source code. TreeBERT system uses

an enhanced vector representation, [24], [25], [26],

[27], whereas the first realization is working with

‘classic’ implementation using standard

representations received from Code2Vec.Each of

the selected approaches has its crucial strong points.

For instance, the first one is suitable for large-scale

programs as its primary focus is on the semantic

content of the code. The second one also has its

benefits: this approach offers more possibilities in

the field of structural and contextual analysis, which

is vital for complex plagiarism checks. The key

difference is that, unlike other systems, the proposed

models can work with different programming code

manipulations and coding styles, along with

ensuring high accuracy of plagiarism detection.

The choice of the proposed implementations was

driven by the need for a complex approach that

contains the main strengths of an effective

plagiarism detection system: adaptability to

different plagiarism scenarios, precise document

checking, and efficiency.

After implementing both proposed variants, a

comparative analysis of the effectiveness of

detecting various types of plagiarism in source code

was conducted. The results of this analysis are

presented in Table 3.

Fig. 2: Proposed Block Diagram of the Algorithm

Table 3. Results of the Comparative Analysis of the Proposed Solutions with the MOSS System

Type of plagiarism

TreeBERT

Multilayer perceptron + Code2Vec

MOSS

Identical source code fragments

100 %

Changed variable names and/or their types

95 %

94 %

100 %

Added or removed individual operators

95 %

92 %

Semantically equivalent code

79 %

85 %

64 %

Example of hash violation problem

93 %

98 %

10 %

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

372

Volume 12, 2024

It is important to note that the MOSS system is

present in this experiment to evaluate the

effectiveness of other proposed methods. It is also

significant that all analyzed pairs of examples

during this check contain plagiarism. Therefore, for

maximum evaluation, which the system can provide,

we consider 100%. A total of 5 types of situations

were examined (4 of them demonstrate different

types of program code plagiarism, and the 5th case

represents a problem of hash collision). Compared

to the MOSS system, neither developed method is

sensitive to the hash collision problem. In cases

where the MOSS system is vulnerable to this

problem and shows a similar result of only 10%, the

proposed methods achieve results of 93% and 98%,

respectively. Thus, the developed models are better

at detecting the fourth type of plagiarism, namely

semantically equivalent code.

When comparing the two developed models, it

can be concluded that the multilayer perceptron +

Code2Vec has better performance indicators.

3.1 Limitations of the Study

One of the limitations of this study is the incomplete

testing of the existing Codequiry system due to the

absence of a free trial version of the system.

Another limitation is the limited number of

programming languages for the proposed systems,

as the research and analysis of these systems'

performance were conducted using a selection of

code fragments in the Java programming language.

Additionally, the study faced limitations related to

the small size of the program code fragments used

for training and testing the systems.

4 Discussion

Today, several systems and methods have been

developed for detecting plagiarism, which helps

identify fragments of program code that are

plagiarized, and systems for checking program code

that can detect semantically equivalent code.

However, as numerical studies show, current

systems for detecting program code plagiarism can

very successfully identify identical program code

and code with changed variable names and

constants, and they can quite well identify code

fragments with added or removed operators. The

detection of semantically equivalent code is not as

successful. Another issue is the vulnerability to hash

collision attacks, which the system fails to recognize

as plagiarism. This vulnerability is present in one of

the popular systems for detecting plagiarism,

MOSS, [28], [29], [30], [31], [32]. The work, [29],

details the problem of hash collision and how the

attack works. It was also proven in this work that it

is possible to generate at least 12 variants of

program code that, when input into the system, these

variants will be semantically equivalent to the

original, and MOSS will not detect plagiarism when

comparing the original and one of the generated

code variants. Also, according to work, [32], the

MOSS system is one of the best systems for

detecting program code plagiarism, as demonstrated

by using various data selections and metrics like F-

measures and precision-recall curves. Therefore,

enhancing systems for detecting program code

plagiarism is an important task, especially in

detecting semantically equivalent code and the

system's resistance to hash collision attacks.

Research into existing systems and methods has

proven their effectiveness in identifying identical

fragments of program code. Methods and techniques

for detecting plagiarism of semantically equivalent

code require further refinement, [33]. This study

focused on models and methods that have recently

proven to be some of the most effective for

detecting plagiarism in source code. Particular

attention was paid to abstract syntax trees, [25],

[26], neural networks, and the possibilities of their

practical application in detecting program code

plagiarism.

The models analyzed by us demonstrate very

good results in the range of 79-85% for detecting

semantically equivalent code and results of 93-95%

in tests demonstrating the problem of hash collision,

which strongly indicates their potential effectiveness

in solving various tasks related to the detection of

program code plagiarism. Thus, BERT, which was

presented in the work, [34], and all subsequent

modifications of this model are among the most

promising for solving tasks related to the detection

of program code plagiarism.

5 Conclusions and Perspectives for

Further Research

During this research, two methods of checking the

output program code for the presence of

obfuscations were proposed – a multilayer

perceptron + Code2Vec and TreeBert. For the

TreeBert system, a function of checking for

similarity was additionally developed, and,

according to the authors, the main task of this model

is code generation. In turn, the multilayer

perceptron + Code2Vec was trained on a dataset

containing 14 million examples of program code in

Java. The designated code fragments from the

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

373

Volume 12, 2024

selection were pre-processed by Code2Vec to detect

similarities in vectors.

After conducting a comparative analysis, it was

concluded that both proposed models resist the

problem of hash collision, a major vulnerability of

the existing MOSS system. It is also worth noting

that the multilayer perceptron + Code2Vec shows

better indicators than TreeBert and MOSS,

especially in working with semantically equivalent

code and in cases demonstrating the problem of

hash collision. Since the method of this work was

developing a more effective method for checking

program code for plagiarism, the results of the work

fully meet the set goals.

As a direction for further development, it is

possible to highlight the optimization of the

necessary time for conducting checks, working with

a collection of documents, expanding the number of

programming languages with which the system can

work, and improving the system for analyzing large

fragments of program code. To experiment with

accuracy results and feasible results improvement,

periodically checking the existing datasets is

recommended. It is also possible to create personal

datasets for further experiments.

References:

[1] Stephens, J. M. How to Cheat and not Feel

Guilty: Cognitive Dissonance and its

Amelioration in the Domain of Academic

Dishonesty. Theory into Practice. Vol.56,

No.2, 2017, рр. 111-120,

https://doi.org/10.1080/00405841.2017.12835

71.

[2] Sraka, D., & Kaučič, B., Source Code

Plagiarism. Conference: Information

Technology Interfaces. Proceedings of the ITI

2009 31st International Conference, Cavtat,

Croatia, 2009,

https://doi.org/10.1109/ITI.2009.5196127.

[3] Cosma, G., & Joy, M. Source-Code

Plagiarism: A UK Academic Perspective,

2006, [Online].

https://www.researchgate.net/publication/228

712855_Source-

code_plagiarism_A_UK_academic_perspectiv

e (Accessed Date: October 23, 2023).

[4] Grimes, P. W. Dishonesty in Academics and

Business: A Cross-Cultural Evaluation of

Student Attitudes. Journal of Business Ethics.

Vol.49, No.3, 2004, рр. 273-290,

https://doi.org/10.1023/B:BUSI.0000017969.2

9461.30.

[5] Hard, S. F., Conway, J., & Moran, A. Faculty

and College Student Beliefs about the

Frequency of Student Academic Misconduct.

The Journal of Higher Education. Vol.77,

No.6, 2006, рр. 1058-1080,

https://doi.org/10.1353/jhe.2006.0048.

[6] Cosma, G., & Joy, M. towards a Definition of

Source-Code Plagiarism. IEEE Transactions

on Education. Vol.51, No.2, 2008, рр. 195–

200, https://doi.org/10.1109/TE.2007.906776.

[7] Joy, M., Cosma, G., Yin-Kim Yau, J., &

Sinclair, J. Source Code Plagiarism – A

Student Perspective. IEEE Transactions on

Education. Vol.54, No.1, 2011, рр. 125–132,

https://doi.org/10.1109/TE.2010.2046664.

[8] Baron, E. Computer Science Cheating

Incidents Part оf Widespread Problem:

Report. Stanford, UC Berkeley, 2017.

[9] MacGregor, J., & Stuebs, M. To cheat or not

to Cheat: Rationalizing Academic

Impropriety. Accounting Education. Vol.21,

No.3, 2012, рр. 265-287,

https://doi.org/10.1080/09639284.2011.61717

[10] Academic Cheating Fact Sheet. (1999)

Educational Testing Service, [Online].

http://www.glass-castle.com/clients/www-

nocheating-

org/adcouncil/research/cheatingfactsheet.html

(Accessed Date: October 22, 2023).

[11] McCabe, D. L., Butterfield, K. D., &

Treviño, L. K. Cheating in college: Why

students do it and what educators can do about

it. Baltimore, Johns Hopkins University Press,

2012 [Online].

https://pure.psu.edu/en/publications/cheating-

in-college-why-students-do-it-and-what-

educators-can-do- (Accessed Date: October

23, 2023).

[12] Cheers, H., Lin, Y., & Smith, S. P. Academic

source code plagiarism detection by

measuring program behavioural similarity.

arXiv, 2021,

https://doi.org/10.48550/arXiv.2102.03995.

[13] Soeharno, J., Keimpe, A., & van der Schot, D.

(Eds.). Plagiarism in Academic Research and

Education. Hague, Association of Universities

in the Netherlands, 2021, [Online].

https://www.researchgate.net/publication/354

904254_Plagiarism_in_Academic_Research_

and_Education (Accessed Date: October 23,

2023).

[14] Maryono, D., Yuana, R. A., & Hatta, P. The

Analysis of Source Code Plagiarism in Basic

Programming Course. Journal of Physics:

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

374

Volume 12, 2024

Conference Series, Vol.1193, 2019, paper

012027, https://doi.org/10.1088/1742-

6596/1193/1/012027

[15] Zhao, G., & Huang, J. DeepSim: Deep

learning code functional similarity.

ESEC/FSE 2018: Proceedings of the 2018

26th ACM Joint Meeting on European

Software Engineering Conference and

Symposium on the Foundations of Software.

New York, Association for Computing

Machinery, 2018, pp. 141–151,

https://doi.org/10.1145/3236024.3236068.

[16] Husain, M. A., Wu, H.-H., Gazit, T.,

Allamanis, M., & Brockschmidt, M.

CodeSearchNet Challenge: Evaluating the

State of Semantic Code Search, 2020,

https://doi.org/10.48550/arXiv.1909.09436.

[17] Alon, U., Zilberstein, M., Levy, O., & Yahav,

E. code2vec: Learning Distributed

Representations of Code, 2018,

https://doi.org/10.48550/arXiv.1803.09473.

[18] Prechelt, L., Malpohl, G., & Philippsen, M.

Finding Plagiarisms among a Set of Programs

with JPlag. Journal of Universal Computer

Science. Vol.8, No.11, 2003, pp. 1016-1038,

[Online].

https://www.researchgate.net/publication/283

2828_Finding_Plagiarisms_among_a_Set_of_

Programs_with_JPlag (Accessed Date:

October 23, 2023).

[19] Ahtiainen, A., Surakka, S., & Rahikainen, M.

Plaggie: GNU-licensed Source Code

Plagiarism Detection Engine for Java

Exercises. Baltic Sea '06: Proceedings of the

6th Baltic Sea conference on Computing

education research: Koli Calling. Uppsala,

Sweden, 2006, pp. 141–142,

https://doi.org/10.1145/1315803.1315831.

[20] Code Plagiarism & Similarity API.

Codequiry, n.d., [Online].

https://codequiry.com/usage/api (Accessed

Date: October 23, 2023).

[21] Simon, Karnalim, O., Sheard, J., Dema, I.,

Karkare, A., Leinonen, J., Liut, M., &

McCauley, R. Choosing Code Segments to

Exclude from Code Similarity Detection.

Conference: ITiCSE-WGR '20: The Working

Group Reports on Innovation and Technology

in Computer Science Education, Trondheim,

Norway, 2020,

https://doi.org/10.1145/3437800.3439201.

[22] Compton, R., Frank, E., Patros, P., & Koay,

A. Embedding Java Classes with code2vec:

Improvements from Variable Obfuscation,

2020, [Online].

https://www.researchgate.net/publication/340

499798_Embedding_Java_Classes_with_code

2vec_Improvements_from_Variable_Obfuscat

ion (Accessed Date: October 23, 2023).

[23] Vaswani, A., Shazeer, N., Parmar, N.,

Uszkoreit, J., Jones, L., Gomez, A. N.,

Kaiser, Ł., & Polosukhin, I. Attention is аll

You Need. 31st Conference on Neural

Information Processing Systems, Long Beach,

CA, USA, 2017, [Online].

https://arxiv.org/pdf/1706.03762.pdf

(Accessed Date: October 22, 2023).

[24] Jiang, X., Zheng, Zh., Lyu, Ch., Li, L., &

Lyu, L. TreeBERT: A Tree-Based Pre-

Trained Model for Programming Language.

37th Conference on Uncertainty in Artificial

Intelligence, Online, 2021, [Online].

https://arxiv.org/pdf/2105.12485.pdf

(Accessed Date: October 23, 2023).

[25] Jones, J. Abstract Syntax Tree Implementation

Idioms, 2003, [Online].

https://www.hillside.net/plop/plop2003/Papers

/Jones-ImplementingASTs.pdf (Accessed

Date: October 23, 2023).

[26] Tang, Z., Li, C., Ge, J., Shen, X., Zhu, Z., &

Luo, B. AST-Transformer: Encoding Abstract

Syntax Trees Efficiently for Code

Summarization, 2021, [Online].

https://arxiv.org/pdf/2112.01184.pdf

(Accessed Date: October 23, 2023).

[27] Kanade, A., & Maniatis, P., Balakrishnan, G.,

& Shi, K. Learning and Evaluating

Contextual Embedding of Source Code, 2020,

[Online].

https://arxiv.org/pdf/2001.00059v3.pdf

(Accessed Date: October 23, 2023).

[28] Yang, D. How MOSS Works, 2019, [Online].

https://yangdanny97.github.io/blog/2019/05/0

3/MOSS (Accessed Date: October 22, 2023).

[29] Devore-McDonald, B., & Berger, E. D.

Mossad: Defeating Software Plagiarism

Detection. Proceedings of the ACM on

Programming Languages, Vol. 4, No.

OOPSLA, 2020, Article 1, [Online].

https://arxiv.org/pdf/2010.01700.pdf

(Accessed Date: October 23, 2023).

[30] Moss, A System for Detecting Software

Similarity. n.d, [Online].

https://theory.stanford.edu/~aiken/moss/

(Accessed Date: October 23, 2023).

[31] Schleimer, S., Wilkerson, D. S., & Aiken А.

Winnowing: Local Algorithms for Document

Fingerprinting, 2003, SIGMOD '03:

Proceedings of the 2003 ACM SIGMOD

International Conference on Management of

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

375

Volume 12, 2024

Data, June 2003, pp. 76–85,

https://doi.org/10.1145/872757.872770.

[32] Heres, D., & Hage, J. A Quantitative

Comparison of Program Plagiarism Detection

Tools. Conference: the 6th Computer Science

Education Research Conference. New York,

NY, USA, 2017, pp. 73-82,

https://doi.org/10.1145/3162087.3162101.

[33] Karnalim, О., Simon, & Chivers, W. J.

Preprocessing for Source Code Similarity

Detection in Introductory Programming.

Conference: 20th Koli Calling International

Conference on Computing Education

Research, Koli, Finland, 2020, pp. 1-10,

https://doi.org/10.1145/3428029.3428065.

[34] Arase, Y., & Tsujii, J. 2021. Transfer fine-

tuning of BERT with phrasal paraphrases.

Computer Speech & Language. Vol.66, 2021,

paper 101164,

https://doi.org/10.1016/j.csl.2020.101164.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.36

Serhii Myronenko, Yelyzaveta Myronenko

E-ISSN: 2415-1521

376

Volume 12, 2024