Enterprise Malware Detection using Digital Forensic Artifacts and

Machine Learning

MATHIEU DROLET, VINCENT ROBERGE

Electrical and Computer Engineering,

Royal Military College of Canada,

13 General Crerar Crescent, Kingston, ON,

CANADA

Abstract: - Malware detection is a complex task. Numerous log aggregation solutions and intrusion detection

systems can help find anomalies within a host or a network and detect intrusions, but they require precise

calibration, skilled analysts, and cutting-edge technology. In addition, processing host-based data is

challenging, as every log, event, and configuration can be analyzed. In order to obtain trusted information about

a host state, the analysis of a computer’s memory can be performed, but obtaining the data from acquisition and

performing the analysis can be challenging. To address this limitation, this paper proposes to collect artifacts

within a network environment. This approach involves remotely gathering memory-based and disk-based

artifacts from a simulated enterprise network using Velociraptor. The data was then processed using three

machine learning algorithms to detect the malware samples against regular user activity generated with a user

simulation tool for added realism. With this method, Random Forest and Support Vector Machine achieved a

perfect classification of 41 malware samples.

Key-Words: - Digital forensics, Host-based monitoring, Machine learning, Malware, Memory forensics, User

simulation, Volatility, Velociraptor.

Received: January 6, 2024. Revised: April 11, 2024. Accepted: June 13, 2024. Published: July 9, 2024.

1 Introduction

Detecting malware presents a complex challenge

due to its varied forms and ability to target diverse

processes, protocols, and devices. Traditionally,

three main strategies are utilized to defend against

intrusions. The first approach involves identifying

malware as it enters the network or computer, with

significant research focusing on detecting files

downloaded from suspicious URLs or when users

browse malicious websites. However, it is possible

for URLs to be altered to appear less suspicious or

to avoid blacklisted domains. A second strategy

involves monitoring activities within workstations,

such as tracking system calls in the operating system

and identifying launched processes and their parent

processes. This method often incorporates antivirus

software, which compares specific strings and

binary patterns against a database of known

malicious software. However, some types of

malware, such as rootkits, can affect the normal

operation of the OS and hide from antivirus

software, which makes detection difficult. The last

strategy consists of using digital forensics, which

focuses on compromised computer hard drive and

memory, post-exploitation, to better understand how

a specific malware works. This allows for signatures

to be developed so subsequent occurrences of the

intrusion can be detected. Digital forensics is

advantageous because it can provide more context

on the state of the machine when the disk or the

memory capture was collected. Files that have been

deleted or hidden can often be retrieved, thus

leading to malware analysis and a deeper

understanding of the adversary and its intent.

However, this technique is limited by the time

required by a skilled analyst to perform it.

Currently, host-based monitoring solutions, such

as Endpoint Detection and Response (EDR) tools

and antivirus software, use agents, that are running

on different endpoints, and all report back to a

centralized server. Collating this data with network-

based tools remains a challenging task and it can be

difficult for network defenders to maintain

situational awareness. Additionally, EDR tools and

antivirus programs can generate numerous alerts due

to their ability to report on a wide range of events

and logs. Even though EDRs may lack the capacity

for the extensive behavioral analysis needed to

identify new threats, they are adept at detecting

known threats, which constitute the majority faced

by enterprise networks, [1]. Security Operation

Centers (SOC) often rely on supplementary tools

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

336

Volume 12, 2024

like threat-hunting feeds and indicators of

compromises.

Establishing an accurate behavior-based

detection system requires a robust baseline built on

sufficient data samples that capture typical network

and computer activities. However, the evolving

network demands and software usage by individuals

present a challenge in maintaining the baseline's

effectiveness. The primary focus of current digital

forensic approaches is on pinpointing how malware

infiltrated a system to aid in implementing

mitigating strategies. Nonetheless, the valuable

insights obtained from this method are frequently

delayed. The analysis of collected artifacts poses a

considerable challenge due to the vastness of the

dataset. One potential solution to this issue is the

application of machine learning to automate digital

forensic investigations. While existing research

mainly concentrates on extracting forensic artifacts

from individual computer memory captures, Virtual

Machine (VM) memory, or sandboxes, the

applicability of these findings to operational

networks can be complex, [2], [3]. To effectively

transfer the principles and conclusions of these

studies to live networks, the collection and analysis

of artifacts need to occur in near real-time. This

study seeks to overcome this obstacle by proposing

a solution suitable for live enterprise networks.

This research, which is based on a thesis, [4],

introduces four key contributions: identifying

Velociraptor as a valuable tool for generating

features to train a machine learning model for

effective malware detection, developing a

methodology to produce data using a user

simulation tool, pinpointing features capable of

identifying malware presence on an active computer

and comparing three machine learning algorithms in

the context of malware detection.

2 Previous Works

Machine learning has proven to be a valuable tool in

various aspects of computer security. In [5],

research is conducted in network security, focusing

on the identification of anomalous network traffic

through the use of the Isolation Forest algorithm.

Their goal was to detect covert channels, malware

usage, and other anomalies within the network.

Host-based analysis researchers such as in [6] have

used machine learning to analyze features collected

from hosts, such as system traces. In [7], the authors

proposed an anomaly detection approach that used

Isolation Forest and K-means for real-time anomaly

detection using the network traffic logs. In [8], a

Deep Convolutional Neural Network is used to

detect intrusion. Their proposed algorithm achieved

better results compared to other current Intrusion

Detection System (IDS) implementations, such as

Deep Belief Network, while reducing the processing

time. They achieved an F1-Score of 0.97 to 0.98,

depending on the type of tested network attacks.

Machine learning has also been used for digital

forensic research. In their disk-based forensic

research, the authors of [9] looked at machine-

learning techniques for file system forensic analysis.

They aimed at detecting modified files to assist in

timeline reconstruction. Multiple memory forensic

researchers such as in [10], [11] and [12] have used

machine learning to automate and assist in the

detection of malware. In [10], the authors used

machine learning and artifacts found in memory

using Volatility to detect ransomware and Remote

Access Trojan (RAT) in a cloud computing server,

hosting hundreds of virtual machines (VM). Using

Volatility plugins, information found in memory,

such as running processes, the services and DLLs

could be retrieved, and features generated to feed

into a machine learning algorithm. Their approach

had the advantage of enabling detection of fileless

malware, which does not have a presence on the

disk. They used VMware’s vSphere infrastructure to

collect snapshots of the VMs, then extracted the

memory capture from the snapshot files so it could

be analyzed using Volatility. They used a baseline

of 100 snapshots taken at ten-minute intervals, with

100 more snapshots taken for each of the nine ran

programs, benign and ransomware. They used nine

machine learning algorithms for their datasets: J48,

Random Forest (RF), Naïve Bayes (NB), Bayesian

Network (BN), Logistic Regression (LR),

LogitBoost (LB), Sequential Minimal Optimization

(SMO), Bagging, and AdaBoost (AB). Out of their

multiple test cases, RF achieves the best overall

results; this has also been observed by the authors

of [11] who researched kernel-level rootkit using

memory forensic and machine learning with a

similar methodology, but different features. The

authors of [12] performed similar research to detect

unknown malware in Linux cloud environments.

More recently, the authors of [13] used a similar

methodology to detect different types of malware

using a custom Volatility plugin to gather specific

process data. Their test environment was using a

virtual machine from which the memory was

collected. In addition, recent work including [14]

has been focusing on analyzing the memory capture

of potential malware samples using computer vision

techniques by converting the executables into RGB

images and processing them using machine

learning.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

337

Volume 12, 2024

In their assessments of various EDR solutions,

the authors of [15] tested multiple EDRs, including

Carbon Black, CrowdStrike Falcon, F-Secure

Elements EDR, McAfee Endpoint Protection, and

Symantec Endpoint Protection, in order to assess

their effectiveness against Tactics, Techniques, and

Procedures (TTPs) employed by Advance Persistent

Threat (APT) actors. Their research revealed that

none of the EDR solutions were capable of

identifying all threats. A significant number of the

EDRs examined performed poorly in identifying

DLL Side-loading [15]. EDR solutions are now

incorporating more machine learning algorithms and

processing capabilities in an effort to enhance

detection rates while also keeping false positives at

a minimum and identifying malware at an earlier

stage in the cyber kill chain [16] before significant

damage occurs. The primary focus for EDR vendors

is to identify effective features for evaluation while

ensuring a minimal impact on the host system and

requiring limited bandwidth, in addition to

effectively managing and processing vast amounts

of data [15]. Conversely, in another study, the

authors of [17] concentrate on the utilization of

custom tools, like their Python-based Based Tool,

for gathering artifacts from Windows hosts.

Current research has certain significant

limitations that could be enhanced. The

effectiveness of signature-based monitoring is

constrained by the information stored in its

database, making it more effective in identifying

older threats that have already been scrutinized.

Furthermore, a few current studies, such as those

conducted in [10] and [18], may not be directly

applicable in a real-time setting as the collection of

memory images necessitates the use of snapshots for

analysis. This methodology does not scale

effectively with modern systems that have a large

memory capacity. Outside of the cloud-computing

environment, disk acquisition and memory capture

need to be done computer by computer, which

creates delays and can require a lot of bandwidth if

done remotely. In addition, if sandboxes are used, it

is possible to fool them. Some malware can be

context-aware and change their execution based on

whether they are on a virtual machine or if they

suspect they are executing in a sandbox

environment using sandbox fingerprinting

techniques, [19]. This paper addresses this limitation

by using Velociraptor for data acquisition, which is

more lightweight than performing a full memory

capture, and is applicable to non-virtualized

environments.

Limitations presented in this section are

addressed by collecting digital forensic evidence

using Velociraptor at regular intervals. The

Velociraptor offline collector, running on each

workstation, can gather information about the

system state. Features can then be generated on a

dedicated server. This enables the collection of the

artifacts from all computers in the network

simultaneously and then processing them using a

machine learning algorithm. This process is

transparent to the user and could be applied to an

enterprise network. Using this technique, it is

possible to widen the range of artifacts available to

host-based detection tools by looking at both

volatile and non-volatile digital forensic artifacts. In

addition, to provide a more realistic test

environment, this paper uses an advanced

simulation environment; not simply a virtual

machine, but a complete enterprise network with

simulated user activity.

3 Background Theory

Evaluation of a machine learning algorithm

performance can be achieved through various

metrics, including accuracy, precision, recall, and

F1-Score. These metrics are based on confusion

matrices, which show the frequency of correct and

incorrect predictions made by the classifier

regarding the null hypotheses, [20]. For instance,

when determining whether a data point is malicious,

the confusion matrix consists of True Positive (TP)

for accurate identification as malicious, True

Negative (TN) for accurate identification as non-

malicious, False Positive (FP) or Type I error for

inaccurate malicious classification, and False

Negative (FN) or Type II error for inaccurate benign

classification, [20].

Equation 1 illustrates the concept of accuracy,

which offers insight into a model's performance, as

referenced in [11]. However, relying solely on

accuracy as a metric may be limited in its

usefulness, as it only signifies the percentage of

correctly classified samples. Precision, as depicted

in Equation 2 and as referenced in [10] , serves as a

measure of how many samples identified as

malicious are truly malicious, without considering

missed malware samples. A high precision score

suggests that minimal normal benign data has been

mistakenly classified as anomalous. On the other

hand, recall, outlined in Equation 3 and as shown in

[11], further refines this assessment by indicating

the number of correctly identified malware samples.

An ideal classifier, accurately detecting all malware

samples, would achieve a recall value of 1, without

reflecting the occurrence of FP in the process. The

F1-Score, also known as F-measure, and presented

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

338

Volume 12, 2024

in Equation 4 and as seen in [11], merges precision

and recall to deliver a more holistic evaluation of

the algorithm's overall performance. Computed

through the harmonic mean of precision and recall,

a high F1-Score signifies elevated levels of

precision and recall, [20].

   

   󰇛󰇜

  

 󰇛󰇜

  

 󰇛󰇜

    

   

   󰇛󰇜

The comprehensive machine learning pipeline

encompasses all stages from data acquisition to

result generation. The process involves gathering

raw data, such as logs, and transforming them into a

format suitable for processing by machine learning

algorithms, typically numerical data. Not all features

generated are equally valuable, prompting the need

for careful selection of the most relevant ones. Each

chosen machine learning algorithm undergoes

model training with the data to evaluate its

performance. Optimization strategies involve

adjusting parameters within the selected algorithm

to enhance results. Ultimately, the model is applied

to new data for classification or anomaly detection,

[21]. The steps of the machine learning pipeline are

illustrated in Figure 1.

The following paragraphs will be discussing the

three machine learning algorithms used in this

paper, Isolation Forest, Random Forest and Support

Vector Machines (SVM).

Isolation Forest is a tree-based algorithm that

separates all the data points into different nodes of a

tree, or splits. This process effectively isolates all

data points into different branches of the tree. The

more splits required to reach the data point, the

more normal the point is determined to be;

anomalous data points tend to be easier to isolate

and therefore require fewer splits. This results in

fewer branches and makes the process of walking

back to the top of the tree shorter. The algorithm

repeats this process for multiple trees, creating a

forest, [22]. The anomaly score, which is a value

between 0 and 1, is calculated using Equation 5, as

seen in [20], and express the anomaly score of a

given data point. In Equation 5, E(h(x)) is the

average path length of a point x in the forest, c(n) is

the average path length of any given data point in

the dataset and s(x,n) is the anomaly score of a

given data point x. The closer to 1 s(x,n) is, the more

likely it is to be an anomalous data point, [22].

󰇛󰇜 󰇛󰇜

󰇛󰇜󰇛󰇜

where E(h(x)) is the average path length of a point x

in the forest, c(n) is the average path length of any

given data point in the dataset and s(x,n) is the

anomaly score of a given data point x.

Random Forest is a collaborative group of

decision trees in which multiple trees are

constructed in a random manner. This enhances the

diversity of Random Forest compared to traditional

decision tree models. In a typical decision tree, a

node is divided based on the optimal feature for

each split, whereas Random Forest makes the split

based on the best feature from a randomly selected

set of features to introduce variety, utilizing the

feature_importances parameter in SKLearn. The

feature_importances value, also known as Mean

Decrease of Impurity (MDI) importance, is

computed for an individual tree using Equation 6, as

referenced in [23] . Each feature (Xm) in a given tree

T in the forest is assigned a score based on the MDI

of that particular feature. To obtain a more precise

estimation of the impact a specific feature can have

on the machine learning model, the average MDI

value of each tree is determined using Equation 7, as

illustrated in [23] . The prediction made by Random

Forest is essentially the prediction that is most

prevalent among all the trees in the ensemble. This

characteristic enables this algorithm to outperform

other tree-based techniques by minimizing the error

that could be generated by a single tree, [24].

󰇛󰇜  󰇛󰇜

󰇛󰇜

󰇛󰇜󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛󰇜

SVMs function by attempting to construct a

hyperplane that divides the distinct classes of data

into distinct regions. Two parallel auxiliary

hyperplanes, which intersect the data points nearest

to the initial hyperplane, are identified as the

support vectors. SVMs have the capability to utilize

various kernels in order to establish these

hyperplanes. The kernel serves as the mathematical

representation of the hyperplane. The formula for

the linear SVM kernel, displayed as Equation 8 in

Table 1 of [25], is one of the options. Nevertheless,

this approach may not always be the most effective.

Alternative non-linear approaches, such as the

Gaussian Radial Basis function (RBF) depicted in

Equation 9 of the same reference, can be employed

to better suit the dataset and achieve a more

effective classification.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

339

Volume 12, 2024

Fig. 1: Machine Learning Pipeline

Fig. 2: Test Environment

The RBF kernel stands out as the most widely

used SVM kernel according to reference, [25]. It has

applications in intrusion detection as well as in

linear and malware detection models as stated in the

same source.

  

󰇛󰇜

  󰇼󰇼

󰇛󰇜

During the process of model training, various

methods can be employed to select features. One

such technique is Principal Component Analysis

(PCA), which is a feature reduction approach

grounded in linear algebra. It involves a sequence of

orthogonal transformations designed to retain the

majority of dataset variance while decreasing its

dimensionality, [26].

Feature importance can be assessed using

different tree-based algorithms, such as Random

Forest using the MDI, or feature_importances. Once

a model is fitted using SKLearn functions, the

variable feature_importances can be accessed and

used to select the best features. The higher the MDI

score, the more useful the feature is at predicting the

model, [20].

The Predictive Power Score (PPS) is an

algorithm introduced in 2020, [27], to facilitate the

exploration of a dataset and help find relationships

between the different features and data points. It

looks at the probability that any given column,

which is one of the features of the dataset, can

predict the next column, [28]. Using the PPS

algorithm, features that bring the most predictability

to the model can be selected. The PPS algorithm can

be used to perform the feature selection step by only

retaining the features of a dataset where the PPS

score for the row containing the label is above zero.

A complex problem with machine learning

algorithms is the tuning of an algorithm's various

hyperparameters. These algorithms have two types

of parameters: the parameters, which are determined

automatically during the training of an algorithm,

and the hyperparameters, which need to be provided

to the training method during the training of the

algorithms, [29]. A popular method of tuning the

algorithms is Grid Search. Its goal is to identify, out

of the different possible hyperparameters of an

algorithm, the values that will lead to the best

prediction while minimizing overfitting, which is

when the algorithm is too perfectly tailored to the

training set and has difficulties adapting to the test

set. If both are very similar, the algorithm may find

it difficult to adapt to new data in the future, [30]. A

range of values is specified to the algorithm, for

each hyperparameter, which makes a grid, and each

value is tested exhaustively, [31].

The industry standard tool for memory forensics

is Volatility. The book “The Art of Memory

Forensics”, [32], provides a lot of details about how

to use this tool and how it manages to get its data.

To conduct an investigation, memory forensics is

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

340

Volume 12, 2024

typically performed by capturing the RAM of a

computer before it is shut down. Due to the nature

of computer memory, once the computer is powered

off, the data in memory is lost and an analyst is no

longer able to gather information from it, which

prevents the creation of indicators of compromise

generated from the analysis of the malware behavior

in memory. The analysis of a live system by an

analyst gives an opportunity to monitor the system

behavior over a longer period of time and take more

memory captures if required. However, this can add

risk to the network and to the host, if an adversary is

exfiltrating data from the computer, shutting it down

quickly to avoid further damage may be preferred.

Methods to capture the memory include the use of

virtual machine snapshots Windows hibernation

files, or live-collection software tools such as

Dumpit, [32]. Using VM snapshots is trusted as it

does not depend on the Windows API to gather its

information.

Velociraptor, [33], is an endpoint monitoring and

Digital Forensics, Incident Response (DFIR) tool

that was designed to collect forensic evidence,

monitor events, facilitate enterprise threat-hunting

efforts and help security practitioners respond to an

incident in an enterprise network. Velociraptor has

hundreds of plugins and can be easily extended by

writing new custom plugins that are specifically

tailored to an analyst’s needs, [33]. Velociraptor

uses query language similar to the Structured Query

Language (SQL), the Velociraptor Query Language

(VQL), to analyze the data directly. VQL also

allows parsing of the network computers as if they

were all part of a database.

4 Methodology

4.1 Data Acquisition

To develop the implemented method, a test

environment was created using VMware vSphere.

The environment simulates a small company, which

has ten employees. The company network used for

this research contains the basic infrastructure that

would be required for such a company to operate.

The company users are being simulated using the

Human Actor Like Orchestration (HALO) software,

developed by Field Effect Software, [34]. This tool

was used in the test environment to add realism and

generate background activity within the

environment. The network configuration is depicted

in Figure 2, illustrating the layout of the

environment. Ten Windows 10 workstations,

operating on Version 1909 (build 18363.778), were

configured for the experiment. To confirm the

effectiveness of the implemented method, the

workstations were divided into two groups of five

each, with one group utilizing Velociraptor for

artifact collection and the other employing Volatility

for validation purposes.

HALO replicates user behavior to enhance the

authenticity of the corporate network, encompassing

tasks such as running programs, managing emails,

and browsing the web. This platform facilitates the

establishment of a timetable for automating user

actions. Artifacts from desktops are gathered every

half-hour through a scheduled task. The task

initiates a series of actions, starting with a batch

script that executes the Velociraptor offline collector

on the designated five workstations tracked by

Volatility. Subsequently, a PowerShell script is

activated to transfer all gathered artifacts to the

Windows collection server, serving as a central hub

for data aggregation and subsequent analysis.

Concurrently, memory images are captured at

regular intervals from the five monitored

workstations, with the plugin results being exported

to the Windows server for further processing.

Multiple servers are present in the environment

to ensure the proper functioning of the emulated

users by HALO. These servers include a domain

controller housing the company's DNS server, a file

share, a mail server, and a web server. To replicate

real-world scenarios, a grey infrastructure was

established, which consists of a Grey DNS server

serving as the authoritative DNS server for the

environment and a simulated Internet. The

simulated Internet within the environment

comprises numerous scraped websites to enhance

the authenticity of the simulation. HALO agents can

navigate the simulated web, retrieve actual web

pages, and simulate user activity realistically. For

the execution of malware, a red infrastructure was

necessary, with Kali Linux serving as the attack

platform. All malware samples were prepared from

this workstation, with the C2 connectivity directed

towards it. The initial phase involved collecting a

network baseline by observing user activities

conducted by the HALO users according to their

routines, such as document creation, email

correspondence, etc. A total of 1340 samples were

collected as a baseline over an 11-day period using

the Velociraptor offline collector executable from

all five workstations. Subsequently, each malware

was executed to generate diverse malware datasets.

After the data collection process, machine

learning algorithms were utilized to analyze the data

outside of the original environment. This analysis

was conducted through the implementation of

SKLearn libraries within a Jupyter notebook. The

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

341

Volume 12, 2024

collected data was divided into separate training and

testing sets for further evaluation. Subsequently, the

classification results of the test set were examined

and interpreted. The Velociraptor feature selection

approach drew inspiration from various scholarly

works, including, [9], [10] and [12], which focused

on disk-based and memory forensic investigations.

A total of 76 plugins were identified for potential

use based on insights gleaned from prior research

utilizing tools like Volatility in disk forensic studies.

Furthermore, experimental assessments were carried

out using the Velociraptor graphical user interface

to assess the efficacy of individual plugins in

enhancing detection capabilities concerning specific

types of malware.

Table 1 displays the number of attributes within

each specific domain. In accordance with the

methodology outlined by the authors of [3] for

classifying attributes, Velociraptor plugins were

chosen to achieve a comparable level of host

visibility through examination of Registries, DLLs,

APIs, and Network-related artifacts. In addition to

the four domains identified in their study, two

additional domains were incorporated: a file system

category encompassing all files-based features

generated in a similar fashion [9] and a Windows

event category comprising PowerShell events, and

remote login events, among others. All chosen

Velociraptor attributes align with one of the six

domains listed in Table 1.

In the experimental phase, 14 types of malware

and frameworks were utilized to produce 41 distinct

malware samples. Table 2 outlines the list of

malware and tools employed. The subsequent

section delves into the functionalities of these

programs and their intended impact on the

compromised host.

4.2 Exploratory Data Analysis and Data

Munging

The information was initially standardized through

the application of a min-max scaler. Subsequently,

utilizing the feature_importances function from

SKLearn's Python class for tree-based classifiers,

the most impactful features were determined by

creating a graphical representation of feature

importance to enhance clarity. A selection criterion

was applied to the top 20 features as a method for

reducing features. Another approach to feature

reduction involved utilizing the PPS matrix to

identify features that exhibited a correlation with the

label. This led to a reduced feature set of 15 features

that showed a correlation in predicting the label and

would help classify the data. A third feature

reduction method was to useuse PCA to retain 95%

and 99% of the variance.

4.3 Feature Engineering

During the feature engineering phase, five different

versions of the dataset were used to test the

performance of the algorithm. Those five testing

sets used different features: all the collected

features, using the PPS matrix score to keep only the

features having a positive correlation with the label,

using PCA, to keep 95% and 99% of the variance,

and selecting the top 20 features using the

feature_importances class. Each machine learning

algorithms were tested against each of those five

different datasets to assess performance and the

impact of the feature reduction process.

Table 1. Type of features generated for Velociraptor

Domain

Features

Registry

DLL

API

Network

File System

Events

Table 2. Malware executables and tools used

Malware Name

Malware Type

Samples

CatfishHTTPSExfiltrator

Data Exfiltration

Lyonfish

Ransomware

CatfishFileShredder

RAT

CatfishSocket1

RAT

CatfishExplorer

RAT

CatfishPowerShell1

RAT

Metasploit

RAT

PowerShell Empire

RAT

Cobalt Strike

RAT/Persistence/

Credentials Stealing

Living Off The Land [35]

Persistence

CatfishPersister

Credentials Stealing/

Persistence

OffensivePH [36]

Post-exploitation tool

77rootkit [37]

Rootkit

Hidden [38]

Rootkit

To conduct cross-validation, the test-train split

method from SKLearn was employed ten times to

create 10 distinct test sets, enhancing the accuracy

of model evaluations. The datasets were divided

equally, with half allocated to the training set and

the other half to the test set, given the varied types

of malware samples and resulting artifacts. This

partitioning was necessitated by the limited

availability of only 41 malware collections,

resulting in approximately 20 to 21 samples per set.

4.4 Model Learning and Evaluation

After data collection, the machine learning models

underwent training with Isolation Forest, Random

Forest, and SVM algorithms. Each algorithm was

utilized to create 11 different sets of trained models

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

342

Volume 12, 2024

through three tuning methods: default parameters of

the algorithm, Grid Search, and a manual exhaustive

implementation based on Grid Search with certain

manually chosen values. Furthermore, all algorithms

were trained with each feature selection method.

The specific combinations of feature selection

and tuning methods employed in this study can be

found in Table 3. The default parameters were only

utilized once to compare results with all features

included.

4.5 Validation

In order to validate both the outcomes and the

methodology, the approach of utilizing Velociraptor

and machine learning was compared to the

validation technique involving Volatility and

machine learning. The research utilizing Volatility

closely resembles the implemented method. The

validation technique was devised by replicating an

experiment carried out in previous studies;

characteristics from [10], [11] and [12] were utilized

to create the validation dataset. For the implemented

method, which utilizes Velociraptor, to be deemed

successful, it needed to surpass the performance of

the Volatility method. The same malware samples

were employed, and the gathering of artifacts with

Volatility was conducted using Dumpit and a

Volatility executable, at the same time interval as

the implemented method. A key advantage of the

implemented method is that Velociraptor is

significantly more efficient and does not necessitate

a memory capture.

Table 3. Trained Machine Learning model

combination

Features Selection

Tuning Method

All features

Default parameters

Grid Search

Exhaustive

Positive PPS correlation

Grid Search

Exhaustive

PCA with 95% variance retained

Grid Search

Exhaustive

PCA with 99% variance retained

Grid Search

Exhaustive

Top 20 features with feature_importances

Grid Search

Exhaustive

5 Results

The findings indicate that the implemented approach

utilizing Velociraptor and machine learning is more

efficient in malware detection compared to the use

of Volatility and machine learning. In this section,

the two methodologies are defined as Velociraptor

and Volatility, respectively.

Each model underwent testing using the test

dataset. The model evaluation was repeated ten

times, with different random splits between training

and testing sets, in order to achieve a more precise

outcome; the performance could vary based on the

selection of malware for the training set. The results

from the model with the highest F1-score for each

methodology and algorithm are presented in

Table 4.

The standard deviation of the F1-Scores between

each of these ten runs of the algorithm is also

displayed in

Table 4. SVM emerged as the most effective and

consistent model, with a standard deviation of

0.008%; there was only one false positive in two test

cases.

Isolation Forest did not prove to be the most

successful algorithm for both methods. The

Velociraptor methodology yielded a lower F1-Score

compared to Volatility but achieved a higher recall.

In the Velociraptor method, the model that

demonstrated the best performance utilized all

features with manual exhaustive tuning. With

Random Forest, the optimum models for both

methods successfully detected all malware

instances. For Velociraptor, all models, with the

exception of those using PCA, achieved a flawless

or nearly flawless classification, while the four

PCA-based models detected none of the malware.

Utilizing SVM, the optimal models for the

Velociraptor method attained a perfect

classification, whereas the optimal models for the

Volatility method were able to identify the majority

of the malware samples.

Both the Random Forest and SVM algorithms

utilizing Velociraptor characteristics exhibited

instances of flawless classification or very poor

classification. Possible reasons for this variability

include a class imbalance in the dataset, where there

were significantly fewer malicious samples

compared to benign samples for both Velociraptor

and Volatility. This imbalance, with malicious to

benign sample ratios of 0.031 and 0.015

respectively, could have influenced the

classification results. To address this issue, two

common techniques are suggested: under-sampling

the majority class and oversampling the minority

class, [39]. Additionally, mixing baseline samples

with malware samples during training may have

impacted the outcomes. Future studies should

explore training methods with different datasets,

potentially focusing solely on baseline data.

Overfitting is also a concern with perfect

classification results, and alternative approaches

such as employing bagging or boosting techniques

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

343

Volume 12, 2024

through algorithms like bagging trees or AdaBoost

can help mitigate this issue.

When it comes to validation, the method

incorporating Velociraptor and machine learning

surpassed the validation method using Volatility for

both Random Forest and SVM. This approach is

noted for being more efficient and less resource-

intensive, as Velociraptor eliminates the need for

memory capture and reduces the need for memory

structure scanning.

The detection rates for each algorithm utilizing

Velociraptor in the proposed method are detailed in

Table 5. The optimal models for Random Forest and

SVM resulted in a 100 percent detection rate,

however, Isolation Forest exhibited shortcomings in

detecting certain types of malware, such as

CatfishPersister operating at the user-level privilege,

data exfiltration attempts, and some RATs samples

like CatfishPowershell1, CatfishExplorer, Cobalt

Strike process injection, and screenshot capture.

Table 6 presents the validation method results

which closely align with findings from similar

studies. The outcomes obtained from the validation

method, particularly in experiments with RATs and

Ransomware conducted by the authors of [10] as

well as [11], are compared. This study demonstrated

a higher recall value compared to previous research

[10] and [11] and a slightly lower F1-Score. These

results validate the effectiveness of this study's

validation method utilizing Volatility and its ability

to evaluate the performance of the implemented

method with Velociraptor.

6 Conclusion

This paper introduced an innovative approach for

identifying malware present in a corporate network

by utilizing digital forensics artifacts gathered

through Velociraptor and analyzed through machine

learning. This paper presented a novel methodology

to detect malware within an enterprise network

using digital forensics artifacts collected using

Velociraptor and analyzed using machine learning.

A total of 41 malicious samples and 1340 benign

samples were tested against three machine learning

algorithms. It was determined that Random Forest

and SVM were the most effective classifiers for the

used dataset, detecting all malicious samples on all

occurrences with no or minimal false positives, with

an F1-Score of 1.0 for both algorithms and with a

minimal standard deviation between the test

occurrences. In order to validate this work, this

method was compared to a validation methodology

based on [10], [11], and [12] using artifacts

collected with the memory forensic tool Volatility

and machine learning at detecting malware in an

enterprise network. The method proposed in this

paper achieved the best results, which validated both

work and methodology and showed that

Velociraptor is an effective tool for this domain of

research.

This paper solves the limitation of live forensics

data collection. Previous methods relying on

memory forensics required VMs to be suspended to

collect the memory image. The proposed method

addresses this by performing the data collection

using the Velociraptor Offline collector, which can

also collect some volatile data typically recovered

using memory forensics. This method is more

lightweight and does not require the suspension of

the user workstation. In addition, it enables the

collection of live data from an enterprise network

and can be applicable to non-virtualized

environments. It also enables a faster incident to

investigation time delay as data can be processed

rapidly once the model has been trained.

Table 4. Comparison of Volatility and Velociraptor

methodologies mean results, for each algorithm,

after ten occurrences

Method

Acc.

Recall

F1-

Score

Std. Dev.

Volatility

0.978

0.390

0.525

0.122

Velociraptor

0.973

0.637

0.418

0.065

Volatility

0.998

1.000

0.919

0.056

Velociraptor

1.000

0.017

Volatility

SVM

0.997

0.955

0.879

0.082

Velociraptor

SVM

1.000

0.008

Table 5. Velociraptor malware type detection, by

algorithms

Table 5. Comparison of the results obtained with

Volatility with the works from [10] and [11]

This paper contributes significantly in four key

areas. Firstly, it recognizes Velociraptor as a

powerful tool for generating features to train a

Sample Detected

Malware Type

Number of Malware

SVM

Credentials Stealing

Data Exfiltration

Persistence

Post-exploitation tool

RAT

Ransomware

Rootkit

Features

Malware

Type(s)

Recal

F1-

Score

This Paper

Multiple

1.000

0.919

From [10]

Ransomwa

0.923

0.924

From [10]

RAT

0.927

0.947

From [11]

Rootkit

0.984

0.986

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

344

Volume 12, 2024

machine-learning model for malware detection.

Secondly, it presents an effective approach for data

generation using a user simulation tool, HALO.

Thirdly, it identifies features that can identify

malware presence on an active computer. Lastly, it

compares the performance of three machine learning

algorithms in the context of malware detection.

In future works, the Velociraptor server could be

utilized in a live setting to monitor and identify

threats, replacing the Velociraptor offline collector.

This would enhance the speed of defensive actions

by network defenders, facilitating direct response

actions such as quarantining or terminating

malicious processes on infected hosts. Although the

Velociraptor Server is not tailored for automated

analysis, a bespoke program leveraging the

Velociraptor API would be necessary for querying

and processing features. The detection model could

still be trained using the methodology outlined in

this paper. Furthermore, to enhance the model's

effectiveness, conducting experiments with a larger

array of malware samples through acquiring more

binary data or data augmentation methods is

recommended.

In terms of future research, alternative

algorithms could be explored for analyzing similar

data. While Random Forest and SVM yielded

positive outcomes in this study, future investigations

could delve into the utilization of deep learning

algorithms like neural networks. With Velociraptor

being a dynamically evolving and customizable

tool, the creation of new plugins for detecting

specific types of malware could be considered.

Declaration of Generative AI and AI-assisted

Technologies in the Writing Process

During the preparation of this work the authors used

myessaywriter.ai in order to improve the readability

and language of the manuscript. After using this

tool/service, the authors reviewed and edited the

content as needed and take full responsibility for the

content of the publication.

References:

[1] Securus360, ‘EDR is great, but its Limitations

can leave you open to Cyberattacks’, [Online].

https://www.securus360.com/blog/edr-is-

great-but-its-limitations-can-leave-you-open-

to-cyberattacks (Accessed Date: April 8,

2022).

[2] Periyadi, G. A. Mutiara, and R. Wijaya,

‘Digital forensics random access memory

using live technique based on network

attacked’, in 2017 5th International

Conference on Information and

Communication Technology (ICoIC7),

Melaka, Malaysia, May 2017, pp. 1–6. DOI:

10.1109/ICoICT.2017.8074695.

[3] M. Murthaja, B. Sahayanathan, A. N. T. S.

Munasinghe, D. Uthayakumar, L. Rupasinghe,

and A. Senarathne, ‘An Automated Tool for

Memory Forensics’, in 2019 International

Conference on Advancements in Computing

(ICAC), Malabe, Sri Lanka, Dec. 2019, pp. 1–

6. DOI: 10.1109/ICAC49085.2019.9103416.

[4] M. Drolet, ‘Enterprise Malware Detection

Using Digital Forensic Artifacts And Machine

Learning’, M.A.Sc. Thesis, Royal Military

College of Canada, Kingston, Ontario, 2022,

[Online].

https://espace.rmc.ca/jspui/bitstream/11264/54

2/1/Thesis_Drolet_Completed.pdf (Accessed

Date: April 13, 2024).

[5] D. Spiekermann and J. Keller, ‘Unsupervised

packet-based anomaly detection in virtual

networks’, Computer Networks, vol. 192, p.

108017, Jun. 2021, DOI:

10.1016/j.comnet.2021.108017.

[6] E. Aghaei and G. Serpen, ‘Host-based

anomaly detection using Eigentraces feature

extraction and one-class classification on

system call trace data’, Journal of Information

Assurance & Security, vol. 14, no. 4, p. 11,

2019. arXiv:1911.11284

[7] M. T. R. Laskar, J. Huang, V. Smetana, C.

Stewart, K. Pouw, A. An, S. Chan and L. Liu,

‘Extending Isolation Forest for Anomaly

Detection in Big Data via K-Means’, ACM

Trans. Cyber-Phys. Syst., vol. 5, no. 4, pp. 1–

26, Oct. 2021, DOI: 10.1145/3460976.

[8] R. Mendonça, A. Teodoro, R. Rosa, Renata M.

Saadi, D. C. Melgarejo, P. Nardelli and D.

Rodríguez, ‘Intrusion Detection System Based

on Fast Hierarchical Deep Convolutional

Neural Network’, IEEE Access, vol. 9, pp.

61024–61034, 2021, DOI:

10.1109/ACCESS.2021.3074664.

[9] R. M. A. Mohammad and M. Alqahtani, ‘A

comparison of machine learning techniques

for file system forensics analysis’, Journal of

Information Security and Applications, vol.

46, pp. 53–61, Jun. 2019, DOI:

10.1016/j.jisa.2019.02.009.

[10] A. Cohen and N. Nissim, ‘Trusted detection of

ransomware in a private cloud using machine

learning methods leveraging meta-features

from volatile memory’, Expert Systems with

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

345

Volume 12, 2024

Applications, vol. 102, pp. 158–178, Jul. 2018,

DOI: 10.1016/j.eswa.2018.02.039.

[11] X. Wang, J. Zhang, A. Zhang and J. Ren,

‘TKRD: Trusted kernel rootkit detection for

cybersecurity of VMs based on machine

learning and memory forensic analysis’,

Mathematical Biosciences and Engineering,

vol. 16, no. 4, pp. 2650–2667, 2019, DOI:

10.3934/mbe.2019132.

[12] T. Panker and N. Nissim, ‘Leveraging

malicious behavior traces from volatile

memory using machine learning methods for

trusted unknown malware detection in Linux

cloud environments’, Knowledge-Based

Systems, vol. 226, p. 107095, Aug. 2021, DOI:

10.1016/j.knosys.2021.107095.

[13] S. Lyles, M. Desantis, J. Donaldson, M.

Gallegos, H. Nyholm, C. Taylor and K.

Monteith, ‘Machine Learning Analysis of

Memory Images for Process Characterization

and Malware Detection’, in 2022 52nd Annual

IEEE/IFIP International Conference on

Dependable Systems and Networks Workshops

(DSN-W), Baltimore, MD, USA, Jun. 2022,

pp. 162–169. DOI: 10.1109/DSN-

W54100.2022.00035.

[14] A. S. Bozkir, E. Tahillioglu, M. Aydos, and I.

Kara, ‘Catch them alive: A malware detection

approach through memory forensics, manifold

learning and computer vision’, Computers &

Security, vol. 103, p. 102166, Apr. 2021, DOI:

10.1016/j.cose.2020.102166.

[15] G. Karantzas and C. Patsakis, ‘An Empirical

Assessment of Endpoint Detection and

Response Systems against Advanced

Persistent Threats Attack Vectors’, JCP, vol.

1, no. 3, pp. 387–421, Jul. 2021, DOI:

10.3390/jcp1030021.

[16] E. M. Hutchins, M. J. Cloppert, and R. M.

Amin, ‘Intelligence-Driven Computer

Network Defense Informed by Analysis of

Adversary Campaigns and Intrusion Kill

Chains’, Leading Issues in Information

Warfare & Security Research, vol. 1, no. 1, p.

14, 2011.

[17] A. Hariyani, J. Undavia, N. Vaidya, and A.

Patel, ‘Forensic Evidence Collection From

Windows Host Using Python Based Tool’, in

2022 IEEE 4th International Conference on

Cybernetics, Cognition and Machine Learning

Applications (ICCCMLA), Goa, India: IEEE,

Oct. 2022, pp. 85–90. DOI:

10.1109/ICCCMLA56841.2022.9989295.

[18] A. M. A. Hameed, M. Daley, and L. Espinosa-

Anke, ‘A Machine Learning Approach for

Memory Forensic Investigation’, Cardiff

University, 2020.

[19] N. Miramirkhani, M. P. Appini, N.

Nikiforakis, and M. Polychronakis, ‘Spotless

Sandboxes: Evading Malware Analysis

Systems Using Wear-and-Tear Artifacts’, in

2017 IEEE Symposium on Security and

Privacy (SP), San Jose, CA, USA, May 2017,

pp. 1009–1024. DOI: 10.1109/SP.2017.42.

[20] A. Géron, Hands-on machine learning with

Scikit-Learn, Keras, and TensorFlow:

Concepts, tools, and techniques to build

intelligent systems, 2nd ed. Sebastopol, CA:

O’Reilly Media, Inc, 2019.

[21] B. Lachine, ‘Machine Learning Introduction’,

Kingston, Ontario, Oct. 13, 2020, [Online].

https://moodle.rmc.ca (Accessed Date:

October 13, 2021).

[22] F. T. Liu, K. M. Ting, and Z.-H. Zhou,

‘Isolation Forest’, in 2008 Eighth IEEE

International Conference on Data Mining,

Pisa, Italy, Dec. 2008, pp. 413–422. DOI:

10.1109/ICDM.2008.17.

[23] A. Sutera, G. Louppe, V. A. Huynh-Thu, L.

Wehenkel, and P. Geurts, ‘From global to

local MDI variable importances for random

forests and when they are Shapley values’, in

Advances in Neural Information Processing

Systems, Curran Associates, Inc., 2021, pp.

3533–3543, arXiv:2111.02218 [Online].

https://proceedings.neurips.cc/paper/2021/hash

/1cfa81af29c6f2d8cacb44921722e753-

Abstract.html (Accessed Date: January 23,

2023)

[24] T. Yiu, ‘Understanding Random Forest: How

the Algorithm Works and Why it is So

Effective’, Towards Data Science, [Online].

https://towardsdatascience.com/understanding-

random-forest-58381e0602d2 (Accessed:

March 23, 2022).

[25] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-

Mazahua, and A. Lopez, ‘A comprehensive

survey on support vector machine

classification: Applications, challenges and

trends’, Neurocomputing, vol. 408, pp. 189-

215, Sep. 2020, DOI:

10.1016/j.neucom.2019.10.118.

[26] S. Khalid, T. Khalil, and S. Nasreen, ‘A

survey of feature selection and feature

extraction techniques in machine learning’,

Proceedings of 2014 Science and Information

Conference, SAI 2014, London, UK, pp. 372-

378, Oct. 2014, DOI:

10.1109/SAI.2014.6918213.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

346

Volume 12, 2024

[27] J. H. Moedjahedy and G. Pramudya, ‘Student

Achievement Classification using Power

Predictive Score with Machine Learning’, in

2021 3rd International Conference on

Cybernetics and Intelligent System (ICORIS),

Makasar, Indonesia, Oct. 2021, pp. 1–6. DOI:

10.1109/ICORIS52787.2021.9649467.

[28] F. Wetschoreck, ‘RIP correlation. Introducing

the Predictive Power Score’, Towards Data

Science, [Online].

https://towardsdatascience.com/rip-

correlation-introducing-the-predictive-power-

score-3d90808b9598 (Accessed: March 17,

2022).

[29] L. Yang and A. Shami, ‘On Hyperparameter

Optimization of Machine Learning

Algorithms: Theory and Practice’,

Neurocomputing, vol. 415, pp. 295–316, Nov.

2020, DOI: 10.1016/j.neucom.2020.07.061.

[30] X. Ying, ‘An Overview of Overfitting and its

Solutions’, J. Phys.: Conf. Ser., vol. 1168, no.

2, p. 022022, Feb. 2019, DOI: 10.1088/1742-

6596/1168/2/022022.

[31] P. Liashchynskyi and P. Liashchynskyi, ‘Grid

Search, Random Search, Genetic Algorithm:

A Big Comparison for NAS’, Computing

Research Repository, vol. abs/1912.06059,

Dec. 2019, arXiv:1912.06059, [Online].

http://arxiv.org/abs/1912.06059 (Accessed

Date: March 28, 2022).

[32] M. H. Ligh, The Art of Memory Forensics.

Indianapolis, IN: Wiley, 2014.

[33] Rapid7, ‘Velociraptor - Dig deeper!’, [Online].

https://docs.velociraptor.app/ (Accessed Date:

February 1, 2021).

[34] Field Effect, ‘Field Effect: The most

sophisticated cyber threat monitoring on the

planet, made simple’, Field Effect Software

Inc., [Online]. https://fieldeffect.com/

(Accessed Date: Marcch 21, 2022).

[35] M. Fischer, ‘Living Off The Land’. Apr. 04,

2022, [Online].

https://github.com/bytecode77/living-off-the-

land (Accessed Date: April 04, 2022).

[36] R. Ancarani, ‘Offensiveph’. Aug. 09, 2021,

[Online].

https://github.com/RiccardoAncarani/Offensiv

ePH (Accessed Date: April 4, 2022).

[37] M. Fischer, ‘r77 Rootkit’. Nov. 04, 2021,

[Online]. https://github.com/bytecode77/r77-

rootkit (Accessed Date: November 04, 2021).

[38] J. Kornev, ‘Hidden’. Apr. 02, 2022, [Online].

https://github.com/JKornev/hidden (Accessed

Date: April 4, 2022).

[39] F. López, ‘Class Imbalance: Random

Sampling and Data Augmentation with

Imbalanced-Learn’, Medium, [Online].

https://towardsdatascience.com/class-

imbalance-random-sampling-and-data-

augmentation-with-imbalanced-learn-

63f3a92ef04a (Accessed Date: June 13, 2022).

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.33

Mathieu Drolet, Vincent Roberge

E-ISSN: 2415-1521

347

Volume 12, 2024