Optimization of Single-user Task Migration based on Improved DDPG

CAO NING1, HE YANG2, HU CAN1

1College of Computer Science and Software Engineering,

Hohai University,

Nanjing,

CHINA

2College of Information Science and Engineering,

Hohai University,

Nanjing,

CHINA

Abstract: - Aiming at the problems of slow convergence and unstable convergence of traditional reinforcement

learning algorithms in minimizing computational cost on edge servers with random task arrivals and time-

varying wireless channels, an improved DDPG algorithm (IDDPG) was proposed. The Critic network structure

of DDPG was replaced by the Dueling structure, which converged faster by splitting the state value function

into an advantage function and a value function. The update frequency of the Critic network was adjusted to be

higher than that of the Actor-network to make the overall training more stable. The Ornstein- Uhlenbeck noise

was added to the actions selected through the Actor-network to improve the algorithm exploration ability, and

the action noise size was set in segments to ensure the stability of convergence. Experimental results show that,

compared with other algorithms, the IDDPG algorithm can better minimize the computational cost and has a

certain improvement in the convergence speed and convergence stability.

Key-Words: - deep reinforcement learning; edge computing; task offloading; strategy optimization; network

structure; algorithm optimization.

Received: July 21, 2023. Revised: May 7, 2024. Accepted: June 22, 2024. Published: July 17, 2024.

1 Introduction

With the rapid development of artificial intelligence

and fifth-generation mobile communication

technology, a large number of computationally

intensive tasks have emerged on mobile devices, [1].

Examples include online 3D games, face

recognition, and augmented or virtual reality, all of

which are limited by the limited computational

capabilities, [2]. Additionally, in the Internet of

Things and intelligent transportation systems, [3],

wireless terminal devices with information

transmission capabilities need to preprocess massive

amounts of sensory data. To improve user

experience quality, Mobile Edge Computing (MEC)

technology, [4], has been proposed to bridge the gap

between the limited computational capabilities of

terminal devices and the massive computational

demand. Task offloading, as a core technology of

MEC, can offload computational tasks from mobile

devices to MEC servers close to base stations (BS).

Rational offloading strategies can not only minimize

transmission latency but also reduce the transmission

energy consumption of MEC servers, efficiently

completing massive computational tasks. To achieve

better user experience and higher transmission

efficiency, the research on task offloading strategies

in MEC has been widely studied.

For short-term optimization on quasi-static

channels, literature [5], studied the optimal

offloading and resource allocation on software-

defined ultra-dense networks (SD-UDN). Literature

[6], studied from the perspective of controlling the

CPU working frequency, aiming to minimize

computational energy consumption and execution

time. Literature [7], discussed the trade-off between

energy and delay in environments with sensitive

delay and limited energy, and further improved the

allocation of computational resources and

communication. Literature [8], improved the

performance of MEC further by using emerging

technologies such as wireless power transfer and

non-orthogonal multiple access. For stochastic task

arrivals and time-varying wireless channels, dynamic

joint control of radio and computational resources

has also been studied. In the single-user scenario,

literature [9], considered the energy consumption in

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

308

Volume 21, 2024

single-antenna mobile devices as the research

objective, optimizing the time and power required

for offloading. Literature [10], used heuristic

algorithms to solve the dynamic task offloading

problem in simple scenarios. Markov decision

process (MDP) can also be used for dynamic control

analysis and design of computation offloading, [11].

In addition, literature [12], demonstrated how to

learn dynamic task offloading strategies through

reinforcement learning (RL) algorithms without prior

knowledge of the system. However, traditional RL

algorithms can only solve small-scale problems, and

they are often powerless in the face of increased

complexity and dimensionality of the state space,

[13]. The association between deep nonlinear

network structures and reinforcement learning makes

Deep Reinforcement Learning (DRL) have end-to-

end perception and control, thus solving the problem

of space explosion, [14]. In the MEC system, online

resource allocation and scheduling based on DRL

algorithms have been studied. Literature [15],

designed an online offloading algorithm to maximize

the weighted sum computation rate in a wireless

power supply system. Literature [16], proposed a

DRL-based task offloading strategy to select a MEC

server for offloading and determine the offloading

rate. Literature [17], presented a policy computation

offloading algorithm based on deep Q-network

(DQN), where mobile devices learn the optimal task

offloading and energy allocation based on the task

queue state, energy queue state, and channel quality,

aiming to maximize long-term utility. Literature

[18], proposed using Deep Deterministic Policy

Gradient (DDPG) as the offloading strategy

algorithm to solve the allocation problem of energy

consumption and delay in a single-user scenario with

stochastic task arrivals and time-varying wireless

channel models, aiming to minimize computational

cost. Literature [19], proposed the ECOO algorithm

based on DDPG for optimizing candidate networks,

solving the problem of stochastic task offloading.

Literature [20], used DDPG to jointly optimize

service cache placement, task offloading decisions,

and resource allocation.

In summary, in the single-user scenario, the

current dynamic computation offloading strategies

based on DRL only utilize traditional reinforcement

learning algorithms. However, these algorithms,

such as Deep Q-Network (DQN) and Deep

Deterministic Policy Gradient (DDPG), suffer from

slow convergence and unstable convergence, leading

to high computational cost and latency in the system

as the base station collects information and allocates

it to the user. Therefore, for a single-user MEC

system, a more efficient task offloading strategy is

needed.

In this paper, a system consisting of a single-user

MEC server and a base station connected to it is

constructed, where tasks arrive randomly and the

user's channel conditions are time-varying. The

mobile user independently learns a dynamic

computation offloading strategy based on local

observations of the system. An improved Deep

Deterministic Policy Gradient algorithm, called

IDDGP, is proposed, which operates in a continuous

action space and enables more efficient local

execution and task offloading power control, to

minimize computational cost in the long term.

2 System Model

This article presents a system consisting of a BS(

with N antennas, a MEC server, and a group of

mobile users. The system adopts a discrete-time

model, where each operation m∈{1, 2, ..., M}is set

to a fixed time interval . The time slots are indexed

by t∈{0, 1, ..., T}. Due to the varying channel

conditions and task arrivals at each t, it is necessary

to calculate the proportion of local execution and

computation offloading to balance the average

energy consumption and latency of task processing,

thus achieving cost minimization.

2.1 Network Model

For each time slot t, the received signal at the

base station can be represented as follows:

where represents the

transmission power when user m migrates its task,

represents the dimensional transmitted

signal, and is an additive

Gaussian white noise vector with variance .

is used to characterize the correlation

between user m and time slot t, and its definition can

be found in [21].

represents the N×M channel matrix between the

mobile users and the base station. The base station

detector can represent the channel matrix's

pseudoinverse as . If

the m-th row of is represented as , then

represents the signal detected by the base station for

user m. Therefore, the corresponding signal-to-noise

ratio can be derived as shown in (2).

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

309

Volume 21, 2024

2.2 Computation Model

We assume that the applications are fine-grained,

[22]. The definitions of and can be

found in reference [21], while the definition of

is described in [20]. The relationship between

and the task computation load is given by (3)

as shown in [20].

(3)

Where , and represents the

arrival task load.

The definitions of and Lm can be found

in [3]. By using the DVFS technique, [23], to adjust

the chip voltage, the CPU frequency of the local

device is determined as shown in (4), [20]:

where k represents the effective switching

capacitance of the chip.

The amount of local data processed by mobile

device m at time t is given by (5):

The amount of migration data processed by

mobile device m at time t is given by (6):

where W is the system bandwidth.

3 DRL-based IDDPG

This paper proposes a DRL-based IDDPG algorithm

that enables mobile devices to learn computation

offloading policies dynamically, minimizing the

computational cost of energy consumption and

buffer delay in the MEC system. The proposed

strategy observes the environment from its

perspective and selects actions to allocate power for

local execution and computation offloading. The

following sections define the DRL framework,

followed by an introduction to the DDPG algorithm

and the Dueling network architecture. Finally, the

proposed IDDPG algorithm is presented.

3.1 Deep Reinforcement Learning

State space: It is assumed that the state of the mobile

device is determined by its local observations of the

system. At the beginning of time interval t, the

mobile device updates the queue length of the data

buffer using (3) and estimates the incoming uplink

channel vector by receiving the signal-to-

noise ratio from the previous wireless

communication with the MEC server. The current

state of the mobile device is represented as follows:

where .

Action space: The mobile device selects an

action based on the observed state , as

represented by (8).

Reward function: The definition of the

relationship between energy consumption and delay

can be found in [24]. Based on this relationship, the

reward function , received by the mobile device

after time slot t is defined as follows:

Where and are non-negative

weighting factors, and . The

meaning of the reward function is the negative

weighted sum of the total energy consumption and

the length of the task buffer queue at time t. The

weighted sum of energy consumption and delay

represents the computational cost required for task

offloading. The optimization objective of this paper

is to minimize the computational cost, which is

equivalent to maximizing the reward. By setting

different values for the weighting factors and

, dynamic adjustments can be made between

the energy consumption and task execution delay in

the task migration strategy.

3.2 DDPG

DDPG, [25], is a deterministic policy algorithm that

utilizes deep learning techniques and is based on the

Actor-Critic algorithm. This algorithm uses deep

neural networks to establish approximate functions.

It directly generates deterministic actions from the

Actor-network, evaluates the actions using the Critic

network, and guides the Actor network in selecting

the next action. Additionally, DDPG maintains a set

of parameters for both the Actor and Critic

networks, which are used to calculate the expected

value of action values for improved stability and

enhancement of the Critic's policy guidance level.

The networks that use the backup parameters are

referred to as target networks, and their

corresponding parameters are updated with small

increments each time. Another set of parameters,

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

310

Volume 21, 2024

corresponding to the Actor and Critic, is used to

generate actual interactive actions and calculate the

respective policy gradients, and these parameters are

updated after each learning iteration. This dual-

parameter setup aims to reduce non-convergence

occurrence due to the guidance of approximate data.

The specific usage scenarios for these four networks

are as follows:

Actor-network: The Actor network iteratively

updates its parameter θ. It generates specific actions

based on the current state s and interacts with the

environment to generate s', r.

Target Actor network: The target Actor network

periodically copies the parameters θ→θ'. It selects

the optimal exploratory action a' based on the

subsequent state s' provided by the environment.

Critic network: The Critic network iteratively

updates its parameter ω. It calculates the current

action value corresponding to the state s and the

generated action a.

Target Critic network: The target Critic network

periodically copies the parameters ω→ω'. It

calculates the target action value based on the

subsequent state s' and a'.

DDPG uses soft updates, where a portion of the

parameters is updated proportionally each time to

ensure training stability. The update expression is

shown as follows:

In action selection, DDPG does not use the ε-

greedy approach as in DQN. Instead, it adds a

certain amount of noise to the chosen action A to

increase randomness and learning coverage. The

expression is shown as follows:

where represents the action policy on the

Actor-network.

For the current Critic network, the mean square

error loss function is defined as shown in (12):

where

serves as the Critic target Q-value. The current

network parameters ω are updated using gradient

backpropagation.

For the current Actor-network, the loss gradient

is defined as shown as follows:

All parameters θ of the current Actor network are

updated through gradient backpropagation using a

neural network.

3.3 Dueling Network Architecture

In reinforcement learning, the agent needs to

estimate the value for each state. However, for many

states, the agent does not need to change its action

to adapt to the new state. Therefore, evaluating the

value of such state-action pairs could be more

efficient and meaningful, [9]. The Dueling network

architecture separates the state value and the action

advantage to evaluate, avoiding unnecessary action

evaluations and enabling more accurate estimation

of Q-values with faster convergence speed.

This paper incorporates the Dueling network

structure into the DDPG Critic main network and

target network, called the Dueling-Critic network.

The Dueling-Critic network separately evaluates the

observed state value and the action advantage

transmitted from the Actor-network. This network

allows finding stable policies in continuous action

spaces and speeds up convergence. The structure of

the Dueling-Critic network in this paper is

illustrated in Figure 1.

State-action value function represents

the expected return value when action a is chosen by

policy π in state s, while state value function

represents the expected return value generated by

policy π in state s. The difference between the two

represents the advantage of choosing action a in

state s, as defined in [9].

Fig. 1: Dueling-Critic network architecture

The output of the Dueling-Critic network

consists of the state value and the action

advantage . ω represents the parameters

of the Dueling-Critic network, while α and β

represent the parameters of the value function and

the action advantage function , [9]. The output of

the deep Q-network with the competitive network

structure is shown as follows:

Since the network directly outputs Q values

without explicitly knowing the state value V and

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

311

Volume 21, 2024

action advantage A, to ensure interpretability, the

action advantage A is centrally processed, which

improves optimization stability while maintaining

performance. The modified Q value is defined as

follows:

3.4 Delayed Policy Updates

In DDPG, the Critic network extracts a portion of

experiences from the

replay buffer as the current Q-values for

calculation. Then, the mean squared error is

computed to estimate the Q-values, and the current

network parameters of the Critic are updated

through gradient backpropagation. These updated

parameters are then transferred to the target network

at a fixed interval τ. Similarly, the Actor network

shares the same network structure as the Critic, so it

synchronizes the parameter updates with the Critic

at the fixed interval τ.

However, synchronous updates may cause issues

when the Actor-network reaches its optimal point

during training while the Q-values of the Critic have

been updated. In this case, the Q-values here could

be more optimal. The Actor network can only

continue searching for new optimal points, which

increases the training duration. In the worst case, the

Actor-network may get trapped in suboptimal points

and fail to find the correct optimal point, leading to

poor training of both the Critic and Actor networks

and making it difficult for the model to converge.

This paper introduces asynchronous updates for

the Critic and Actor networks by setting τ1 and τ2 as

the soft update coefficients for the Actor and Critic

networks, respectively, and adjusting the update

frequency of the Critic network to be faster than the

Actor-network. The purpose is to stabilize the Critic

before the Actor proceeds with the next learning

step. Delayed updates not only reduce unnecessary

repeated updates but also minimize accumulated

errors in multiple updates.

3.5 Noise Segmentation

Reinforcement learning often struggles with the

exploration-exploitation trade-off. If an agent

always selects actions based on the maximum Q-

value, it can hardly learn the optimal policy in

complex environments. This is because such an

action selection strategy lacks exploration, and some

actions may never be explored, leading to getting

stuck in suboptimal situations. Common methods to

address the exploration-exploitation dilemma

include:

1. ε-greedy: With a probability of (1-ε), the agent

greedily selects the action that corresponds to

the currently perceived maximum Q-value, and

with a probability of ε, it randomly selects an

action from all available options.

2. Decaying ε-greedy: As time progresses, the

probability of selecting a random action ε

decreases.

3. Uncertainty-based exploration: When the value

of an action is uncertain, the agent has a higher

probability of choosing that action.

4. Information value-based exploration:

Approximating the value and function and

constructing a model based on the information

state to sample and approximate the solution.

5. Adding Gaussian noise or Ornstein-Uhlenbeck

(OU) exploration noise to action selection,

where OU noise is defined in [5], to enhance

exploration.

In this paper, due to the small-time granularity of

the system, the agent selects OU exploration noise.

To improve the efficiency of exploration, this paper

segments the magnitude of the OU noise, allowing it

to have different levels of exploration at different

stages of training. In the early stages of training, the

agent needs to explore a wide range of actions in

unknown environments, so a higher level of

exploration is required. As the agent interacts with

the environment and improves its learning ability,

the exploration level gradually decreases with the

number of steps.

This paper provides the settings for the noise

magnitude as shown in Table 1.

Table 1. Parameter configuration

Index

Step number

Noise magnitude

0<Step<60000

noise_sigma=0.2

60000≤Step<150000

noise_sigma=0.12

150000≤Step<300000

noise_sigma=0.08

300000≤Step<400000

noise_sigma=0.03

3.6 IDDPG

Algorithm 1: IDDPG

Input: Actor current network θ, Actor target network θ',

Critic current network ω, Critic target network ω', discount

factor γ, soft update coefficients τ1 and τ2, the batch size for

gradient descent m, value function network parameters α,

advantage function network parameters β, maximum iterations

for target network Tmax

Output: The optimal parameters for the current Actor

network θ and the current Critic network ω

1 Randomly initialize the Actor-network and the Dueling-Critic

network, initialize the weights of the target networks θ'←θ,

ω'←ω, α'←α, β'←β, initialize the experience replay buffer Bm,

and initialize the exploration noise process Δμ

2 for episode=1 to episode_max do

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

312

Volume 21, 2024

3 Initialize state s

4 for t=1 to Tmax do

5 Observe the current state s and select an action a by using

the current policy network in conjunction with the

segmented exploration noise Δμ

6 Take action a, receive reward r, and then observe the next

state s'

7 Collect the set of (s, a, r, s') for the current execution and

store it in the buffer Bm

8 Randomly sample m experiences {(sj, aj, rj, sj')} from Bm

to create a mini-batch

9 Update the Dueling-Critic network by minimizing the

loss function L using the sampled mini-batch

10 Update the Actor-network using the sample gradient

policy

11 Soft update the Dueling-Critic target network and the

Actor target network (τ1τ2<<1).

12 end for

13 end for

In the DRL-based IDDPG problem model, MEC

observes the environment and obtains an initial state

space s as (7). These high-dimensional states are

processed through deep neural networks to output

Q-value functions or policies. As an edge server, the

MEC selects actions a for local offloading power

and task migration power based on the current

policy network and segmented exploration noise Δμ

as shown in (8). Then, the reward value r based on

(9) is returned to the MEC, and the next state s' is

observed. The MEC collects the set (s, a, r, s') for

the current execution and stores it in the buffer Bm.

Afterward, a mini-batch of m experiences is

randomly sampled from Bm, and the Dueling-Critic

network Q(s, a, ω, α, β) is updated by minimizing

the loss function using the sampled gradient. The

Actor network is updated using the sample gradient.

The MEC receives the reward value r corresponding

to each action selection, and trains and improves the

network model to output the optimal policy that

maximizes the reward, which means minimizing the

MEC task offloading computational cost.

4 Simulation

To assess the practicality of the IDDPG algorithm in

a single-user MEC model. experimental simulations

were conducted in an environment with Windows 10,

CPU 8500, GTX 1080, Python 3.7, and TensorFlow

1.5. Four algorithms were compared: Greedy local

offloading (GD-Local), [20], Greedy migration

offloading (GD-Offload), [20], Discrete action DQN,

[11], Continuous action DDPG, [25].

4.1 Setup

In the MEC system, the time interval τ0 is set to 1

ms. At the beginning of each episode, the channel

vectors are

initialized, where the reference distance d0=1 m, the

path loss exponent α=3, and d represents the

distance between the mobile device and the MEC

server, which is set to 100m in this paper. The

channel correlation coefficient pm=0.95, the

bandwidth W=1 MHz, the maximum transmission

power of the mobile device po,m=2 W, the noise

power σR2=10-9 W, the local execution parameter

k=10-27, the CPU cycle Lm=500 cycles/bit, the CPU

cycle frequency Fm=1.26 GHz, and the maximum

power for local execution pl,m=2 W.

To validate the effectiveness of the proposed

algorithm, the same neural network architecture and

parameters are used for DQN, DDPG, and ID-DPG.

The number of neurons in the hidden layers is set to

400 and 300, respectively. In IDDPG, the last

hidden layer of the dueling-critic network is split

into an advantage function and a value function.

ReLU is used as the activation function for the

hidden layers of the neural network, and the actions

outputted by the Actor-network are mapped using a

sigmoid function. The neural network parameter

settings are shown in Table 2.

Table 2. Network parameter setting

Index

Parameter

Value

Actor learning rate

0.001

Critic learning rate

0.0005

Actor soft update coefficient

0.001

Critic soft update coefficient

0.15

Noise parameter

0.12

Noise parameter

2.5×105

4.2 Training

The distance between the mobile device and the BS

is set to d1=100. As shown in Figure 2, Figure 3 and

Figure 4, the training process for dynamic

offloading in the single-user scenario is conducted

with different values of ω1. In Figure 2, the left plot

corresponds to ω1=0.5, and the right plot

corresponds to ω1=0.8, where episode_max=2000,

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

313

Volume 21, 2024

Tmax=200. Each curve represents the average value

obtained from running the numerical simulation 10

times.

In each plot, two cases are shown, with task

arrival rates λ=2.0 Mbps and λ=3.0 Mbps. When

ω1=0.5, which represents a balance between delay

and energy consumption, although the average

energy consumption in Figure 3(left) is not the

lowest, the proposed IDDPG algorithm achieves

higher average rewards in Figure 2(left) compared

to the DDPG and DQN algorithms, by adjusting the

average delay in Figure 4(left).

Fig. 2: Average reward values for different task

arrival rates with different trade-off factors

Fig. 3: Average energy consumption values for

different task arrival rates with different trade-off

factors

Fig. 4: Average delay values for different task

arrival rates with different trade-off factors

When ω1=0.8, which indicates a higher emphasis

on energy consumption, the IDDPG algorithm

achieves lower average energy consumption in

Figure 3(right) compared to the compared

algorithms. Although there is a sacrifice in average

delay in Figure 4(right), the IDDPG algorithm still

achieves higher average rewards in Figure 2(right)

for task arrival rates λ=2.0 Mbps and λ=3.0 Mbps

compared to the DDPG and DQN algorithms.

Through comparative experiments, it can be

observed that the proposed IDDPG algorithm,

through interaction between the user agent and the

environment, achieves higher average rewards than

the compared algorithms under different trade-off

factors, and demonstrates greater stability. This

indicates that for continuous control problems, the

IDDPG algorithm is capable of better "exploration-

exploitation" and more efficient maximization of

rewards, i.e., minimizing computational costs, while

learning the optimal computation offloading

strategy.

The IDDPG algorithm demonstrates superior

performance in terms of average rewards and

convergence speed compared to the other algorithms,

regardless of the task arrival rate (λ = 2.0 Mbps or λ

= 3.0 Mbps) when the trade-off factor ω1 is set to

0.8. Furthermore, the advantage of the IDDPG

algorithm over the DDPG and DQN algorithms is

even greater when compared to the weighting factor

ω1=0.5. This indicates that the IDDPG algorithm is

more suitable for scenarios that prioritize power

consumption. In situations with more complex task

computations, the IDDPG algorithm demonstrates

superior performance due to its reasonable action

exploration and network structure optimization.

4.3 Test

After training for a total of 2000 episodes, we

obtained dynamic computation offloading strategies

learned by the IDDPG, DDPG, and DQN

algorithms, respectively. To compare the

performance of different strategies, we conducted

numerical simulations in 100 episodes with a test

step size of T=10000. The average values were

taken from 10 simulations for each of the five task

arrival rates: λ=1.0 Mbps, λ=1.5 Mbps, λ=2.0 Mbps,

λ=2.5 Mbps, and λ=3.0 Mbps.

Fig. 5: Average reward value for different task

arrival rates with different ω1

In the scenario of ω1 =0.5 in Figure 5(left), when

λ=3.0 Mbps, the average reward of the DQN-based

strategy is approximately equal to the greedy

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

314

Volume 21, 2024

migration offloading. This is because the DQN

strategy has a limited number of discrete power

levels, and although finer granularity in the action

space discretization may lead to better performance,

the number of operations increases exponentially

with the degrees of freedom, making efficient

exploration more challenging and thus reducing the

performance of the DQN strategy.

From Figure 5, it can be observed that as the task

arrival rate increases, the average rewards of all the

algorithms gradually decrease. This is because

higher demands lead to increased computational

costs (average rewards in this paper are defined as

the negative value of costs). In the case of low

arrival rates, such as λ=1.0 Mbps and λ=1.5 Mbps,

the IDDPG algorithm shows little difference in

average rewards compared to the DDPG algorithm

but performs better than the DQN and other greedy

algorithms. This is because the computational

workload is small and does not require deep

exploration in actions. The improved network

structure and exploration noise advantage in IDDPG

are not prominent. However, as the arrival rate

increases, the IDDPG algorithm gradually

demonstrates its advantages in terms of faster and

more stable convergence.

In summary, the IDDPG algorithm outperforms

the compared strategies in terms of average rewards

for different task arrival rates under different trade-

off factors. This indicates that the IDDPG strategy

can allocate local execution power and task

offloading power more reasonably based on

observations of the environment in unknown

information settings, thereby balancing energy

consumption and latency and achieving the goal of

minimizing long-term costs, highlighting the

superiority of this strategy.

5 Conclusion

In this paper, an improved DDPG algorithm is

proposed to address the slow convergence and

instability issues in task offloading of traditional

single-user MEC systems using reinforcement

learning algorithms. The improved DDPG algorithm

aims to minimize computational costs more

efficiently. Through experimental validation, our

proposed algorithm has been shown to outperform

traditional DDPG, DQN, and other greedy strategies

in terms of power consumption and buffer delay.

The main innovation of this paper lies in replacing

the Critic network structure of DDPG with a

Dueling structure. This structure decomposes the

state value function into the advantage function and

value function, ensuring faster convergence speed

and higher stability. Additionally, we set the update

frequency of the Critic network higher than that of

the Actor-network to stabilize the Critic network's

training and improve the Actor network's action

output. Furthermore, we segment the action noise,

initially setting it at a higher value to ensure

comprehensive exploration during the initial

training stage. As the training progresses and

stabilizes, we gradually reduce the noise level to

ensure convergence stability.In our future work, we

plan to incorporate multi-user collaboration into the

IDDPG framework to further enhance task

offloading in MEC systems and improve strategy

performance.

References:

[1] SHI Weisong, ZHANG Xingzhou, WANG

Yifan, ZHANG Qingyang. Edge computing:

State-of-the-art and future directions. Journal

of Computer Research and Development,

2019, 56 (1):69-89 (in Chinese).

[2] W. Shi, J. Cao, Q. Zhang, Y. Li and L. Xu,

"Edge Computing: Vision and Challenges," in

IEEE Internet of Things Journal, vol. 3, no. 5,

pp. 637-646, Oct. 2016, doi:

10.1109/JIOT.2016.257919 8

[3] K. Zhang, Y. Mao, S. Leng, Y. He and Y.

ZHANG, "Mobile-Edge Computing for

Vehicular Networks: A Promising Network

Paradigm with Predictive Off-Loading," in

IEEE Vehicular Technology Magazine, vol.

12, no. 2, pp. 36-44, June 2017, doi:

10.1109/MVT. 2017.2668838

[4] Y. Mao, C. You, J. Zhang, K. Huang and K. B.

Letaief, "A Survey on Mobile Edge

Computing: The Communication

Perspective," in IEEE Communications

Surveys & Tutorials, vol. 19, no. 4, pp. 2322-

2358, Fourthquarter 2017, doi:

10.1109/COMST.2017.2745201

[5] Maria J.P. Peixoto and Akramul Azim. 2021.

Using time-correlated noise to encourage

exploration and improve autonomous agents

performance in Reinforcement Learning.

Procedia Comput. Sci., 191, C (2021), 85-92.

https://doi.org/10.1016/j. procs.2021.07.014.

[6] M. Chen and Y. Hao, "Task Offloading for

Mobile Edge Computing in Software Defined

Ultra-Dense Network," in IEEE Journal on

Selected Areas in Communications, vol. 36,

no. 3, pp. 587-597, March 2018, doi:

10.1109/JSAC.201 8.2815360.

[7] H. Guo, J. Liu, J. Zhang, W. Sun and N. Kato,

"Mobile-Edge Computation Offloading for

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

315

Volume 21, 2024

Ultradense IoT Networks," in IEEE Internet

of Things Journal, vol. 5, no. 6, pp. 4977-

4988, Dec. 2018, doi:

10.1109/JIOT.2018.2838584.

[8] J. Zhang et al., "Energy-Latency Tradeoff for

Energy-Aware Offloading in Mobile Edge

Computing Networks," in IEEE Internet of

Things Journal, vol. 5, no. 4, pp. 2633-2645,

Aug. 2018, doi: 10.1109/JIOT.2017.2786343.

[9] S. Bi and Y. J. Zhang, "Computation Rate

Maximization for Wireless Powered Mobile-

Edge Computing With Binary Computation

Offloading," in IEEE Transactions on

Wireless Communications, vol. 17, no. 6, pp.

4177-4190, June 2018, doi:

10.1109/TWC.201 8.2821664.

[10] J. Kwak, Y. Kim, J. Lee and S. Chong,

"DREAM: Dynamic Resource and Task

Allocation for Energy Minimization in Mobile

Cloud Systems," in IEEE Journal on Selected

Areas in Communications, vol. 33, no. 12, pp.

2510-2523, Dec. 2015, doi:

10.1109/JSAC.2015.2478718.

[11] N. Janatian, I. Stupia and L. Vandendorpe,

"Optimal resource allocation in ultra-low

power fog-computing SWIPT-based

networks," 2018 IEEE Wireless

Communications and Networking Conference

(WCNC), Barcelona, Spain, 2018, pp. 1-6, doi:

10.1109/WCNC.2018.8376974.

[12] M. Qin, L. Chen, N. Zhao, Y. Chen, F. R. Yu

and G. Wei, "Power-Constrained Edge

Computing With Maximum Processing

Capacity for IoT Networks," in IEEE Internet

of Things Journal, vol. 6, no. 3, pp. 4330-

4343, June 2019, doi:

10.1109/JIOT.2018.2875218.

[13] J. Liu, Y. Mao, J. Zhang and K. B. Letaief,

"Delay-optimal computation task scheduling

for mobile-edge computing systems," 2016

IEEE International Symposium on

Information Theory (ISIT), Barcelona, Spain,

2016, pp. 1451-1455, doi: 10.1109/ISIT.201

6.7541539.

[14] T. Q. Dinh, Q. D. La, T. Q. S. Quek and H.

Shin, "Learning for Computation Offloading

in Mobile Edge Computing," in IEEE

Transactions on Communications, vol. 66, no.

12, pp. 6353-6367, Dec. 2018, doi:

10.1109/TCOMM.201 8 .2866572.

[15] Sutton R S, Barto A G. Reinforcement

learning: An introduction. MIT press, 2018.

[16] Mnih, V., Kavukcuoglu, K., Silver, D. et al.

Human-level control through deep

reinforcement learning. Nature, 518, 529–533

(2015), https://doi.org/10.1038/nature14236.

[17] L. Huang, S. Bi and Y. -J. A. Zhang, "Deep

Reinforcement Learning for Online

Computation Offloading in Wireless Powered

Mobile-Edge Computing Networks," in IEEE

Transactions on Mobile Computing, vol. 19,

no. 11, pp. 2581-2593, 1 Nov. 2020, doi:

10.1109/TMC.2019.2928811.

[18] M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu

and W. Zhuang, "Learning-Based

Computation Offloading for IoT Devices

With Energy Harvesting," in IEEE

Transactions on Vehicular Technology, vol.

68, no. 2, pp. 1930-1941, Feb. 2019, doi:

10.1109/TVT.2018.2890685.

[19] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji and

M. Bennis, "Optimized Computation

Offloading Performance in Virtual Edge

Computing Systems Via Deep Reinforcement

Learning," in IEEE Internet of Things Journal,

vol. 6, no. 3, pp. 4005-4018, June 2019, doi:

10.1109/JIOT.2018.2876279.

[20] Chen, Z., Wang, X. Decentralized

computation offloading for multi-user mobile

edge computing: a deep reinforcement

learning approach. J. Wireless Com Network,

2020, 188 (2020),

https://doi.org/10.1186/s13638-020-01801-6.

[21] H. Lu, C. Gu, F. Luo, W. Ding, S. Zheng and

Y. Shen, "Optimization of Task Offloading

Strategy for Mobile Edge Computing Based

on Multi-Agent Deep Reinforcement

Learning," in IEEE Access, vol. 8, pp.

202573-202584, 2020, doi:

10.1109/ACCESS.2020.3036416.

[22] Y. Yang, K. Wang, G. Zhang, X. Chen, X.

Luo and M. -T. Zhou, "MEETS: Maximal

Energy Efficient Task Scheduling in

Homogeneous Fog Networks," in IEEE

Internet of Things Journal, vol. 5, no. 5, pp.

4076-4087, Oct. 2018, doi:

10.1109/JIOT.2018.2846644.

[23] Asghari, A., Sohrabi, M.K. Combined use of

coral reefs optimization and multi-agent deep

Q-network for energy-aware resource

provisioning in cloud data centers using

DVFS technique. Cluster Comput., 25, 119–

140 (2022), https://doi.org/10.1007/s10586-

021-03368-3.

[24] Shortle JF, Thompson JM, Gross D, et al.

Fundamentals of queueing theory. New York:

John Wiley & Sons, 2018:1-576.

[25] Xu, Yi-Han, et al. “Deep Deterministic Policy

Gradient (DDPG)-Based Resource Allocation

Scheme for NOMA Vehicular

Communications.” IEEE Access, Jan. 2020,

pp. 18797–807, https://doi.org/10.1109/ac

cess.2020.2968595.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

316

Volume 21, 2024

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.30

Cao Ning, He Yang, Hu Can

E-ISSN: 2224-3402

317

Volume 21, 2024