Continuous-Time Markov Decision Problems with Binary State

CHIARA BRAMBILLA, LUCA GROSSET, ELENA SARTORI

Dipartimento di Matematica ”Tullio Levi-Civita”

Università degli Studi di Padova

Via Trieste, 63 - 35121 Padova

ITALY

Abstract: We analyse a binary state continuous-time Markov decision problem. The standard Hamilton–Jacobi–

Bellman equation is introduced and, with suitable assumptions on the probability rate and on the cost function, it

can be replaced by a simpler backward differential equation. Through a numerical example, we show how to find

an optimal feedback control using the results presented in this paper.

Key-Words: Continuous-time Finite-state Markov Process, Optimal Stochastic Control, Dynamic Programming.

Received: October 24, 2022. Revised: December 20, 2022. Accepted: January 18, 2023. Published: February 16, 2023.

1 Introduction

The theory of continuous-time Markov decision pro-

cesses is a rapidly growing area of research with

numerous practical applications; see, e.g., [4], and

[6]. This paper focuses on binary state processes

and prioritises an analytical solution over a compu-

tational one. The description of the evolution of bi-

nary state processes draws inspiration from the Curie-

Weiss model [5], but recent models have also incorpo-

rated human rationality by allowing decision-making

agents to modify their states. Initially, agents were

assumed to update their state according to a prede-

termined probabilistic transition rate based on their

surrounding environment. However, this approach

does not take into account the rationality of humans.

Therefore, to change their states, agents face decision

problems and try to minimise or maximise a suitable

objective (such as cost, gain, or level of satisfaction).

To model this situation, we introduce in the transition

rate a control function, which is chosen by the agents

once they have solved their optimisation problem.

This leads to the definition of a controlled Markov

chain. It is well known that the Hamilton–Jacobi–

Bellman equation (HJB) associated with a controlled

continuous-time finite-state Markov process can be

difficult to treat. In this paper, we use techniques de-

veloped in [1] and [2] to derive closed or semi-closed

expressions for optimal control and state evolution.

2 The model

In this section, we describe the continuous-time

Markov decision process that we are going to study,

which is defined in the programming interval [0, T ]

with T > 0. Let S={−1,1}be the state space and

let [0, υ], with υ > 0, be the control set. For t∈[0, T ]

denote by Xtthe state process and by Ut∈[0, υ]the

control process. We formally define the dynamics of

the process by

P(Xt+h=−x|Xt=x, Ut=u) =

=ℓ(x, u)h+o(h),(1)

where ℓ(x, ·) : [0, υ]→[0,+∞)is a continuous func-

tion for all x∈S. Now, assume that the system is

controlled using a feedback function:

Ut=u(t, Xt−),(2)

where

u: [0, T ]× {−1,1} → [0, υ](3)

is a measurable function. We denote by Uthe set

of all feasible feedback control functions. We recall

that for all feasible feedback functions, there exists a

continuous-time Markov process defined by (1).

For the reader’s convenience, we sketch how con-

trol and state are connected. We observe that the fol-

lowing approach is a bit more general than what we

would need because it can be applied not only for

the feedback controls, but also for more general non-

anticipating controls. Nevertheless, we prefer to keep

this approach since it is the most popular in stochas-

tic optimal control problems. Let us denote by B(S)

the space of bounded functions with real value in S

equipped with the supremum norm.

For all t∈[0, T ]and for all measurable control func-

tions u, consider the following operator:

Λu

t:B(S)→B(S)

f7→ Λu

tf, (4)

where

Λu

tf(x) := ℓ(x, u(t, x)) (f(−x)−f(x))

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.17

Chiara Brambilla, Luca Grosset, Elena Sartori

E-ISSN: 2224-2880

139

Volume 22, 2023

for x∈S. Let D=D([0, T ], S)be the space of

right-continuous functions with finite left limit de-

fined in [0, T ], taking values in the finite binary set

S={−1,1}. We endow this space with the Sko-

rokhod topology and denote by Sthe Borel σ-algebra

on D. In the measurable space (D,S), we denote by

(Xt)t∈[0,T ]the canonical process: Xt(ω) = ω(t).

Let µbe a probability measure on S. A probability

measure Pu

s,µ on (D,S)is a solution to the martingale

problem characterised by (4) if and only if the follow-

ing two conditions hold:

1. Pu

s,µ(Xs∈A) = µ(A)for s∈[0, T ],A⊂S;

2. for all functions f∈B(S), the process f(Xt)−

t

0Λu

rf(Xr)dr is a martingale with respect to the

natural filtration Ft=σ(Xr, r ≤t).

The process Xtcharacterised by the previous martin-

gale problem is the unique continuous-time Markov

chain with infinitesimal generator given by (4).

This discussion allows us to clarify the connection

between a feedback control function and the associ-

ated state function. Therefore, we denote by Eu

s,µ the

expectation with respect to the probability measure

s,µ. Furthermore, if µis the measure that concen-

trates all probability in the state x∈S, we write Pu

s,x

and Eu

s,x.

3 Stochastic optimal control

In this section, we present the finite-time optimal con-

trol problem we are dealing with. We are looking for

a feasible feedback control function uthat minimises

J(u) := Eu

0,x T

c(t, Xt, Ut)dt +g(XT),(5)

where x∈S,c(·,·,·)is a continuous function and

g(·)is given.

Let’s introduce the optimal value function:

V(t, x) = inf

u∈U

t,x T

c(s, Xs, Us)ds +g(XT).

This function is the key point to solving our optimal

control problem because it solves a particular equa-

tion. More in detail, the HJB equation associated with

the previous optimal control problem can be written

∂tV(t, x)+ min

w∈[0,υ]{c(t, x, w)+Λw

tV(t, x)}= 0

V(T, x) = g(x),

(6)

where x∈ {−1,1}, and t∈[0, T ).

In the following Theorem, we explain the connection

between the optimal value function, a solution of (6),

an optimal feedback control function.

Theorem 1 (HJB Equation).Let us assume that:

•S={−1,1};

•ℓ(x, ·) : [0, υ]→[0,+∞)is a continuous func-

tion for all x∈S;

•c: [0, T ]×S×[0, υ]→Ris a continuous func-

tion.

Then:

•There exists a unique function, bounded and con-

tinuously differentiable with respect to the first

variable, which is a solution of the HJB equation;

•An optimal control in the feedback form u∗:

[0, T ]× {−1,1} → [0, υ]is given by

c(t, x, u∗(t, x)) + Λu∗

tV(t, x) =

=min

w∈[0,υ]{c(t, x, w)+Λw

tV(t, x)}.(7)

Proof. See [3] Theorem 2.4. □

This standard result is very powerful when we want to

solve an optimal control problem driven by a stochas-

tic differential equation. However, when considering

a controlled continuous-time Markov chain with bi-

nary state, the HJB equation does not seem to be so

useful. In the next section, we present an approach

that allows us to use all the information of the HJB

equation to find an optimal control.

4 From HJB to a backward ODE

This is the main section of this work. At this point,

we show how the HJB equation can be replaced by

an ordinary backward differential equation. The idea

is to use the algebraic difference between the evalu-

ation of the HJB equation in both states of the sys-

tem to obtain useful information to construct an opti-

mal feedback control. Thus, the HJB equation, which

seemed hardly treatable, allows us to obtain an ordi-

nary differential equation that must be solved back-

wards. Henceforth, we use the discrete gradient nota-

tion: for all x∈Swe define

∇xf(x) := f(−x)−f(x).

Now we have all the necessary information to enun-

ciate and prove the following result.

Theorem 2 (Backward ODE).Assume that the op-

timal control problem presented in the previous sec-

tions has the following formulation:

•ℓ(x, u) = α(x)+βu, with α(x)≥0for all x∈S

and β > 0;

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.17

Chiara Brambilla, Luca Grosset, Elena Sartori

E-ISSN: 2224-2880

140

Volume 22, 2023

•c(t, x, u) = γ(t)x+κ(t)u2/2 , with γ(·), κ(·)∈

C0([0, T ],R), and κ(t)>0.

Moreover, suppose that we can solve the backward

ODE:

˙z(t) = 2γ(t)− ∇xα(1) + β2

2κ(t)|z(t)|z(t)

z(T) = ∇xg(1)

and, finally, assume that the solution z(t)of this ODE

satisfies the inequality:

[z(t)]−≤υκ(t)

β,

so that the control constraint is inactive. Therefore,

the optimal feedback control is

u∗(t, x) = β

κ(t)[z(t)x]−.

Proof. Under the assumptions, the HJB equation be-

comes:

∂tV(t, x) +

min

w∈[0,υ]γ(t)x+κ(t)w2

2+α(x) + βw∇xV(t, x)

= 0.

Minimising we get

w=min β

κ(t)[(∇xV(t, x))]−, υ.

Since β

κ(t)[(∇xV(t, x))]−≤υby the above assump-

tion, we can rewrite the HJB equation as

∂tV(t, x)+γ(t)x+β2

κ(t)1

2([(∇xV(t, x))]−)2+

α(x) + β2

κ(t)[(∇xV(t, x))]−(∇xV(t, x))= 0.

Let us introduce a new key variable

z(t) := ∇xV(t, 1).

Now we rewrite the previous HJB equation for x= 1

and for x=−1; we obtain the system:











∂tV(t, 1) + γ(t) + α(1)+

+β2

κ(t)1

2([z(t)]−)2+ [z(t)]−z(t)= 0

∂tV(t, −1) −γ(t) + α(−1)+

+β2

κ(t)1

2([−z(t)]−)2−[−z(t)]−z(t)= 0.

Recalling that, for all z∈R, we have that [−z]−=

[z]+, and, taking the difference between the two equa-

tions, we get

˙z(t)−2γ(t) + ∇xα(1)+

β2

κ(t)1

2([z(t)]+)2−([z(t)]−)2+

β2

κ(t)−z(t)[z(t)]++ [z(t)]−= 0.

Finally, for all z∈Rit holds [z]++ [z]−=|z|and

[z]+−[z]−=z; hence we obtain:

˙z(t) = 2γ(t)− ∇xα(1) + β2

2κ(t)|z(t)|z(t).

The final condition V(T, x) = g(x)for all x∈S

gives us z(T) = ∇xV(T, 1) = ∇xg(1). Therefore,

HJB becomes







˙z(t) = 2γ(t)− ∇xα(1) + β2

2κ(t)|z(t)|z(t)

z(T) = ∇xg(1),

which is exactly the backward ODE we are looking

for. □

5 A numerical example

We formally define the dynamics of the process with

x∈Sas follows:

P(Xt+h=−x|Xt=x, Ut=u) =

= (1 + x+u)h+o(h).(8)

We want to choose a feasible feedback control func-

tion uwhich takes values in [0,2], in order to min-

imise

0,x := Eu

0,x 1

−Xt

2+U2

4dt +XT

4;

then, the backward ODE is

˙z(t) = 1 + |z(t)|z(t)

z(1) = −1/2.

In a left neighborhood of 1, for example in (1 −ε, 1]

with ε > 0, the ODE becomes

˙z(t) = 1 −(z(t))2

z(1) = −1/2 .

The solution to the previous Cauchy problem is

z(t) = 1 −6e2

e2t+ 3e2.

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.17

Chiara Brambilla, Luca Grosset, Elena Sartori

E-ISSN: 2224-2880

141

Volume 22, 2023

We observe that, for t∈[0,1], this function is always

negative (i.e., ε > 1); moreover, [z(t)]−≤ −z(0) =

(1 −3e3)/(1 + 3e2)<1, because the function is

strictly negative and strictly increasing. Hence, the

constraint on the control u∈[0,2] is not active and

the optimal feedback is feasible. Using the optimal

feedback control

u∗(t, x) = e2t−3e2

e2t+ 3e2·x−

we can find the evolution of the expected value of the

process. Let us define m∗

x(t) := Eu∗

0,x(Xt); using the

infinitesimal generator we obtain

˙m∗

x(t)=−2Eu∗

0,xXt1+Xt+e2t−3e2

e2t+ 3e2·Xt−

,

which becomes

˙m∗

x(t) = e2t−3e2

e2t+ 3e2−2(m∗

x(t) + 1) (9)

with initial condition:

m∗

x(0) = Eu∗

0,x(X0) = x .

Equation (9) is a linear ODE whose coefficients are

continuous functions for all t∈[0,1]; therefore, there

exists a unique solution m∗

x(t)to the previous Cauchy

problem. For x=−1the solution is constant:

m∗

−1(t)≡ −1. On the other hand, for x= 1 we can

explicitly find the analytical form of this function, but

it is long and inexpressive. We prefer to plot its graph

in Figure 1. Moreover, using the evolution of the

expected value of the optimal process starting from

x= 1, we can find the evolution of the probability of

each state: p(t) := Pu∗

0,1(Xt= 1) = (1 + m∗

1(t))/2.

The probability is displayed in Figure 1.

Out[

0.2

0.4

0.6

0.8

1.0

-0.5

0.5

1.0

p(t)

m*1(t)

Figure 1: p(t)and m∗

1(t).

We notice that, using the optimal feedback control,

the process moves towards the state −1. When the

initial position is −1, the process remains in this state;

otherwise, when the initial position is 1, the process

changes its state with a strictly positive probability

rate.

6 Conclusion

In this paper, we analyse the evolution of a

continuous-time Markov decision process charac-

terised by a binary state. We introduce the standard

Hamilton–Jacobi–Bellman equation and prove that,

under suitable analytical formulations of the rate of

transition and of the cost function, we can replace the

HJB equation with a backward ODE. Then, all in-

formation useful to characterise an optimal feedback

control is now contained in the solution of the back-

ward ODE. Using a numerical example, we show how

to find an optimal feedback control through the results

shown in this paper.

This research can be improved in various direc-

tions. First, we can try to extend the family of func-

tions that satisfies the hypotheses of Theorem 2. Sub-

sequently, we can investigate whether what is proven

in this paper is valid for analogous problems with an

infinite-time horizon, too. Finally, it may be inter-

esting to study whether this approach is useful for

analysing the interaction between multiple players.

References:

[1] A. Cecchin, P. Dai Pra, M. Fischer & G. Pelino,

On the Convergence Problem in Mean Field

Games: A Two State Model without Uniqueness,

SIAM Journal on Control and Optimization, Vol.

57, No. 4, 2019, pp. 2443-2466.

[2] P. Dai Pra, E. Sartori & M. Tolotti, Climb on

the Bandwagon: Consensus and Periodicity in

a Lifetime Utility Model with Strategic Interac-

tions, Dynamic Games and Applications, Vol. 9,

No. 4, 2019, pp. 1061-1075.

[3] M.K. Ghosh & S. Saha, Risk-sensitive Control

of Continuous-time Markov Chains, Stochastics,

Vol. 86, No.4, 2014, pp. 655-675.

[4] X. Guo & O. Hernández-Lerma, Continuous-

Time Markov Decision Processes, Springer

Berlin Heidelberg, 2009.

[5] M. Kochmański, T. Paszkiewicz, & S. Wolski,

Curie–Weiss magnet — A Simple Model of Phase

Transition, European Journal of Physics, Vol.34,

No.6, 2013, pp. 1555-1578.

[6] W. Wang, X. Su, S. Gan, & L. Qian, Pricing

Vulnerable European Options Under a Markov-

Modulated Jump Diffusion Process, WSEAS

Transactions on Mathematics, Vol.16, 2017, pp.

123-132.

WSEAS TRANSACTIONS on MATHEMATICS

DOI: 10.37394/23206.2023.22.17

Chiara Brambilla, Luca Grosset, Elena Sartori

E-ISSN: 2224-2880

142

Volume 22, 2023

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

The authors equally contributed in the present

research, at all stages from the formulation of the

problem to the final findings and solution.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The authors have no conflicts of interest to declare

that are relevant to the content of this article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US