Continuous-Time Markov Decision Problems with Binary State
CHIARA BRAMBILLA, LUCA GROSSET, ELENA SARTORI
Dipartimento di Matematica ”Tullio Levi-Civita”
Università degli Studi di Padova
Via Trieste, 63 - 35121 Padova
ITALY
Abstract: We analyse a binary state continuous-time Markov decision problem. The standard Hamilton–Jacobi–
Bellman equation is introduced and, with suitable assumptions on the probability rate and on the cost function, it
can be replaced by a simpler backward differential equation. Through a numerical example, we show how to find
an optimal feedback control using the results presented in this paper.
Key-Words: Continuous-time Finite-state Markov Process, Optimal Stochastic Control, Dynamic Programming.
Received: October 24, 2022. Revised: December 20, 2022. Accepted: January 18, 2023. Published: February 16, 2023.
1 Introduction
The theory of continuous-time Markov decision pro-
cesses is a rapidly growing area of research with
numerous practical applications; see, e.g., [4], and
[6]. This paper focuses on binary state processes
and prioritises an analytical solution over a compu-
tational one. The description of the evolution of bi-
nary state processes draws inspiration from the Curie-
Weiss model [5], but recent models have also incorpo-
rated human rationality by allowing decision-making
agents to modify their states. Initially, agents were
assumed to update their state according to a prede-
termined probabilistic transition rate based on their
surrounding environment. However, this approach
does not take into account the rationality of humans.
Therefore, to change their states, agents face decision
problems and try to minimise or maximise a suitable
objective (such as cost, gain, or level of satisfaction).
To model this situation, we introduce in the transition
rate a control function, which is chosen by the agents
once they have solved their optimisation problem.
This leads to the definition of a controlled Markov
chain. It is well known that the Hamilton–Jacobi–
Bellman equation (HJB) associated with a controlled
continuous-time finite-state Markov process can be
difficult to treat. In this paper, we use techniques de-
veloped in [1] and [2] to derive closed or semi-closed
expressions for optimal control and state evolution.
2 The model
In this section, we describe the continuous-time
Markov decision process that we are going to study,
which is defined in the programming interval [0, T ]
with T > 0. Let S={−1,1}be the state space and
let [0, υ], with υ > 0, be the control set. For t[0, T ]
denote by Xtthe state process and by Ut[0, υ]the
control process. We formally define the dynamics of
the process by
P(Xt+h=x|Xt=x, Ut=u) =
=(x, u)h+o(h),(1)
where (x, ·) : [0, υ][0,+)is a continuous func-
tion for all xS. Now, assume that the system is
controlled using a feedback function:
Ut=u(t, Xt),(2)
where
u: [0, T ]× {−1,1} [0, υ](3)
is a measurable function. We denote by Uthe set
of all feasible feedback control functions. We recall
that for all feasible feedback functions, there exists a
continuous-time Markov process defined by (1).
For the readers convenience, we sketch how con-
trol and state are connected. We observe that the fol-
lowing approach is a bit more general than what we
would need because it can be applied not only for
the feedback controls, but also for more general non-
anticipating controls. Nevertheless, we prefer to keep
this approach since it is the most popular in stochas-
tic optimal control problems. Let us denote by B(S)
the space of bounded functions with real value in S
equipped with the supremum norm.
For all t[0, T ]and for all measurable control func-
tions u, consider the following operator:
Λu
t:B(S)B(S)
f7→ Λu
tf, (4)
where
Λu
tf(x) := (x, u(t, x)) (f(x)f(x))
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.17
Chiara Brambilla, Luca Grosset, Elena Sartori
E-ISSN: 2224-2880
139
Volume 22, 2023
for xS. Let D=D([0, T ], S)be the space of
right-continuous functions with finite left limit de-
fined in [0, T ], taking values in the finite binary set
S={−1,1}. We endow this space with the Sko-
rokhod topology and denote by Sthe Borel σ-algebra
on D. In the measurable space (D,S), we denote by
(Xt)t[0,T ]the canonical process: Xt(ω) = ω(t).
Let µbe a probability measure on S. A probability
measure Pu
s,µ on (D,S)is a solution to the martingale
problem characterised by (4) if and only if the follow-
ing two conditions hold:
1. Pu
s,µ(XsA) = µ(A)for s[0, T ],AS;
2. for all functions fB(S), the process f(Xt)
t
0Λu
rf(Xr)dr is a martingale with respect to the
natural filtration Ft=σ(Xr, r t).
The process Xtcharacterised by the previous martin-
gale problem is the unique continuous-time Markov
chain with infinitesimal generator given by (4).
This discussion allows us to clarify the connection
between a feedback control function and the associ-
ated state function. Therefore, we denote by Eu
s,µ the
expectation with respect to the probability measure
Pu
s,µ. Furthermore, if µis the measure that concen-
trates all probability in the state xS, we write Pu
s,x
and Eu
s,x.
3 Stochastic optimal control
In this section, we present the finite-time optimal con-
trol problem we are dealing with. We are looking for
a feasible feedback control function uthat minimises
J(u) := Eu
0,x T
0
c(t, Xt, Ut)dt +g(XT),(5)
where xS,c(·,·,·)is a continuous function and
g(·)is given.
Let’s introduce the optimal value function:
V(t, x) = inf
u∈U
Eu
t,x T
t
c(s, Xs, Us)ds +g(XT).
This function is the key point to solving our optimal
control problem because it solves a particular equa-
tion. More in detail, the HJB equation associated with
the previous optimal control problem can be written
as
tV(t, x)+ min
w[0]{c(t, x, w)+Λw
tV(t, x)}= 0
V(T, x) = g(x),
(6)
where x {−1,1}, and t[0, T ).
In the following Theorem, we explain the connection
between the optimal value function, a solution of (6),
an optimal feedback control function.
Theorem 1 (HJB Equation).Let us assume that:
S={−1,1};
(x, ·) : [0, υ][0,+)is a continuous func-
tion for all xS;
c: [0, T ]×S×[0, υ]Ris a continuous func-
tion.
Then:
There exists a unique function, bounded and con-
tinuously differentiable with respect to the first
variable, which is a solution of the HJB equation;
An optimal control in the feedback form u:
[0, T ]× {−1,1} [0, υ]is given by
c(t, x, u(t, x)) + Λu
tV(t, x) =
=min
w[0]{c(t, x, w)+Λw
tV(t, x)}.(7)
Proof. See [3] Theorem 2.4.
This standard result is very powerful when we want to
solve an optimal control problem driven by a stochas-
tic differential equation. However, when considering
a controlled continuous-time Markov chain with bi-
nary state, the HJB equation does not seem to be so
useful. In the next section, we present an approach
that allows us to use all the information of the HJB
equation to find an optimal control.
4 From HJB to a backward ODE
This is the main section of this work. At this point,
we show how the HJB equation can be replaced by
an ordinary backward differential equation. The idea
is to use the algebraic difference between the evalu-
ation of the HJB equation in both states of the sys-
tem to obtain useful information to construct an opti-
mal feedback control. Thus, the HJB equation, which
seemed hardly treatable, allows us to obtain an ordi-
nary differential equation that must be solved back-
wards. Henceforth, we use the discrete gradient nota-
tion: for all xSwe define
xf(x) := f(x)f(x).
Now we have all the necessary information to enun-
ciate and prove the following result.
Theorem 2 (Backward ODE).Assume that the op-
timal control problem presented in the previous sec-
tions has the following formulation:
(x, u) = α(x)+βu, with α(x)0for all xS
and β > 0;
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.17
Chiara Brambilla, Luca Grosset, Elena Sartori
E-ISSN: 2224-2880
140
Volume 22, 2023
c(t, x, u) = γ(t)x+κ(t)u2/2 , with γ(·), κ(·)
C0([0, T ],R), and κ(t)>0.
Moreover, suppose that we can solve the backward
ODE:
˙z(t) = 2γ(t) xα(1) + β2
2κ(t)|z(t)|z(t)
z(T) = xg(1)
and, finally, assume that the solution z(t)of this ODE
satisfies the inequality:
[z(t)]υκ(t)
β,
so that the control constraint is inactive. Therefore,
the optimal feedback control is
u(t, x) = β
κ(t)[z(t)x].
Proof. Under the assumptions, the HJB equation be-
comes:
tV(t, x) +
min
w[0]γ(t)x+κ(t)w2
2+α(x) + βwxV(t, x)
= 0.
Minimising we get
w=min β
κ(t)[(xV(t, x))], υ.
Since β
κ(t)[(xV(t, x))]υby the above assump-
tion, we can rewrite the HJB equation as
tV(t, x)+γ(t)x+β2
κ(t)1
2([(xV(t, x))])2+
α(x) + β2
κ(t)[(xV(t, x))](xV(t, x))= 0.
Let us introduce a new key variable
z(t) := xV(t, 1).
Now we rewrite the previous HJB equation for x= 1
and for x=1; we obtain the system:
tV(t, 1) + γ(t) + α(1)+
+β2
κ(t)1
2([z(t)])2+ [z(t)]z(t)= 0
tV(t, 1) γ(t) + α(1)+
+β2
κ(t)1
2([z(t)])2[z(t)]z(t)= 0.
Recalling that, for all zR, we have that [z]=
[z]+, and, taking the difference between the two equa-
tions, we get
˙z(t)2γ(t) + xα(1)+
β2
κ(t)1
2([z(t)]+)2([z(t)])2+
β2
κ(t)z(t)[z(t)]++ [z(t)]= 0.
Finally, for all zRit holds [z]++ [z]=|z|and
[z]+[z]=z; hence we obtain:
˙z(t) = 2γ(t) xα(1) + β2
2κ(t)|z(t)|z(t).
The final condition V(T, x) = g(x)for all xS
gives us z(T) = xV(T, 1) = xg(1). Therefore,
HJB becomes
˙z(t) = 2γ(t) xα(1) + β2
2κ(t)|z(t)|z(t)
z(T) = xg(1),
which is exactly the backward ODE we are looking
for.
5 A numerical example
We formally define the dynamics of the process with
xSas follows:
P(Xt+h=x|Xt=x, Ut=u) =
= (1 + x+u)h+o(h).(8)
We want to choose a feasible feedback control func-
tion uwhich takes values in [0,2], in order to min-
imise
Ju
0,x := Eu
0,x 1
0
Xt
2+U2
t
4dt +XT
4;
then, the backward ODE is
˙z(t) = 1 + |z(t)|z(t)
z(1) = 1/2.
In a left neighborhood of 1, for example in (1 ε, 1]
with ε > 0, the ODE becomes
˙z(t) = 1 (z(t))2
z(1) = 1/2 .
The solution to the previous Cauchy problem is
z(t) = 1 6e2
e2t+ 3e2.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.17
Chiara Brambilla, Luca Grosset, Elena Sartori
E-ISSN: 2224-2880
141
Volume 22, 2023
We observe that, for t[0,1], this function is always
negative (i.e., ε > 1 moreover, [z(t)] z(0) =
(1 3e3)/(1 + 3e2)<1, because the function is
strictly negative and strictly increasing. Hence, the
constraint on the control u[0,2] is not active and
the optimal feedback is feasible. Using the optimal
feedback control
u(t, x) = e2t3e2
e2t+ 3e2·x
,
we can find the evolution of the expected value of the
process. Let us define m
x(t) := Eu
0,x(Xt); using the
infinitesimal generator we obtain
˙m
x(t)=2Eu
0,xXt1+Xt+e2t3e2
e2t+ 3e2·Xt
,
which becomes
˙m
x(t) = e2t3e2
e2t+ 3e22(m
x(t) + 1) (9)
with initial condition:
m
x(0) = Eu
0,x(X0) = x .
Equation (9) is a linear ODE whose coefficients are
continuous functions for all t[0,1]; therefore, there
exists a unique solution m
x(t)to the previous Cauchy
problem. For x=1the solution is constant:
m
1(t) 1. On the other hand, for x= 1 we can
explicitly find the analytical form of this function, but
it is long and inexpressive. We prefer to plot its graph
in Figure 1. Moreover, using the evolution of the
expected value of the optimal process starting from
x= 1, we can find the evolution of the probability of
each state: p(t) := Pu
0,1(Xt= 1) = (1 + m
1(t))/2.
The probability is displayed in Figure 1.
Out[
!
]=
0.2
0.4
0.6
0.8
1.0
t
0.5
1.0
p(t)
m*1(t)
Figure 1: p(t)and m
1(t).
We notice that, using the optimal feedback control,
the process moves towards the state 1. When the
initial position is 1, the process remains in this state;
otherwise, when the initial position is 1, the process
changes its state with a strictly positive probability
rate.
6 Conclusion
In this paper, we analyse the evolution of a
continuous-time Markov decision process charac-
terised by a binary state. We introduce the standard
Hamilton–Jacobi–Bellman equation and prove that,
under suitable analytical formulations of the rate of
transition and of the cost function, we can replace the
HJB equation with a backward ODE. Then, all in-
formation useful to characterise an optimal feedback
control is now contained in the solution of the back-
ward ODE. Using a numerical example, we show how
to find an optimal feedback control through the results
shown in this paper.
This research can be improved in various direc-
tions. First, we can try to extend the family of func-
tions that satisfies the hypotheses of Theorem 2. Sub-
sequently, we can investigate whether what is proven
in this paper is valid for analogous problems with an
infinite-time horizon, too. Finally, it may be inter-
esting to study whether this approach is useful for
analysing the interaction between multiple players.
References:
[1] A. Cecchin, P. Dai Pra, M. Fischer & G. Pelino,
On the Convergence Problem in Mean Field
Games: A Two State Model without Uniqueness,
SIAM Journal on Control and Optimization, Vol.
57, No. 4, 2019, pp. 2443-2466.
[2] P. Dai Pra, E. Sartori & M. Tolotti, Climb on
the Bandwagon: Consensus and Periodicity in
a Lifetime Utility Model with Strategic Interac-
tions, Dynamic Games and Applications, Vol. 9,
No. 4, 2019, pp. 1061-1075.
[3] M.K. Ghosh & S. Saha, Risk-sensitive Control
of Continuous-time Markov Chains, Stochastics,
Vol. 86, No.4, 2014, pp. 655-675.
[4] X. Guo & O. Hernández-Lerma, Continuous-
Time Markov Decision Processes, Springer
Berlin Heidelberg, 2009.
[5] M. Kochmański, T. Paszkiewicz, & S. Wolski,
Curie–Weiss magnet A Simple Model of Phase
Transition, European Journal of Physics, Vol.34,
No.6, 2013, pp. 1555-1578.
[6] W. Wang, X. Su, S. Gan, & L. Qian, Pricing
Vulnerable European Options Under a Markov-
Modulated Jump Diffusion Process, WSEAS
Transactions on Mathematics, Vol.16, 2017, pp.
123-132.
WSEAS TRANSACTIONS on MATHEMATICS
DOI: 10.37394/23206.2023.22.17
Chiara Brambilla, Luca Grosset, Elena Sartori
E-ISSN: 2224-2880
142
Volume 22, 2023
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
The authors equally contributed in the present
research, at all stages from the formulation of the
problem to the final findings and solution.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study.
Conflict of Interest
The authors have no conflicts of interest to declare
that are relevant to the content of this article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US