Medical Knowledge Integration into Reinforcement Learning Algorithms for Dynamic Treatment Regimes

[Uncaptioned image] Sophia Yazzourh
Institut de Mathématiques de Toulouse
UMR5219 - Université de Toulouse
CNRS - UPS IMT
F-31062 Toulouse Cedex 9, France
[email protected]
&[Uncaptioned image]  Nicolas Savy
Institut de Mathématiques de Toulouse
UMR5219 - Université de Toulouse
CNRS - UPS IMT
F-31062 Toulouse Cedex 9, France

&[Uncaptioned image]  Philippe Saint-Pierre
Institut de Mathématiques de Toulouse
UMR5219 - Université de Toulouse
CNRS - UPS IMT
F-31062 Toulouse Cedex 9, France

&[Uncaptioned image]  Michael R. Kosorok
Department of Biostatistics
University of North Carolina at Chapel Hill
Chapel Hill, NC 27599, U.S.A.

Abstract

The goal of precision medicine is to provide individualized treatment at each stage of chronic diseases, a concept formalized by Dynamic Treatment Regimes (DTR). These regimes adapt treatment strategies based on decision rules learned from clinical data to enhance therapeutic effectiveness. Reinforcement Learning (RL) algorithms allow to determine these decision rules conditioned by individual patient data and their medical history. The integration of medical expertise into these models makes possible to increase confidence in treatment recommendations and facilitate the adoption of this approach by healthcare professionals and patients. In this work, we examine the mathematical foundations of RL, contextualize its application in the field of DTR, and present an overview of methods to improve its effectiveness by integrating medical expertise.

Keywords Expert Knowledge Integration  \cdot Precision Medicine  \cdot Adaptive Interventions  \cdot Medical Decision Making  \cdot Decision Process

1 Introduction

Modern medicine, with its remarkable advancements in care, drugs, and treatments, now seeks to enhance its ability to deliver personalized treatments for each individual patient. The paradigm of precision medicine (Kosorok and Laber, 2019) initiates a profound reflection on this question. Precision medicine aims to optimize the quality of healthcare by tailoring the medical approach to match the specific and continually changing health condition of every individual patient. The heterogeneity among patients’ populations and sub-populations leads to distinct reactions and, consequently, necessitates different treatment approaches. Initially, this research domain introduced statistical models (Chakraborty and Murphy, 2014; Kosorok and Laber, 2019; Kosorok and Moodie, 2015) aimed at facilitating decision-making support. Naturally, with the advent of data storage and the computational power, machine learning methods (Coronato et al., 2020; Yu et al., 2021) have also begun to be applied to address this issue.

In this context, one of the growing interests of modern medicine is to adapt prescribed treatments to the individual data, unique characteristics and particular medical history of the patient. Precision medicine seeks to put the patient’s own information at the center in order to improve their health. The motto behind is "The right treatment for the right patient (at the right time)". In a 2015 State of the Union address, President Obama announced a Precision Medicine Initiative to revolutionize how we improve health, research, and treat disease. The initiative defines precision medicine as "an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person" (Terry, 2015). In technical terms, Adaptive Treatment Strategies (ATS) or Dynamic Treatment Regimes (DTR) formalize the objective of enhancing the care pathway for patients by proposing an optimal and personalized treatment sequence. They aim to establish a decision rule at each stage of the care process. It conditions the treatment based on responses to previous prescriptions and medical history (Chakraborty and Murphy, 2014; Laber et al., 2014a). The goal is to optimize the patient’s long-term positive response to the sequence of treatment decisions while tailoring the treatment to their own medical information (Kosorok and Moodie, 2015).

In the past decades, machine learning has emerged as a solution to large-scale and high-complexity problems. When it comes to decision support, particularly in sequential scenarios, Reinforcement Learning (RL) (Sutton and Barto, 2018) offers the most effective solution. These methods excel in adapting to changing conditions and optimizing decisions over a series of steps, making them especially valuable in dynamic decision-making processes. The concept revolves around identifying a decision rule, referred to as policy, which is designed to optimize a long-term objective. This policy is crafted in order to make decisions over time that lead to the greatest cumulative benefit or outcome.

RL methods is thus an appealing candidate for precision medicine and has been intensely studied as a potential tool to guide medical decisions towards personalized medicine. First, the application of these methods to DTR is facilitated by modeling the underlying decision problem using a so-called Decision Process (DP), as detailed in Section 2. It is straightforward to express and establish connections between medical elements and its mathematical components. Second, the primary aim of RL is to identify this decision rule. In this context, there is a desire to establish this rule while maximizing long-term cumulative gains. In medicine, the effects of treatments and side effects are not immediate but can take several stages to manifest. The way the policy is constructed is a significant asset for precision medicine. Third, RL models have the capacity to simultaneously consider the extensive patient covariates data and address multi-stage decision problems. The scope of RL applications in precision medicine is in recent thematic reviews of major interest : a non-technical survey offering illustrations of RL applications in public health is proposed in Weltz et al. (2022). More specifically, RL applications in the context of mobile health are presented in Deliu et al. (2022). Two more technical reviews describe the methods for determining medical decision rules using off-policy RL approach (Uehara et al., 2022), or more specifically with the use of Q-learning (Clifton and Laber, 2020) and their empirical comparison with other estimation methods (Li et al., 2023).

While RL offers promising algorithms for sequential decision-making in healthcare, as detailed in the Supplementary Material, relying on a machine learning algorithm may create apprehension among all stakeholders in the process. This hesitation can originate from both the patient and the physician sides. In order to be operational in a clinical context, several points must be improved such as safer, more interpretative and efficient medical decision making (Eckardt et al., 2021). One approach to enhance the application of RL in healthcare is the integration of expertise or human knowledge into the models. The concept is to create a partnership between both machine learning capabilities and domain experts (Holzinger, 2016; Maadi et al., 2021). This "collaboration" would not only improve confidence in RL models and the recommendations they provide (Love et al., 2023) but also facilitate the utilization of this technology by healthcare professionals and patients within a clinical setting (Holzinger et al., 2019). This fusion of machine learning and human expertise yields to improved results compared to RL in isolation or expert decisions alone (Arzate Cruz and Igarashi, 2020; Li et al., 2019a). From a technical point, involving experts or medical knowledge also reduces the learning time, allowing for quicker adaptation and refinement of the methods, ultimately leading to more effective and patient-centered healthcare solutions.

The objective of this paper is to provide a comprehensive overview of RL applied to the optimization of treatment sequences. By facilitating an entry into this field for those interested in its practical application in precision medicine, we illustrate its mathematical framework and provide contextualization. This overview aims to help navigate the array of available algorithms. Additionally, we explore the integration of medical knowledge into RL models, highlighting considerations that could facilitate their clinical integration and application. We introduce these issues, offering initial questions and showcasing opportunities for further research in this area.

To achieve this objective, we structure the paper as follows. In Section 2, we delve into the mathematical foundations of RL approaches, specifically exploring decision processes and introducing key concepts and specific terms of the domain: policy, rewards, and value function. In Section 3, we contextualize this study within the realm of DTR, offering a more detailed explanation of how RL and the concept of precision medicine are intricately connected. We also explain the properties and classification of RL algorithms within our medical context. In Section 4, we provide an overview of methods to enhance reinforcement learning in the medical context by integrating expert knowledge. Various methods are presented and discussed. The paper ends by a concluding section 5.

2 Theoretical foundations of reinforcement learning

This section aims to outline the mathematical framework of RL applied in the DTR field. Typically, RL is explained in the context of a Markov Decision Process (MDP) and its evolution into a Partially Observable Markov Decision Process (POMPD). However, in this context, a return is made to a decision-making framework without the inclusion of Markov assumptions, which is referred to as a decision process. Subsequently, fundamental concepts are introduced : policy, value function, and the notion of optimality.

2.1 Decision process

2.1.1 General statement

The modeling context revolves around the realm of decision-making. A foundation proposed is DP, which acts as the initial framework for DTR. It represents a dynamic system which evolves through time t𝕋𝑡𝕋t\in\mathbb{T}italic_t ∈ roman_𝕋. This system navigates within the space of states 𝕊𝕊\mathbb{S}roman_𝕊 by executing actions within the realm of possibilities defined by the space of actions 𝔸𝔸\mathbb{A}roman_𝔸. The collection of non-empty measurable subsets of 𝔸𝔸\mathbb{A}roman_𝔸, denoted as {𝔸(s)|s𝕊}conditional-set𝔸𝑠𝑠𝕊\{\mathbb{A}(s)|s\in\mathbb{S}\}{ roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 }, represents the feasible actions that can be undertaken when the system finds itself in a specific state s𝕊𝑠𝕊s\in\mathbb{S}italic_s ∈ roman_𝕊.

Definition 2.1 (Decision Process).

A decision process (S,A,{𝔸(s)|s𝕊},ν)𝑆𝐴conditional-set𝔸𝑠𝑠𝕊𝜈(S,A,\{\mathbb{A}(s)|s\in\mathbb{S}\},\nu)( italic_S , italic_A , { roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } , italic_ν ) on 𝕋𝕋\mathbb{T}roman_𝕋 includes:

  • a family S𝑆Sitalic_S of 𝕊𝕊\mathbb{S}roman_𝕊-valued random variables {St,t𝕋}subscript𝑆𝑡𝑡𝕋\{S_{t},t\in\mathbb{T}\}{ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ roman_𝕋 }, 𝕊𝕊\mathbb{S}roman_𝕊 is called space of states.

  • a family A𝐴Aitalic_A of 𝔸𝔸\mathbb{A}roman_𝔸-valued random variables {At,t𝕋}subscript𝐴𝑡𝑡𝕋\{A_{t},t\in\mathbb{T}\}{ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ roman_𝕋 }, 𝔸𝔸\mathbb{A}roman_𝔸 is called space of actions.

  • a family {𝔸(s)|s𝕊}conditional-set𝔸𝑠𝑠𝕊\{\mathbb{A}(s)|s\in\mathbb{S}\}{ roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } of non empty measurable subsets of 𝔸𝔸\mathbb{A}roman_𝔸, the set of realizable actions when the system is in the state s𝕊𝑠𝕊s\in\mathbb{S}italic_s ∈ roman_𝕊. The requirement is for 𝕂={(s,a)|s𝕊,a𝔸(s)}𝕂conditional-set𝑠𝑎formulae-sequence𝑠𝕊𝑎𝔸𝑠\mathbb{K}=\{(s,a)|s\in\mathbb{S},a\in\mathbb{A}(s)\}roman_𝕂 = { ( italic_s , italic_a ) | italic_s ∈ roman_𝕊 , italic_a ∈ roman_𝔸 ( italic_s ) } to be a measurable subset of 𝕊×𝔸𝕊𝔸\mathbb{S}\times\mathbb{A}roman_𝕊 × roman_𝔸.

  • a distribution ν𝜈\nuitalic_ν on 𝕊𝕊\mathbb{S}roman_𝕊.

Remark 2.1.

DP is initially characterized for Borel spaces 𝕊𝕊\mathbb{S}roman_𝕊 and 𝔸𝔸\mathbb{A}roman_𝔸. However, in most practical applications, these spaces typically have finite dimensions, context considered for the rest of the article.

Remark 2.2.

Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the state at time t𝑡titalic_t, this variable may be a vector including the state and several covariates observed at the time t𝑡titalic_t. In what follows no covariate are considered.

Remark 2.3.

In full generalities, 𝕋𝕋\mathbb{T}roman_𝕋 will be taken as continuous or discrete but for a sake of readability 𝕋𝕋\mathbb{T}roman_𝕋 will be a discrete space denoted by 𝕋={0=t0,t1,,tn,,τ}𝕋0subscript𝑡0subscript𝑡1subscript𝑡𝑛𝜏\mathbb{T}=\{0=t_{0},t_{1},\dots,t_{n},\dots,\tau\}roman_𝕋 = { 0 = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … , italic_τ }, with τ𝜏\tauitalic_τ representing either a finite (τ=tN<𝜏subscript𝑡𝑁\tau=t_{N}<\inftyitalic_τ = italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT < ∞) or infinite (τ=𝜏\tau=\inftyitalic_τ = ∞) value. For the sake of simplicity, the variables Xtnsubscript𝑋subscript𝑡𝑛X_{t_{n}}italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT will be indicated as Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Xτsubscript𝑋𝜏X_{\tau}italic_X start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT as Xsubscript𝑋X_{\infty}italic_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT in infinite horizon setting.

Definition 2.2.

For any n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, an admissible history at time n𝑛nitalic_n is a vector which contains the states traveled by the system together with the actions taken up to time n𝑛nitalic_n. The set of admissible histories at time n𝑛nitalic_n is denoted:

0=𝕊n=𝕂n1×𝕊formulae-sequencesubscript0𝕊subscript𝑛superscript𝕂𝑛1𝕊\mathbb{H}_{0}=\mathbb{S}\qquad\mathbb{H}_{n}=\mathbb{K}^{n-1}\times\mathbb{S}roman_ℍ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_𝕊 roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_𝕂 start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT × roman_𝕊

An element hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT writes (s0,a0,,sn1,an1,sn)subscript𝑠0subscript𝑎0subscript𝑠𝑛1subscript𝑎𝑛1subscript𝑠𝑛(s_{0},a_{0},\dots,s_{n-1},a_{n-1},s_{n})( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) where for all 0jn1,(sj,aj)𝕂formulae-sequence0𝑗𝑛1subscript𝑠𝑗subscript𝑎𝑗𝕂0\leq j\leq n-1,~{}(s_{j},a_{j})\in\mathbb{K}0 ≤ italic_j ≤ italic_n - 1 , ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ roman_𝕂.

The point of main importance to deal with the decision process is to exhibit the probability to reach state sn+1subscript𝑠𝑛1s_{n+1}italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT at time n+1𝑛1n+1italic_n + 1 given the history up to time n𝑛nitalic_n and the decision taken at time n𝑛nitalic_n this expresses as:

ν[Sn+1=sn+1|Hn=hn,An=an].\mathbb{P}_{\nu}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{n},A_{n}=a_{n}\right].roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] . (1)

In practice the computation of these probabilities requires significant computational resources because of the increasing length of the vector hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as n𝑛nitalic_n increases. Rapidly working directly with such variable is intractable (usually when n4𝑛4n\geq 4italic_n ≥ 4).

2.1.2 Markov decision process

To overpass this difficulty the Markov assumption is of particular interest. It consists in simplifying the dependence on the past by considering that all the necessary information for is contained in the current state.

Definition 2.3 (Markov Decision Process).

A Markov decision process on 𝕋𝕋\mathbb{T}roman_𝕋 is a
decision process (𝕊,𝔸,{𝔸(s)|s𝕊},ν)𝕊𝔸conditional-set𝔸𝑠𝑠𝕊𝜈(\mathbb{S},\mathbb{A},\{\mathbb{A}(s)|s\in\mathbb{S}\},\nu)( roman_𝕊 , roman_𝔸 , { roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } , italic_ν ) satisfying:

ν[Sn+1=sn+1|Hn=hn,An=an]=ν[Sn+1=sn+1|Sn=sn,An=an].\mathbb{P}_{\nu}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{n},A_{n}=a_{n}\right]=% \mathbb{P}_{\nu}\left[S_{n+1}=s_{n+1}~{}|~{}S_{n}=s_{n},A_{n}=a_{n}\right].roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] . (2)

A MDP is thus governed by a family of probability transitions

Pan(sn,sn+1)=[Sn+1=sn+1|Sn=sn,An=an].P_{a_{n}}(s_{n},s_{n+1})=\mathbb{P}\left[S_{n+1}=s_{n+1}~{}|~{}S_{n}=s_{n},A_{% n}=a_{n}\right].italic_P start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = roman_ℙ [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] .

which is the probability that action ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in state snsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT at time tn𝕋subscript𝑡𝑛𝕋t_{n}\in\mathbb{T}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝕋 leads to state sn+1subscript𝑠𝑛1s_{n+1}italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT at time tn+1subscript𝑡𝑛1t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT.

The most traditional RL framework is MDP (Bellman, 1957; Garcia and Rachelson, 2013). The majority of optimizing application complete their decision models with the memory-less Markov assumption.

Remark 2.4.

Behind MDP modeling, there is a strong assumption that all the information necessary for the decisions observed. In reality, states space can be noisy or incomplete. To overpass this assumption, Partially Observable Markov Decision Process (POMDP) model introduced in Monahan (1982) provides a relaxation to this assumption. POMDP can be seen as a generalization of MDP and is broadly based on the same framework. The major difference comes from the expression of the state space. POMDP consider a distinction between observed data and unobserved data, whereas DP and MDP are based exclusively on the data which have been directly observed. Mathematically, POMDP defines as an MDP except S𝑆Sitalic_S which is a family of 𝕊obs×𝕊unobssuperscript𝕊𝑜𝑏𝑠superscript𝕊𝑢𝑛𝑜𝑏𝑠\mathbb{S}^{obs}\times\mathbb{S}^{unobs}roman_𝕊 start_POSTSUPERSCRIPT italic_o italic_b italic_s end_POSTSUPERSCRIPT × roman_𝕊 start_POSTSUPERSCRIPT italic_u italic_n italic_o italic_b italic_s end_POSTSUPERSCRIPT-valued random variables {(Snobs,Snunobs),n}subscriptsuperscript𝑆𝑜𝑏𝑠𝑛subscriptsuperscript𝑆𝑢𝑛𝑜𝑏𝑠𝑛𝑛\{(S^{obs}_{n},S^{unobs}_{n}),n\in\mathbb{N}\}{ ( italic_S start_POSTSUPERSCRIPT italic_o italic_b italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT italic_u italic_n italic_o italic_b italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_n ∈ roman_ℕ } where Sobssuperscript𝑆𝑜𝑏𝑠S^{obs}italic_S start_POSTSUPERSCRIPT italic_o italic_b italic_s end_POSTSUPERSCRIPT is observed and Sunobssuperscript𝑆𝑢𝑛𝑜𝑏𝑠S^{unobs}italic_S start_POSTSUPERSCRIPT italic_u italic_n italic_o italic_b italic_s end_POSTSUPERSCRIPT is not.

2.2 Policy

The crucial concept in addressing dynamic programming is the notion of a policy, which is formalized as follows :

A policy is a sequence π=(πn)n𝜋subscriptsubscript𝜋𝑛𝑛\pi=(\pi_{n})_{n\in\mathbb{N}}italic_π = ( italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ roman_ℕ end_POSTSUBSCRIPT of conditional distributions from 𝔸𝔸\mathbb{A}roman_𝔸 given nsubscript𝑛\mathbb{H}_{n}roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT defined, for any 𝒜(𝔸)𝒜𝔸\mathcal{A}\in\mathcal{B}(\mathbb{A})caligraphic_A ∈ caligraphic_B ( roman_𝔸 ) and all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, by:

πn(𝒜,hn)=[An𝒜|Hn=hn],subscript𝜋𝑛𝒜subscript𝑛delimited-[]subscript𝐴𝑛conditional𝒜subscript𝐻𝑛subscript𝑛\pi_{n}(\mathcal{A},h_{n})=\mathbb{P}\left[A_{n}\in\mathcal{A}~{}|~{}H_{n}=h_{% n}\right],italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_A , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_ℙ [ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_A | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ,

satisfying for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT :

πn(𝔸(sn),hn)=1,subscript𝜋𝑛𝔸subscript𝑠𝑛subscript𝑛1\pi_{n}(\mathbb{A}(s_{n}),h_{n})=1,italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 1 ,

and for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and all an𝔸(sn)subscript𝑎𝑛𝔸subscript𝑠𝑛a_{n}\in\mathbb{A}(s_{n})italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

πn(an,hn)>0.subscript𝜋𝑛subscript𝑎𝑛subscript𝑛0\pi_{n}(a_{n},h_{n})>0.italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0 .

Decision-making is selecting an option based on environmental information. A policy represents a plan that establishes a sequence of actions. This strategy can be tailored to align with a specified objective. As a result, the focus will be on deriving the strategy that optimizes this objective. A policy πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a strategy that suggests, for every possible states sn𝕊subscript𝑠𝑛𝕊s_{n}\in\mathbb{S}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝕊, an action an𝔸(sn)subscript𝑎𝑛𝔸subscript𝑠𝑛a_{n}\in\mathbb{A}(s_{n})italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) taking to account the history hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the system.

Theorem 2.1 (Hernández-Lerma and Lasserre (2012); Nivot (2016)).

Given a policy π𝜋\piitalic_π and the initial distribution ν𝜈\nuitalic_ν, there is a unique probability νπsuperscriptsubscript𝜈𝜋\mathbb{P}_{\nu}^{\pi}roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT such that, for all (𝕊)𝕊\mathcal{B}\in\mathcal{B}(\mathbb{S})caligraphic_B ∈ caligraphic_B ( roman_𝕊 ), the Borel algebra of 𝕊𝕊\mathbb{S}roman_𝕊, and 𝒜(𝔸)𝒜𝔸\mathcal{A}\in\mathcal{B}(\mathbb{A})caligraphic_A ∈ caligraphic_B ( roman_𝔸 ), the Borel algebra of 𝔸𝔸\mathbb{A}roman_𝔸 :

νπ[S0]superscriptsubscript𝜈𝜋delimited-[]subscript𝑆0\displaystyle\mathbb{P}_{\nu}^{\pi}\left[S_{0}\in\mathcal{B}\right]roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_B ] =ν(),absent𝜈\displaystyle=\nu(\mathcal{B}),= italic_ν ( caligraphic_B ) ,
νπ[An𝒜|Hn=hn]superscriptsubscript𝜈𝜋delimited-[]subscript𝐴𝑛conditional𝒜subscript𝐻𝑛subscript𝑛\displaystyle\mathbb{P}_{\nu}^{\pi}\left[A_{n}\in\mathcal{A}~{}|~{}H_{n}=h_{n}\right]roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_A | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] =πn(𝒜,hn)absentsubscript𝜋𝑛𝒜subscript𝑛\displaystyle=\pi_{n}(\mathcal{A},h_{n})= italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_A , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

In the following, 𝔼νπsubscriptsuperscript𝔼𝜋𝜈\mathbb{E}^{\pi}_{\nu}roman_𝔼 start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT denotes the expectation associated with the probability νπsubscriptsuperscript𝜋𝜈\mathbb{P}^{\pi}_{\nu}roman_ℙ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT for an arbitrary policy π𝜋\piitalic_π and an initial distribution ν𝜈\nuitalic_ν.

The following result is of major practical importance and expresses the likelihood to observe a trajectory hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT by means of the DP.

Theorem 2.2.

Given (S,A,{𝔸(s)|s𝕊},ν)𝑆𝐴conditional-set𝔸𝑠𝑠𝕊𝜈(S,A,\{\mathbb{A}(s)|s\in\mathbb{S}\},\nu)( italic_S , italic_A , { roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } , italic_ν ) a decision process on 𝕋𝕋\mathbb{T}roman_𝕋 and π𝜋\piitalic_π a policy, we have for all n𝑛superscriptn\in\mathbb{N}^{*}italic_n ∈ roman_ℕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT,

νπ[Hn=hn]=j=1n[Sj=sj|Aj1=aj1,Hj1=hj1]π(aj1,hj1)ν(s0)\displaystyle\mathbb{P}_{\nu}^{\pi}\left[H_{n}=h_{n}\right]=\prod_{j=1}^{n}% \mathbb{P}\left[S_{j}=s_{j}~{}|~{}A_{j-1}=a_{j-1},H_{j-1}=h_{j-1}\right]\pi(a_% {j-1},h_{j-1})\nu(s_{0})roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℙ [ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ] italic_π ( italic_a start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) italic_ν ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

In the framework of MDP, to follow the same lines as in the proof of Theorem 2.2, an additional assumption on the policy is needed yielding to the concept of Markov policy:

Definition 2.4 (Markovian policy (Nivot, 2016)).

A Markovian policy π=(πn)n𝜋subscriptsubscript𝜋𝑛𝑛\pi=(\pi_{n})_{n\in\mathbb{N}}italic_π = ( italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ roman_ℕ end_POSTSUBSCRIPT is a policy satisfying for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all 𝒜(𝔸)𝒜𝔸\mathcal{A}\in\mathcal{B}(\mathbb{A})caligraphic_A ∈ caligraphic_B ( roman_𝔸 ) and all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

[An𝒜|Hn=hn]=[An𝒜|Sn=sn]=πn(𝒜,sn).delimited-[]subscript𝐴𝑛conditional𝒜subscript𝐻𝑛subscript𝑛delimited-[]subscript𝐴𝑛conditional𝒜subscript𝑆𝑛subscript𝑠𝑛subscript𝜋𝑛𝒜subscript𝑠𝑛\mathbb{P}\left[A_{n}\in\mathcal{A}~{}|~{}H_{n}=h_{n}\right]=\mathbb{P}\left[A% _{n}\in\mathcal{A}~{}|~{}S_{n}=s_{n}\right]=\pi_{n}(\mathcal{A},s_{n}).roman_ℙ [ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_A | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = roman_ℙ [ italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_A | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_A , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

2.3 Rewards, valuation and optimization of policies

2.3.1 Rewards

As discussed in the Introduction, the aim of DP modeling is to find optimal policies associated to an objective. To do so, a criterion of optimality has to be introduced. This criterion is usually built by means of rewards functions which provides a temporal judgment of the desirability of a state-action pair and are formalized as follows:

Definition 2.5.

Reward is defined as a family of bounded \mathbb{R}roman_ℝ-valued random variables {Rn,n}subscript𝑅𝑛𝑛\{R_{n},n\in\mathbb{N}\}{ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ∈ roman_ℕ }. For a sake of simplicity, let us denote for a given n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, for all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, all an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸 and all sn+1𝕊subscript𝑠𝑛1𝕊s_{n+1}\in\mathbb{S}italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝕊 :

n+1(hn,an,sn+1)=𝔼νπ[Rn+1|Hn=hn,An=an,Sn+1=sn+1].subscript𝑛1subscript𝑛subscript𝑎𝑛subscript𝑠𝑛1superscriptsubscript𝔼𝜈𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅𝑛1subscript𝐻𝑛subscript𝑛formulae-sequencesubscript𝐴𝑛subscript𝑎𝑛subscript𝑆𝑛1subscript𝑠𝑛1\mathcal{R}_{n+1}(h_{n},a_{n},s_{n+1})=\mathbb{E}_{\nu}^{\pi}\left[R_{n+1}~{}|% ~{}H_{n}=h_{n},A_{n}=a_{n},S_{n+1}=s_{n+1}\right].caligraphic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = roman_𝔼 start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ] .
Remark 2.5.

The concept of rewards functions are usually integrated in the definition of a decision process.

2.3.2 Valuation of policies and value-functions

State-value functions and state-action values functions are respectively known as V-function and Q-functions. These two concepts provide quantitative measures for evaluating policies, making meaningful policies comparisons and defining the optimal policy. These value-functions serve as qualitative evaluations for guiding strategic adaptations.

State-value functions allow to answer to : "How good is to be in state s𝑠sitalic_s after following the policy π𝜋{\pi}italic_π?" while action-value functions allow to answer to : "How good it is to have done the action a𝑎aitalic_a following policy π𝜋{\pi}italic_π knowing that they were in state s𝑠sitalic_s?". The key point is the evaluation is not assessing step-by-step evaluation but by means of the cumulative reward over time. In such a way, value functions focus on a long-term objective.

Definition 2.6.

Given γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] a discount parameter, the stage n𝑛nitalic_n long term discounted reward function is defined for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, by:

Gn=j=n+1γjn1Rjsubscript𝐺𝑛superscriptsubscript𝑗𝑛1superscript𝛾𝑗𝑛1subscript𝑅𝑗G_{n}=\sum_{j=n+1}^{\infty}\gamma^{j-n-1}\,R_{j}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_n - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
Definition 2.7 (Value functions (Chakraborty and Murphy, 2014; Schulte et al., 2014)).

Given (S,A,{𝔸(s)|s𝕊},ν)𝑆𝐴conditional-set𝔸𝑠𝑠𝕊𝜈(S,A,\{\mathbb{A}(s)|s\in\mathbb{S}\},\nu)( italic_S , italic_A , { roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } , italic_ν ) a decision process on 𝕋𝕋\mathbb{T}roman_𝕋, {Rn,n}subscript𝑅𝑛𝑛\{R_{n},n\in\mathbb{N}\}{ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ∈ roman_ℕ } a family of rewards, π𝜋\piitalic_π a policy and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] a discount parameter.

  • The stage n𝑛nitalic_n state-value function (V-function) for a history hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the total expected future rewards from stage n𝑛nitalic_n given by:

    Vnπ(hn)=𝔼νπ[Gn|Hn=hn].superscriptsubscript𝑉𝑛𝜋subscript𝑛superscriptsubscript𝔼𝜈𝜋delimited-[]conditionalsubscript𝐺𝑛subscript𝐻𝑛subscript𝑛V_{n}^{\pi}(h_{n})=\mathbb{E}_{\nu}^{\pi}\left[G_{n}~{}|~{}H_{n}=h_{n}\right].italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_𝔼 start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] .
  • The stage n𝑛nitalic_n action-value function (Q-function) is the total expected future rewards starting from a history hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, taking action ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is given by

    Qnπ(hn,an)=𝔼νπ[Gn|Hn=hn,An=an].superscriptsubscript𝑄𝑛𝜋subscript𝑛subscript𝑎𝑛superscriptsubscript𝔼𝜈𝜋delimited-[]formulae-sequenceconditionalsubscript𝐺𝑛subscript𝐻𝑛subscript𝑛subscript𝐴𝑛subscript𝑎𝑛Q_{n}^{\pi}(h_{n},a_{n})=\mathbb{E}_{\nu}^{\pi}\left[G_{n}~{}|~{}H_{n}=h_{n},A% _{n}=a_{n}\right].italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_𝔼 start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] .

The crucial aspect to observe in these definitions is that, instead of a step-by-step evaluation, the approach aims to assess a long-term objective. The goal is to evaluate the cumulative reward over time. Asa consequence of a decision, after each time step tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, an immediate reward Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is received which is the most distinctive feature of RL. The value functions represent the total expected future reward starting at a particular state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and thereafter choosing actions according to the policy π𝜋{\pi}italic_π.

Remark 2.6.

The discount factor γ𝛾\gammaitalic_γ introduced in the definition of the long-term reward at each step n𝑛nitalic_n aims to strike a thoughtful balance between immediate rewards and long-term rewards. It allows for a balancing between striving for the highest cumulative reward and the aim to reach substantial benefits within a reasonable time (Coronato et al., 2020). This is also a mathematical trick to make the sum converge.

Remark 2.7.

In the finite horizon case τ=tN𝜏subscript𝑡𝑁\tau=t_{N}italic_τ = italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the values functions can be defined in a similar way by considering

Gn=j=n+1Nγjn1Rjsubscript𝐺𝑛superscriptsubscript𝑗𝑛1𝑁superscript𝛾𝑗𝑛1subscript𝑅𝑗G_{n}=\sum_{j=n+1}^{N}\gamma^{j-n-1}\,R_{j}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_n - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

Notice that in this framework, the introduction of a discount parameter is not needed and is usually fixed to 1 from the definitions.

Remark 2.8.

To consider valuation in infinite horizon, we have considered processes in infinite horizon and to do so, the Markov assumptions on the decision process and on the policy are necessary. The discount factor is now mandatory to insure the convergence of the long term discounted reward. The values functions can be defined in the same way by considering conditional to S𝑆Sitalic_S expectations:

Vnπ(sn)superscriptsubscript𝑉𝑛𝜋subscript𝑠𝑛\displaystyle V_{n}^{\pi}(s_{n})italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =𝔼νπ[Gn|Sn=sn].absentsuperscriptsubscript𝔼𝜈𝜋delimited-[]conditionalsubscript𝐺𝑛subscript𝑆𝑛subscript𝑠𝑛\displaystyle=\mathbb{E}_{\nu}^{\pi}\left[G_{n}~{}|~{}S_{n}=s_{n}\right].= roman_𝔼 start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] .
Qnπ(sn,an)superscriptsubscript𝑄𝑛𝜋subscript𝑠𝑛subscript𝑎𝑛\displaystyle Q_{n}^{\pi}(s_{n},a_{n})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =𝔼νπ[Gn|Sn=sn,An=an].absentsuperscriptsubscript𝔼𝜈𝜋delimited-[]formulae-sequenceconditionalsubscript𝐺𝑛subscript𝑆𝑛subscript𝑠𝑛subscript𝐴𝑛subscript𝑎𝑛\displaystyle=\mathbb{E}_{\nu}^{\pi}\left[G_{n}~{}|~{}S_{n}=s_{n},A_{n}=a_{n}% \right].= roman_𝔼 start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] .

The following proposition highlights the link between V-functions and Q-functions.

Proposition 2.1 ((Kosorok and Moodie, 2015; Schulte et al., 2014; Sutton and Barto, 2018)).

For all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸, we have:

Vnπ(hn)superscriptsubscript𝑉𝑛𝜋subscript𝑛\displaystyle V_{n}^{\pi}(h_{n})italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =an𝔸(sn)Qnπ(hn,an)πn(hn,an)absentsubscriptsubscript𝑎𝑛𝔸subscript𝑠𝑛superscriptsubscript𝑄𝑛𝜋subscript𝑛subscript𝑎𝑛subscript𝜋𝑛subscript𝑛subscript𝑎𝑛\displaystyle=\sum_{a_{n}\in\mathbb{A}(s_{n})}Q_{n}^{\pi}(h_{n},a_{n})~{}\pi_{% n}(h_{n},a_{n})= ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (3)
Qnπ(hn,an)subscriptsuperscript𝑄𝜋𝑛subscript𝑛subscript𝑎𝑛\displaystyle Q^{\pi}_{n}(h_{n},a_{n})italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =sn+1𝕊(n+1(hn,an,sn+1)+γVn+1π((hn,an,sn+1))\displaystyle=\sum_{s_{n+1}\in\mathbb{S}}\left(\mathcal{R}_{n+1}(h_{n},a_{n},s% _{n+1})+\gamma V_{n+1}^{\pi}((h_{n},a_{n},s_{n+1})~{}\right)= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝕊 end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + italic_γ italic_V start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) )
×νπ[Sn+1=sn+1|Hn=hn,An=an].\displaystyle\times\mathbb{P}_{\nu}^{\pi}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{% n},A_{n}=a_{n}\right].× roman_ℙ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] . (4)

The remaining issue consists in the computation of the value functions. To do so, the result of major importance is the recursive form of the value functions which states that the value functions can be decomposed into immediate reward plus discounted value of successor state.

Theorem 2.3 (Recursive form for value functions (Chakraborty and Murphy, 2014; Zhao et al., 2015)).

For all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸, we have:

Vnπ(hn)subscriptsuperscript𝑉𝜋𝑛subscript𝑛\displaystyle V^{\pi}_{n}(h_{n})italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =sn+1𝕊an𝔸(sn)(n+1(hn,an,sn+1)+γVn+1π(hn+1))absentsubscriptsubscript𝑠𝑛1𝕊subscriptsubscript𝑎𝑛𝔸subscript𝑠𝑛subscript𝑛1subscript𝑛subscript𝑎𝑛subscript𝑠𝑛1𝛾superscriptsubscript𝑉𝑛1𝜋subscript𝑛1\displaystyle=\sum_{s_{n+1}\in\mathbb{S}}~{}\sum_{a_{n}\in\mathbb{A}(s_{n})}% \left(\mathcal{R}_{n+1}(h_{n},a_{n},s_{n+1})+\gamma~{}V_{n+1}^{\pi}(h_{n+1})\right)= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝕊 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + italic_γ italic_V start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) )
×[Sn+1=sn+1|Hn=hn,An=an]π(an,hn)\displaystyle\times\mathbb{P}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{n},A_{n}=a_{% n}\right]\pi(a_{n},h_{n})× roman_ℙ [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] italic_π ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (5)
Qnπ(hn,an)subscriptsuperscript𝑄𝜋𝑛subscript𝑛subscript𝑎𝑛\displaystyle Q^{\pi}_{n}(h_{n},a_{n})italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =sn+1𝕊(n+1(hn,an,sn+1)\displaystyle=\sum_{s_{n+1}\in\mathbb{S}}~{}\left(\mathcal{R}_{n+1}(h_{n},a_{n% },s_{n+1})\right.= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝕊 end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT )
+γan+1𝔸(sn+1)Qn+1π(hn+1,an+1)π(hn+1,an+1))\displaystyle+\left.\gamma~{}\sum_{a_{n+1}\in\mathbb{A}(s_{n+1})}Q_{n+1}^{\pi}% (h_{n+1},a_{n+1})\pi(h_{n+1},a_{n+1})\right)+ italic_γ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) italic_π ( italic_h start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) )
×[Sn+1=sn+1|Hn=hn,An=an]\displaystyle\times\mathbb{P}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{n},A_{n}=a_{% n}\right]× roman_ℙ [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] (6)

Equations (5) and (6) are known as Bellman’s equation. A policy being fixed, the Bellman equation can be solved, therefore making it possible to determine the values of the value functions and thus the values of Q-function. Indeed, in the case where the number of steps is finite, the Bellman equation actually hides a linear system of N𝑁Nitalic_N equations to N𝑁Nitalic_N unknowns. It can therefore be solved, once translated into a matrix equation, by a technique such as the Gaussian pivot.

2.3.3 Optimization of the policies

The key concern of the RL problem is to determine the optimal policy, denoted as πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which represents the optimal strategy for maximizing our long-term reward function. In other words, it is about finding the best way to make decisions in an environment to obtain the highest long-term rewards. The search for the optimal policy is based on the Bellman optimality principle developed bellow.

Definition 2.8.

The optimal state-value functions (Vn)superscriptsubscript𝑉𝑛(V_{n}^{*})( italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are defined for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the maximum value functions over all policies

Vn(hn)=maxπVnπ(hn)superscriptsubscript𝑉𝑛subscript𝑛subscript𝜋superscriptsubscript𝑉𝑛𝜋subscript𝑛V_{n}^{*}(h_{n})=\max_{\pi}~{}V_{n}^{\pi}(h_{n})italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

The optimal action-value functions (Qn)superscriptsubscript𝑄𝑛(Q_{n}^{*})( italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are defined for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸, as the maximum action-value functions over all policies

Qn(hn,an)=maxπQnπ(hn,an)superscriptsubscript𝑄𝑛subscript𝑛subscript𝑎𝑛subscript𝜋superscriptsubscript𝑄𝑛𝜋subscript𝑛subscript𝑎𝑛Q_{n}^{*}(h_{n},a_{n})=\max_{\pi}~{}Q_{n}^{\pi}(h_{n},a_{n})italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
Definition 2.9.

Consider the partial ordering over policies defined by:

ππif and only if, for all n, all hnn,Vnπ(hn)Vnπ(hn).formulae-sequencesuperscript𝜋𝜋if and only if, for all n, all hnn,superscriptsubscript𝑉𝑛superscript𝜋subscript𝑛superscriptsubscript𝑉𝑛𝜋subscript𝑛\pi^{\prime}\geq\pi\quad\text{if and only if, for all $n\in\mathbb{N}$, all $h% _{n}\in\mathbb{H}_{n}$,}\quad V_{n}^{\pi^{\prime}}(h_{n})\geq V_{n}^{\pi}(h_{n% }).italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_π if and only if, for all italic_n ∈ roman_ℕ , all italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

This partial ordering allows to define optimal policy in the following way:

Proposition 2.2.

There exists an optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that is better than or equal to all other policies, ππsuperscript𝜋𝜋\pi^{*}\geq\piitalic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ italic_π for all π𝜋\piitalic_π.

Theorem 2.4.

All optimal policies achieve the optimal value functions and the optimal action-value functions, for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸,

Vnπ(hn)=Vn(hn)andQnπ(hn,an)=Qn(hn,an).formulae-sequencesuperscriptsubscript𝑉𝑛superscript𝜋subscript𝑛superscriptsubscript𝑉𝑛subscript𝑛andsuperscriptsubscript𝑄𝑛superscript𝜋subscript𝑛subscript𝑎𝑛superscriptsubscript𝑄𝑛subscript𝑛subscript𝑎𝑛V_{n}^{\pi^{*}}(h_{n})=V_{n}^{*}(h_{n})\qquad\text{and}\qquad Q_{n}^{\pi^{*}}(% h_{n},a_{n})=Q_{n}^{*}(h_{n},a_{n}).italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .
Theorem 2.5 (Bellman Optimality Equations for Qnsuperscriptsubscript𝑄𝑛Q_{n}^{*}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

For all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ, all hnnsubscript𝑛subscript𝑛h_{n}\in\mathbb{H}_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_ℍ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and an𝔸subscript𝑎𝑛𝔸a_{n}\in\mathbb{A}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_𝔸, we have

Qn(hn,an)subscriptsuperscript𝑄𝑛subscript𝑛subscript𝑎𝑛\displaystyle Q^{*}_{n}(h_{n},a_{n})italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =sn+1𝕊(n+1(hn,an,sn+1)+γmaxa𝔸(sn+1)Qn+1(hn+1,a))absentsubscriptsubscript𝑠𝑛1𝕊subscript𝑛1subscript𝑛subscript𝑎𝑛subscript𝑠𝑛1𝛾subscript𝑎𝔸subscript𝑠𝑛1subscriptsuperscript𝑄𝑛1subscript𝑛1𝑎\displaystyle=\sum_{s_{n+1}\in\mathbb{S}}~{}\left(\mathcal{R}_{n+1}(h_{n},a_{n% },s_{n+1})+\gamma~{}\max_{a\in\mathbb{A}(s_{n+1})}Q^{*}_{n+1}(h_{n+1},a)\right)= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_𝕊 end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_a ∈ roman_𝔸 ( italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_a ) )
×[Sn+1=sn+1|Hn=hn,An=an]\displaystyle\times\mathbb{P}\left[S_{n+1}=s_{n+1}~{}|~{}H_{n}=h_{n},A_{n}=a_{% n}\right]× roman_ℙ [ italic_S start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] (7)

As a consequence of the Bellman Optimality Equation, we can claim that an optimal policy can be found by maximizing over Qnπ(s,a)superscriptsubscript𝑄𝑛superscript𝜋𝑠𝑎Q_{n}^{\pi^{*}}(s,a)italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) for all n𝑛n\in\mathbb{N}italic_n ∈ roman_ℕ and by considering the optimal policy defined as

πn(s,a)={1if aargmaxa𝔸(s)Qn(s,a)0otherwisesuperscriptsubscript𝜋𝑛𝑠𝑎cases1if 𝑎subscriptargmax𝑎𝔸𝑠superscriptsubscript𝑄𝑛𝑠𝑎0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\pi_{n}^{*}(s,a)=\begin{cases}1&\text{if }a\in\operatorname*{arg\,max}_{a\in% \mathbb{A}(s)}Q_{n}^{*}(s,a)\\ 0&otherwise\end{cases}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = { start_ROW start_CELL 1 end_CELL start_CELL if italic_a ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ roman_𝔸 ( italic_s ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW (8)

Note that this policy is deterministic.

2.4 Reinforcement Learning

The mathematical foundations established in the previous sections serve as the basis for building algorithms to determine decision rules. In the field of RL, numerous algorithms aim to learn optimal policies. We have chosen to present two of these algorithms to illustrate a first distinction between online and offline application contexts. Furthermore, the second algorithm presented has been widely adopted to meet our application context. A discussion on the different RL algorithms suitable for our context will be the subject of Section 3.5.

2.4.1 Q-learning Forward

Q-learning, proposed in 1989 by Chris Watkins (Sutton and Barto, 2018; Watkins and Dayan, 1992), is one of the most famous and widely used algorithms in RL. It was historically developed in the so-called online context where the algorithm can dynamically interact with its application context. This is associated with the notion of "agent" which is an entity capable of interacting with the environment while receiving rewards. The concept of interaction is related to the exploitation-exploration dilemma. The agent must, through trial and error, choose between exploiting acquired knowledge to maximize immediate rewards or exploring new actions to discover better long-term strategies (Sutton and Barto, 2018). An excellent illustration of this problem is the ϵitalic-ϵ\epsilonitalic_ϵ-greedy strategy presented in the following definition:

Definition 2.10 (ϵitalic-ϵ\epsilonitalic_ϵ-greedy Policy).
πϵ(s)={random action from 𝔸(s)with probability ϵargmaxa𝔸(s)Q(s,a)with probability 1ϵsubscript𝜋italic-ϵ𝑠casesrandom action from 𝔸(s)with probability ϵsubscriptargmax𝑎𝔸𝑠𝑄𝑠𝑎with probability 1ϵ\pi_{\epsilon}(s)=\begin{cases}\text{random action from $\mathbb{A}(s)$}&\text% {with probability $\epsilon$}\\ \operatorname*{arg\,max}_{a\in\mathbb{A}(s)}Q(s,a)&\text{with probability $1-% \epsilon$}\end{cases}italic_π start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_s ) = { start_ROW start_CELL random action from roman_𝔸 ( italic_s ) end_CELL start_CELL with probability italic_ϵ end_CELL end_ROW start_ROW start_CELL start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ roman_𝔸 ( italic_s ) end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ) end_CELL start_CELL with probability 1 - italic_ϵ end_CELL end_ROW

where ϵ[0,1]italic-ϵ01\epsilon\in[0,1]italic_ϵ ∈ [ 0 , 1 ] is an hyperparemeter called the exploration rate.

Q-learning relies on the recursive Bellman equations (2.3). The idea is to estimate value functions based on the differences between current and previous estimates, and then to derive an optimal strategy from Equation (8) of Bellman optimality.

Algorithm 1 Q-learning
Initialisation : Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) arbitrarily, set learning rate α𝛼\alphaitalic_α, discount factor γ𝛾\gammaitalic_γ, and exploration rate ϵitalic-ϵ\epsilonitalic_ϵ
for each history to build do
     Initialize state s𝑠sitalic_s
     while s𝑠sitalic_s has not reached the terminal stage do
         Choose action a𝑎aitalic_a using policy derived from Q𝑄Qitalic_Q (e.g., ϵitalic-ϵ\epsilonitalic_ϵ-greedy)
         Take action a𝑎aitalic_a, observe reward r𝑟ritalic_r and new state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
         Update Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) using the Q-learning update rule:
               Q(s,a)Q(s,a)+α(r+γmaxaQ(s,a)Q(s,a))𝑄𝑠𝑎𝑄𝑠𝑎𝛼𝑟𝛾subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎𝑄𝑠𝑎Q(s,a)\leftarrow Q(s,a)+\alpha\left(r+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{% \prime})-Q(s,a)\right)italic_Q ( italic_s , italic_a ) ← italic_Q ( italic_s , italic_a ) + italic_α ( italic_r + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( italic_s , italic_a ) )
         ss𝑠superscript𝑠s\leftarrow s^{\prime}italic_s ← italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
     end while
end for
Output: The optimal decison rule is determined such as π(s,a)=argmaxaQ(s,a)superscript𝜋𝑠𝑎subscriptargmax𝑎𝑄𝑠𝑎\pi^{*}(s,a)=\operatorname*{arg\,max}_{a}Q(s,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a )

2.4.2 Q-learning Backward

When exploration of the environment is challenging, learning can be conducted using existing data, allowing decision rules to be derived from a non-interactive environment. This is referred to as offline or batch-RL. In this context, the algorithm does not interact with its environment; learning relies on estimating value functions from pre-existing databases. This offline Q-learning (Ernst et al., 2005; Ormoneit and Sen, 2002) follows a backward approach illustrated in Figure 1.

Refer to caption
Figure 1: Illustration of the Backward Q-learning algorithm for estimating Q-values on a history with 4 steps.

The estimates of the Q-function are initialized at the terminal time and move backward in time step by step. This strategy allows for the consideration of a possible delay effect commonly observed in longitudinal data. To estimate the Q-functions, various regression algorithms can be used, such as linear regression, support vector machines, decision trees or by deep neural networks, among others.

Algorithm 2 Backward Q-learning
Input: A set of training offline data consists of patients admissible histories htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and their associated indexed reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t=0,,τ𝑡0𝜏t=0,...,\tauitalic_t = 0 , … , italic_τ and a regression algorithm
Initialisation : Let t=τ+1𝑡𝜏1t=\tau+1italic_t = italic_τ + 1 and Q^tsubscript^𝑄𝑡\hat{Q}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be a function equal to zero everywhere on 𝕊×𝔸𝕊𝔸\mathbb{S}\times\mathbb{A}roman_𝕊 × roman_𝔸
while until t=0𝑡0t=0italic_t = 0 do
     tt1𝑡𝑡1t\leftarrow t-1italic_t ← italic_t - 1 (Backward)
     Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is fitted with a regression algorithm though the following recursive equation : Qt(st,at)=rt+maxat+1Q^t+1(st+1,at+1)subscript𝑄𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscriptsubscript𝑎𝑡1subscript^𝑄𝑡1subscript𝑠𝑡1subscript𝑎𝑡1Q_{t}(s_{t},a_{t})=r_{t}+\max_{a_{t+1}}\hat{Q}_{t+1}(s_{t+1},a_{t+1})italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
end while
Output: Given the sequential estimates of {Q^0,,Q^τ}subscript^𝑄0subscript^𝑄𝜏\{\hat{Q}_{0},...,\hat{Q}_{\tau}\}{ over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT }, the sequential optimal policies {π^0,,π^τ}subscript^𝜋0subscript^𝜋𝜏\{\hat{\pi}_{0},...,\hat{\pi}_{\tau}\}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } can be determined
Remark 2.9.

In an offline context, direct exploration is not present because decisions are made based on data collected in the database. Although there is no longer an exploration-exploitation dilemma as in the online context, it will be necessary to take into account a bias resulting from data where exploration-exploitation has already been performed.

3 Dynamic Treatment Regimes and Reinforcement Learning

3.1 Dynamic Treatment Regimes

Until the end of the 20th century, progress in medicine followed a "one-size-fits-all" approach. The search for the effect of a treatment or intervention was framed within evidence-based medicine on a target population. With the advent of massive data, particularly genomics, the paradigm has evolved. The volume of individual data collected has exploded, suggesting the possibility of integrating individual factors in the search for the effect of an intervention. The desired effect of treatment is no longer an average effect but a conditional effect on patient characteristics.

In this context, where the effect of an intervention is conditional to the variable characteristics of the patient which vary over time, the relevance of a treatment for a given individual may also vary over time. A central objective of precision medicine is to develop adaptive, and potentially optimal, intervention rules, where the definition of optimality must be clearly defined (Kahkoska et al., 2022a).

The search for adaptive (optimal) intervention rules is not a new question. A vast literature, primarily in the field of causal inference, exists and has real practical relevance. The foundational works in this context are attributed to Robins (1998), and the three extensions that allow for the effects of time-varying regimes in the presence of confounding variables: G-computation (Robins, 1986), the method of structural nested mean (Robins, 1994) models and G-estimation (Robins, 1992, 1989, 1998), as well as marginal structural models (Robins, 2000) and methods associated with inverse probability of treatment weighting (Chesnaye et al., 2022). Subsequently, a number of methods have been proposed, both in frequentist and Bayesian frameworks. All estimate the optimal DTR based on distributional assumptions of the data generation process via parametric models. We can consider them as direct resolution methods. These methods will not be further developed in this article; an up-to-date review including direct methods can be found in Deliu and Chakraborty (2022).

In the following section, we will detail the parallel that can be drawn between DTR and RL, which helps overcome a major barrier of direct methods, namely the risk of misspecification of underlying assumptions (Zhao et al., 2015). To address this limitation, in Murphy (2003), followed immediately by Robins (2004), semi-parametric methods were considered, marking the first examples of RL-based approaches in the literature on DTR. The innovations of RL have breathed new life into the search for optimal DTRs, gradually expanding its applicability domain.

3.2 Decision Process and Dynamic Treatment Regimes

In Section 2, we notably introduced decision processes, policy and rewards which forms the theoretical foundation for algorithms searching for optimal policies, namely reinforcement learning. To describe the contribution of RL algorithms in the medical context, we will begin by examining how the framework introduced and DTRs are linked.

As discussed in Section 3.1, an adaptive intervention involves making a treatment decision based on the patient’s characteristics and treatment history. An adaptive decision rule can thus be perceived as a policy in the theoretical sense presented in Section 2.2. To leverage the results of reinforcement learning, it is essential to define the applied framework of the underlying DP for DTRs.

Building upon the definition 2.1 of a decision process, it is natural to consider, in a medical context:

  • The state space 𝕊𝕊\mathbb{S}roman_𝕊 contains the selected covariates describing the patient’s state.

  • The action space 𝔸𝔸\mathbb{A}roman_𝔸 contains the selected treatments and their associated dosages.

  • The subset {𝔸(s)|s𝕊}conditional-set𝔸𝑠𝑠𝕊\{\mathbb{A}(s)|s\in\mathbb{S}\}{ roman_𝔸 ( italic_s ) | italic_s ∈ roman_𝕊 } states that the treatments feasible or accessible for a patient depend on a given state.

Remark 3.1.

It is worth noting that in our context, the variable Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be (and this is the usual situation) a vector containing both the patient’s health state and a set of covariates observed at time t𝑡titalic_t, which may influence the transition probabilities from one state to another.

The observed histories htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are then the care pathways of different patients. They contain health data and treatments administered up to decision t𝑡titalic_t.

One of the key elements of RL is the reward. In the medical context, rewards are defined to address the clinical objective. This is a very important point as optimization relies on it. The notion of reward will be central in the discussion on the integration of medical expertise in Section 4. Indeed, for a given situation, different rewards can be associated depending on the expertise of the physicians, the specific objectives of the clinical trial, either proximally (directly after the decision) or distally (at the end of the follow-up).

3.3 Specificities of the Medical Context

DTRs find their primary application in medical contexts where multiple treatment lines are possible or in contexts with multiple possible decision points (see Figure 2). These adaptive strategies are particularly relevant in areas such as intensive care, chronic diseases, psychiatry, or oncology.

Refer to caption
Figure 2: Illustration of medical history: treatment line for a patient.

The medical context is known for the great heterogeneity of its data (Kahkoska et al., 2022b; Sperger et al., 2020), whether in terms of care pathways, treatment response, side effects, social factors, or lifestyle. In this regard, data-driven methods offer interesting perspectives by overcoming the issue of model misspecification. Precision medicine would thus offer a path to more equitable access to treatments. Moreover, the decision-making process can take into account variables such as resource availability, finances, and other socio-economic or discriminatory factors, leading to fairer decision rules.

The timing of decision-making moments is a central issue in the problem of adaptive interventions. Typically, these decision points are linked to patient visits to the practitioner. It is therefore natural to consider these moments as discrete and finite and to model them using a finite-horizon DP introduced in Section 2.1. Two issues arise: the time interval between two decision points and their frequency.

The issue of non-homogeneous time intervals between patients in the context of DTRs is typically addressed by considering the time between two visits as a covariate (Laber et al., 2014a). In some scenarios, such as patient follow-up in oncology or diabetes care, the number of visits is indefinite and varies based on individual patient needs. These patients are regularly monitored through mobile Health (m-Health) initiatives, which operate in an online environment. Therefore, employing the Q-learning approach with backward induction, as explained in Section 2.4.2, becomes impractical. In Luckett et al. (2019), researchers identified optimal DTRs within an indefinite horizon framework using V-learning. This method aims to estimate the optimal policy from a predefined class of policies. Another approach, discussed in Ertefaie and Strawderman (2018), utilizes an inferential procedure for estimating Q-functions.

Remark 3.2.

In the rapidly expanding field of m-Health research, online approaches are particularly suitable. Just-In-Time Adaptive Interventions (JITAIs) have already been the subject of research efforts (Istepanian et al., 2007; Nahum-Shani et al., 2018; Rehg et al., 2017). A synthesis of JITAI research is provided in Deliu et al. (2022), along with a comparative study with DTRs. This study addresses the technical aspect of making decisions about adaptive treatments in an interactive online environment. We will not cover these aspects further in the work, as the framework of DTRs on real data is discussed in Section 2.4.2, which is only feasible in the context of offline algorithms.

3.4 Real Data Application

The Supplementary Material provides an overview of the RL research conducted in the context of DTRs. It is important to note that decision points are typically few in real data application context; many studies consider two or three decision points. This choice is primarily driven by computational challenges: the more decision points there are, the more complex it becomes to integrate the patient’s history into the models. An alternative approach is to impose a Markov assumption on the DP. However, in healthcare applications, this assumption is often unrealistic. The entire patient history can rarely be ignored or encapsulated in the current state.

As with any analysis on healthcare data, it is natural to question the biases inherent in the methods and the issue of causality (Hernan and Robins, 2023; Neuberg, 2003). Since machine learning techniques are not causal inference methods, their use requires unbiased data. The issue typically arises in terms of "potential outcomes", and it is common to consider causal inference assumptions such as the "stable unit treatment value" assumption and the "no unmeasured confounders" assumption, as explained in (Chakraborty and Murphy, 2014, Chap. 2). The question of causality in the field of reinforcement learning is also addressed more directly in the framework of "causal RL"111for details of ”causal RL” initiative, see https://crl.causalai.net/ (Chakraborty and Murphy, 2014; Zhang, 2020). The search for adaptive intervention rules relies on data with a specific longitudinal structure. Innovations in algorithms for finding optimal DTRs often begin with adjustments to existing observational databases.

The Medical Information Mart for Intensive Care (MIMIC) (Johnson et al., 2016) is a publicly accessible observational database containing information on 53,423 distinct admissions for patients in intensive care units between 2001 and 2012. It includes data on vital signs, medications, laboratory tests, measurements, caregiver notes, procedure and diagnostic codes, imaging reports, length of hospital stay, survival data, etc. Due to the wealth of available information and its longitudinal nature, MIMIC has been widely used by the RL community as a support for methods comparison (see (Roggeveen et al., 2021), Table 1 and Supplementary Material). It is also utilized as a training dataset for the development of data augmentation methods (Tseng et al., 2017) and the generation of interactive environment models (Peng et al., 2018; Raghu et al., 2017a).

Reference Model State Space Action Space Rewards
Komorowski et al. (2016) SARSA Discretised state space 25 unique actions based on a 5 by 5 binning procedure of maximum vasopressor dose and sum of intravenous fluids per 4h time interval Terminal reward at the end of each trajectory based on 90-day mortality
Raghu et al. (2017b) Dueling DDQN Ordinary and Sparse Auto-Encoders were used for latent state space representation As paper Komorowski et al. (2016) Terminal reward at the end of each trajectory based on in-hospital mortality
Raghu et al. (2017a) Dueling DDQN Continuous state space based on 4h aggregated features based on physiological parameters As paper Komorowski et al. (2016) Intermediate reward based on changes in critical care scores and lactate combined with a terminal reward for survival based on ICU mortality
Peng et al. (2018) Dueling DDQN Patient states are encoded recurrently using an LSTM autoencoder representing the cumulative history for each patient As paper Komorowski et al. (2016) The change in the negative mortality logodds of mortality between the current observations and the next observations.
Li et al. (2019b) Actor-Critic POMDP As paper Komorowski et al. (2016) As paper Komorowski et al. (2016)
Yu et al. (2019a) Dueling DDQN As paper [3] As paper Komorowski et al. (2016) Developed several reward functions based on 7 potential features most important during the treatment process
Table 1: Applications of RL algorithms on MIMIC database: highlighting various medical objectives with rewards design extract from Roggeveen et al. (2021).

Similarly to how randomized trials play a distinct role in clinical research and may be considered the gold standard for causal relationship investigation, the Sequential Multiple Assignment Randomized Trial (SMART) design (Cheung et al., 2015; Kosorok and Moodie, 2015) can be regarded as the gold standard for clinical trial design in the context of adaptive interventions. SMART designs involve an initial randomization of patients to various treatment options, followed by re-randomizations at each subsequent stage of some or all of the patients to another available treatment at that stage. With such a design, the stable unit treatment value assumption is "by design" fulfilled. However, SMART designs are challenging to implement, costly, and as a result, there is limited access to data from SMARTs. However, notable trials include :

  • CATIE (Clinical Antipsychotic Trials of Intervention Effectiveness) is a SMART study involving 1,460
    schizophrenia patients over 18 months aimed at evaluating the clinical effectiveness of specific sequences of antipsychotic medications (Shortreed et al., 2011).

  • ADHD (Attention Deficit Hyperactivity Disorder) is a SMART study involving 150 simulated participants, aimed at evaluating an adaptive intervention for children with this disorder. This study integrates behavior modification treatment along with medication treatment (Chakraborty and Murphy, 2014; Laber et al., 2014a).

  • STAR*D (Sequenced Treatment Alternatives to Relieve Depression) is a SMART study involving 4,041 patients with major depressive disorders. This study evaluated the effectiveness of different treatment regimens (Chakraborty and Murphy, 2014; Laber et al., 2014b).

3.5 Properties of Reinforcement Learning Applied to Dynamic Treatment Regimes

There is a wide range of RL algorithms offering various methodological approaches tailored to specific contexts, as illustrated in the Supplementary Material Table. Figure 3 below provides a non-exhaustive overview of the most common RL algorithms. It presents many dichotomies, which will be explained in the following paragraph and contextualized in DTRs applications.

Refer to caption
Figure 3: Classification of the most common RL algorithms.

3.5.1 Model-based vs. Model-free

The first dichotomy in Figure 3 is based on the distinction between a model-based approach and a model-free approach. This distinction is related to the concept of transition probability defined by equation (1). A procedure is considered "model-based" when it relies on knowledge of all transition probabilities from a model, which means having access to all dynamics of the system. A model-free method is able to bypass this model and is based on partial information of the associations between states and actions to determine the optimal strategy. In a model-based approach, all possible paths from an initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are explored, and an optimal policy is one that maximizes the objective.

However, in a medical context, exploring all possibilities from the same starting point is infeasible, mainly for clinical and ethical reasons. The environment is thus inherently partially observed. This reality inherently places us in a model-free framework. It is worth noting the existence of an application on simulated patient data based on MIMIC (see Section 1) in the model-based framework in Raghu et al. (2018).

3.5.2 Policy-based vs. Value-based vs. Actor Critic

The second distinction involves two different approaches to determine the best strategy: policy-based methods and value-based methods. The former aim to directly find the optimal policy by formalizing the RL problem through a family of policies, introduced in (Sutton and Barto, 2018, Chapter 13). The latter seek the optimal policy through value functions, introduced in Section 2.3.2, and serve as the basis for algorithmic methods such as dynamic programming, Monte Carlo, and temporal-difference, also presented in the same book. These two approaches can be combined, thus forming actor-critic methods (Grondman et al., 2012; Sutton and Barto, 2018).

Policy-based

Policy-based methods are direct approaches to finding the optimal policy that rely on a parametric form of the strategy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ. Optimization can be typically achieved through gradient descent :

θn+1=θn+𝔼πθ[Gn|θ]subscript𝜃𝑛1subscript𝜃𝑛subscript𝔼subscript𝜋𝜃delimited-[]conditionalsubscript𝐺𝑛𝜃\theta_{n+1}=\theta_{n}+\nabla\mathbb{E}_{\pi_{\theta}}[G_{n}|\theta]italic_θ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ∇ roman_𝔼 start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_θ ] (9)

where Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the cumulative long-term reward introduced in Remark 2.7.

This method has been applied to simulated HIV data (Yu et al., 2019b) as well as in the intensive care domain (Raghu et al., 2018). Note that the first application highlighted the challenges of converging to an optimal decision rule due to the simplification of simulation models. The main obstacle to using this method is the difficulty of convergence, which requires a large volume of data.

Value-based

Value-based methods evaluate the optimal policy indirectly based on value functions Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT oder Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT introduced in Section 2.3.2. The general idea is to quantitatively evaluate states or action-state pairs using one of the value functions (Q-function or V-function). The optimal policy is then obtained by identifying actions that maximize these values. The success of these methods relies on the ability to model these value functions, as outlined in Section 2.4.2, through algorithms such as Backward Q-learning, making it a highly flexible approach.

The initial work was conducted by Murphy (2005a), who introduced an offline Q-learning, also known as batch learning, in a context of non-Markovian planning with a limited and restricted number of steps (n4)𝑛4(n\leq 4)( italic_n ≤ 4 ). This approach proves ideal for its application to DTRs and can serve as a starting point for many other applications. Research activity in this field quickly became significant, considering various parametric, semi-parametric, and non-parametric strategies to model the value function (Chakraborty and Murphy, 2014; Laber et al., 2014b; Murphy, 2005b; Tsiatis et al., 2019).

Value-based methods are better suited for application to DTRs. They enable the discovery of optimal decision rules in a non-Markovian framework with a small number of steps and data, unlike policy-based methods. This makes them easily applicable to real-world data. Moreover, they can offer a clearer interpretation, especially when Q-function estimation relies on a linear regression model (Laber et al., 2014b), thus providing interpretable decision rules. As shown in the Supplementary Material Table, this is the most widely used method in practice, particularly Q-learning approaches and its derivatives in the context of DTRs.

Actor-Critic

A third approach to address the question of finding an optimal strategy is known as the ’Actor-Critic’ method (Grondman et al., 2012). It takes a hybrid approach by combining an Actor based on policy-based methods with a Critic based on model-based methods, thus integrating the advantages of both previous methods. The Actor refines the parameterized policy under the guidance of the Critic. The latter uses value functions, also parameterized Vπθsuperscript𝑉subscript𝜋𝜃V^{\pi_{\theta}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT oder Qπθsuperscript𝑄subscript𝜋𝜃Q^{\pi_{\theta}}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, to guide learning. This third way of constructing decision rules was developed to correct biases in value-based methods and to counterbalance the high variability of the gradient part of policy-based methods in equation (9).

Actor-Critic methods have been applied to the MIMIC dataset. This compromise between policy-based and value-based methods converges towards a decision rule reducing patient mortality in (Wang et al., 2018) or providing a decision rule in line with physician’s usual opinions in (Li et al., 2020, 2018). This approach relies on gradient descent, similar to policy-based methods, thus necessitating databases containing a large number of individuals, often simulated data.

3.5.3 On-policy vs. Off-policy

This last dichotomy is closely related and sometimes confused with the concepts of offline and online algorithms presented in Section 2.4. DTRs on real data inherently operate in an offline context, seeking the optimal policy from previously collected data. Therefore, we are necessarily in an off-policy context, meaning that the strategy of generating the data (’behavior policy’) is not necessarily optimal. The optimal strategy (’target policy’) is deduced subsequently.

On-policy algorithms require an interactive online context where the strategy generating the data is optimized. The concepts of behavior and target policies are merged. The online framework can benefit from both on-policy algorithms, as is the case in the medical domain with Just-in-Time Adaptive Interventions (JITAIs) discussed in Deliu et al. (2022), and off-policy algorithms (see Figure 3). Some online algorithms, both off-policy and on-policy, have been explored within the context of DTRs, but exclusively in simulated data settings, as indicated in the Supplementary Material table.

4 Integrating Medical Knowledge into Reinforcement Learning Models

The previous section has highlighted the variety of algorithms available for seeking optimal decision rules. Regardless of the method used, the construction of decision rules remains algorithmic and data-driven. Therefore, the legitimate question arises regarding the explainability of the obtained decision rule, both for the patient and the practitioner. A prerequisite for the clinical application of these decision rules will be to address these concerns. To do so, we will explore how medical expertise can intervene in the construction of these decision rules.

This section has two main goals: firstly, to outline how medical knowledge intersects with RL algorithms in the search for treatment decision rules, and secondly, to propose adjustments to these algorithms to better suit their application to DTR. These twin aims are aimed at enhancing the safety, interpretability, and relevance of tools for medical decision-making.

4.1 Medical Knowledge and Model Preparation

Like any machine learning method, the search for the optimal DTR depends on the data from which the method was trained. Data preparation is therefore an essential step. Medical knowledge is certainly involved in this process. Indeed, in this causal context, the choice of variables to collect and the selection of confounding factors are crucial. These decisions are primarily guided by medical expertise, drawn from the experience of practitioners and medical literature, as detailed in Section 3.3. The construction of the training dataset is thus the very first intervention of medical knowledge in RL models. It is primarily a methodological consideration that may bias the constructed optimal decision rule (Remark 2.9).

The second step in the preparation phase of applying RL in the context of searching for optimal DTRs involves selecting an algorithm from the various possibilities presented in Figure 3. This choice is primarily based on how the data were collected, the chronology of events, juxtaposed with the different characteristics of RL algorithms discussed in Section 3.5. The choice of method thus depends mainly on the application context and available data, and therefore, on underlying medical knowledge. Again, this is primarily a methodological issue, where the medical specialist collaborates with the machine learning specialist to make this choice or develop a new ad-hoc method. This discussion could follow the decision tree outlined in the figure titled "Overview of the guideline for the application of RL to healthcare" in Coronato et al. (2020).

4.2 Medical Knowledge and Rewards

One crucial aspect of learning optimal strategies is the formulation of rewards. This is a key component and one of the primary mechanisms for integrating medical knowledge into RL methods. In practical terms, commonly, the choice of reward is directly based on medical expertise. It is primarily a methodological issue closely linked to the study’s objective. The selection of the reward is similar to choosing the primary outcome measure in the design of a clinical trial, with the same imperatives of precision and representativeness of the variable. Rewards mainly consist in scores or quantitative variables, such as changes in body mass index (BMI) in weight loss studies (Linn et al., 2015), or survival functions in critical care settings (Roggeveen et al., 2021). Additionally, more complex rewards can be found, such as compromises or combinations of variables, as seen in oncology contexts (Zhao et al., 2009), where the reward is evaluated considering tumor size, treatment toxicity, patient well-being, and survival rates. In Table 1, an illustration of various reward functions is provided, each aiming to achieve a specific medical objective.

It is evident that selecting an ad-hoc reward for the problem under study can entail choices that are either too arbitrary or too context-specific, potentially leading to overly restrictive learning objectives. An alternative approach is to replace this choice of reward with reward shaping. Several approaches have been developed in this direction.

One way to generalize and automatically construct rewards is through inverse reinforcement learning. This method uses patient trajectories generated with expert medical decision-making to extract an estimate of the underlying reward function for these choices. Thus, it also seeks to highlight the characteristics that should be considered for its formulation. The latent medical knowledge will then be encapsulated in the estimation of the reward function. This approach has been used in the context of alcohol addiction management (Shah et al., 2022) for the search for a personalized decision-making rule. The application of inverse reinforcement Learning to the framework of DTRs is also explored in Luckett et al. (2017), where the objective of this study is to construct a reward function as a linear combination of covariates. Inverse reinforcement Learning allows for the determination of rewards from data, thereby accelerating the learning of a decision rule compared to manually constructed rewards. It is important to note that these methods assume that the physicians who generated the training data made decisions aimed at maximizing the interests of each patient. Thus, the constructed rewards are sensitive not only to the quality of the data but also to medical decisions.

Another approach involves the use of preference learning. This method relies on medical expertise, where the physician expresses preferences regarding patient trajectories. Patient histories are pairwise compared by an expert, thus creating a ranking. These comparisons are then used in a probabilistic model such as the Bradley-Terry to construct rewards by maximum likelihood estimation. These rewards are then integrated into RL algorithms. Preference Learning methods, described as on-policy by Fürnkranz et al. (2012) and off-policy by Akrour et al. (2012), use preferences on trajectories on simulated data similar to the generic cancer scenario described by Zhao et al. (2009). In Fürnkranz et al. (2012), patient trajectories are compared using a partial order relation that considers survival, maximum toxicity over time, and final tumor size. Meanwhile, Akrour et al. (2012) formulate expert preferences by prioritizing trajectories with superior final outcomes, which include minimal tumor size and reduced toxicity levels. Preference Learning enables the construction of rewards based on expert preferences on trajectories, allowing learning to rely on explainable choices. However, the applications described in the articles are based on simulated cancer data and simulated preferences, and have been developed in an online framework, which is not suitable for direct clinical application.

Other methods for constructing rewards exist, such as human-centered reinforcement learning, which utilizes rewards directly provided by an expert. The agent interprets expert feedback as numerical rewards. These approaches are detailed in Li et al. (2019a), but they are generally applied in an online and on-policy context, which involves direct interaction of the agent with patients, thus raising ethical concerns and requiring a specific application framework beyond the scope of this article.

4.3 Medical Knowledge and Value Functions

The evaluation or estimation of value functions Vnπsuperscriptsubscript𝑉𝑛𝜋V_{n}^{\pi}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and Qnπsuperscriptsubscript𝑄𝑛𝜋Q_{n}^{\pi}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is also a key concept in RL. In the medical context, due to the complexity of environments and the volume of available data, these assessments often suffer from a lack of precision. Integrating medical expertise can be considered to improve results.

This is particularly true when medical expertise translates into knowledge of treatment response mechanisms. Indeed, these observations can then be integrated into RL methods to guide the learning of the optimal strategy. From a technical standpoint, it is conceivable to penalize the value function: decrease the value function when mechanisms identified by an expert indicate that the treatment is inappropriate and increase the value function when the treatment is considered relevant. Actions associated with a lower value function are less likely to be selected than those associated with a higher value function. This approach thus highlights actions considered more relevant by the expert and guides learning in the right direction. This approach was implemented for patients with renal failure in Gaweda et al. (2005). Medical experts identified that patients who do not respond to standard treatment require higher doses. The authors constructed a DTR by incorporating this clinical fact into a Q-learning algorithm. When a patient does not respond to a treatment dose, the Q-values of lower doses are penalized, thus favoring higher doses. This approach offers the advantage of reducing the need for exploration and hence the learning time. However, it was developed in an online framework using simulated data, limiting its applicability to real-world data.

The integration of medical expertise can also occur through relay collaboration. The principle involves considering two concurrent value functions: Q𝑄Qitalic_Q, the usual value function, and Qclinsuperscript𝑄𝑐𝑙𝑖𝑛Q^{clin}italic_Q start_POSTSUPERSCRIPT italic_c italic_l italic_i italic_n end_POSTSUPERSCRIPT, the value function under the practitioner’s strategy in a given situation. The latter comes into play only when the patient is in a critical state, as evidenced by their vital signs. Subsequently, this decision and the patient’s response to treatment will be used to enrich the learning model through an enhanced value function, denoted as Q+superscript𝑄Q^{+}italic_Q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Thus, the strategy for updating the value functions involves recommending treatments suggested by the RL model while seeking the expertise of physicians when the patient’s condition is deemed critical. Q+superscript𝑄Q^{+}italic_Q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT can therefore be formalized as:

Q+(st,at+)={Qclin(st,atclin) If the patient’s covariates indicate a critical stateQ(st,at)Otherwisesuperscript𝑄subscript𝑠𝑡subscriptsuperscript𝑎𝑡casessuperscript𝑄𝑐𝑙𝑖𝑛subscript𝑠𝑡subscriptsuperscript𝑎𝑐𝑙𝑖𝑛𝑡 If the patient’s covariates indicate a critical state𝑄subscript𝑠𝑡subscript𝑎𝑡OtherwiseQ^{+}(s_{t},a^{+}_{t})=\left\{\begin{array}[]{ll}Q^{clin}(s_{t},a^{clin}_{t})&% \mbox{ If the patient's covariates indicate a critical state}\\ Q(s_{t},a_{t})&\mbox{Otherwise}\end{array}\right.italic_Q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ARRAY start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_c italic_l italic_i italic_n end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_c italic_l italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL If the patient’s covariates indicate a critical state end_CELL end_ROW start_ROW start_CELL italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL Otherwise end_CELL end_ROW end_ARRAY

where aclinsuperscript𝑎𝑐𝑙𝑖𝑛a^{clin}italic_a start_POSTSUPERSCRIPT italic_c italic_l italic_i italic_n end_POSTSUPERSCRIPT is the treatment chosen by the clinician.

This approach has been deployed in the context of intensive care treatment in Wu et al. (2023) when the patient exhibits severe symptoms. In such situations, RL algorithms may propose aggressive treatment strategies to maximize reward, which can entail significant risk for the patient. In this study, a model based on value functions Q𝑄Qitalic_Q incorporates human expertise on the treatment of sepsis. Applied to the MIMIC database, this model is evaluated using a score reflecting the patient’s critical state. Expert intervention is triggered when the score is considered low. The application of this method demonstrates a higher survival rate compared to some similar methods without human expertise and also improves the estimation of the value function.

The principle of collaboration between the agent and the expert is also addressed in Sonabend et al. (2020) using the MIMIC database. It still impacts the Q𝑄Qitalic_Q functions, but now through a statistical test. The idea is to introduce exploration into an offline model by comparing risks between two strategies. One simulates standard medical decisions, while the other strategy suggests an alternative treatment. From a comparison test on state values associated with a policy, one of these strategies is adopted. The question is: when could a new treatment be better than conventional therapies? The solution seeks to balance choices of standard treatments with new options while assessing risks to discover promising alternatives that physicians have not considered.

This link between RL and medical expertise allows both supervision of treatments in complex cases and exploration of alternative treatments while assessing associated risks. Off-policy RL suffers from data biases that can be better controlled by these methods, providing critical evaluation of the strategy to be adopted. These are methods suited for real clinical applications.

4.4 Medical Expertise and Objective Function

As we have just seen, value-based approaches can benefit from the integration of medical expertise in determining optimal strategies. Similarly, methods for incorporating medical expertise have been proposed for policy-based approaches, which directly modify on the objective function.

Supervised reinforcement learning merges two subfields of machine learning: Supervised Learning (SL) and RL. The fundamental principle of this method is to maximize a long-term objective, with the supervision of an expert, in order to maintain consistency with clinical treatment standards. Its ultimate goal is to predict an optimal treatment policy, minimizing deviations from medical expert recommendations. In this framework, the expert plays a crucial role as a reference for training the RL algorithm, using a database containing all medical decisions made within a cohort. This control affects the objective function in two ways. The latter is simplified into two parts: the first, derived from an actor-critic algorithm, aims to perfectly mimic the experts through its "critic" part (Section 3.5.2). The second part of supervised learning minimizes the difference between predicted treatments and those traditionally administered. This method, described notably in Yu et al. (2020), is applied in the intensive care domain using the MIMIC database and focusing on ventilation and sedation dosing. The primary objective is to provide optimal care that respects both short-term and long-term goals for patients, while adhering to best clinical practices. In this context, research shows that the supervised reinforcement learning approach outperforms the classical Actor-Critic approach in terms of convergence speed and alignment with usual medical decisions. In the study by Wang et al. (2018), the supervised reinforcement learning approach was applied to the MIMIC dataset. The treatment recommendations obtained would lead to a decrease in patient mortality rates. Supervised reinforcement learning, in its fundamental construction, aims to perfectly mimic the usual treatment practices, making it an excellent means of emulating practitioners. However, it does not allow for the proposal of alternative or less explored treatments compared to usual care methods.

4.5 Medical Expertise and Policy

It is important to note that medical decision rules constructed within the framework of reinforcement learning recommend only a single action for a given state. The multiple policies approach involves proposing different equivalent or closely related strategies for a given patient state. Consequently, the specialist, relying on their expertise and the constraints of their environment, chooses the treatment from the selection of actions offered. This approach introduces the notion of quasi-equivalent actions that may take into account considerations such as side effects, less invasive treatments, and local availability. The general idea is essentially to train a set of policies evaluated by value functions, which learn a correspondence between each state and a collection of closely comparable actions. Subsequently, the approach involves restricting the choice of actions by evaluating the extent to which the deviation from optimal value is acceptable. This is the concept of worst-case value, referring to the expected gain in the worst possible scenario within the set of allowed actions. The level of deviation from optimality allowed will be controlled by a hyper-parameter.

The concept of multiple policies was introduced in Milani Fard and Pineau (2011) and applied in a simulated setting of sequential clinical trials for patients suffering from depression. It was developed within a model-based, on-policy, online framework with a finite horizon, not conducive to real data or real clinical applications. In the article Tang et al. (2020), the method evolved into a model-free and off-policy framework, still online using the Temporal Difference learning algorithm, and was applied in the simulated context of critical care based on MIMIC. Like the previous method, its development in an online environment does not align with our application context, but it establishes the foundation for a model-free approach, thus representing progress towards a model suitable for DTR.

Finally, the concept of multiple policies has also been employed in a multi-objective context, not based on expert opinion but on patient preferences, as detailed in Lizotte and Laber (2016). By combining the notion of equivalent strategies with a multi-objective framework and Pareto dominance, and considering the preferences of patients, less restrictive solutions can be obtained. This approach, applied in the CATIE study specifically tailored to the DTR context, offers decision-makers increased choice by a larger class of optimal policies. These could provide the basis for an application that integrates experts’ preferences and medical knowledge, thus addressing the issue outlined in this article.

5 Conclusions

This paper introduces and aims to facilitate the understanding of RL methods for precision medicine, especially its application to optimal DTR research. This topic is of major practical interest since it aims to determine an optimal decision rule for personalized treatments, with a large range of applications in areas such as intensive care, chronic diseases, psychiatry, and oncology. However, applying RL to medical research requires specific considerations and adaptations.

The main specificity arises from the data, typically derived from observational studies, which limits RL methods to offline applications. While an online setting is feasible, such as in m-health scenarios, for many cases, it is unethical to base treatment decisions solely on an algorithm. Therefore, since the data has already been collected beforehand, it is important to note that the well-known exploration-exploitation dilemma of online RL translates into an exploration-exploitation bias in offline RL settings. Section 3.5 details the properties of RL algorithms and helps identify the most desirable characteristics for an algorithm applied to DTR. First, due to clinical and ethical constraints, exploring all possibilities from the same starting point is impractical, necessitating the use of model-free algorithms. Secondly, value-based methods enable the discovery of optimal decision rules in a non-Markovian setting with limited steps and data, distinguishing them from policy-based approaches. Thirdly, off-policy algorithms are suited for offline contexts where data is already collected following a specific strategy, allowing for the determination of the optimal policy in a second phase. When these three characteristics converge, the result is an algorithm well suited for practical applications with real DTR data. Consequently, Backward Q-learning, also known as Fitted Q-Iteration, emerges as the most widely adopted and utilized algorithm in the realm of applying RL to DTR (Clifton and Laber, 2020).

Intimately linked to all work on observational data, the question of causality arises in the optimal DTR research context. A few research works directly focus on this challenge (Chakraborty and Murphy, 2014; Zhang, 2020), but most of the time causality is based on assumptions that are difficult to verify which make the results questionable. This limitation may be overcome by the experimental design relying on SMART designs but such designs are difficult and expensive to set up (Cheung et al., 2015; Kosorok and Moodie, 2015).

The classical formulation of RL relies on decision processes theory under the Markov assumption. However, this assumption is often too stringent in practical applications. Indeed, there is no guarantee that the current state under study contains all the necessary information to construct a precise decision. However most of the mathematical properties remain true without this Markov assumption by considering the entirety of the patient’s history. In practice, that necessitates huge computational capacities and restricts to the applications the determination of adaptive strategies where the number of DTR steps is small (less than 4).

In addition to the previous issues, another problem emerges in the search for an optimal treatment strategy: the acceptability of the optimal DTR to both patient and practitioner. This raises concerns about how understandable the decision rules are for both patients and physicians, which is crucial for their clinical use. Integrating medical expertise into machine learning methods for personalized treatments is essential to improve safety, interpretability, and effectiveness in real-world scenarios.

One way to overpass this issue is to consider algorithms involving, one way or another, medical expertise or knowledge. We have seen that various approaches and studies demonstrate how medical expertise can be integrated into RL methods for sequential treatment decisions. This integration can be done at various stages of algorithm implementation.

First, the medical knowledge is often integrated before the study, at the design of the experiment. Indeed, physicians contribute to selecting the variables used for learning the decision rule. Similarly, algorithm selection involves collaboration between medical and machine learning expert, based on the application framework and available data.

Second, the medical knowledge can be integrated by acting on the rewards. Rewards is one of the main elements of a RL algorithm. Since they influence and guide the determination of the decision rule. Their design is thus crucial. Traditionally, a variable representative of the study’s objective is chosen. Methods such as inverse reinforcement learning and preference learning attempt to generalize their construction through expert input. Preference learning (Fürnkranz et al., 2012; Akrour et al., 2012) and human-centered RL (Li et al., 2019a) directly incorporate expert knowledge into reward construction. However, this method suffers from being developed only in an online setup, which is not applicable to DTRs and real clinic application. Nonetheless, early research in this area can serve as a foundation for further exploration. On the other hand, inverse reinforcement learning is promising since it is developed within the offline context and it is well suited for real clinical application (Shah et al., 2022; Luckett et al., 2017).

Thirdly, the learning of decision rules can be achieved through value functions, allowing for the integration of medical expertise at this level. One approach is to incorporate observed medical mechanisms; specifically, the idea is to penalize the Q-values associated with non-decisive treatments (Gaweda et al., 2005). However, this method was initially developed in an online context and requires reassessment for offline settings. A second idea is to establish a relay between human decisions and decisions proposed by the algorithm. In one scenario, the physician would take over when the patient is in critical conditions (Wu et al., 2023). In another scenario, the algorithm would suggest alternative treatments to those traditionally proposed, along with associated risks (Sonabend et al., 2020). These hybrid methods seem promising for real clinical applications, but concrete evidence of their implementation is currently lacking. In the policy-based methodological framework, the integration of expertise can occur through a method called supervised RL (Yu et al., 2020; Wang et al., 2018). Its aim is to faithfully replicate common medical practices, offering precise emulation of physicians’ decisions. However, it does not allow for the discovery of alternative or underexplored treatments compared to conventional care methods.

Finally, the learning of decision rules can be approached methodologically through policy and it is worth noting that classical RL methods typically recommend only one policy, typically one treatment and one dose for each decision time. To enrich the context, multiple policies methods have been developed with the aim of offering an expert multiple equivalent treatment to choose from. The work of Lizotte and Laber (2016) is particularly suitable for application to real data-based DTRs, but it was developed within a framework of patient preferences and could be reassessed within an expert preference framework.

The integration of medical knowledge is a promising research field, exploring various innovative perspectives and methods. However, further research is needed to adapt them to the specific constraints and realities of precision medicine. These advancements have the potential to lead to practical clinical applications and significantly enhance daily hospital operations. This aligns with the broader challenge of applying mathematical solutions effectively in clinical practice. Particularly, the development of Health System Science enables the use of interdisciplinary skills to study the complexity of healthcare systems (Apostolopoulos et al., 2020; Kahkoska et al., 2022b). Practically speaking, the aim is to ease the transition of laboratory discoveries into clinical practices (Gilliland et al., 2019), achieved by forming interdisciplinary teams within healthcare systems. Combining progress in both research areas could establish a tangible framework for applying RL alongside medical expertise, simplifying the treatment decision process for the benefit of all involved parties. We hope this study will encourage collaboration between machine learning researchers and healthcare professionals, by showing a framework that helps practically applying RL for DTR context.

References

  • Kosorok and Laber [2019] Michael R Kosorok and Eric B Laber. Precision medicine. Annual review of statistics and its application, 6:263–286, 2019.
  • Chakraborty and Murphy [2014] Bibhas Chakraborty and Susan A Murphy. Dynamic treatment regimes. Annual review of statistics and its application, 1:447–464, 2014.
  • Kosorok and Moodie [2015] Michael R Kosorok and Erica EM Moodie. Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine. SIAM, 2015.
  • Coronato et al. [2020] Antonio Coronato, Muddasar Naeem, Giuseppe De Pietro, and Giovanni Paragliola. Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence in Medicine, 109:101964, 2020.
  • Yu et al. [2021] Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
  • Terry [2015] Sharon F Terry. Obama’s precision medicine initiative. Genetic testing and molecular biomarkers, 19(3):113, 2015.
  • Laber et al. [2014a] Eric B Laber, Daniel J Lizotte, Min Qian, William E Pelham, and Susan A Murphy. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics, 8(1):1225, 2014a.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Weltz et al. [2022] Justin Weltz, Alex Volfovsky, and Eric B Laber. Reinforcement learning methods in public health. Clinical therapeutics, 44(1):139–154, 2022.
  • Deliu et al. [2022] Nina Deliu, Joseph Jay Williams, and Bibhas Chakraborty. Reinforcement learning in modern biostatistics: constructing optimal adaptive interventions. arXiv preprint arXiv:2203.02605, 2022.
  • Uehara et al. [2022] Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355, 2022.
  • Clifton and Laber [2020] Jesse Clifton and Eric Laber. Q-learning: Theory and applications. Annual Review of Statistics and Its Application, 7:279–301, 2020.
  • Li et al. [2023] Zhen Li, Jie Chen, Eric Laber, Fang Liu, and Richard Baumgartner. Optimal treatment regimes: a review and empirical comparison. International Statistical Review, 91(3):427–463, 2023.
  • Eckardt et al. [2021] Jan-Niklas Eckardt, Karsten Wendt, Martin Bornhaeuser, and Jan Moritz Middeke. Reinforcement learning for precision oncology. Cancers, 13(18):4624, 2021.
  • Holzinger [2016] Andreas Holzinger. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Informatics, 3(2):119–131, 2016.
  • Maadi et al. [2021] Mansoureh Maadi, Hadi Akbarzadeh Khorshidi, and Uwe Aickelin. A review on human–ai interaction in machine learning and insights for medical applications. International journal of environmental research and public health, 18(4):2121, 2021.
  • Love et al. [2023] Tamlin Love, Ritesh Ajoodha, and Benjamin Rosman. Who should i trust? cautiously learning with unreliable experts. Neural Computing and Applications, 35(23):16865–16875, 2023.
  • Holzinger et al. [2019] Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo Müller. Causability and explainability of artificial intelligence in medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4):e1312, 2019.
  • Arzate Cruz and Igarashi [2020] Christian Arzate Cruz and Takeo Igarashi. A survey on interactive reinforcement learning: Design principles and open challenges. In Proceedings of the 2020 ACM designing interactive systems conference, pages 1195–1209, 2020.
  • Li et al. [2019a] Guangliang Li, Randy Gomez, Keisuke Nakamura, and Bo He. Human-centered reinforcement learning: A survey. IEEE Transactions on Human-Machine Systems, 49(4):337–349, 2019a.
  • Bellman [1957] Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
  • Garcia and Rachelson [2013] Frédérick Garcia and Emmanuel Rachelson. Markov decision processes. Markov Decision Processes in Artificial Intelligence, pages 1–38, 2013.
  • Monahan [1982] George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms. Management science, 28(1):1–16, 1982.
  • Hernández-Lerma and Lasserre [2012] Onésimo Hernández-Lerma and Jean B Lasserre. Discrete-time Markov control processes: basic optimality criteria, volume 30. Springer Science & Business Media, 2012.
  • Nivot [2016] Christophe Nivot. Analyse et étude des processus markoviens décisionnels. PhD thesis, Bordeaux, 2016.
  • Schulte et al. [2014] Phillip J Schulte, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. Q-and a-learning methods for estimating optimal dynamic treatment regimes. Statistical science: a review journal of the Institute of Mathematical Statistics, 29(4):640, 2014.
  • Zhao et al. [2015] Ying-Qi Zhao, Donglin Zeng, Eric B Laber, and Michael R Kosorok. New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association, 110(510):583–598, 2015.
  • Watkins and Dayan [1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992.
  • Ernst et al. [2005] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
  • Ormoneit and Sen [2002] Dirk Ormoneit and Śaunak Sen. Kernel-based reinforcement learning. Machine learning, 49:161–178, 2002.
  • Kahkoska et al. [2022a] Anna R Kahkoska, Kristen Hassmiller Lich, and Michael R Kosorok. Focusing on optimality for the translation of precision medicine. Journal of Clinical and Translational Science, 6(1):e118, 2022a.
  • Robins [1998] James M Robins. Correction for non-compliance in equivalence trials. Statistics in medicine, 17(3):269–302, 1998.
  • Robins [1986] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393–1512, 1986.
  • Robins [1994] James M Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics-Theory and methods, 23(8):2379–2412, 1994.
  • Robins [1992] James Robins. Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika, 79(2):321–334, 1992.
  • Robins [1989] James M Robins. The analysis of randomized and non-randomized aids treatment trials using a new approach to causal inference in longitudinal studies. Health service research methodology: a focus on AIDS, pages 113–159, 1989.
  • Robins [2000] James M Robins. Marginal structural models versus structural nested models as tools for causal inference. In Statistical models in epidemiology, the environment, and clinical trials, pages 95–133. Springer, 2000.
  • Chesnaye et al. [2022] Nicholas C Chesnaye, Vianda S Stel, Giovanni Tripepi, Friedo W Dekker, Edouard L Fu, Carmine Zoccali, and Kitty J Jager. An introduction to inverse probability of treatment weighting in observational research. Clinical Kidney Journal, 15(1):14–20, 2022.
  • Deliu and Chakraborty [2022] Nina Deliu and Bibhas Chakraborty. Dynamic treatment regimes for optimizing healthcare. In The Elements of Joint Learning and Optimization in Operations Management, pages 391–444. Springer, 2022.
  • Murphy [2003] Susan A Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331–355, 2003.
  • Robins [2004] James M Robins. Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics: analysis of correlated data, pages 189–326. Springer, 2004.
  • Kahkoska et al. [2022b] Anna R Kahkoska, Nikki LB Freeman, and Kristen Hassmiller Lich. Systems-aligned precision medicine—building an evidence base for individuals within complex systems. In JAMA health forum, volume 3, pages e222334–e222334. American Medical Association, 2022b.
  • Sperger et al. [2020] John Sperger, Nikki LB Freeman, Xiaotong Jiang, David Bang, Daniel de Marchi, and Michael R Kosorok. The future of precision health is data-driven decision support. Statistical Analysis and Data Mining: The ASA Data Science Journal, 13(6):537–543, 2020.
  • Luckett et al. [2019] Daniel J Luckett, Eric B Laber, Anna R Kahkoska, David M Maahs, Elizabeth Mayer-Davis, and Michael R Kosorok. Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association, 2019.
  • Ertefaie and Strawderman [2018] Ashkan Ertefaie and Robert L Strawderman. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4):963–977, 09 2018. ISSN 0006-3444. doi:10.1093/biomet/asy043. URL https://doi.org/10.1093/biomet/asy043.
  • Istepanian et al. [2007] Robert Istepanian, Swamy Laxminarayan, and Constantinos S Pattichis. M-health: Emerging mobile health systems. Springer Science & Business Media, 2007.
  • Nahum-Shani et al. [2018] Inbal Nahum-Shani, Shawna N Smith, Bonnie J Spring, Linda M Collins, Katie Witkiewitz, Ambuj Tewari, and Susan A Murphy. Just-in-time adaptive interventions (jitais) in mobile health: key components and design principles for ongoing health behavior support. Annals of Behavioral Medicine, pages 1–17, 2018.
  • Rehg et al. [2017] James M Rehg, Susan A Murphy, and Santosh Kumar. Mobile health. Cham: Springer International Publishing, 2017.
  • Hernan and Robins [2023] M.A. Hernan and J.M. Robins. Causal Inference: What If. Chapman & Hall/CRC Monographs on Statistics & Applied Probab. CRC Press, 2023. ISBN 9781420076165. URL https://books.google.fr/books?id=_KnHIAAACAAJ.
  • Neuberg [2003] Leland Gerson Neuberg. Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000. Econometric Theory, 19(4):675–685, 2003.
  • Zhang [2020] Junzhe Zhang. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In International conference on machine learning, pages 11012–11022. PMLR, 2020.
  • Johnson et al. [2016] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  • Roggeveen et al. [2021] Luca Roggeveen, Ali El Hassouni, Jonas Ahrendt, Tingjie Guo, Lucas Fleuren, Patrick Thoral, Armand RJ Girbes, Mark Hoogendoorn, and Paul WG Elbers. Transatlantic transferability of a new reinforcement learning model for optimizing haemodynamic treatment for critically ill patients with sepsis. Artificial Intelligence in Medicine, 112:102003, 2021.
  • Tseng et al. [2017] Huan-Hsin Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
  • Peng et al. [2018] Xuefeng Peng, Yi Ding, David Wihl, Omer Gottesman, Matthieu Komorowski, H Lehman Li-wei, Andrew Ross, Aldo Faisal, and Finale Doshi-Velez. Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, page 887. American Medical Informatics Association, 2018.
  • Raghu et al. [2017a] Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi, Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017a.
  • Komorowski et al. [2016] Matthieu Komorowski, A Gordon, LA Celi, and A Faisal. A markov decision process to suggest optimal treatment of severe infections in intensive care. In Neural Information Processing Systems Workshop on Machine Learning for Health, 2016.
  • Raghu et al. [2017b] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163. PMLR, 2017b.
  • Li et al. [2019b] Luchen Li, Matthieu Komorowski, and Aldo A Faisal. Optimizing sequential medical treatments with auto-encoding heuristic search in pomdps. arXiv preprint arXiv:1905.07465, 2019b.
  • Yu et al. [2019a] Chao Yu, Guoqi Ren, and Jiming Liu. Deep inverse reinforcement learning for sepsis treatment. In 2019 IEEE International Conference on Healthcare Informatics (ICHI), pages 1–3, 2019a. doi:10.1109/ICHI.2019.8904645.
  • Cheung et al. [2015] Ying Kuen Cheung, Bibhas Chakraborty, and Karina W Davidson. Sequential multiple assignment randomized trial (smart) with adaptive randomization for quality improvement in depression treatment program. Biometrics, 71(2):450–459, 2015.
  • Shortreed et al. [2011] Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning, 84(1-2):109, 2011.
  • Laber et al. [2014b] Eric B Laber, Kristin A Linn, and Leonard A Stefanski. Interactive model building for q-learning. Biometrika, 101(4):831–847, 2014b.
  • Raghu et al. [2018] Aniruddh Raghu, Matthieu Komorowski, and Sumeetpal Singh. Model-based reinforcement learning for sepsis treatment. arXiv preprint arXiv:1811.09602, 2018.
  • Grondman et al. [2012] Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, part C (applications and reviews), 42(6):1291–1307, 2012.
  • Yu et al. [2019b] Chao Yu, Yinzhao Dong, Jiming Liu, and Guoqi Ren. Incorporating causal factors into reinforcement learning for dynamic treatment regimes in hiv. BMC medical informatics and decision making, 19:19–29, 2019b.
  • Murphy [2005a] Susan A Murphy. A generalization error for q-learning. Journal of Machine Learning Research, 6:1073–1097, 2005a.
  • Murphy [2005b] Susan A Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in medicine, 24(10):1455–1481, 2005b.
  • Tsiatis et al. [2019] Anastasios A Tsiatis, Marie Davidian, Shannon T Holloway, and Eric B Laber. Dynamic treatment regimes: Statistical methods for precision medicine. Chapman and Hall/CRC, 2019.
  • Wang et al. [2018] Lu Wang, Wei Zhang, Xiaofeng He, and Hongyuan Zha. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2447–2456, 2018.
  • Li et al. [2020] Luchen Li, Ignacio Albert-Smet, and Aldo A Faisal. Optimizing medical treatment for sepsis in intensive care: from reinforcement learning to pre-trial evaluation. arXiv preprint arXiv:2003.06474, 2020.
  • Li et al. [2018] Luchen Li, Matthieu Komorowski, and Aldo A Faisal. The actor search tree critic (astc) for off-policy pomdp learning in medical decision making. arXiv preprint arXiv:1805.11548, 2018.
  • Linn et al. [2015] Kristin A Linn, Eric B Laber, and Leonard A Stefanski. iqlearn: Interactive q-learning in r. Journal of statistical software, 64(1), 2015.
  • Zhao et al. [2009] Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28(26):3294–3315, 2009.
  • Shah et al. [2022] Syed Ihtesham Hussain Shah, Antonio Coronato, and Muddasar Naeem. Inverse reinforcement learning based approach for investigating optimal dynamic treatment regime. In Workshops at 18th International Conference on Intelligent Environments (IE2022), pages 266–276. IOS Press, 2022.
  • Luckett et al. [2017] Daniel J Luckett, Eric B Laber, and Michael R Kosorok. Estimation and optimization of composite outcomes. arXiv preprint arXiv:1711.10581, 2017.
  • Fürnkranz et al. [2012] Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89:123–156, 2012.
  • Akrour et al. [2012] Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
  • Gaweda et al. [2005] Adam E Gaweda, Mehmet K Muezzinoglu, George R Aronoff, Alfred A Jacobs, Jacek M Zurada, and Michael E Brier. Incorporating prior knowledge into q-learning for drug delivery individualization. In Fourth International Conference on Machine Learning and Applications (ICMLA’05), pages 6–pp. IEEE, 2005.
  • Wu et al. [2023] XiaoDan Wu, RuiChang Li, Zhen He, TianZhi Yu, and ChangQing Cheng. A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis. npj Digital Medicine, 6(1):15, 2023.
  • Sonabend et al. [2020] Aaron Sonabend, Junwei Lu, Leo Anthony Celi, Tianxi Cai, and Peter Szolovits. Expert-supervised reinforcement learning for offline policy learning and evaluation. Advances in Neural Information Processing Systems, 33:18967–18977, 2020.
  • Yu et al. [2020] Chao Yu, Guoqi Ren, and Yinzhao Dong. Supervised-actor-critic reinforcement learning for intelligent mechanical ventilation and sedative dosing in intensive care units. BMC medical informatics and decision making, 20(3):1–8, 2020.
  • Milani Fard and Pineau [2011] M. Milani Fard and J. Pineau. Non-deterministic policies in markovian decision processes. Journal of Artificial Intelligence Research, 40:1–24, January 2011. ISSN 1076-9757. doi:10.1613/jair.3175. URL http://dx.doi.org/10.1613/jair.3175.
  • Tang et al. [2020] Shengpu Tang, Aditya Modi, Michael Sjoding, and Jenna Wiens. Clinician-in-the-loop decision making: Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine Learning, pages 9387–9396. PMLR, 2020.
  • Lizotte and Laber [2016] Daniel J Lizotte and Eric B Laber. Multi-objective markov decision processes for data-driven decision support. The journal of machine learning research, 17(1):7378–7405, 2016.
  • Apostolopoulos et al. [2020] Yorghos Apostolopoulos, Kristen Hassmiller Lich, and Michael K Lemke. Complex systems and population health. Oxford University Press, 2020.
  • Gilliland et al. [2019] C Taylor Gilliland, Julia White, Barry Gee, Rosan Kreeftmeijer-Vegter, Florence Bietrix, Anton E Ussi, Marian Hajduch, Petr Kocis, Nobuyoshi Chiba, Ryutaro Hirasawa, et al. The fundamental characteristics of a translational scientist, 2019.
  • Hassani et al. [2010] Amin Hassani et al. Reinforcement learning based control of tumor growth with chemotherapy. In 2010 International Conference on System Science and Engineering, pages 185–189. IEEE, 2010.
  • Ahn and Park [2011] Inkyung Ahn and Jooyoung Park. Drug scheduling of cancer chemotherapy based on natural actor-critic approach. BioSystems, 106(2-3):121–129, 2011.
  • Goldberg and Kosorok [2012] Yair Goldberg and Michael R Kosorok. Q-learning with censored data. Annals of statistics, 40(1):529, 2012.
  • Nahum-Shani et al. [2012] Inbal Nahum-Shani, Min Qian, Daniel Almirall, William E Pelham, Beth Gnagy, Gregory A Fabiano, James G Waxmonsky, Jihnhee Yu, and Susan A Murphy. Q-learning: a data analysis method for constructing adaptive interventions. Psychological methods, 17(4):478, 2012.
  • Zhao et al. [2011] Yufan Zhao, Donglin Zeng, Mark A Socinski, and Michael R Kosorok. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67(4):1422–1433, 2011.
  • Vincent [2014] Robert Vincent. Reinforcement learning in models of adaptive medical treatment strategies. McGill University (Canada), 2014.
  • Liu et al. [2016] Ying Liu, Yuanjia Wang, Michael R Kosorok, Yingqi Zhao, and Donglin Zeng. Robust hybrid learning for estimating personalized dynamic treatment regimens. arXiv preprint arXiv:1611.02314, 2016.
  • Humphrey [2017] Kyle Humphrey. Using Reinforcement Learning to Personalize Dosing Strategies in a Simulated Cancer Trial with High Dimensional Data. PhD thesis, The University of Arizona, 2017.
  • Padmanabhan et al. [2017] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
  • Jalalimanesh et al. [2017] Ammar Jalalimanesh, Hamidreza Shahabi Haghighi, Abbas Ahmadi, and Madjid Soltani. Simulation-based optimization of radiotherapy: Agent-based modeling and reinforcement learning. Mathematics and Computers in Simulation, 133:235–248, 2017.
  • Liu et al. [2017] Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE international conference on healthcare informatics (ICHI), pages 380–385. IEEE, 2017.
  • Krakow et al. [2017] Elizabeth F Krakow, Michael Hemmer, Tao Wang, Brent Logan, Mukta Arora, Stephen Spellman, Daniel Couriel, Amin Alousi, Joseph Pidala, Michael Last, et al. Tools for the precision medicine era: how to develop highly personalized treatment recommendations from cohort and registry data using q-learning. American journal of epidemiology, 186(2):160–172, 2017.
  • Yauney and Shah [2018] Gregory Yauney and Pratik Shah. Reinforcement learning with action-derived rewards for chemotherapy and clinical trial dosing regimen selection. In Machine Learning for Healthcare Conference, pages 161–226. PMLR, 2018.
  • Yazdjerdi et al. [2019] Parisa Yazdjerdi, Nader Meskin, Mohammad Al-Naemi, Ala-Eddin Al Moustafa, and Levente Kovács. Reinforcement learning-based control of tumor growth under anti-angiogenic therapy. Computer methods and programs in biomedicine, 173:15–26, 2019.
  • Zade et al. [2020] Amir Ebrahimi Zade, Seyedhamidreza Shahabi Haghighi, and Madjid Soltani. Reinforcement learning for optimal scheduling of glioblastoma treatment with temozolomide. Computer methods and programs in biomedicine, 193:105443, 2020.
  • Daoud et al. [2020] Salma Daoud, Afef Mdhaffar, Mohamed Jmaiel, and Bernd Freisleben. Q-rank: reinforcement learning for recommending algorithms to predict drug sensitivity to cancer therapy. IEEE Journal of Biomedical and Health Informatics, 24(11):3154–3161, 2020.
  • Moreau et al. [2021] Grégoire Moreau, Vincent François-Lavet, Paul Desbordes, and Benoît Macq. Reinforcement learning for radiotherapy dose fractioning automation. Biomedicines, 9(2):214, 2021.
  • Shiranthika et al. [2022] Chamani Shiranthika, Kuo-Wei Chen, Chung-Yih Wang, Chan-Yun Yang, BH Sudantha, and Wei-Fu Li. Supervised optimal chemotherapy regimen based on offline reinforcement learning. IEEE Journal of Biomedical and Health Informatics, 26(9):4763–4772, 2022.
  • Kaushik et al. [2022] Pramod Kaushik, Sneha Kummetha, Perusha Moodley, and Raju S Bapi. A conservative q-learning approach for handling distribution shift in sepsis treatment strategies. arXiv preprint arXiv:2203.13884, 2022.

Supplementary Material

Table 2: Reinforcement Learning Applications for Sequential Decision in Healthcare. Ref, References; Environment, description of the medical application context; Data, Simulated or Real; Model, Decision Process (DP) or Markov Decision Process (MPD) or Partially Observable Markov Decision Process (POMDP); Stage, number of stages; Off/On, offline or online; Algorithm, standard reference algorithm of reinforcement learning without the paper specifications or innovation added.
Ref. Environment Data Model Stage Off/On Algorithm
Gaweda et al. [2005] Simulated patient with anemia due to kidney failure Real Data MDP 24 Online Q-learning
Zhao et al. [2009] ODE Simulation of cancer trial for advanced generic cancer of treatment Simulation DP 6 Offline Backward Q-learning
Hassani et al. [2010] ODE Simulation of cancer growth on a cell population level Simulation MDP K.A. Online Q-learning
Ahn and Park [2011] ODE Simulation of cancer growth on a cell population level Simulation DP K.A. Online Actor-Critic
Shortreed et al. [2011] CATIE Real Data DP 2 Offline Backward Q-learning
Akrour et al. [2012] Same as in Zhao et al. [2009] Simulation MDP 12 Online Policy Search
Goldberg and Kosorok [2012] Same as in Zhao et al. [2009] Simulation DP 3 Offline Backward Q-learning
Nahum-Shani et al. [2012] ADHD Real Data DP 2 Offline Backward Q-learning
Fürnkranz et al. [2012] Same as in Zhao et al. [2009] Simulation MDP 6 Online Policy Iteration
Zhao et al. [2011] Exponential distribution simulation of cancer for parties in phase III. Simulation DP 2 Offline Backward Q-learning
Vincent [2014] (Chapter 4) Electrical stimulation for epilepsy from in vitro experiments Real Data MDP K.A. Offline Backward Q-learning
Vincent [2014] (Chapter 5) Electrical stimulation for Parkinson’s disease Simulation MDP 2, 6, 9, 10 Online SARSA, Temporal Difference
Vincent [2014] (Chapter 5) Electrical stimulation for Parkinson’s disease Simulation MDP 2, 6, 9, 10 Offline Backward Q-learning
Vincent [2014] (Chapter 6) Fractionation scheduling for radiation therapy Simulation MDP 4, 10, 30 Offline Backward Q-learning
Laber et al. [2014a] ADHD Real Data DP 2 Offline Backward Q-learning
Laber et al. [2014b] STAR*D Real Data DP 2 Offline Backward Q-learning
Ertefaie and Strawderman [2018] Simulated patients with diabetes Simulation MDP Indefinite Online Q-learning
Cheung et al. [2015] Comparison of depression interventions after acute coronary syndrom (SMART) Real Data DP 2 Offline Backward Q-learning
Liu et al. [2016] STAR*D and ADHD Real Data DP 2 Offline Outcome-Weighted Learning and Q-learning
Humphrey [2017] Same as in Zhao et al. [2009] Simulation MDP K.A. Offline Backward Q-learning
Padmanabhan et al. [2017] ODE Simulation of cancer growth on a cell population level Simulation MDP K.A. Online Q-learning
Tseng et al. [2017] 114 patients used to construct synthetic data by GAN Simulation MDP 35 Online Deep Q-learning
Jalalimanesh et al. [2017] Model of tumor growth using NetLogo package, Agent-based simulation Simulation DP K.A. Online Q-learning
Liu et al. [2017] Registry data from 6021 AML patients who underwent allogeneic stem cell transplantation Real Data DP 5 Offline Deep Q-learning
Krakow et al. [2017] Nonrandomized registry data from 11,141 patients who underwent allogeneic stem cell transplantation Real Data DP 2 Offline Backward Q-learning
Raghu et al. [2017b] MIMIC Real Data MDP K.A. Offline Deep Q-Learning
Raghu et al. [2017a] MIMIC Real Data MDP K.A. Offline Deep Q-Learning
Yauney and Shah [2018] Linear and ODE Simulation of cancer trial Simulation MDP K.A. Online Deep Q-learning
Peng et al. [2018] MIMIC Real Data MDP K.A. Offline Deep Q-Learning
Yazdjerdi et al. [2019] ODE Simulation of cancer growth on a cell population level Simulation MDP K.A. Online Q-learning
Luckett et al. [2019] Simulated patients with diabetes Simulation MDP Indefinite Online V-learning
Li et al. [2019b] MIMIC Real Data POMDP K.A. Offline Deep Q-Learning
Zade et al. [2020] ODE Simulation of cancer growth on a cell population level Simulation MDP K.A. Online Q-learning
Daoud et al. [2020] Real dataset of breast cancer Real Data MDP K.A. Online Q-learning
Tang et al. [2020] MIMIC Real Data MDP 750 Offline Temporal Difference
Sonabend et al. [2020] MIMIC Real Data MDP K.A. Offline Q-learning
Moreau et al. [2021] Simulate tumor development inside healthy tissue and the effect of radiation therapy Simulation MDP K.A. Online Deep Policy Gradient
Wu et al. [2023] MIMIC Real Data MDP K.A. Offline Deep Q-Learning
Shiranthika et al. [2022] 40 patients of stage-four colon cancer Real Data MDP 6 Offline Deep Q-Learning
Kaushik et al. [2022] MIMIC Real Data MDP K.A. Offline Deep Q-Learning