Entropy for Optimal Control on a Simplex With an Application to Behavioral Nudging

We study the utilization of the entropy function of inputs in solving an Optimal Control Problem (OCP) with linear dynamics and inputs constrained to a variable-sized simplex in which the size is also an input. By using the entropy function as part of the objective functional in the OCP, we are able to derive a closed-form solution. Additionally, we present an example of how the studied OCP can be applied to choose between nudging techniques to discourage a specific behavior, such as non-adherence to medication, through the lens of behavioral momentum theory.


I. INTRODUCTION
O PTIMAL control problems (OCPs) are concerned with finding an optimal input trajectory for a dynamical system which maximizes or minimizes an objective functional while satisfying specific constraints. A class of problems in which the input trajectory is constrained to be on a simplex arises in many applications such as portfolio optimization in finance, resource allocation in energy systems, mixing chemicals in chemical reactions, and when the control input is a discrete probability distribution. The use of entropy in the objective for continuous-time dynamics has been studied in works of [1] and [2]. In [1], the authors analyzed the use of the entropy function for stochastic linear optimal control problems. As for the work in [2], the authors derived a class of Hamilton-Jacobi-Bellman (HJB) equations for optimal control problem in which the input is a probability measure. The optimization in the mentioned papers is performed over a probability measure with the dynamics having inputs drawn from the probability measure, and the optimal control problem considers averaged dynamics with respect to the probability measure in addition to averaged objective terms with respect to the probability measure. In this letter, we consider an OCP with linear time-varying dynamics and an input vector u constrained to a simplex of size v > 0 with v being an input itself. In particular, we show how the use of the entropy function in the objective in addition to a linear objective in state, a linear objective in u, and a quadratic objective in v will yield a closed-form solution using the necessary conditions of the maximum principle with Arrow type sufficient conditions [3]. Although setting v = 1 in our setup will make our problem a special case of those considered in [1], [2], these works do not explicitly address and solve the case of a discrete probability measure with continuous linear time-varying dynamics and a linear objective function, as we do in this letter. Moreover, we introduce the size of the simplex as an additional optimization input, which further expands the scope of the problem. Furthermore, we present an example on how the OCP of interest in this letter can be used to schedule different nudging techniques using behavioral momentum theory [4] to discourage an unhealthy behavior in people. Behavioral momentum theory suggests that behaviors that are reinforced more frequently are more resistant to changes in the environment. With this theory, we model the dynamics of the average rate of a target behavior by a linear model, with the average rate of a reinforcement being a parameter.
To discourage an unhealthy behavior, we introduce different nudges to the behavior as inputs to the OCP framework. In the context of our framework, we represent the probabilities of selecting the nudges as inputs belonging to a simplex, with the size of the simplex representing their overall rate. Our objective is to optimize the choice of different nudges and their average rate to minimize the average rate of the targeted behavior, while also considering the cost of each nudge and ensuring a diversity of nudges.
The use of computational and machine learning techniques have recently been investigated for the design of nudges for medical care professionals such as in [5] and to encourage patients to adhere to their prescribed medicine in [6]. Additionally, the work in [7] considers the optimal design of nudges within a Markov decision process framework derived from resource-rational analysis. In this letter, we consider the problem of choosing between nudges while minimizing their average rate within a continuous optimal control framework derived from behavioural momentum theory. Our work offers an alternative framework and perspective for the problem of behavioral nudging in healthcare. We hope that our discussion in this letter can be one of the early works towards the application of control theory concepts in behavioral nudging of people in healthcare.
The summary of the contributions of this letter is as follows • We derive a closed-form solution for an OCP with inputs constrained to a simplex in which the size of the simplex itself is also another input. • We present examples of how the OCP of interest and behavioral momentum theory can be used to assist in the choice of different nudging techniques aimed at discouraging an undesired behavior, such as non-adhering to medication. To our knowledge, this is the first time control theory techniques have been used in connection with behavioral momentum theory.

II. NOTATIONS
All vectors are considered as column vectors. We let [a, b] denote the closed interval from a to b, and [a b] denote the row vector with coordinates a and b. The symbols I n and 0 n×m are used to denote the n × n identity and the n × m zero matrix, respectively. The symbol 1 n is used to denote the ndimensional column vector of 1s. The symbols ≥ e , > e are used for element-wise ≥ and >. For u ∈ v n := {u ∈ R n ≥0 | u 1 = v}, we write the entropy function as φ(u) = − n i=1 u i ln(u i ) and we take 0 ln(0) := 0. For u ∈ v n and w > e 0, we write the Kullback-Leibler (KL) divergence (relative entropy) as We use exp e (x) and ln e (x) for the element-wise exponential and logarithm of a vector x, respectively.

III. SOLUTION OF THE OPTIMAL CONTROL PROBLEM
In this section, we first present the OCP of interest in Section III-A, and derive an explicit solution for it in Section III-B.

A. Problem Setup
To define the OCP of interest (OCPv) in this letter, we begin by defining and S(x) := e T x, with η > 0, c(t) ∈ R n u being continuously differentiable, d ∈ R n x , e ∈ R n x , and q < 0. The OCP in this letter has the following form Note that for the case in which v is set to 1, the inputs u will be constrained to the unit simplex 1 n u . Also note that B(t) is assumed to be an explicit function of time t, e.g., see below (11).

B. Closed-Form Solution
In order to find the solution for OCPv, we use the necessary conditions of the maximum principle. To summarize the necessary conditions, it is convenient to define the Hamiltonian function for our problem where λ is called the adjoint variable. 1 The necessary conditions for to be a solution of the OCP in (1) are summarized as follows From the maximality condition (3a), we get (see the Appendix) is also concave in x, the necessary conditions (3) are sufficient (Arrow type sufficient conditions [3]). Now using the adjoint equation (3b) together with the transversality condition (3c), we geṫ Referring to [8], we can obtain the solution to (5) as 2 : where Substituting the adjoint solution (6) in (4), we get the solution Remark 1: For the case when v is set to 1 and it is not optimized over, the solution u * can then be shown to be u * = exp e (ζ (t)) 1 T exp e (ζ (t)) by following the same procedure to obtain (7a). Moreover, if d = e = 0 and c(t) = c, then the solution (7) is a constant input u * = exp e (ηc)/1 T exp e (ηc). Additionally, if the inputs are weighted equally (i.e., c = c1 for some scalar c ∈ R), then the solution simplifies to u * = 1 n u 1. This is the well-known solution for the maximum entropy on a simplex.

Remark 2:
Incorporating the entropy function into the objective of OCPv encourages the utilization of all available inputs, as the resulting solution, as shown in equation (7), is always non-zero. This encourages diversification in the inputs, which can be advantageous in certain applications where exploring diverse solutions is desirable or in situations where one or more inputs could potentially lose their effectiveness, such as in the case of a faulty actuator. Using only a linear term for the inputs in the objective of OCPv will yield a bang-bang solution in step (4). Additionally, if we use a quadratic term −u T (t)Qu(t), Q ≥ 0 for the inputs in place of the entropy function, then the problem in (4) becomes a standard quadratic optimization problem (StQP). However, determining explicit solutions for StQPs is known to be NPhard [9], even though efficient algorithms are available. In contrast, despite the need to evaluate a matrix exponential for M A and M d , computing the explicit solution in (7) can be more efficient to implement in many scenarios (e.g., when A is diagonal). Moreover, if we intend to implement (7) recursively, as demonstrated in Section IV-D, the matrix exponential need only to be evaluated once. Finally, obtaining an explicit solution may prove valuable for conducting further theoretical analyses of the implemented solution's dynamics.
Remark 3: The solution to OCPv can be used in a receding horizon fashion by recursively estimating the dynamics parameters and solving the OCP for a fixed horizon (see Section IV-D for an example). To ensure that the inputs between the solutions are close to each other, we introduce a relative entropy objective −η −1 The values u p , v p are the last inputs from the previously computed solution. In this case, the solution in (4) becomes , andq(t) = q + q p exp(−ρt).

IV. EXAMPLE WITH BEHAVIORAL MOMENTUM THEORY
In this section, we will present a simple model derived from the principles of behavioral momentum theory. The model takes the form ofẋ(t) = B(t)u(t), with 1 T u(t) = v(t) and will be described in detail in Section IV-A. We will then proceed to use this model to solve OCPv for various scenarios in Sections IV-B-IV-D.

A. Behavioral Momentum Model
Behavioral momentum theory provides a quantitative basis for the idea that the rate of a behavior, which has been reinforced frequently in the past is more resistant to change with disruptions than if it has been reinforced less frequently [4], [10]. In the works of [4], [10], mathematical representations for behavioral momentum theory were introduced and validated with data obtained from different experiments. In this letter, we use a simple continuous-time version based on an averaged model from [4], [10]. Let β(t) ∈ R ≥0 be the average rate of occurrence for a specific behavior per unit time and define x(t) := log 10 (β(t)), then the change x(t)−x(t 1 ) with t := t − t 1 ≥ 0 is modelled with respect to disruptions and reinforcers as where r(t) ∈ R ≥0 is the average rate of a reinforcer, and δ(t) = b(t)v(t) with v(t) ∈ R ≥0 being the average rate of disruption events, and b(t) ∈ R ≥0 being an effect factor for the disruption events. The value √ r represents a "behavioral inertia", a higher average reinforcer rate would require a higher average rate for the effect of disruptions δ to change the behaviour. Dividing by t and taking the limit for t → 0, we obtaiṅ Consider now that for a disruption happening with an average rate of v(t), the disruption can be of n u different types with a probabilityū i of being of type i with an effect factor b i . In that case, δ(t) in (9) becomes withū(t) ∈ 1 n u and b(t) ∈ R n u ≥0 . Here, the i th component u i (t) ≥ 0 ofū(t) can also be understood as the average rate of a type of disruption with respect to the other types in δ(t) (average rate ratio). Note that the sum n u i=1 v(t)ū i (t) = v(t). The value v(t) is usually desired to be small enough to avoid what is known as alert fatigue [11]. Alert fatigue is when the rate of disruptions is high enough that the disruptions will lose their effect. Note that the model (9) with (10) can be written in the form of the model of OCPv (1b) by introducing v(t) in the constraint (1c): where B(t) = −1 √ r(t) b T (t). Remark 4: For a better understanding of the averaged representation and how to obtain (10), consider the Poisson Compound Process (t) defined as where P(t) is a Poisson process representing the number of disruptions (jumps) up until time t with rate v(t), T − k is the pre-disruption time value of the k th random disruption type, W = W k is an IID stochastic process where W k represents the type of the k th disruption such that W k ∈ W = {w 1 , . . . , w n u } with w i being a vector of zeros except for the i th element being 1 and P(W k = w i ) =ū i . Taking the expectation of (t) [12, Ch. 5] will give us which is equivalent to x(t) in (9) with δ(t) being chosen as in (10). This interpretation also gives us a method to apply the disruptions in real life by simulating (12).
In this letter, we will consider nudges as intentional disruptions that can change the reinforcement contingencies associated with a behavior and we will examine three different examples. The first one in Section IV-B deals with a case when v(t) is fixed to be 1 (see Remark 1) and b is constant. The second case in Section IV-C is when v(t) is optimized over, and the third case in Section IV-D is when b(t) is time-varying compared with a receding horizon setting. It is important to note that the examples discussed are simplified abstractions. The intention of presenting the examples is to show how the solution of OCPv in this letter can potentially be used for behavior nudging with elements from behavioral momentum theory. In all of the figures, we will report β(t) = 10 x(t) andū(t) = 1 v(t) u(t). The code for generating the results can be found on https://gitlab.com/aauadapt-t2d/nudging_entropyocp.

B. Case With a Constant Rate of Nudges
Consider a case in which a diabetic subject is not following their prescribed medication regimen, such as failing to administer the correct dose of insulin or taking a lower or higher dose than what was prescribed due to some constant average rate of a reinforcer r = 7 [1/Week]. Here the reinforcer could be inconveniences of administering the dose and/or economical burden. Assume that we have three different types of disruptions:ū 1 being the probability of sending dose reminder text messages to the subject with an effect of b 1 = 0.2,ū 2 being the probability of sending personalized encouraging text messages to the subjects (e.g., reminding them about the importance of their health to their family) with an effect of b 2 = 0.3, and u 3 being the probability of a phone call from a medical staff reminding them about the importance of their health with an effect of b 3 = 0.4. Our case study assumes that having a call from a medical staff is the most effective method while sending unpersonlized reminders is the least effective. Additionally, consider that we desire to fix the rate of nudges to a constant v = 1 [1/Week]. Phone calls from medical staff can be costly and labor intensive. To account for this, we define a linear cost for the different options c = −[0.1 0.5 1] T giving a higher cost forū 3 and a lower cost forū 1 . A higher cost for u 2 than the cost forū 1 is used since the second type of nudges requires obtaining personal information regarding the subjects and formulating specific text messages for them. This cannot be easily automated when compared to just sending dose reminders withū 1 . Additionally, we choose d = e = −2 to lower x(t) within a time horizon t f = 24 [Week]. Finally, we select η = 1 for the entropy function. Figure 1 shows the results of applying the solution in (7) with v = 1 compared to a solution to the problem obtained numerically by using forward-Euler with a discretization step T d = 0.01 to discretize the dynamics, lift the problem, and then solve it using SDP3 [13] with CVX [14]. The numerical solution matches the closed-form solution which further validates it. We can see from the solution that the reliance on medical staff and personalized reminders is higher at the beginning than text reminders but slowly decreases with time to reduce the burden on the medical staff. Additionally, none of the nudging techniques have a zero contribution at any point of time and there is always a mix between all of them ((4b) will always be strictly positive).  In Figure 2, we compare our closed-form solution with a numerical solution obtained using a quadratic cost −u T u instead of the entropy in OCPv. Additionally, we simulate the response β(t) in a case where the medical staff become unavailable after the first week rendering b 3 = 0 in simulation only and not in the calculation of the input nudges. We can see from the figure how the input nudges with the entropy objective are smoother than the ones calculated with a quadratic cost. Additionally, we see that for the quadratic cost case, the text reminders were not used at all until almost 10 weeks from the beginning of the scheduling of nudges. This is not preferable since it is desired for the subject to be more acquainted with the different nudging techniques as early as possible to handle technical and personal difficulties from the beginning. The average rate of the behaviour β(t) for both the quadratic objective case and the entropy objective case are very similar. For the case when b 3 = 0, the entropy objective case has a lower β(t) curve over time than the curve obtained with the quadratic objective. This is expected since maximizing the entropy encourage the use of all the available resources which offers robustness in case of the sudden absence of one resource or more.

C. Case With a Time-Varying Nudge Rate
We consider in this section the same case in the previous Section IV-B but when we desire to optimize over v(t). Figure 3 shows the results when we choose q = −1 against a numerical solution obtained using CVX and SDP3. We observe from Figure 3 that the numerical solution matches the closed-form solution which further validates it. We notice from the solution that the rate of nudges v at the beginning has a value greater than 2 [1/Week], and phone calls from medical staff have the highest share of the different types of nudges. Afterwards, the nudge rate v decreases to be below 1 [1/Week] throughout the solution while the reliance on text reminders is increasing to finally be the nudge with the highest contribution. Allowing v to vary gives the opportunity to lower it while the average behavioral rate β(t) is decreasing, which prevents overburdening the subject with nudges that could lead to alert fatigue. In Figure 4, we compare the solutions when v = 1 with two cases of varying v with q = −1 and q = −2. We can see from the figure that for both of the cases of varying v, the average rate β(t) decreases faster than the case of a fixed rate due to v starting with a value greater than 1. Additionally, we observe that the inputsū are identical for all the cases withū 3 being the highest at the beginning and the lowest towards the end. Notice how increasing |q| will make v starts at a lower value which helps to reduce the risk of alert fatigue from the beginning.

D. Receding Horizon Case
In this section, we demonstrate how the solution of OCPv can be used in a receding horizon fashion to adapt to changes in the parameters of the model. We introduce "feedback" by utilizing recursively estimated values of the model's parameters for the computation of a new scheduling scheme. We choose ρ = 5, q p = 0.5q, and η p = 10η in (8) for the receding horizon solution. For the simulation, we consider a case in which the effect b 3 of a phone call from the medical staff vanishes for a while during treatment according to For the receding horizon case, we consider that every week N (0, I). 3 The solution of the receding horizon for each week then uses a constantb(t j ) for the entire week with t f = 24 [Week]. Figure 5 shows the results. We can see from the results that the open loop response of β with the perfect knowledge of b(t) compared to the one with the receding horizon and imperfect knowledge of b are very similar. As for the inputs, we can see how they are affected by the noise and the delay during the simulation. Despite the presence of noise and delay, the receding horizon solution is able to follow the trend of the optimal open-loop solution for the case of a perfect knowledge of b(t).

V. CONCLUSION AND FUTURE WORK
We presented an OCP in which the inputs are constrained to a variable-sized simplex, with the size being another input to optimize over. We showed that with the inclusion of the entropy function in the objective, it is possible to derive closed-form solutions when the dynamics are linear, and the objectives are linear on the states and the simplex inputs, and quadratic on the size of the simplex. A possible future research direction is to study a more general class of OCPs with entropy and simplex constraints. We also demonstrated how the formulated OCP can potentially be used in conjunction with behavioral momentum theory in the help of scheduling nudges to discourage unhealthy behaviors, such as non-adherence to medication. This letter is a starting point for utilizing control theory methods with the behavioral momentum theory for nudging design. Future work will focus on incorporating more complex behavioural momentum models, comparing this framework with different frameworks such as the one in [7], performing and developing system identification for behavioural momentum models, and applying the solutions in a real-life setting using a receding horizon approach.