5 Dynamic Programming

5.1 Derivation of Bellman's PDE

5.1.1 Dynamic Programming

We begin with some mathematical wisdom:

“It is sometimes easier to solve a problem by embedding it within a larger class of problems and then solving the larger class all at once.”

A Calculus Example

Suppose we wish to calculate the value of the integral

\int_{0}^{\infty} \frac{\sin x}{x} d x

This is pretty hard to do directly, so let us as follows add a parameter $α$ into the integral:

I (α) := \int_{0}^{\infty} e^{- α x} \frac{\sin x}{x} d x

We compute

I^{'} (α) = \int_{0}^{\infty} (- x) e^{- α x} \frac{\sin x}{x} d x = - \int_{0}^{\infty} \sin x e^{- α x} d x = - \frac{1}{α^{2} + 1}

where we integrated by parts twice to find the last equality. Consequently

I (α) = - \arctan α + C

and we must compute the constant $C$ . To do so, observe that

0 = I (\infty) = - \arctan (\infty) + C = - \frac{π}{2} + C,

and so $C = \frac{π}{2}$ . Hence $I (α) = - \arctan α + \frac{π}{2}$ , and consequently

\int_{0}^{\infty} \frac{\sin x}{x} d x = I (0) = \frac{π}{2}

Come back to control

We want to adapt some version of this idea to the vastly more complicated setting of control theory. For this, fix a terminal time $T > 0$ and then look at the controlled dynamics

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = f (x (s), α (s)) \\ x (0) = x^{0}, \end{cases} \end{matrix}

with the associated payoff functional

\begin{matrix} (P) & P [α (\cdot)] = \int_{0}^{T} r (x (s), α (s)) d s + g (x (T)) . \end{matrix}

We embed this into a larger family of similar problems, by varying the starting times and starting points:

\begin{matrix} (5.1) & {\begin{cases} \dot{x} (s) = f (x (s), α (s)) & (t \leq s \leq T) \\ x (t) = x . \end{cases} \end{matrix}

with

\begin{matrix} (5.2) & P_{x, t} [α (\cdot)] = \int_{t}^{T} r (x (s), α (s)) d s + g (x (T)) . \end{matrix}

Consider the above problems for all choices of starting times $0 \leq t \leq T$ and all initial points $x \in R^{n}$ .

Definition: value function

For $x \in R^{n}, 0 \leq t \leq T$ , define the value function $v (x, t)$ to be the greatest payoff possible if we start at $x \in R^{n}$ at time $t$ . In other words,

\begin{matrix} (5.3) & v (x, t) := sup_{α (\cdot) \in A} P_{x, t} [α (\cdot)] (x \in R^{n}, 0 \leq t \leq T) . \end{matrix}

Notice then that

\begin{matrix} (5.4) & v (x, T) = g (x) (x \in R^{n}) . \end{matrix}

5.1.2 Derivation of Hamilton-Jacobi-Bellman Equation

Our first task is to show that the value function $v$ satisfies a certain nonlinear partial differential equation.

Our derivation will be based upon the reasonable principle that

"it's better to be smart from the beginning, than to be stupid for a time and then become smart".

We want to convert this philosophy of life into mathematics.

To simplify, we hereafter suppose that the set $A$ of control parameter values is compact.

Theorem 5.1 (Hamilton-Jacobi-Bellman Equation)

Assume that the value function $v$ is a $C^{1}$ function of the variables ( $x, t$ ). Then $v$ solves the nonlinear partial differential equation

\begin{matrix} (HJB) & v_{t} (x, t) + max_{a \in A} {f (x, a) \cdot \nabla_{x} v (x, t) + r (x, a)} = 0 (x \in R^{n}, 0 \leq t < T), \end{matrix}

with the terminal condition

v (x, T) = g (x) (x \in R^{n})

Remark

We call $(HJB)$ the Hamilton-Jacobi-Bellman equation, and can rewrite it as

\begin{matrix} (HJB) & v_{t} (x, t) + H (x, \nabla_{x} v) = 0 (x \in R^{n}, 0 \leq t < T), \end{matrix}

for the partial differential equations Hamiltonian

H (x, p) := max_{a \in A} H (x, p, a) = max_{a \in A} {f (x, a) \cdot p + r (x, a)}

where $x, p \in R^{n}$ .

Proof

Let $x \in R^{n}, 0 \leq t < T$ and let $h > 0$ be given. As always

A = {α (\cdot) : [0, \infty) \to A measurable} .

Pick any parameter $a \in A$ and use the constant control

α (\cdot) \equiv a

for times $t \leq s \leq t + h$ . The dynamics then arrive at the point $x (t + h)$ , where $t + h < T$ . Suppose now a time $t + h$ , we switch to an optimal control and use it for the remaining times $t + h \leq s \leq T$ .

What is the payoff of this procedure? Now for times $t \leq s \leq t + h$ , we have

{\begin{cases} \dot{x} (s) = f (x (s), a) \\ x (t) = x \end{cases}

The payoff for this time period is $\int_{t}^{t + h} r (x (s), a) d s$ . Furthermore, the payoff incurred from time $t + h$ to $T$ is $v (x (t + h), t + h)$ , according to the definition of the payoff function $v$ . Hence the total payoff is

\int_{t}^{t + h} r (x (s), a) d s + v (x (t + h), t + h)

But the greatest possible payoff if we start from ( $x, t$ ) is $v (x, t)$ . Therefore

\begin{matrix} (5.5) & v (x, t) \geq \int_{t}^{t + h} r (x (s), a) d s + v (x (t + h), t + h) . \end{matrix}

We now want to convert this inequality into a differential form. So we rearrange $(5.5)$ and divide by $h > 0$ :

\frac{v (x (t + h), t + h) - v (x, t)}{h} + \frac{1}{h} \int_{t}^{t + h} r (x (s), a) d s \leq 0

Let $h \to 0$ :

v_{t} (x, t) + \nabla_{x} v (x (t), t) \cdot \dot{x} (t) + r (x (t), a) \leq 0

But $x (\cdot)$ solves the ODE

{\begin{cases} \dot{x} (s) = f (x (s), a) (t \leq s \leq t + h) \\ x (t) = x . \end{cases}

Employ this above, to discover:

v_{t} (x, t) + f (x, a) \cdot \nabla_{x} v (x, t) + r (x, a) \leq 0.

This inequality holds for all control parameters $a \in A$ , and consequently

\begin{matrix} (5.6) & max_{a \in A} {v_{t} (x, t) + f (x, a) \cdot \nabla_{x} v (x, t) + r (x, a)} \leq 0. \end{matrix}

We next demonstrate that in fact the maximum above equals zero. To see this, suppose $α^{*} (\cdot), x^{*} (\cdot)$ were optimal for the problem above. Let us utilize the optimal control $α^{*} (\cdot)$ for $t \leq s \leq t + h$ . The payoff is

\int_{t}^{t + h} r (x^{*} (s), α^{*} (s)) d s

and the remaining payoff is $v (x^{*} (t + h), t + h)$ . Consequently, the total payoff is

\int_{t}^{t + h} r (x^{*} (s), α^{*} (s)) d s + v (x^{*} (t + h), t + h) = v (x, t) .

Rearrange and divide by $h$ :

\frac{v (x^{*} (t + h), t + h) - v (x, t)}{h} + \frac{1}{h} \int_{t}^{t + h} r (x^{*} (s), α^{*} (s)) d s = 0.

Let $h \to 0$ and suppose $α^{*} (t) = a^{*} \in A$ . Then

v_{t} (x, t) + \nabla_{x} v (x, t) \cdot \underset{f (x, a^{*})}{\underset{⏟}{{\dot{x}}^{*} (t)}} + r (x, a^{*}) = 0;

and therefore

v_{t} (x, t) + f (x, a^{*}) \cdot \nabla_{x} v (x, t) + r (x, a^{*}) = 0

for some parameter value $a^{*} \in A$ . This proves $(HJB)$ .

◻

5.1.2 The Dynamic Programming Method

Here is how to use the dynamic programming method to design optimal controls:

Step 1: Solve the Hamilton-Jacobi-Bellman equation, and thereby compute the value function $v$ .

Step 2: Use the value function $v$ and the Hamilton-Jacobi-Bellman PDE to design an optimal feedback control $α^{*} (\cdot)$ , as follows. Define for each point $x \in R^{n}$ and each time $0 \leq t \leq T$ ,

α (x, t) = a \in A

to be a parameter value where the maximum in (HJB) is attained. In other words, we select $α (x, t)$ so that

v_{t} (x, t) + f (x, α (x, t)) \cdot \nabla_{x} v (x, t) + r (x, α (x, t)) = 0 .

Next we solve the following ODE, assuming $α (\cdot, t)$ is sufficiently regular to let us do so:

\begin{matrix} (ODE) & {\begin{cases} {\dot{x}}^{*} (s) = f (x^{*} (s), α (x^{*} (s), s)) (t \leq s \leq T) \\ x (t) = x . \end{cases} \end{matrix}

Finally, define the feedback control

\begin{matrix} (5.7) & α^{*} (s) := α (x^{*} (s), s) . \end{matrix}

In summary, we design the optimal control this way: If the state of system is $x$ at time $t$ , use the control which at time $t$ takes on the parameter value $a \in A$ such that the minimum(fault, maximum?) in (HJB) is obtained.

We demonstrate next that this construction does indeed provide us with an optimal control.

Theorem 5.2 (Verification of Optimality)

The control $α^{*} (\cdot)$ defined by the construction $(5.7)$ is optimal.

Proof of Theorem 5.2

We have

P_{x, t} [α^{*} (\cdot)] = \int_{t}^{T} r (x^{*} (s), α^{*} (s)) d s + g (x^{*} (T)) .

Furthermore according to the definition $(5.7)$ of $α (\cdot)$ :

\begin{aligned} P_{x, t} [α^{*} (\cdot)] & = \int_{t}^{T} (- v_{t} (x^{*} (s), s) - f (x^{*} (s), α^{*} (s)) \cdot \nabla_{x} v (x^{*} (s), s)) d s + g (x^{*} (T)) \\ = - \int_{t}^{T} v_{t} (x^{*} (s), s) + \nabla_{x} v (x^{*} (s), s) \cdot {\dot{x}}^{*} (s) d s + g (x^{*} (T)) \\ = - \int_{t}^{T} \frac{d}{d s} v (x^{*} (s), s) d s + g (x^{*} (T)) (C h a i n R u l e) \\ = - v (x^{*} (T), T) + v (x^{*} (t), t) + g (x^{*} (T)) \\ = - g (x^{*} (T)) + v (x^{*} (t), t) + g (x^{*} (T)) \\ = v (x, t) = sup_{α (\cdot) \in A} P_{x, t} [α (\cdot)] . \end{aligned}

That is,

P_{x, t} [α^{*} (\cdot)] = sup_{α (\cdot) \in A} P_{x, t} [α (\cdot)];

and so $α^{*} (\cdot)$ is optimal, as asserted.

◻

5.2 Examples

5.2.1 Example 1: Dynamics with Three Velocities

Let us begin with a fairly easy problem:

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = α (s) (0 \leq t \leq s \leq 1) \\ x (t) = x \end{cases} \end{matrix}

where our set of control parameters is

A = {- 1, 0, 1} .

We want to minimize

\int_{t}^{1} | x (s) | d s

and so take for our payoff functional

\begin{matrix} (P) & P_{x, t} [α (\cdot)] = - \int_{t}^{1} | x (s) | d s . \end{matrix}

As our first illustration of dynamic programming, we will compute the value function $v (x, t)$ and confirm that it does indeed solve the appropriate Hamilton-Jacobi-Bellman equation. To do this, we first introduce the three regions:

Region $I = {(x, t) ∣ x < t - 1, 0 \leq t \leq 1}$ .
Region $I I = {(x, t) ∣ t - 1 < x < 1 - t, 0 \leq t \leq 1}$ .
Region $I I I = {(x, t) ∣ x > 1 - t, 0 \leq t \leq 1}$ .

We will consider the three cases as to which region the initial data ( $x, t$ ) lie within.

Region III

In this case we should take $α \equiv - 1$ , to steer as close to the origin 0 as quickly as possible. Then

\begin{aligned} v (x, t) & = - area under path taken \\ = - (1 - t) \frac{1}{2} (x + x + t - 1) = - \frac{(1 - t)}{2} (2 x + t - 1) . \end{aligned}

Region I In this region, we should take $α \equiv 1$ , in which case we can similarly compute $v (x, t) = - (\frac{1 - t}{2}) (- 2 x + t - 1)$ .
Region II In this region we take $α \equiv \pm 1$ , until we hit the origin, after which we take $α \equiv 0$ . We therefore calculate that $v (x, t) = - \frac{x^{2}}{2}$ in this region.

Checking the Hamilton-Jacobi-Bellman PDE

Now the Hamilton-JacobiBellman equation for our problem reads

\begin{matrix} (5.8) & v_{t} + max_{a \in A} {f \cdot \nabla_{x} v + r} = 0 \end{matrix}

for $f = a, r = - | x |$ . We rewrite this as

v_{t} + max_{a = \pm 1, 0} {a v_{x}} - | x | = 0;

and so

\begin{matrix} (HJB) & v_{t} + | v_{x} | - | x | = 0. \end{matrix}

We must check that the value function $v$ , defined explicitly above in Regions I-III, does in fact solve this PDE, with the terminal condition that $v (x, 1) = g (x) = 0$ .

Now in Region II $v = - \frac{x^{2}}{2}, v_{t} = 0, v_{x} = - x$ . Hence

v_{t} + | v_{x} | - | x | = 0 + | - x | - | x | = 0 in Region II,

in accordance with $(HJB)$ . In Region III we have

v (x, t) = - \frac{(1 - t)}{2} (2 x + t - 1);

and therefore

v_{t} = \frac{1}{2} (2 x + t - 1) - \frac{(1 - t)}{2} = x - 1 + t, v_{x} = t - 1, | t - 1 | = 1 - t \geq 0.

Hence

v_{t} + | v_{x} | - | x | = x - 1 + t + | t - 1 | - | x | = 0 in Region III,

because $x > 0$ there.

Likewise the Hamilton-Jacobi-Bellman PDE holds in Region I.

Remarks

In the example, $v$ is not continuously differentiable on the borderlines between Regions II and I or III.
In general, it may not be possible actually to find the optimal feedback control. For example, reconsider the above problem, but now with $A = {- 1, 1}$ . We still have $α = sgn (v_{x})$ , but there is no optimal control in Region II.

5.2.2 Example 2: Rocket Railroad Car

Recall that the equations of motion in this model are

(\binom{{\dot{x}}_{1}}{{\dot{x}}_{2}}) = (\begin{array}{ll} 0 & 1 \\ 0 & 0 \end{array}) (\binom{x_{1}}{x_{2}}) + (\binom{0}{1}) α, | α | \leq 1

and

P [α (\cdot)] = - time to reach (0, 0) = - \int_{0}^{τ} 1 d t = - τ .

To use the method of dynamic programming, we define $v (x_{1}, x_{2})$ to be minus the least time it takes to get to the origin $(0, 0)$ , given we start at the point $(x_{1}, x_{2})$ .

What is the Hamilton-Jacobi-Bellman equation? Note $v$ does not depend on $t$ , and so we have

max_{a \in A} {f \cdot \nabla_{x} v + r} = 0

for

A = [- 1, 1], f = (\binom{x_{2}}{a}), r = - 1

Hence our PDE reads

max_{| a | \leq 1} {x_{2} v_{x_{1}} + a v_{x_{2}} - 1} = 0;

and consequently

\begin{matrix} (HJB) & {\begin{cases} x_{2} v_{x_{1}} + | v_{x_{2}} | - 1 = 0 in R^{2} \\ v (0, 0) = 0 . \end{cases} \end{matrix}

Checking the Hamilton-Jacobi-Bellman PDE

We now confirm that $v$ really satisfies $(HJB)$ . For this, define the regions

\begin{aligned} I := & {(x_{1}, x_{2}) | x_{1} \geq - \frac{1}{2} x_{2} | x_{2} |}, \\ I I := & {(x_{1}, x_{2}) | x_{1} \leq - \frac{1}{2} x_{2} | x_{2} |} . \end{aligned}

A direct computation, the details of which we omit, reveals that

v (x) = {\begin{cases} - x_{2} - 2 {(x_{1} + \frac{1}{2} x_{2}^{2})}^{\frac{1}{2}} & in Region I \\ x_{2} - 2 {(- x_{1} + \frac{1}{2} x_{2}^{2})}^{\frac{1}{2}} & in Region II. \end{cases}

In Region I we compute

\begin{aligned} v_{x_{2}} = - 1 - {(x_{1} + \frac{x_{2}^{2}}{2})}^{- \frac{1}{2}} x_{2}, \\ v_{x_{1}} = - {(x_{1} + \frac{x_{2}^{2}}{2})}^{- \frac{1}{2}}; \end{aligned}

and therefore

\begin{aligned} x_{2} v_{x_{1}} + | v_{x_{2}} | - 1 \\ = & - x_{2} {(x_{1} + \frac{x_{2}^{2}}{2})}^{- \frac{1}{2}} + [1 + x_{2} {(x_{1} + \frac{x_{2}^{2}}{2})}^{- \frac{1}{2}}] - 1 = 0 . \end{aligned}

This confirms that our $(HJB)$ equation holds in Region I, and a similar calculation holds in Region II.

Optimal control

Since

max_{| a | \leq 1} {x_{2} v_{x_{1}} + a v_{x_{2}} + 1} = 0,

the optimal control is

α = sgn v_{x_{2}} .

5.2.3 Example 3: General Linear-Quadratic Regulator

For this important problem, we are given matrices $M, B, D \in M^{n \times n}, N \in M^{n \times m}, C \in M^{m \times m}$ ; and assume

B, C, D are symmetric and nonnegative definite,

and

C is invertible.

We take the linear dynamics

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = M x (s) + N α (s) (t \leq s \leq T) \\ x (t) = x, \end{cases} \end{matrix}

for which we want to minimize the quadratic cost functional

\int_{t}^{T} x (s)^{⊤} B x (s) + α (s)^{⊤} C α (s) d s + x (T)^{⊤} D x (T) .

So we must maximize the payoff

\begin{matrix} (P) & P_{x, t} [α (\cdot)] = - \int_{t}^{T} x (s)^{⊤} B x (s) + α (s)^{⊤} C α (s) d s - x (T)^{⊤} d x (T) . \end{matrix}

The control values are unconstrained, meaning that the control parameter values can range over all of $A = R^{m}$ .

We will solve by dynamic programming the problem of designing an optimal control. To carry out this plan, we first compute the Hamilton-Jacobi-Bellman equation

v_{t} + max_{a \in R^{m}} {f \cdot \nabla_{x} v + r} = 0,

where

{\begin{cases} f = M x + N a \\ r = - x^{⊤} B x - a^{⊤} C a \\ g = - x^{⊤} D x \end{cases}

Rewrite:

\begin{matrix} (HJB) & v_{t} + max_{a \in R^{m}} {(\nabla v)^{⊤} N a - a^{⊤} C a} + (\nabla v)^{⊤} M x - x^{⊤} B x = 0. \end{matrix}

We also have the terminal condition

v (x, T) = - x^{⊤} D x

Maximization

For what value of the control parameter $a$ is the minimum in the original problem attained? To understand this, we define $Q (a) := (\nabla v)^{⊤} N a - a^{⊤} C a$ , and determine where $Q$ has a maximum by computing the partial derivatives $Q_{a_{j}}$ for $j = 1, \dots, m$ and setting them equal to $0$ . This gives the identitites

Q_{a_{j}} = \sum_{i = 1}^{n} v_{x_{i}} n_{i j} - 2 a_{i} c_{i j} = 0.

Therefore $(\nabla v)^{⊤} N = 2 a^{⊤} C$ , and then $2 C^{⊤} a = N^{⊤} \nabla v$ . But $C^{⊤} = C$ . Therefore

a = \frac{1}{2} C^{- 1} N^{⊤} \nabla_{x} v .

This is the formula for the optimal feedback control: It will be very useful once we compute the value function $v$ .

Finding the value function

We insert our formula $a = \frac{1}{2} C^{- 1} N^{⊤} \nabla v$ into $(HJB)$ , and this PDE then reads

\begin{matrix} (HJB) & {\begin{cases} v_{t} + \frac{1}{4} (\nabla v)^{⊤} N C^{- 1} N^{⊤} \nabla v + (\nabla v)^{⊤} M x - x^{⊤} B x = 0 \\ v (x, T) = - x^{⊤} D x . \end{cases} \end{matrix}

Our next move is to guess the form of the solution, namely

v (x, t) = x^{⊤} K (t) x,

provided the symmetric $n \times n$ -matrix valued function $K (\cdot)$ is properly selected. Will this guess work?

Now, since $- x^{⊤} K (T) x = - v (x, T) = x^{⊤} D x$ , we must have the terminal condition that

K (T) = - D .

Next, compute that

v_{t} = x^{⊤} \dot{K} (t) x, \nabla_{x} v = 2 K (t) x .

We insert our guess $v = x^{⊤} K (t) x$ into $(HJB)$ , and discover that

x^{⊤} {\dot{K} (t) + K (t) N C^{- 1} N^{⊤} K (t) + 2 K (t) M - B} x = 0 .

Look at the expression

\begin{aligned} 2 x^{⊤} K M x & = x^{⊤} K M x + {[x^{⊤} K M x]}^{⊤} \\ = x^{⊤} K M x + x^{⊤} M^{⊤} K x . \end{aligned}

Then

x^{⊤} {\dot{K} + K N C^{- 1} N^{⊤} K + K M + M^{⊤} K - B} x = 0 .

This identity will hold if $K (\cdot)$ satisfies the matrix Riccati equation

\begin{matrix} (R) & {\begin{cases} \dot{K} (t) + K (t) N C^{- 1} N^{⊤} K (t) + K (t) M + M^{⊤} K (t) - B = 0 (0 \leq t < T) \\ K (T) = - d \end{cases} \end{matrix}

In summary, if we can solve the Riccati equation $(R)$ , we can construct an optimal feedback control

α^{*} (t) = C^{- 1} N^{⊤} K (t) x (t)

Furthermore, $(R)$ in fact does have a solution, as explained for instance in the book of Fleming-Rishel.

5.2.4 Example 4: More General Linear-Quadratic Regulator with Cross-Term and time-variant

Chapter 11, part 2 in Numerical Optimal Control by Moritz Diehl and Sebastien Gros.

For this important problem, we are given matrices $M, B, D \in M^{n \times n}, N, Q \in M^{n \times m}, C \in M^{m \times m}$ ; and assume

B, C, D are symmetric and nonnegative definite,

and

C is invertible.

We take the linear dynamics

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = M (s) x (s) + N (s) α (s) (0 \leq s \leq T) \\ x (0) = x^{0}, \end{cases} \end{matrix}

for which we want to minimize the quadratic cost functional

\int_{t}^{T} {[\begin{matrix} x (s) \\ α (s) \end{matrix}]}^{⊤} [\begin{matrix} B (s) & Q (s)^{⊤} \\ Q (s) & C (s) \end{matrix}] [\begin{matrix} x (s) \\ α (s) \end{matrix}] d s + x (T)^{⊤} D x (T) .

So we must maximize the payoff

\begin{matrix} (P) & P_{x, t} [α (\cdot)] = - \int_{t}^{T} {[\begin{matrix} x (s) \\ α (s) \end{matrix}]}^{⊤} [\begin{matrix} B (s) & Q (s)^{⊤} \\ Q (s) & C (s) \end{matrix}] [\begin{matrix} x (s) \\ α (s) \end{matrix}] d s + x (T)^{⊤} D x (T) . \end{matrix}

v (x, t) = sup P_{x, t} [α (\cdot)]

Also, we start from the assumption that the value function is quadratic. In order to verify this statement, let us first observe that $v (x, T) = x^{⊤} D x$ is quadratic. Thus, let us assume for now that $v (x, t)$ is quadratic for all time, i.e. $v (x, t) = x^{⊤} K (t) x$ for some symmetric matrix $K (t)$ . Under this assumption, the HJB equation reads as

\begin{aligned} - v_{t} & = max_{a \in R^{m}} {f \cdot \nabla_{x} v + r} \\ = max_{a \in R^{m}} {2 x (t)^{⊤} K (t) (M (t) x (t) + N (t) a (t)) - {[\begin{array}{c} x (t) \\ a (t) \end{array}]}^{⊤} [\begin{array}{c} B (t) & Q (t)^{⊤} \\ Q (t) & C (t) \end{array}] [\begin{array}{c} x (t) \\ a (t) \end{array}]} \\ = max_{a \in R^{m}} {- {[\begin{array}{c} x \\ a \end{array}]}^{⊤} [\begin{array}{c} B - K M - M^{⊤} K & Q^{⊤} - K N \\ Q - N^{⊤} K & C \end{array}] [\begin{array}{c} x \\ a \end{array}]} . \end{aligned}

Introduce Schur complement lemma below:

Lemma: Schur Complement

If a matrix $R$ is positive definitem then

min_{u} {[\begin{matrix} x \\ u \end{matrix}]}^{⊤} [\begin{matrix} Q & S^{⊤} \\ S & R \end{matrix}] [\begin{matrix} x \\ u \end{matrix}] = x^{⊤} (Q - S^{⊤} R^{- 1} S) x

and the minimizer $u^{*} (x) = - R^{- 1} S x$

By Schur Complement Lemma, the above HJB eqatio yields

- v_{t} = x^{⊤} (B - K M - M^{⊤} K - (Q^{⊤} - K N) C^{- 1} (Q - N^{⊤} K)) x,

which is again a quadratic term. Thus, as $v$ is quadratic at a time $T$ , it remains quadratic throughout the backwards evolution. The resulting matrix differential equation

- \dot{K} = B - K M - M^{⊤} K - (Q^{⊤} - K N) C^{- 1} (Q - N^{⊤} K)

with terminal condition

K (T) = D

is called the differential Riccati equation. Integrating it backwards allows us to compute the cost-to-go function for the above optimal control problem. The corresponding feedback law is by the Schur complement lemma given as:

α^{*} (x, t) = - C (t)^{- 1} (Q (t) - N (t)^{⊤} K (t)) x .

5.3 Dynamic Programming and the Pontryagin Maximum Principle

5.3.1 The Method of Characteristics.

Assume $H : R^{n} \times R^{n} \to R$ and consider this initial-value problem for the Hamilton-Jacobi equation:

\begin{matrix} (HJB) & {\begin{cases} u_{t} (x, t) + H (x, \nabla_{x} u (x, t)) = 0 (x \in R^{n}, 0 < t < T) \\ u (x, 0) = g (x) \end{cases} \end{matrix}

A basic idea in PDE theory is to introduce some ordinary differential equations, the solution of which lets us compute the solution $u$ . In particular, we want to find a curve $x (\cdot)$ along which we can, in principle at least, compute $u (x, t)$ .

This section discusses this method of characteristics, to make clearer the connections between PDE theory and the Pontryagin Maximum Principle.

Notaion

x (t) = (\begin{matrix} x^{1} (t) \\ ⋮ \\ x^{n} (t) \end{matrix}), p (t) = \nabla_{x} u (x (t), t) = (\begin{matrix} p^{1} (t) \\ ⋮ \\ p^{n} (t) \end{matrix})

Derivation of characteristic equations

We have

p^{k} (t) = u_{x_{k}} (x (t), t)

and therefore

{\dot{p}}^{k} (t) = u_{x_{k} t} (x (t), t) + \sum_{i = 1}^{n} u_{x_{k} x_{i}} (x (t), t) \cdot {\dot{x}}^{i}

Now suppose $u$ solves $(HJ)$ . We differentiate this PDE with respect to the variable $x_{k}$ :

u_{t x_{k}} (x, t) = - H_{x_{k}} (x, \nabla u (x, t)) - \sum_{i = 1}^{n} H_{p_{i}} (x, \nabla u (x, t)) \cdot u_{x_{k} x_{i}} (x, t)

Let $x = x (t)$ and substitute above:

{\dot{p}}^{k} (t) = - H_{x_{k}} (x (t), \underset{p (t)}{\underset{⏟}{\nabla_{x} u (x (t), t)}}) + \sum_{i = 1}^{n} ({\dot{x}}^{i} (t) - H_{p_{i}} (x (t), \underset{p (t)}{\underset{⏟}{\nabla_{x} u (x, t)}}) u_{x_{k} x_{i}} (x (t), t) .

We can simplify this expression if we select $x (\cdot)$ so that

{\dot{x}}^{i} (t) = H_{p_{i}} (x (t), p (t)), (1 \leq i \leq n);

then

{\dot{p}}^{k} (t) = - H_{x_{k}} (x (t), p (t)), (1 \leq k \leq n) .

These are Hamilton's equations, already discussed in a different context in §4.1:

\begin{matrix} (H) & {\begin{cases} \dot{x} (t) = \nabla_{p} H (x (t), p (t)) \\ \dot{p} (t) = - \nabla_{x} H (x (t), p (t)) . \end{cases} \end{matrix}

We next demonstrate that if we can solve $(H)$ , then this gives a solution to PDE $(HJ)$ , satisfying the initial conditions $u = g$ on $t = 0$ . Set $p^{0} = \nabla g (x^{0})$ . We solve $(H)$ , with $x (0) = x^{0}$ and $p (0) = p^{0}$ . Next, let us calculate

\begin{aligned} \frac{d}{d t} u (x (t), t) \\ = & u_{t} (x (t), t) + \nabla_{x} u (x (t), t) \cdot \dot{x} (t) \\ = & - H (\underset{p (t)}{\underset{⏟}{\nabla_{x} u (x (t), t)}}, x (t)) + \underset{p (t)}{\underset{⏟}{\nabla_{x} u (x (t), t)}} \cdot \nabla_{p} H (x (t), p (t)) \\ = & - H (x (t), p (t)) + p (t) \cdot \nabla_{p} H (x (t), p (t)) \end{aligned}

Note also $u (x (0), 0) = u (x^{0}, 0) = g (x^{0})$ . Integrate, to compute $u$ along the curve $x (\cdot)$ :

u (x (t), t) = \int_{0}^{T} - H + \nabla_{p} H \cdot p d s + g (x^{0})

This gives us the solution, once we have calculated $x (\cdot)$ and $p (\cdot)$ .

5.3.2 Connections between Dynamic Programming AND The Pontryagin Maximum Principle

Return now to our usual control theory problem, with dynamics

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = f (x (s), α (s)) (t \leq s \leq T) \\ x (t) = x \end{cases} \end{matrix}

and payoff

\begin{matrix} (P) & P_{x, t} [α (\cdot)] = \int_{t}^{T} r (x (s), α (s)) d s + g (x (T)) . \end{matrix}

As above, the value function is

v (x, t) = sup_{α (\cdot)} P_{x, t} [α (\cdot)] .

The next theorem demonstrates that the costate in the Pontryagin Maximum Principle is in fact the gradient in $x$ of the value function $v$ , taken along an optimal trajectory:

Theorem 5.3 (Costates AND Gradients)

Assume $α^{*} (\cdot), x^{*} (\cdot)$ solve the control problem (ODE), (P).

If the value function $v$ is $C^{2}$ , then the costate $p^{*} (\cdot)$ occuring in the Maximum Principle is given by

p^{*} (s) = \nabla_{x} v (x^{*} (s), s) (t \leq s \leq T) .

Proof

As usual, suppress the superscript *. Define $p (t) := \nabla_{x} v (x (t), t)$ . We claim that $p (\cdot)$ satisfies conditions $(ADJ)$ and $(M)$ of the Pontryagin Maximum Principle. To confirm this assertion, look at

{\dot{p}}^{i} (t) = \frac{d}{d t} v_{x_{i}} (x (t), t) = v_{x_{i} t} (x (t), t) + \sum_{j = 1}^{n} v_{x_{i} x_{j}} (x (t), t) {\dot{x}}^{j} (t) .

We know $v$ solves

v_{t} (x, t) + max_{a \in A} {f (x, a) \cdot \nabla_{x} v (x, t) + r (x, a)} = 0;

and, applying the optimal control $α (\cdot)$ , we find:

v_{t} (x (t), t) + f (x (t), α (t)) \cdot \nabla_{x} v (x (t), t) + r (x (t), α (t)) = 0.

Now freeze the time $t$ and define the function

h (x) := v_{t} (x, t) + f (x, α (t)) \cdot \nabla_{x} v (x, t) + r (x, α (t)) \leq 0 .

Observe that $h (x (t)) = 0$ . Consequently $h (\cdot)$ has a maximum at the point $x = x (t)$ ; and therefore for $i = 1, \dots, n$ ,

\begin{aligned} 0 = h_{x_{i}} ( & x (t)) = v_{t x_{i}} (x (t), t) + f_{x_{i}} (x (t), α (t)) \cdot \nabla_{x} v (x (t), t) \\ + f (x (t), α (t)) \cdot \nabla_{x} v_{x_{i}} (x (t), t) + r_{x_{i}} (x (t), α (t)) . \end{aligned}

Substitute above:

{\dot{p}}^{i} (t) = v_{x_{i} t} + \sum_{i = 1}^{n} v_{x_{i} x_{j}} f_{j} = v_{x_{i} t} + f \cdot \nabla_{x} v_{x_{i}} = - f_{x_{i}} \cdot \nabla_{x} v - r_{x_{i}} .

Recalling that $p (t) = \nabla_{x} v (x (t), t)$ , we deduce that

\dot{p} (t) = - (\nabla_{x} f) p - \nabla_{x} r .

Recall also

H = f \cdot p + r, \nabla_{x} H = (\nabla_{x} f) p + \nabla_{x} r .

Hence

\dot{p} (t) = - \nabla_{x} H (p (t), x (t)),

which is $(ADJ)$ . 3. Now we must check condition $(M)$ . According to $(HJB)$ ,

v_{t} (x (t), t) + max_{a \in A} {f (x (t), a) \cdot \underset{p (t)}{\underset{⏟}{\nabla v (x (t), t)}} + r (x (t), t)} = 0,

and maximum occurs for $a = α (t)$ . Hence

max_{a \in A} {H (x (t), p (t), a)} = H (x (t), p (t), α (t));

and this is assertion (M) of the Maximum Principle.

◻

Interpretations

The foregoing provides us with another way to look at transversality conditions:

Free endpoint problem: Recall that we stated earlier in Theorem 4.3 that for the free endpoint problem we have the condition

\begin{matrix} (T) & p^{*} (T) = \nabla g (x^{*} (T)) \end{matrix}

for the payoff functional

\int_{t}^{T} r (x (s), α (s)) d s + g (x (T)) .

To understand this better, note $p^{*} (s) = - \nabla v (x^{*} (s), s)$ . But $v (x, t) = g (x)$ , and hence the foregoing implies

p^{*} (T) = \nabla_{x} v (x^{*} (T), T) = \nabla g (x^{*} (T)) .

Constrained initial and target sets:

Recall that for this problem we stated in Theorem 4.5 the transversality conditions that

{\begin{cases} p^{*} (0) is perpendicular to T_{0} \\ p^{*} (τ^{*}) is perpendicular to T_{1} \end{cases}

when $τ^{*}$ denotes the first time the optimal trajectory hits the target set $X_{1}$ . Now let $v$ be the value function for this problem:

v (x) = sup_{α (\cdot)} P_{x} [α (\cdot)],

with the constraint that we start at $x^{0} \in X_{0}$ and end at $x^{1} \in X_{1}$ But then $v$ will be constant on the set $X_{0}$ and also constant on $X_{1}$ . Since $\nabla v$ is perpendicular to any level surface, $\nabla v$ is therefore perpendicular to both $\partial X_{0}$ and $\partial X_{1}$ . And since

p^{*} (t) = \nabla v (x^{*} (t))

this means that

{\begin{cases} p^{*} is perpendicular to \partial X_{0} at t = 0 \\ p^{*} is perpendicular to \partial X_{1} at t = τ^{*} \end{cases}

5.4 Infinite Time Optimal Control

Let us now regard an infinite time optimal control problem, as follows:

\begin{matrix} (ODE) & {\begin{cases} \dot{x} (s) = f (x (s), α (s)), s \in [0, \infty] \\ x (0) = x^{0}, \end{cases} \end{matrix}

with the associated payoff functional

\begin{matrix} (P) & P [α (\cdot)] = \int_{t}^{\infty} r (x (s), α (s)) d s . \end{matrix}

The principle of optimality states that the value function of this problem, if it is finite and it exists, must be stationary, i.e. independent of time. Setting $v_{t} (x, t) = 0$ leads to the stationary HJB equation

0 = max_{a \in A} {f (x, a) \cdot \nabla_{x} v (x, t) + r (x, a)}

with stationary optimal feedback control law

α^{*} (x) = \underset{α}{\arg max} {f (x, α) \cdot \nabla_{x} v (x, t) + r (x, α)}

This equation is easily solvable in the linear quadratic case, i.e., in the case of an infinite horizon linear quadratic optimal control with time independent cost and system matrices. The solution is again quadratic and obtained by setting

\dot{K} = 0

and solving

0 = B - K M - M^{⊤} K - (Q^{⊤} - K N) C^{- 1} (Q - N^{⊤} K) .

This equation is called the algebraic Riccati equation in continuous time. Its feedback law is a static linear gain:

α^{*} (x) = - C^{- 1} (Q - N^{⊤} K) x .

5 Dynamic Programming ​

5.1 Derivation of Bellman's PDE ​

5.1.1 Dynamic Programming ​

A Calculus Example ​

Come back to control ​

5.1.2 Derivation of Hamilton-Jacobi-Bellman Equation ​

Remark ​

Proof ​

5.1.2 The Dynamic Programming Method ​

Proof of Theorem 5.2 ​

5.2 Examples ​

5.2.1 Example 1: Dynamics with Three Velocities ​

Checking the Hamilton-Jacobi-Bellman PDE ​

Remarks ​

5.2.2 Example 2: Rocket Railroad Car ​

Checking the Hamilton-Jacobi-Bellman PDE ​

Optimal control ​

5.2.3 Example 3: General Linear-Quadratic Regulator ​

Maximization ​

Finding the value function ​

5.2.4 Example 4: More General Linear-Quadratic Regulator with Cross-Term and time-variant ​

5.3 Dynamic Programming and the Pontryagin Maximum Principle ​

5.3.1 The Method of Characteristics. ​

Derivation of characteristic equations ​

5.3.2 Connections between Dynamic Programming AND The Pontryagin Maximum Principle ​

Proof ​

Interpretations ​

5.4 Infinite Time Optimal Control ​

5 Dynamic Programming

5.1 Derivation of Bellman's PDE

5.1.1 Dynamic Programming

A Calculus Example

Come back to control

5.1.2 Derivation of Hamilton-Jacobi-Bellman Equation

Remark

Proof

5.1.2 The Dynamic Programming Method

Proof of Theorem 5.2

5.2 Examples

5.2.1 Example 1: Dynamics with Three Velocities

Checking the Hamilton-Jacobi-Bellman PDE

Remarks

5.2.2 Example 2: Rocket Railroad Car

Checking the Hamilton-Jacobi-Bellman PDE

Optimal control

5.2.3 Example 3: General Linear-Quadratic Regulator

Maximization

Finding the value function

5.2.4 Example 4: More General Linear-Quadratic Regulator with Cross-Term and time-variant

5.3 Dynamic Programming and the Pontryagin Maximum Principle

5.3.1 The Method of Characteristics.

Derivation of characteristic equations

5.3.2 Connections between Dynamic Programming AND The Pontryagin Maximum Principle

Proof

Interpretations

5.4 Infinite Time Optimal Control