Dynamic Programming#

Note

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. DP provides an essential foundation for the understanding of the methods presented in the rest of this book. In fact, all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

Policy Evaluation#

First we consider how to compute the state-value function \(V_{\pi}\) for an arbitrary policy \(\pi\). This is called policy evaluation in the DP literature. Recall the Bellman equation:

\[\begin{split} \begin{aligned} V_{\pi}(s) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma V_{\pi}(S_{t+1})|S_{t}=s]\\ &= \sum_{a}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V_{\pi}(s')\right] \end{aligned} \end{split}\]

If the environment’s dynamics are completely known, then it is a system of \(|S|\) linear equations in \(|S|\) unknowns. In principle, its solution is a straightforward computation. For our purposes, iterative solution methods are most suitable. The initial approximation, \(V_{0}\), is chosen arbitrarily, and each successive approximation is obtained by using the Bellman equation:

\[\begin{split} \begin{aligned} V_{k+1}(s) &= \mathbb{E}_{\pi}[R_{t+1} + \gamma V_{k}(S_{t+1})|S_{t}=s]\\ &= \sum_{a}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V_{k}(s')\right] \end{aligned} \end{split}\]

Clearly, \(V_{k} = V_{\pi}\) is a fixed point for this update rule. Indeed, the sequence \(\{V_{k}\}\) can be shown in general to converge to \(V_{\pi}\) as \(k\to\infty\).

Policy Improvement#

Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function \(V_{\pi}\) for an arbitrary deterministic policy \(\pi\). For some state \(s\) we would like to know whether or not we should change the policy to deterministically choose an action \(a\ne \pi(s)\).

Theorem 6 (Policy improvement theorem)

Let \(\pi\) and \(\pi'\) be any pair of deterministic policies such that, for all \(s\in\mathcal{S}\),

\[Q_{\pi}(s, \pi'(s)) \ge V_{\pi}(s)\]

Then the policy \(\pi'\) must be as good as, or better than, \(\pi\).

Proof. The idea behind the proof of the policy improvement theorem is easy to understand, we keep expanding the \(Q_{\pi}\) side and reapplying the inequality:

\( \begin{aligned} V_{\pi}(s) &\le Q_{\pi}(s, \pi'(s)) \\ &=\mathbb{E}_{\pi}[R_{t+1} + \gamma V_{\pi}(S_{t+1})|S_{t}=s,A_{t}=\pi'(s)]\\ &=\mathbb{E}_{\pi'}[R_{t+1} + \gamma V_{\pi}(S_{t+1})|S_{t}=s]\\ &\le\mathbb{E}_{\pi'}[R_{t+1} + \gamma Q_{\pi}(S_{t+1},\pi'(S_{t+1}))|S_{t}=s]\\ &=\mathbb{E}_{\pi'}[R_{t+1} + \gamma \mathbb{E}[R_{t+2}+\gamma V_{\pi}(S_{t+2})|S_{t+1},A_{t+1}=\pi'(S_{t+1})]|S_{t}=s]\\ &=\mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^{2}V_{\pi}(S_{t+2})|S_{t}=s]\\ &\le\mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3}V_{\pi}(S_{t+3})|S_{t}=s]\\ &\vdots\\ &\le\mathbb{E}_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3}R_{t+4}+\dots|S_{t}=s]\\ &=V_{\pi'}(s) \end{aligned} \)

Given a deterministic policy \(\pi\), consider the new greedy policy \(\pi'\) given by

\[ \begin{aligned} \pi'(s) := \arg\max_{a}Q_{\pi}(s, a) \end{aligned} \]

The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement. Suppose the new greedy policy, \(\pi'\), is as good as, but not better than, the old policy \(\pi\). Then \(V_{\pi}=V_{\pi'}\), the Bellman optimality equations are satisfied, and therefore, \(V_{\pi'}\) must be \(V_{\ast}\), and both \(\pi\) and \(\pi'\) must be optimal policies..

So far we have considered the special case of deterministic policies. In the general case, a stochastic policy \(\pi\) specifies probabilities, \(\pi(a|s)\), for taking each action, \(a\), in each state, \(s\). The policy improvement theorem carries through as stated for the stochastic case.

Policy Iteration#

Once a policy, \(\pi\), has been improved using \(V_{\pi}\) to yield a better policy, \(\pi'\), we can then compute \(V_{\pi'}\) and improve it again to yield an even better \(\pi''\). We can thus obtain a sequence of monotonically improving policies and value functions:

\[ \pi_{0}\overset{E}{\longrightarrow} V_{\pi_{0}}\overset{I}{\longrightarrow} \pi_{1}\overset{E}{\longrightarrow} V_{\pi_{1}}\overset{I}{\longrightarrow} \dots \]

where \(\overset{E}{\longrightarrow}\) denotes a policy evaluation and \(\overset{I}{\longrightarrow}\) denotes a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). This process must converge to an optimal policy and optimal value function.

Value Iteration#

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a iterative computation.

In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one update of each state. This algorithm is called value iteration. It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:

\[ V_{k+1}(s) := \max_{a}\mathbb{E}[R_{t+1} + \gamma V_{k}(S_{t+1})|S_{t}=s,A_{t}=a] \]

Another way of understanding value iteration is by reference to the Bellman optimality equation. Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule.