The Bellman Optimality Equation#

Note

The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore, necessary to define what optimal policies are.

Partial ordering between policies:

\[\pi_{1} \ge \pi_{2} \Leftrightarrow V_{\pi_1}(s) \ge V_{\pi_2}(s)\quad \forall s\in\mathcal{S}\]

Some policies are not comparable!

Optimal state-value function \(V^{\ast}(s)\) shows the maximum expected discounted return that one can achieve from state \(s\):

\[V^{\ast}(s) := \max_{\pi}V_{\pi}(s).\]

Optimal policy \(\pi^{\ast}\) is a policy whose state-value function is the maximum out of all policy simultaneously for all states.

This definition leads to many questions:

  • Existence: Does the optimal policy exist?

  • Uniqueness: Is the optimal policy unique?

  • Algorithm: How to obtain the optimal policy and the optimal state values?

Bellman optimality equation#

Bellman optimality equation (BOE):

\[\begin{split} \begin{aligned} v(s) &= \underset{\pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s,a)\left[R(s,a,s') + \gamma v(s')\right] \\ &= \underset{a}{\max}\sum_{s'\in\mathcal{S}}P(s'|s,a)\left[R(s,a,s') + \gamma v(s')\right] \end{aligned} \end{split}\]

Matrix form:

\[v=\underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v) = f(v)\]

Tip

BOE is a equation with unknown variable \(v\).

Contraction mapping theorem#

Definition. Let \((X, d)\) be a metric space. Then a map \(T: X\to X\) is called a contraction mapping on \(X\) if there exists \(q\in[0,1)\) such that

\[d(Tx, Ty)\le qd(x, y)\]

for all \(x, y\in X\).

Theorem 2 (Banach fixed-point theorem)

Let \((X, d)\) be a non-empty complete metric space with a contraction mapping \(T: X\to X\). Then \(T\) admits a unique fixed-point \(x^{\ast}\) in \(X\) (i.e. \(T(x^{\ast})=x^{\ast}\)). Furthermore, \(x^{\ast}\) can be found as follows: start with an arbitrary element \(x_{0}\in X\) and define a sequence \(\{x_{n}\}_{n\in\mathbb{N}}\) by \(x_{n}=T(x_{n-1})\) for \(n\ge 1\). Then \(\lim_{n\to\infty}x_{n}=x^{\ast}\).

Proof. Omit.

Theorem 3

The function \(f(v) = \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v)\) is a contraction mapping. In particular, for any \(v_{1},v_{2}\in\mathbb{R}^{|\mathcal{S}|}\), it holds that

\[\|f(v_{1}) - f(v_{2})\|_{\infty} \le \gamma\|v_{1}-v_{2}\|_{\infty},\]

where \(\gamma\in(0, 1)\) is the discounted rate, and \(\|\cdot\|_{\infty}\) is the maximum norm, which is the maximum absolute value of the elements of a vector.

Proof. Consider any two vectors \(v_{1},v_{2}\in\mathbb{R}^{|\mathcal{S}|}\), and suppose that \(\pi_{1}^{\ast} = \arg\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v_{1})\) and \(\pi_{2}^{\ast} = \arg\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v_{2})\). Then

\[\begin{split}\begin{split} f(v_{1}) &= \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v_{1}) = r_{\pi_{1}^{\ast}}+ \gamma P_{\pi_{1}^{\ast}}v_{1} \ge r_{\pi_{2}^{\ast}}+ \gamma P_{\pi_{2}^{\ast}}v_{1} \\ f(v_{2}) &= \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v_{2}) = r_{\pi_{2}^{\ast}}+ \gamma P_{\pi_{2}^{\ast}}v_{2} \ge r_{\pi_{1}^{\ast}}+ \gamma P_{\pi_{1}^{\ast}}v_{2} \end{split}\end{split}\]

where \(\ge\) is an elementwise comparison. As a result,

\[\begin{split}\begin{split} f(v_{1}) - f(v_{2}) &= r_{\pi_{1}^{\ast}}+ \gamma P_{\pi_{1}^{\ast}}v_{1} - (r_{\pi_{2}^{\ast}}+ \gamma P_{\pi_{2}^{\ast}}v_{2}) \\ &\le r_{\pi_{1}^{\ast}}+ \gamma P_{\pi_{1}^{\ast}}v_{1} - (r_{\pi_{1}^{\ast}}+ \gamma P_{\pi_{1}^{\ast}}v_{2}) \\ &= \gamma P_{\pi_{1}^{\ast}}(v_{1} - v_{2}). \end{split}\end{split}\]

Similarly, it can be shown that \(f(v_{2}) - f(v_{1}) \le \gamma P_{\pi_{2}^{\ast}}(v_{2} - v_{1})\). Define

\[z = \max\{ |\gamma P_{\pi_{1}^{\ast}}(v_{1} - v_{2})|, |\gamma P_{\pi_{2}^{\ast}}(v_{1} - v_{2})|\} \in\mathbb{R}^{|\mathcal{S}|}.\]

It then follows that

\[\|f(v_{1}) - f(v_{2})\|_{\infty} \le \|z\|_{\infty}.\]

On the other hand, suppose \(z_{i}\) is the \(i\)-th entry of \(z\), and \(p_{i}^{T}\) and \(q_{i}^{T}\) are the \(i\)-th row of \(P_{\pi_{1}^{\ast}}\) and \(P_{\pi_{2}^{\ast}}\), then

\[z_{i} = \max\{\gamma|p_{i}^{T}(v_{1} - v_{2})|, \gamma|q_{i}^{T}(v_{1} - v_{2})|\}\]

Since \(p_{i}\) is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that

\[|p_{i}^{T}(v_{1} - v_{2})| \le p_{i}^{T}|(v_{1} - v_{2})| \le \|v_{1} - v_{2}\|_{\infty}\]

Similarly, we have \(|q_{i}^{T}(v_{1} - v_{2})| \le \|v_{1} - v_{2}\|_{\infty}\). Therefore

\[\|z\|_{\infty} = \max_{i}|z_{i}| \le \gamma\|v_{1} - v_{2}\|_{\infty}.\]

Solving an optimal policy from the BOE#

Now we have, For the BOE \(v=f(v)=\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v)\), there always exists a unique solution \(v^{\ast}\).

Once the value of \(v^{\ast}\) has been obtained, we can easily obtain \(\pi^{\ast}\) by solving

\[\pi^{\ast} = \arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^{\ast}).\]

Theorem 4

The solution \(v^{\ast}\) is the optimal state value, and \(\pi^{\ast}\) is an optimal policy.

Proof. For any policy \(\pi\), it holds that

\[v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}\]

Since

\[v^{\ast} = \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v^{\ast}) = r_{\pi^{\ast}} + \gamma P_{\pi^{\ast}}v^{\ast} \ge r_{\pi} + \gamma P_{\pi}v^{\ast}\]

we have

\[v^{\ast} - v_{\pi} \ge (r_{\pi} + \gamma P_{\pi}v^{\ast}) - (r_{\pi} + \gamma P_{\pi}v_{\pi}) = \gamma P_{\pi}(v^{\ast} - v_{\pi})\]

Repeated applying the above inequality gives \(v^{\ast} - v_{\pi} \ge \gamma P_{\pi}(v^{\ast} - v_{\pi}) \ge \gamma^{2} P_{\pi}^{2}(v^{\ast} - v_{\pi})\ge \dots\). It follows that

\[v^{\ast} - v_{\pi} \ge \lim_{n\to\infty}\gamma^{n} P_{\pi}^{n}(v^{\ast} - v_{\pi})=0\]

where the last equality is true because \(\gamma<1\) and \(\|P_{\pi}^{n}(v^{\ast} - v_{\pi})\|_{\infty} \le\dots\le \|(v^{\ast} - v_{\pi})\|_{\infty}\).

Now we have:

\[V^{\ast}(s) = \underset{\pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s,a)\left[R(s,a,s') + \gamma V^{\ast}(s')\right]\]

Tip

Now, it is clear why we must study the BOE: its solution corresponds to optimal state values and optimal policies.

Optimal action-value#

Definition of the optimal action-value:

\[Q^{\ast}(s, a) = \max_{\pi}Q_{\pi}(s, a)\]

It is easy to see from relation between V and Q function that:

\[Q^{\ast}(s, a) = \sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V^{\ast}(s')\right]\]

Theorem 5

For any \(s\in\mathcal{S}\), the deterministic policy:

\[\begin{split}\pi_{\text{greedy}}(a|s) = \begin{cases} 1,\quad &a=\arg\max_{a'}Q^{\ast}(s, a')\\ 0,\quad &\text{otherwise} \end{cases}\end{split}\]

is an optimal policy for solving the BOE.

Proof. While the matrix-vector form of the optimal policy is \(\pi^{\ast} = \arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^{\ast})\), its elementwise form is

\[\begin{split}\begin{aligned} \pi^{\ast}(s) &= \arg\max_{\pi}\sum_{a}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V^{\ast}(s')\right] \\ & = \arg\max_{\pi}\sum_{a}\pi(a|s)Q^{\ast}(s, a) \end{aligned}\end{split}\]

It is clear that \(\sum_{a}\pi(a|s)Q^{\ast}(s, a)\) is maximized if \(\pi(s)\) selects the action with the greatest \(Q^{\ast}(s, a)\).

Other forms of Bellman Optimality Equation#

Since \(Q^{\ast}\) and \(V^{\ast}\) are both related to the optimal policy \(\pi_{\text{greedy}}\), we can obtain:

\[V^{\ast}(s) = \max_{a}Q^{\ast}(s, a)\]
\[Q^{\ast}(s, a) = \sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V^{\ast}(s')\right]\]

and

\[V^{\ast}(s) = \max_{a}\sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V^{\ast}(s')\right]\]
\[Q^{\ast}(s, a) = \sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma \max_{a'}Q^{\ast}(s', a')\right]\]