The Bellman Optimality Equation#
Note
The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore,
necessary to define what optimal policies are.
Partial ordering between policies:
Some policies are not comparable!
Optimal state-value function \(V^{\ast}(s)\) shows the maximum expected discounted return that one can achieve from state \(s\):
Optimal policy \(\pi^{\ast}\) is a policy whose state-value function is the maximum out of all policy simultaneously for all states.
This definition leads to many questions:
Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Algorithm: How to obtain the optimal policy and the optimal state values?
Bellman optimality equation#
Bellman optimality equation (BOE):
Matrix form:
Tip
BOE is a equation with unknown variable \(v\).
Contraction mapping theorem#
Definition. Let \((X, d)\) be a metric space. Then a map \(T: X\to X\) is called a contraction mapping on \(X\) if there exists \(q\in[0,1)\) such that
for all \(x, y\in X\).
(Banach fixed-point theorem)
Let \((X, d)\) be a non-empty complete metric space with a contraction mapping \(T: X\to X\). Then \(T\) admits a unique fixed-point \(x^{\ast}\) in \(X\) (i.e. \(T(x^{\ast})=x^{\ast}\)). Furthermore, \(x^{\ast}\) can be found as follows: start with an arbitrary element \(x_{0}\in X\) and define a sequence \(\{x_{n}\}_{n\in\mathbb{N}}\) by \(x_{n}=T(x_{n-1})\) for \(n\ge 1\). Then \(\lim_{n\to\infty}x_{n}=x^{\ast}\).
Proof. Omit.
The function \(f(v) = \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v)\) is a contraction mapping. In particular, for any \(v_{1},v_{2}\in\mathbb{R}^{|\mathcal{S}|}\), it holds that
where \(\gamma\in(0, 1)\) is the discounted rate, and \(\|\cdot\|_{\infty}\) is the maximum norm, which is the maximum absolute value of the elements of a vector.
Proof. Consider any two vectors \(v_{1},v_{2}\in\mathbb{R}^{|\mathcal{S}|}\), and suppose that \(\pi_{1}^{\ast} = \arg\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v_{1})\) and \(\pi_{2}^{\ast} = \arg\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v_{2})\). Then
where \(\ge\) is an elementwise comparison. As a result,
Similarly, it can be shown that \(f(v_{2}) - f(v_{1}) \le \gamma P_{\pi_{2}^{\ast}}(v_{2} - v_{1})\). Define
It then follows that
On the other hand, suppose \(z_{i}\) is the \(i\)-th entry of \(z\), and \(p_{i}^{T}\) and \(q_{i}^{T}\) are the \(i\)-th row of \(P_{\pi_{1}^{\ast}}\) and \(P_{\pi_{2}^{\ast}}\), then
Since \(p_{i}\) is a vector with all nonnegative elements and the sum of the elements is equal to one, it follows that
Similarly, we have \(|q_{i}^{T}(v_{1} - v_{2})| \le \|v_{1} - v_{2}\|_{\infty}\). Therefore
Solving an optimal policy from the BOE#
Now we have, For the BOE \(v=f(v)=\max_{\pi}(r_{\pi}+ \gamma P_{\pi}v)\), there always exists a unique solution \(v^{\ast}\).
Once the value of \(v^{\ast}\) has been obtained, we can easily obtain \(\pi^{\ast}\) by solving
The solution \(v^{\ast}\) is the optimal state value, and \(\pi^{\ast}\) is an optimal policy.
Proof. For any policy \(\pi\), it holds that
Since
we have
Repeated applying the above inequality gives \(v^{\ast} - v_{\pi} \ge \gamma P_{\pi}(v^{\ast} - v_{\pi}) \ge \gamma^{2} P_{\pi}^{2}(v^{\ast} - v_{\pi})\ge \dots\). It follows that
where the last equality is true because \(\gamma<1\) and \(\|P_{\pi}^{n}(v^{\ast} - v_{\pi})\|_{\infty} \le\dots\le \|(v^{\ast} - v_{\pi})\|_{\infty}\).
Now we have:
Tip
Now, it is clear why we must study the BOE: its solution corresponds to optimal state
values and optimal policies.
Optimal action-value#
Definition of the optimal action-value:
It is easy to see from relation between V and Q function that:
For any \(s\in\mathcal{S}\), the deterministic policy:
is an optimal policy for solving the BOE.
Proof. While the matrix-vector form of the optimal policy is \(\pi^{\ast} = \arg\max_{\pi}(r_{\pi}+\gamma P_{\pi}v^{\ast})\), its elementwise form is
It is clear that \(\sum_{a}\pi(a|s)Q^{\ast}(s, a)\) is maximized if \(\pi(s)\) selects the action with the greatest \(Q^{\ast}(s, a)\).
Other forms of Bellman Optimality Equation#
Since \(Q^{\ast}\) and \(V^{\ast}\) are both related to the optimal policy \(\pi_{\text{greedy}}\), we can obtain:
and