The Bellman Equation#

Note

The Bellman equation simplifies our state value or state-action value calculation.

Relation between V and Q function#

V from Q:

\[V_{\pi}(s) = \sum_{a\in\mathcal{A}}\pi(a|s)Q_{\pi}(s, a)\]

Q from V:

\[Q_{\pi}(s, a) = \sum_{s'\in\mathcal{S}}P(s'|s, a)\left[R(s,a,s') + \gamma V_{\pi}(s')\right]\]

Bellman equation of the state-value function#

With what we have learned so far, we know that if we calculate \(V_{\pi}(S_{t})\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after.

Instead of calculating the expected return for each state or each state-action pair, we can use the Bellman equation (if you know what Dynamic Programming is, this is very similar!). The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:

The immediate reward \(R_{t+1}\) + the discounted value of the state that follows \(\gamma * V_{\pi}(S_{t+1})\).

Bellman equation of the action-value function#

The Bellman equation can also be expressed in terms of action values:

\[Q_{\pi}(s, a) = \mathbb{E}_{\pi}[R_{t+1} + \gamma * Q_{\pi}(S_{t+1}, A_{t+1})|S_{t}=s, A_{t}=a]\]

Closed-form solution of the Bellman equation#

Bellman equation:

\[ V_{\pi}(s) = \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}P(s'|s,a)\left[R(s,a,s') + \gamma V_{\pi}(s')\right] \]

can be expressed in the matrix form:

\[v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}.\]

Next we will proof that \(I - \gamma P_{\pi}\) is invertible, so state-values \(v_{\pi} = (I - \gamma P_{\pi})^{-1}r_{\pi}\) is the unique solution of the Bellman equation.

Theorem 1 (Gershgorin circle theorem)

Let \(A\) be a \(n\times n\) matrix, with entries \(a_{ij}\). Let

\[R_i = \sum_{j\ne i}|a_{ij}|.\]

Let \(D(a_{ii}, R_i)\) be a closed disc centered at \(a_{ii}\) with radius \(R_{i}\). Such a disc is called a Gershgorin disc. Then every eigenvalue of \(A\) lies within at least one of the Gershgorin discs.

Proof. Let \(\lambda\) be an eigenvalue of \(A\) with corresponding eigenvector \(x\). Find \(i\) such that the element of \(x\) with the largest absolute value is \(x_{i}\). Since \(Ax=\lambda x\), in particular we take the \(i\)-th component of that equation to get:

\[\sum_{j}a_{ij}x_j = \lambda x_{i}.\]

Taking \(a_{ii}\) to the other side:

\[\sum_{j\ne i}a_{ij}x_j = (\lambda - a_{ii}) x_{i}.\]

Therefore:

\[|\lambda - a_{ii}| = \left|\sum_{j\ne i}\frac{a_{ij}x_j}{x_i}\right|\le \sum_{j\ne i}|a_{ij}| = R_{i}.\]

Now according to the Gershgorin circle theorem, every eigenvalue of \(I - \gamma P_{\pi}\) lies within at least one of the Gershgorin circles. The \(i\)-th Gershgorin circle has a center at \(1-\gamma p_{\pi}(s_i|s_i)\) and a radius equal to \(\sum_{j\ne i}\gamma p_{\pi}(s_j|s_i)\). Since \(\gamma<1\), we know that the radius is less than the magnitude of the center:

\[\sum_{j\ne i}\gamma p_{\pi}(s_j|s_i) < 1-\gamma p_{\pi}(s_i|s_i).\]

Therefore, all Gershgorin circles do not encircle the origin, and hence no eigenvalue \(I - \gamma P_{\pi}\) is zero.