The Bellman Equation#
Note
The Bellman equation simplifies our state value or state-action value calculation.
Relation between V and Q function#
V from Q:
Q from V:
Bellman equation of the state-value function#
With what we have learned so far, we know that if we calculate \(V_{\pi}(S_{t})\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after.
Instead of calculating the expected return for each state or each state-action pair, we can use the Bellman equation (if you know what Dynamic Programming is, this is very similar!). The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
The immediate reward \(R_{t+1}\) + the discounted value of the state that follows \(\gamma * V_{\pi}(S_{t+1})\).
Bellman equation of the action-value function#
The Bellman equation can also be expressed in terms of action values:
Closed-form solution of the Bellman equation#
Bellman equation:
can be expressed in the matrix form:
Next we will proof that \(I - \gamma P_{\pi}\) is invertible, so state-values \(v_{\pi} = (I - \gamma P_{\pi})^{-1}r_{\pi}\) is the unique solution of the Bellman equation.
(Gershgorin circle theorem)
Let \(A\) be a \(n\times n\) matrix, with entries \(a_{ij}\). Let
Let \(D(a_{ii}, R_i)\) be a closed disc centered at \(a_{ii}\) with radius \(R_{i}\). Such a disc is called a Gershgorin disc. Then every eigenvalue of \(A\) lies within at least one of the Gershgorin discs.
Proof. Let \(\lambda\) be an eigenvalue of \(A\) with corresponding eigenvector \(x\). Find \(i\) such that the element of \(x\) with the largest absolute value is \(x_{i}\). Since \(Ax=\lambda x\), in particular we take the \(i\)-th component of that equation to get:
Taking \(a_{ii}\) to the other side:
Therefore:
Now according to the Gershgorin circle theorem, every eigenvalue of \(I - \gamma P_{\pi}\) lies within at least one of the Gershgorin circles. The \(i\)-th Gershgorin circle has a center at \(1-\gamma p_{\pi}(s_i|s_i)\) and a radius equal to \(\sum_{j\ne i}\gamma p_{\pi}(s_j|s_i)\). Since \(\gamma<1\), we know that the radius is less than the magnitude of the center:
Therefore, all Gershgorin circles do not encircle the origin, and hence no eigenvalue \(I - \gamma P_{\pi}\) is zero.