Q-learning#
Note
Recall that Sarsa can only estimate the action values of a given policy, and it must be combined with a policy improvement step to find optimal policies.
By contrast, Q-learning can directly estimate optimal action values and find
optimal policies.
The Q-Learning algorithm#
Step 1: We initialize the Q-table
Step 2: Choose an action using the epsilon-greedy strategy
The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off. The idea is that:
With probability
: we do exploitation (our agent selects the action with the highest state-action pair value).With probability
: we do exploration (trying random action).
At the beginning of the training, the probability of doing exploration will be huge since
Step 3: Perform action
Step 4: Update
Remember that in TD Learning, we update our value function after one step of the interaction.
In Q-learning, to produce our TD target, we used the immediate reward
Tip
Q-learning is a stochastic approximation algorithm for solving the Bellman optimality equation expressed in terms of action values:
Off-policy vs On-policy#
Note
What makes Q-learning special compared to the other TD algorithms is that Q-learning is off-policy while the others are on-policy.
Two policies exist in any reinforcement learning task: a behavior policy and a target policy. The behavior policy is the one used to generate experience samples. The target policy is the one that is constantly updated to converge to an optimal policy. When the behavior policy is the same as the target policy, such a learning process is called on-policy. Otherwise, when they are different, the learning process is called off-policy.
Sarsa is on-policy. The samples required by Sarsa in every iteration include
. How these samples are generated is illustrated below: is dependent on the target policy .Q-learning is off-policy. The samples required by Q-learning in every iteration is
. How these samples are generated is illustrated below:the estimation of the optimal action value of
does not involve and we can use any policy to generate samples.