1. Introduction: From Actor-Critic to Value-Based RL
In Actor-Critic methods, we learned two functions: a policy (actor) and a value function (critic). Value-Based Methods ask: Can we omit the explicit policy entirely?
The core idea is to define the policy implicitly using the value function. If we know the optimal value function , the optimal policy is simply to establish the "greedy" action at every step:
This guarantees the policy is at least as good as the one that generated the values.
2. Foundations & Definitions
To understand Value-Based RL, we must rigorously define our objective and the equations we solve.
2.1 The Objective: Optimal Value
We define as the maximum expected total discounted reward starting from state , taking action , and acting optimally thereafter:
2.2 The Recursive Logic (Bellman Equation)
This definition implies a recursive relationship. The value of the current step is the immediate reward plus the discounted value of the best possible future.
Why is ?
Since represents the value of being in state and acting optimally thereafter, the agent will choose the action that maximizes the expected return. Thus, the value of the state is the specific value of the best action available in that state.
Bellman Optimality Equation:
2.3 Requirements for Validity
For this recursive equation to hold, we assume:
- Markov Property: Future depends only on , not history.
- Stationarity: Dynamics and rewards are constant.
- Discounting (): Ensures finite sums and convergence (contraction).
3. Model-Based vs. Model-Free Learning
3.1 Policy Iteration (Model-Based)
If we know the transition dynamics , we can use Dynamic Programming.
- Evaluate: Compute for current .
- Improve: Update .
Limitation: The "Improve" step requires calculating an expectation over . This requires a model ().
3.2 The Q-Function Breakthrough (Model-Free)
Why do we learn instead of just ?
- V-Function: To extract a policy from , we need a model to predict which action leads to the best next state:
- Q-Function: If we have , the "future" and "model" are already baked into the value. We simply pick the biggest number:
Conclusion: Learning allows Model-Free control.
4. Fitted Q-Iteration (FQI)
When state spaces are large (e.g., images), we cannot use tables. We must use Function Approximation (Neural Networks) to estimate .
4.1 The Algorithm
We solve the Bellman Optimality Equation by turning it into a regression problem.
4.2 Why Regression Works (Stochastic Targets)
The target uses a single sample next state , but the Bellman equation requires an expectation .
- Target:
- Regression: Minimizing MSE approximates the conditional expectation:
Thus, training on noisy samples allows the network to recover the true expected Bellman update.
4.3 Q-Learning (Online)
Q-Learning is simply the online version of FQI with a batch size of 1.
(Note: We treat as a fixed constant during the gradient step, even though it depends on .)
5. Theoretical Properties
5.1 Optimality of the Greedy Policy
Does maximizing really give the best policy? Yes.
- Definition: is the maximum possible value achievable by any policy.
- Connection: The policy that achieves this maximum satisfies .
- Result: The Q-function for this policy is , which is exactly the definition of . Therefore, finding is equivalent to finding the optimal policy.
5.2 Convergence Issues
- Tabular Case: The Bellman Operator is a contraction in the max-norm (). Iterating it guarantees convergence to .
- Function Approximation: We cannot represent perfectly. Each step involves a "Projection" (fitting the neural net).
- contracts in .
- contracts in (MSE).
- The Problem: The combination is not necessarily a contraction.
- Consequence: FQI and Q-Learning can diverge with neural networks (unlike Gradient Bandit or tabular methods).
5.3 Practical Reality: All States vs. Expectation
- Theory (Tabular): We seek Pointwise Optimality ( for all ).
- Practice (Deep RL): We minimize error weighted by the data distribution .
We maximize performance on average over the states we actually visit, potentially sacrificing accuracy in rare states.