Reinforcement Learning Notes (III) -- Value Function Methods

#reinforcement-learning #value-function-method

1. Introduction: From Actor-Critic to Value-Based RL

In Actor-Critic methods, we learned two functions: a policy $π_{θ}$ (actor) and a value function $V_{ϕ}$ (critic). Value-Based Methods ask: Can we omit the explicit policy entirely?

The core idea is to define the policy implicitly using the value function. If we know the optimal value function $Q^{*} (s, a)$ , the optimal policy is simply to establish the "greedy" action at every step:

π^{*} (s) = ar g a max Q^{*} (s, a)

This guarantees the policy is at least as good as the one that generated the values.

2. Foundations & Definitions

To understand Value-Based RL, we must rigorously define our objective and the equations we solve.

2.1 The Objective: Optimal Value

We define $Q^{*} (s, a)$ as the maximum expected total discounted reward starting from state $s$ , taking action $a$ , and acting optimally thereafter:

Q^{*} (s, a) = π max E_{π} [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s, a_{0} = a]

2.2 The Recursive Logic (Bellman Equation)

This definition implies a recursive relationship. The value of the current step is the immediate reward plus the discounted value of the best possible future.

Why is $V^{*} (s^{'}) = max_{a^{'}} Q^{*} (s^{'}, a^{'})$ ?

Since $V^{*} (s^{'})$ represents the value of being in state $s^{'}$ and acting optimally thereafter, the agent will choose the action $a^{'}$ that maximizes the expected return. Thus, the value of the state is the specific value of the best action available in that state.

Bellman Optimality Equation:

Q^{*} (s, a) = E_{s^{'} \sim p (\cdot ∣ s, a)} r (s, a) + γ V^{*} (s^{'}) a^{'} max Q^{*} (s^{'}, a^{'})

2.3 Requirements for Validity

For this recursive equation to hold, we assume:

Markov Property: Future depends only on $s$ , not history.
Stationarity: Dynamics $p (s^{'} ∣ s, a)$ and rewards $r (s, a)$ are constant.
Discounting ( $γ < 1$ ): Ensures finite sums and convergence (contraction).

3. Model-Based vs. Model-Free Learning

3.1 Policy Iteration (Model-Based)

If we know the transition dynamics $p (s^{'} ∣ s, a)$ , we can use Dynamic Programming.

Evaluate: Compute $V^{π} (s)$ for current $π$ .
Improve: Update $π (s) \leftarrow ar g max_{a} E_{s^{'}} [r + γ V^{π} (s^{'})]$ .

Limitation: The "Improve" step requires calculating an expectation over $s^{'}$ . This requires a model ( $p$ ).

3.2 The Q-Function Breakthrough (Model-Free)

Why do we learn $Q (s, a)$ instead of just $V (s)$ ?

V-Function: To extract a policy from $V (s)$ , we need a model to predict which action leads to the best next state:

π (s) = ar g a max s^{'} \sum p (s^{'} ∣ s, a) V (s^{'})

Q-Function: If we have $Q (s, a)$ , the "future" and "model" are already baked into the value. We simply pick the biggest number:

π (s) = ar g a max Q (s, a)

Conclusion: Learning $Q$ allows Model-Free control.

4. Fitted Q-Iteration (FQI)

When state spaces are large (e.g., images), we cannot use tables. We must use Function Approximation (Neural Networks) to estimate $Q_{ϕ} (s, a) \approx Q^{*} (s, a)$ .

4.1 The Algorithm

We solve the Bellman Optimality Equation by turning it into a regression problem.

Algorithm: Fitted Q-Iteration Input: Dataset D = {(s_{i}, a_{i}, r_{i}, s_{i}^{'})}, function approximator Q_{ϕ} Initialize parameters ϕ randomly for k = 1, \dots, K do Compute Targets: y_{i} \leftarrow r_{i} + γ max_{a^{'}} Q_{ϕ_{k - 1}} (s_{i}^{'}, a^{'}) \forall i \in D Regression (Train Network): ϕ_{k} \leftarrow ar g min_{ϕ} \sum_{i} (Q_{ϕ} (s_{i}, a_{i}) - y_{i})^{2} end for

4.2 Why Regression Works (Stochastic Targets)

The target $y_{i}$ uses a single sample next state $s_{i}^{'}$ , but the Bellman equation requires an expectation $E_{s^{'}}$ .

Target: $y_{i} \approx Sample of (r + γ max Q)$
Regression: Minimizing MSE approximates the conditional expectation:

ar g ϕ min E [(Q_{ϕ} - y)^{2}] ⟹ Q_{ϕ} (s, a) \approx E [y ∣ s, a]

Thus, training on noisy samples allows the network to recover the true expected Bellman update.

4.3 Q-Learning (Online)

Q-Learning is simply the online version of FQI with a batch size of 1.

ϕ \leftarrow ϕ - α \nabla_{ϕ} (Q_{ϕ} (s, a) - y)^{2}

(Note: We treat $y$ as a fixed constant during the gradient step, even though it depends on $ϕ$ .)

5. Theoretical Properties

5.1 Optimality of the Greedy Policy

Does maximizing $Q^{*}$ really give the best policy? Yes.

Definition: $V^{*} (s)$ is the maximum possible value achievable by any policy.
Connection: The policy $π^{*}$ that achieves this maximum satisfies $V^{π^{*}} = V^{*}$ .
Result: The Q-function for this policy is $Q^{π^{*}} (s, a) = r + γ E [V^{*} (s^{'})]$ , which is exactly the definition of $Q^{*}$ . Therefore, finding $Q^{*}$ is equivalent to finding the optimal policy.

5.2 Convergence Issues

Tabular Case: The Bellman Operator $B$ is a contraction in the max-norm ( $L_{\infty}$ ). Iterating it guarantees convergence to $Q^{*}$ .
Function Approximation: We cannot represent $Q$ perfectly. Each step involves a "Projection" $Π$ (fitting the neural net).
- $B$ contracts in $L_{\infty}$ .
- $Π$ contracts in $L_{2}$ (MSE).
- The Problem: The combination $Π B$ is not necessarily a contraction.
- Consequence: FQI and Q-Learning can diverge with neural networks (unlike Gradient Bandit or tabular methods).

5.3 Practical Reality: All States vs. Expectation

Theory (Tabular): We seek Pointwise Optimality ( $V^{π^{*}} (s) \geq V^{π} (s)$ for all $s$ ).
Practice (Deep RL): We minimize error weighted by the data distribution $ρ$ .

L (ϕ) = E_{s \sim ρ} [(Q_{ϕ} - y)^{2}]

We maximize performance on average over the states we actually visit, potentially sacrificing accuracy in rare states.

Comments Loading (Debug)...

DINO Series Reinforcement Learning Notes (II) -- Actor-Critic