Reinforcement Learning Notes (II) -- Actor-Critic

1. Introduction and Motivation

In previous notes, we explored Policy Gradient (PG) methods (e.g., REINFORCE). While elegant and capable of handling continuous action spaces, PG methods suffer from a significant drawback: high variance. The gradient of the reinforcement learning objective $J (θ)$ is given by:

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \hat{Q}_{i, t}

Where $\hat{Q}_{i, t}$ is an estimate of the return (reward to go). In REINFORCE, we use the Monte Carlo estimate (the actual sum of rewards observed in the rollout). Because trajectories can vary wildly due to stochasticity in the environment and policy, this estimate has very high variance, leading to unstable training and poor sample efficiency.

Actor-Critic algorithms aim to reduce this variance by replacing the high-variance Monte Carlo return with a lower-variance function approximator (the Critic).

2. The Actor-Critic Architecture

An Actor-Critic algorithm consists of two components:

The Actor ( $π_{θ} (a ∣ s)$ ): The policy that controls how the agent acts. It tries to maximize the expected return.
The Critic ( $V_{ϕ} (s)$ ): A value function that estimates the expected return of the current policy. It evaluates the actor's actions.

2.1. The One-Step Actor-Critic Algorithm

The interaction between the Actor and Critic creates a cycle where the Actor generates usage data, and the Critic learns from this data to provide better updates for the Actor.

Below is the standard One-Step Actor-Critic algorithm (online, episodic). Notice how the Critic's value estimate $V (s)$ is used to compute the TD-error $δ$ , which then drives the updates for both the Actor and the Critic.

Algorithm 2 One-step Actor-Critic (Episodic) Input: differentiable policy π_{θ} (a ∣ s), differentiable value function V_{ϕ} (s) Parameters: step sizes α, β > 0 Initialize: policy parameters θ and value weights ϕ arbitrarily for each episode do Initialize S (first state of episode) while S is not terminal do Sample action A \sim π_{θ} (\cdot ∣ S) Take action A, observe reward R, and next state S^{'} Compute TD Error: δ \leftarrow R + γ V_{ϕ} (S^{'}) - V_{ϕ} (S) (if S^{'} is terminal, V_{ϕ} (S^{'}) = 0) Update Critic (Policy Evaluation): ϕ \leftarrow ϕ + β δ \nabla_{ϕ} V_{ϕ} (S) Update Actor (Policy Improvement): θ \leftarrow θ + α δ \nabla_{θ} lo g π_{θ} (A ∣ S) S \leftarrow S^{'} end while end for

2.2. Policy Evaluation (Training the Critic)

The Critic's role is to evaluate the current policy $π_{θ}$ by estimating the value function $V (s)$ . To perform the Update Critic step shown above, we treat Policy Evaluation as a regression problem. The critic minimizes the Mean Squared Error (MSE) between its prediction $V_{ϕ} (s)$ and a target value $y_{t}$ .

L (ϕ) = \frac{1}{2} ∥ V_{ϕ} (s_{t}) - y_{t} ∥^{2}

The gradient of this loss determines the update direction: $\nabla_{ϕ} L = - (y_{t} - V_{ϕ} (s_{t})) \nabla_{ϕ} V_{ϕ} (s_{t})$ . In the algorithm above, the term $(y_{t} - V_{ϕ} (s_{t}))$ corresponds exactly to the TD error $δ$ when using the Bootstrapped Target.

2.2.1. The Choice of Target: Bias vs. Variance

The crucial difference between Actor-Critic and methods like REINFORCE lies in the choice of the target $y_{t}$ .

Monte Carlo Target (REINFORCE): Uses the actual sum of rewards from the entire episode. $y_{t}^{MC} = t^{'} = t \sum T r (s_{t^{'}}, a_{t^{'}})$
Bootstrapped (TD) Target (Actor-Critic): Uses the immediate reward plus the estimated value of the next state. $y_{t}^{T D} = r (s_{t}, a_{t}) + γ V_{ϕ} (s_{t + 1})$

Using the bootstrapped target introduces bias (because $V_{ϕ}$ is initially an incorrect estimate) but significantly reduces variance.

Rigorous Variance Decomposition:

To understand mathematically why bootstrapping reduces variance, we can use the Law of Total Variance to decompose the variance of the full Monte Carlo return $G_{t}$ . We condition on the random variables of the immediate step, $τ_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})$ :

Var (G_{t}) = A: Immediate Variance Var (E [G_{t} ∣ τ_{t}]) + B: Future Variance E [Var (G_{t} ∣ τ_{t})]

Term A (Immediate Variance): The variance arising from the randomness of the current step (action choice $a_{t}$ , immediate reward $r_{t}$ , and transition to $s_{t + 1}$ ).

Term A = Var (r_{t} + γ V_{true} (s_{t + 1}))

Term B (Future Variance): The expected variance of the rewards from time $t + 1$ onwards. In Monte Carlo, this sums up the noise of every future coin flip (stochastic actions and transitions) for the rest of the episode. This term is positive and typically large for long horizons.

The Actor-Critic Advantage: By using the TD target $y_{t}^{T D} = r_{t} + γ V_{ϕ} (s_{t + 1})$ , we are effectively choosing to ignore Term B. We replace the highly variable future return $G_{t + 1}$ with its deterministic expectation $V_{ϕ} (s_{t + 1})$ (assuming our critic is accurate).

Thus, the Critic "cuts off" the accumulation of noise from the future, leaving only the variance from the immediate step (Term A). This leads to strictly lower variance updates, ensuring more stable training.

2.3. Policy Improvement (Training the Actor)

The actor is updated using the approximate advantage calculated by the critic.

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \hat{A} (s_{i, t}, a_{i, t})

Where the advantage estimate $\hat{A}$ is often approximated using the temporal difference error (TD-error):

\hat{A} (s_{t}, a_{t}) \approx r (s_{t}, a_{t}) + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})

2.4. The Challenge of Off-Policy Actor-Critic

A common question is: "Can we just use a Replay Buffer to train Actor-Critic?" to improve sample efficiency.

Directly training the policy on old samples (transitions from a buffer) is mathematically incorrect for standard Policy Gradient because the gradient requires samples from the current policy distribution $π_{θ}$ .
Using old samples introduces distribution shift, meaning the gradient estimate is no longer valid for the current policy.
Naive "Off-Policy AC" that treats old samples as current samples is often considered "broken". Correct Off-Policy algorithms require specific corrections (e.g., importance sampling) to handle this mismatch.

3. Bias-Variance Tradeoff and N-Step Returns

The choice of target for the critic dictates the bias-variance tradeoff:

Monte Carlo (REINFORCE): Unbiased, High Variance.
One-step Actor-Critic: Biased (if Critic is imperfect), Low Variance.
Discount Factor $γ$ as Variance Reduction:
- Lower $γ$ (ignoring far-future rewards) significantly reduces variance but introduces bias (we optimize for a "myopic" horizon).
- High $γ$ (near 1) is unbiased (w.r.t true objective) but has high variance.

3.1. Variance Comparison

The concept of "cutting off" the variance is visualized below:

Variance Cutoff - Bootstrapping

To understand why the Critic reduces variance, let's compare the variance of the Monte Carlo return ( $G_{t}$ ) versus the Bootstrapped TD Target ( $y_{t}$ ).

Bias-Variance Tradeoff

1. Variance of Monte Carlo Return ( $G_{t}$ ): The MC return is the sum of all future discounted rewards. Assuming rewards at each step are independent random variables with variance $σ^{2}$ :

G_{t} = k = 0 \sum \infty γ^{k} r_{t + k}

Var (G_{t}) = Var (k = 0 \sum \infty γ^{k} r_{t + k}) \approx k = 0 \sum \infty γ^{2 k} Var (r_{t + k}) = k = 0 \sum \infty γ^{2 k} σ^{2} = \frac{σ ^{2}}{1 - γ ^{2}}

Even with discounting, this variance accumulates from every future time step.

2. Variance of TD Target ( $y_{t}$ ): The one-step TD target only involves the immediate reward $r_{t}$ and the estimated value of the next state $V (s_{t + 1})$ .

y_{t} = r_{t} + γV (s_{t + 1})

Var (y_{t}) = Var (r_{t}) + γ^{2} Var (V (s_{t + 1}))

If the value function $V$ is a good estimator (or effectively constant/learned), its variance is typically much lower than the sum of all future reward variances.

Var (y_{t}) ≪ Var (G_{t})

Essentially, we replace the "infinite tail" of variance ( $\sum_{k = 1}^{\infty} \dots$ ) with the single term variance of our estimator $Var (V)$ .

We can interpolate between these using n-step returns:

\hat{A}_{n} (s_{t}, a_{t}) = k = 0 \sum n - 1 γ^{k} r_{t + k} + γ^{n} V_{ϕ} (s_{t + n}) - V_{ϕ} (s_{t})

3.2. Generalized Advantage Estimation (GAE)

Instead of choosing a single $n$ (which is a hard hyperparameter to tune), Schulman et al. (2015) proposed GAE, which computes an exponentially weighted average of all possible n-step returns.

The key insight is to define the TD error at time $t$ as:

δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})

This $δ_{t}$ is actually the bias-corrected estimate of the advantage for just one step.

The GAE Estimator $\hat{A}_{t}^{G A E}$ is defined as the sum of discounted TD errors:

\hat{A}_{t}^{G A E} = l = 0 \sum \infty (γλ)^{l} δ_{t + l}

Why is this useful?

$λ$ as a Knob: The parameter $λ \in [0, 1]$ allows us to smoothly interpolate between the high-bias TD target and the high-variance MC return.
- $λ = 0$ : $\hat{A}_{t} = δ_{t}$ . This is equivalent to TD(0) (low variance, high bias).
- $λ = 1$ : $\hat{A}_{t} = \sum γ^{l} δ_{t + l} = \sum γ^{l} r_{t + l} - V (s_{t})$ . This is equivalent to Monte Carlo (high variance, unbiased).
Robustness: By averaging multiple k-step returns, we are less sensitive to the specific choice of exactly "how many steps" to look ahead.

Implementation Note

We don't actually compute infinite sums. We compute GAE efficiently by iterating backwards from the end of the trajectory:

\hat{A}_{t}^{G A E} = δ_{t} + (γλ) \hat{A}_{t + 1}^{G A E}

This recursive form is $O (T)$ and very easy to implement in code.

Comments Loading (Debug)...

Reinforcement Learning Notes (III) -- Value Function Methods Reinforcement Learning Notes (I) -- Policy Gradient