Reinforcement Learning Notes (I) -- Policy Gradient

#reinforcement-learning #policy-gradient

This is the first post of a series of notes on reinforcement learning. The aim for this series is to document my understanding of RL and provide a minimal level of knowledge (but with reasonable amount of mathematical rigor) for engineers to get started with RL.

1. The Goal of Reinforcement Learning

The goal is to learn a policy $π_{θ} (a_{t} ∣ s_{t})$ that maximizes the expected return $J (θ)$ .

J (θ) = E_{τ \sim π_{θ}} [r (τ)] = \int p_{θ} (τ) r (τ) d τ

Where $τ$ is a trajectory $(s_{1}, a_{1}, \dots, s_{T}, a_{T})$ and $p_{θ} (τ)$ is the probability of the trajectory under policy $π_{θ}$ :

p_{θ} (τ) = p (s_{1}) t = 1 \prod T π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t})

Where the notation is defined as follows:

$s_{t}$ : State at time step $t$ .
$a_{t}$ : Action at time step $t$ .
$r (s_{t}, a_{t})$ : Reward function.
$π_{θ} (a_{t} ∣ s_{t})$ : Policy with parameters $θ$ .
$p (s_{t + 1} ∣ s_{t}, a_{t})$ : Transition dynamics of the environment.
$p (s_{1})$ : Initial state distribution.
$τ$ : A trajectory sequence ${s_{1}, a_{1}, \dots, s_{T}, a_{T}}$ .
$p_{θ} (τ)$ : Probability of observing trajectory $τ$ under policy $θ$ .
$r (τ)$ : Cumulative return of trajectory $τ$ .

The following diagram illustrates the relationship between these variables in a Markov Decision Process (MDP):

MDP Diagram

2. The Policy Gradient

We want to update $θ$ in the policy with $\nabla_{θ} J (θ)$ . Using the log-derivative trick: $\nabla_{θ} p_{θ} (τ) = p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ)$ , we get:

\nabla_{θ} J (θ) = \int \nabla_{θ} p_{θ} (τ) r (τ) d τ = \int p_{θ} (τ) \nabla_{θ} lo g p_{θ} (τ) r (τ) d τ

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [\nabla_{θ} lo g p_{θ} (τ) r (τ)]

Expanding $lo g p_{θ} (τ)$ , the terms involving the dynamics $p (s_{t + 1} ∣ s_{t}, a_{t})$ do not depend on $θ$ , so their gradient is zero. We are left with:

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} gradient of policy over trajectory (t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})) cumulative reward of trajectory (t = 1 \sum T r (s_{t}, a_{t}))

Interpretation

This equation decomposes the gradient into two parts:

Policy Direction: $\sum \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ represents the direction in parameter space that increases the probability of the actions taken in the trajectory.
Trajectory Reweighting: $\sum r (s_{t}, a_{t})$ acts as a scalar weight.

The gradient update pushes the policy parameters $θ$ in the direction of trajectories that yield high cumulative rewards ("trial and error"). Formally, it scales the gradient of the log-probability of the trajectory by its return.

The REINFORCE Algorithm

The REINFORCE algorithm (Williams, 1992) is the simplest implementation of the policy gradient. It uses Monte Carlo sampling to estimate the return $r (τ)$ .

Algorithm 1 REINFORCE Algorithm Input: differentiable policy π_{θ}, learning rate α Initialize: parameters θ at random for each episode k = 1, \dots, M do Generate trajectory τ = (s_{1}, a_{1}, r_{1}, \dots, s_{T}, a_{T}, r_{T}) \sim π_{θ} Compute gradient \nabla J (θ) \approx \sum_{i} (\sum_{t = 1}^{T} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})) (\sum_{t = 1}^{T} r_{t}) Update θ \leftarrow θ + α \nabla J (θ) end for

Since it relies on the full trajectory return (Monte Carlo), REINFORCE is unbiased but suffers from high variance, often leading to slow convergence.

Why is variance high?

Stochasticity: In a standard MDP, both the policy and the environment are stochastic. A single sampled trajectory is just one realization of a highly variable process. High-probability paths might yield low rewards due to a few unlucky transitions, and vice-versa.
Difficulty in Credit Assignment: REINFORCE uses the total return $G_{t}$ to update all actions in the trajectory. If a trajectory has a high return, the algorithm reinforces every action taken, even if some were suboptimal. Without a critic to evaluate individual states, the signal ("good" or "bad") is smeared across the entire sequence, introducing significant noise.
Magnitude of Returns: The gradient updates are scaled by the return $G_{t}$ . If returns vary wildly in magnitude (e.g., one path gives 0, another gives 1000), the gradient updates will swing violently, destabilizing the learning process.

3. Variance Reduction

The standard policy gradient estimator has high variance. We can reduce this variance using two main techniques: Causality and Baselines.

3.1 Exploiting Causality (Reward-to-Go)

The policy at time $t$ cannot affect rewards obtained in the past ( $t^{'} < t$ ). Therefore, we can replace the total return with the reward-to-go:

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) (t^{'} = t \sum T r (s_{i, t^{'}}, a_{i, t^{'}}))

The term $\sum_{t^{'} = t}^{T} r (s_{i, t^{'}}, a_{i, t^{'}})$ is often denoted as $\hat{Q}_{i, t}$ .

3.2 Baselines

We can subtract a baseline $b (s_{t})$ from the return without introducing bias, as long as the baseline does not depend on the action $a_{t}$ :

E [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) b (s_{t})] = 0

Thus, the policy gradient with a baseline is:

\nabla_{θ} J (θ) \approx \frac{1}{N} i = 1 \sum N t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{i, t} ∣ s_{i, t}) (\hat{Q}_{i, t} - b (s_{i, t}))

A common choice for the baseline is the average reward or a learned Value Function $V_{ϕ} (s_{t})$ . This significantly reduces variance.

4. Implementation

In automatic differentiation frameworks (like PyTorch or TensorFlow), we don't compute the gradient manually. Instead, we construct a "surrogate loss" whose gradient equals the policy gradient.

L = - \frac{1}{N} i = 1 \sum N t = 1 \sum T lo g π_{θ} (a_{i, t} ∣ s_{i, t}) \hat{A}_{i, t}

Where $\hat{A}_{i, t}$ is the estimated advantage (e.g., $\hat{Q}_{i, t} - b (s_{i, t})$ ). We treat $\hat{A}_{i, t}$ as a fixed constant (detach from graph) during backpropagation.

Algorithm:

Sample: Run policy $π_{θ}$ to collect trajectories ${τ_{i}}$ .
Estimate Return: Compute reward-to-go $\hat{Q}_{i, t}$ and optionally fit a baseline $V_{ϕ}$ .
Update: Compute gradient $\nabla_{θ} J$ and update $θ \leftarrow θ + α \nabla_{θ} J$ .