Policy Gradients Without the Fog
Policy gradient methods look mysterious until you strip them to bookkeeping: estimate how changing action probabilities would have changed sampled returns, then push probabilities toward better outcomes. Below is a compact walk from the objective to a working baseline‑enhanced algorithm, with notes on variance control and implementation gotchas.
1. Objective
We want to maximize expected return:
\(J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[R(\tau)]\) where a trajectory $\tau = (s_0,a_0,\dots,s_T)$ has return $R(\tau)=\sum_t r_t$.
Differentiate: \(\nabla_\theta J = \mathbb{E}_{\tau}\big[ R(\tau) \, \nabla_\theta \log p_\theta(\tau) \big]\) (log‑derivative trick).
Because $\log p_\theta(\tau)=\sum_t \log \pi_\theta(a_t\mid s_t)+\log p(s_0)+\sum_t \log p(s_{t+1}\mid s_t,a_t)$ and dynamics terms drop (no $\theta$),
\[\nabla_\theta J = \mathbb{E}\Big[ \sum_t R(\tau) \, \nabla_\theta \log \pi_\theta(a_t\mid s_t) \Big].\]High variance: using full $R(\tau)$ for every time step overcounts late rewards for early actions.
Intuition: Score Function View
$\nabla_\theta \log \pi_\theta(a_t|s_t)$ is a signed sensitivity: if positive in direction of parameters you will increase the probability of producing similar $a_t$ at $s_t$. Multiplying by a scalar “how good was the whole outcome” (the return) says: tilt the policy towards action patterns that preceded good episodes. The expectation gives the true gradient; a single sample gives a noisy but unbiased estimate.
One‑Step Bandit Analogy
Single state, stochastic policy over actions $a$. Objective $J = \sum_a \pi_\theta(a) r(a)$. True gradient is $\sum_a r(a) \nabla_\theta \pi_\theta(a)$. Our estimator samples one action $a$ and returns $r(a) \nabla_\theta \log \pi_\theta(a)$. Its expectation matches the true gradient because $\mathbb{E}[r(a) \nabla_\theta \log \pi_\theta(a)] = \sum_a r(a) \pi_\theta(a) \nabla_\theta \log \pi_\theta(a)=\sum_a r(a) \nabla_\theta \pi_\theta(a)$. Episodes are just longer bandits with delayed feedback.
2. Causality Fix
Replace $R(\tau)$ with return from t onward: $G_t = \sum_{k=t}^T r_k$.
\[\nabla_\theta J = \mathbb{E}\Big[ \sum_t G_t \, \nabla_\theta \log \pi_\theta(a_t\mid s_t) \Big].\]Intuition
Earlier actions cannot influence rewards that occurred before them. Swapping $R(\tau)$ for the tail return removes variance due to unrelated, already‑earned rewards, tightening the signal.
3. Baseline
Subtract any baseline $b(s_t)$ not depending on $a_t$: \(\mathbb{E}\big[ (G_t - b(s_t)) \, \nabla_\theta \log \pi_\theta(a_t\mid s_t) \big]\) has same expectation, lower variance.
Common choice: $b(s_t)=V_\phi(s_t)$ learned via regression to $G_t$ (gives Advantage Actor‑Critic when using $A_t=G_t - V_\phi$).
Why It Stays Unbiased (Covariance Argument)
Let $X=\nabla_\theta \log \pi_\theta(a_t|s_t)$ and $Y=b(s_t)$. Then $\mathbb{E}[Y X] = \mathbb{E}[Y]\mathbb{E}[X]$ because $Y$ does not depend on the sampled action, and $\mathbb{E}[X]=0$ (score function property). So subtracting $b$ changes only variance, not mean.
Intuition
Baseline sets a moving neutrality level: above‑baseline outcomes push probabilities up; below push them down. If baseline fits the typical return well, the residuals (advantages) are smaller‑magnitude and less noisy.
4. Advantage with Bootstrapping
Instead of Monte Carlo $G_t$, use temporal‑difference targets: \(\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t).\) Generalized Advantage Estimation (GAE): \(A_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\) truncated at episode end.
Balances variance (lower with bootstrapping) and bias (higher when $\lambda < 1$).
Intuition: A Leaky Accumulator
Think of $\delta_t$ as an instantaneous surprise: outcome + expected future minus what you previously believed. GAE forms a decaying sum of future surprises. Large $\lambda$ lets distant surprises influence the current credit (low bias, high variance); small $\lambda$ damps them quickly (higher bias, lower variance). It smoothly interpolates between pure Monte Carlo ($\lambda=1$) and pure TD ($\lambda=0$).
5. Loss Functions
Actor loss (minimize): \(L_{\text{actor}} = -\mathbb{E}_t\big[ A_t \log \pi_\theta(a_t\mid s_t) \big].\)
Critic loss: mean squared error \(L_{\text{critic}} = \big\| V_\phi(s_t) - (A_t + V_\phi(s_t)) \big\|_2^2 \;=\; \big\| V_\phi(s_t) - G^{\lambda}_t \big\|_2^2.\)
Add entropy bonus $-\beta H(\pi_\theta(\cdot\mid s_t))$ to encourage exploration early.
Intuition
Actor: perform weighted maximum likelihood where weights are advantages (positive = reinforce, negative = suppress). Critic: supervised regression to smoothed returns. Entropy: regularizer preventing premature collapse of the distribution.
6. Practical Notes
- Normalize advantages per batch to zero mean / unit std.
- Clip gradient norms (helps with exploding updates on sparse rewards).
- Reward scaling or whitening can stabilize training.
- Use separate optimizers or learning rates for actor and critic.
- Ensure you stop gradients through bootstrap target $V_\phi(s_{t+1})$.
- For continuous actions, use Gaussian policy: $\pi_\theta(a\mid s)=\mathcal{N}(\mu_\theta(s),\Sigma_\theta(s))$ and include log std parameters.
Tiny Worked 2‑Step Example
Suppose episode length 2 with rewards r0=0, r1=5. At t=0 action was exploratory; at t=1 action directly earned reward.
- Monte Carlo: $G_0=5, G_1=5$ so both steps get credit 5.
- Causality fix: still same here (no earlier rewards), but imagine negative early reward then later positive — tail returns prevent over‑penalizing late good actions.
- Baseline (say critic predicts 4): advantages become $(5-4, 5-4)=(1,1)$ — smaller variance across batch vs raw 5.
7. Tiny Pseudocode
for update in range(U):
trajectories = collect(env, policy, T, N)
compute values V(s_t)
compute deltas δ_t = r_t + γ V(s_{t+1}) - V(s_t)
advantages A_t = discounted_sum(δ_t, γλ)
returns Gλ_t = A_t + V(s_t)
normalize A_t
L_actor = -mean(A_t * log π(a_t|s_t)) - β * entropy
L_critic = mse(V(s_t), Gλ_t)
(L_actor + c * L_critic).backward(); optimizer.step()
8. Where It Fails
- Pure policy gradients struggle on long horizons with sparse rewards (variance explosion).
- Sensitive to reward scaling and poorly tuned entropy coefficient.
- Sample inefficiency versus off‑policy methods (e.g., DDPG, SAC, TD3) when reuse is possible.
9. Extensions
- PPO: trust region via clipped objective; reduces need for KL penalties.
- TRPO: solves constrained optimization with a Fisher metric.
- SAC: entropy maximization baked into objective for continuous control.
- IMPALA / V‑trace: importance weighting for distributed collection.
Mental Picture
Imagine each action leaves a faint arrow tugging parameters in the direction of making it more or less likely next time. The strength of the arrow is the (advantage) credit assigned after hindsight. Baselines and GAE just refine how sharply and how far those arrows propagate along the timeline.
Takeaway
The core idea is simple: push up log‑prob of actions that preceded above‑baseline returns, push down the rest. Most “advanced” methods differ in how they reduce variance, reuse data, or constrain destructive jumps.