KLip-PPOA per-sample KL perspective on PPO-Clip

Riccardo Colletti & Robin Holzinger
// KLip-PPO

PPO-Clip is a KL penalty whose coefficient varies per sample.

The clipped surrogate and a KL penalty are treated as separate algorithms. We show the clipped gradient is reproduced exactly by a KL penalty, once its coefficient is allowed to change from one sample to the next, so the two train alike:

CartPole-v1
real logged returns

Proximal Policy Optimization (Schulman et al., 2017) is the default algorithm for on-policy reinforcement learning. At each update it improves the policy while keeping it close to the policy that collected the data, a trust region that keeps any one update from moving the policy too far.

// 01  two surrogates, one trust region

The original paper enforces that closeness in two different ways.

PPO-Clip caps how far the probability ratio \(w_t=\pi_{\theta'}/\pi_\theta\) between the new and the old policy may move before the update stops counting.

PPO-KL instead subtracts a penalty proportional to the Kullback–Leibler divergence between the two policies, paying for every unit of drift.

the clipped surrogate rises then flattens once the ratio leaves the band
PPO-Clip: the surrogate rises with the ratio \(w_t\), then flattens once \(w_t\) leaves the band, so past the edge the gradient is clipped.PPO-Clip: past the band edge, the gradient is clipped.
two policy distributions separated by a Kullback-Leibler divergence
PPO-KL: the penalty grows with the divergence \(D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi_{\theta'})\), pulling \(\pi_{\theta'}\) back toward \(\pi_\theta\).PPO-KL: the penalty grows with \(D_{\mathrm{KL}}(\pi_\theta\,\|\,\pi_{\theta'})\).

Since 2017 these have been read as separate algorithms, with their own gradients and their own hyperparameters, and a large literature compares them task by task, asking which is better.

This work shows the two are not alternatives. The clip is itself a KL penalty, applied one sample at a time.

PPO-Clip training loop

// 02  the per-sample identity

At a single update, every transition has its ratio \(w_t\) either inside the allowed band \([1-\epsilon,\,1+\epsilon]\) or outside it, and PPO-Clip treats the three cases differently:

  • inside the band (\(\mathcal I_{\mathrm{in}}\)): the clip is inactive and the gradient is the ordinary policy-gradient.
  • outside, pushing further where the advantage already improved (\(\mathcal I_{\mathrm{kill}}\)): the clip freezes the gradient.
  • outside, but correcting a harmful move (\(\mathcal I_{\mathrm{pass}}\)): the unclipped term stays active.

On the frozen set the clipped gradient is zero. A KL penalty with coefficient \(\beta_t=-w_t\hat A_t\) drives the gradient to zero on exactly those samples and is inactive on the rest; added to the unpenalised surrogate, it reproduces the PPO-Clip gradient sample for sample:

Theorem 1 · per-sample gradient identity
\[ \nabla_{\theta'} L_{\mathrm{CLIP}} \;=\; \nabla_{\theta'}\,\mathbb{E}_t\!\big[\, w_t\hat A_t \;+\; \beta_t \,\log \pi_{\theta'}(a_t\mid s_t) \,\big], \qquad \beta_t = -\,w_t\hat A_t \,\mathbb{1}\!\left[t\in\mathcal I_{\mathrm{kill}}\right]. \]
the coefficient beta over the plane of ratio w and advantage A
Where the clip acts. The coefficient \(\beta_t=-w\hat A\) lives on the two kill corners (red); in the band and the pass corners \(\beta_t=0\).

The match is exact and global: it holds at every parameter setting \(\theta'\) and across the whole inner loop, while the original PPO paper argued only a first-order agreement near \(\theta_{\mathrm{old}}\). Statement and proof →

// 03  learning curves

Because they share a gradient, the clip and the per-sample KL surrogate train the same way. The two panels below are the logged returns (mean over five seeds) for PPO-Clip and per-sample KL, on a shared scale, drawn in step with training. Selecting a task animates both; the curves stay together throughout.

The scalar baselines and the full four-way comparison are on the experiments page.

// 04  one gradient, three forms

The identity lets PPO-Clip be written in three ways that look different but compute the same gradient. All three start from the same reward term \(w_t\hat A_t\) and differ only in how they express the penalty the clip applies. Open each to read it.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\min(\,w_t\hat A_t,\ \mathrm{clip}(w_t,1-\epsilon,1+\epsilon)\,\hat A_t\,)\big] \]

Schulman et al., 2017. The \(\min\) hides the penalty; only its effect is visible.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\,w_t\hat A_t - \Phi(w_t,\hat A_t)\,\big] \]

A dual form (Appendix A): the penalty \(\Phi\) subtracts how far \(w_t\) overshoots the band, on the frozen samples.

\[ L_{\mathrm{CLIP}} = \mathbb{E}_t\big[\,w_t\hat A_t + \beta_t\,\log\pi_{\theta'}(a_t\mid s_t)\,\big],\quad \beta_t=-w_t\hat A_t\,\mathbb{1}[\mathcal I_{\mathrm{kill}}] \]

The per-sample KL form. PPO-Clip fixes \(\beta_t\) to a step on the boundary; the per-sample view makes that step an explicit choice.

// 05  consequences

For years the clipped and KL-penalised surrogates have been treated as competing algorithms, and a sizeable literature ranks one against the other.

The per-sample identity removes the opposition. PPO-Clip is itself a KL penalty, with the per-sample coefficient \(\beta_t\). The reported advantage of clipping over PPO-KL is an advantage over a single scalar coefficient, which our experiments confirm on the high-dimensional tasks.

What remains to study is the shape of \(\beta_t\). The clip fixes it to a step on the trust-region boundary; soft relaxations, asymmetric and position-aware variants are other choices of \(\beta_t\) in the same surrogate family.