The clipped surrogate and a KL penalty are treated as separate algorithms. We show the clipped gradient is reproduced exactly by a KL penalty, once its coefficient is allowed to change from one sample to the next, so the two train alike:
Proximal Policy Optimization (Schulman et al., 2017) is the default algorithm for on-policy reinforcement learning. At each update it improves the policy while keeping it close to the policy that collected the data, a trust region that keeps any one update from moving the policy too far.
The original paper enforces that closeness in two different ways.
PPO-Clip caps how far the probability ratio \(w_t=\pi_{\theta'}/\pi_\theta\) between the new and the old policy may move before the update stops counting.
PPO-KL instead subtracts a penalty proportional to the Kullback–Leibler divergence between the two policies, paying for every unit of drift.
Since 2017 these have been read as separate algorithms, with their own gradients and their own hyperparameters, and a large literature compares them task by task, asking which is better.
This work shows the two are not alternatives. The clip is itself a KL penalty, applied one sample at a time.
At a single update, every transition has its ratio \(w_t\) either inside the allowed band \([1-\epsilon,\,1+\epsilon]\) or outside it, and PPO-Clip treats the three cases differently:
On the frozen set the clipped gradient is zero. A KL penalty with coefficient \(\beta_t=-w_t\hat A_t\) drives the gradient to zero on exactly those samples and is inactive on the rest; added to the unpenalised surrogate, it reproduces the PPO-Clip gradient sample for sample:
The match is exact and global: it holds at every parameter setting \(\theta'\) and across the whole inner loop, while the original PPO paper argued only a first-order agreement near \(\theta_{\mathrm{old}}\). Statement and proof →
Because they share a gradient, the clip and the per-sample KL surrogate train the same way. The two panels below are the logged returns (mean over five seeds) for PPO-Clip and per-sample KL, on a shared scale, drawn in step with training. Selecting a task animates both; the curves stay together throughout.
The scalar baselines and the full four-way comparison are on the experiments page.
The identity lets PPO-Clip be written in three ways that look different but compute the same gradient. All three start from the same reward term \(w_t\hat A_t\) and differ only in how they express the penalty the clip applies. Open each to read it.
Schulman et al., 2017. The \(\min\) hides the penalty; only its effect is visible.
A dual form (Appendix A): the penalty \(\Phi\) subtracts how far \(w_t\) overshoots the band, on the frozen samples.
The per-sample KL form. PPO-Clip fixes \(\beta_t\) to a step on the boundary; the per-sample view makes that step an explicit choice.
For years the clipped and KL-penalised surrogates have been treated as competing algorithms, and a sizeable literature ranks one against the other.
The per-sample identity removes the opposition. PPO-Clip is itself a KL penalty, with the per-sample coefficient \(\beta_t\). The reported advantage of clipping over PPO-KL is an advantage over a single scalar coefficient, which our experiments confirm on the high-dimensional tasks.
What remains to study is the shape of \(\beta_t\). The clip fixes it to a step on the trust-region boundary; soft relaxations, asymmetric and position-aware variants are other choices of \(\beta_t\) in the same surrogate family.