Experiments

Five MuJoCo tasks, with CartPole and LunarLander as checks; five seeds each.

// 01  setup

The four objectives share the trainer, the rollout collector and the value head, and follow the standard PPO configuration of CleanRL and Stable-Baselines3. They differ only in the policy loss.

shared configurationvalue
actor-critic\(64\)-\(64\) \(\tanh\), orthogonal init
advantagesGAE, \(\gamma=0.99\), \(\lambda=0.95\)
optimiserAdam, \(3\times10^{-4}\), linear annealing
rollout\(2048\) steps, \(K=10\) epochs, minibatch \(64\)
normalisationobservations and rewards
varianttrust-region knob
PPO-Clip\(\epsilon = 0.2\)
fixed-\(\beta\) PPO-KL\(\beta = 1\)
adaptive-\(\beta\) PPO-KLtarget \(D_{\mathrm{KL}} = 0.02\)
per-sample PPO-KL\(\beta_t=-w\hat A\) on \(\mathcal I_{\mathrm{kill}}\) (Thm 1)

// 02  learning curves

Theorem 1 predicts that PPO-Clip and its per-sample KL twin trace the same learning curve. The lines below are the logged returns, drawn in step with training; the scalar-\(\beta\) baselines fall behind on the tasks that move the policy farthest from initialisation.

// 03  final return

PPO-Clip and per-sample KL agree on every task. Fixed and adaptive \(\beta\) match on the easier tasks but fall behind on Ant and Humanoid, where the largest fraction of the batch enters \(\mathcal I_{\mathrm{kill}}\).

taskPPO-Clipper-samplefixed \(\beta\)adaptive \(\beta\)

Mean \(\pm\) std over 5 seeds, last 10% of training. The small residual gap between PPO-Clip and per-sample KL is numerical: the two losses use different floating-point operations that accumulate over the run.

// 04  the kill fraction

The penalty \(\beta_t\) is non-zero only on \(\mathcal I_{\mathrm{kill}}\). Its peak reach over training grows with task difficulty, exceeding half the batch on Humanoid, exactly where a scalar \(\beta\) cannot keep up.

peak fraction of the PPO-Clip minibatch in \(\mathcal I_{\mathrm{kill}}\)

// 05  figures from the paper

PPO-Clip vs per-sample, task by task

PPO-Clip (left) and per-sample KL (right) on every task; the two panels are indistinguishable.

return, four variants

Episode return on each MuJoCo task. PPO-Clip and per-sample KL coincide; the scalar baselines trail on Ant and Humanoid.

the clipping partition

Fraction of the PPO-Clip batch in \(\mathcal I_{\mathrm{kill}}\) and \(\mathcal I_{\mathrm{pass}}\) over training. The penalty \(\beta_t\) acts only on the kill set.

the per-sample coefficient \(\beta_t\)

\(\beta_t\) over training: zero in median, with negative tails on the kill set that widen on the harder tasks.

All of the above is in the full write-up (PDF).