The four objectives share the trainer, the rollout collector and the value head, and follow the standard PPO configuration of CleanRL and Stable-Baselines3. They differ only in the policy loss.
| shared configuration | value |
|---|---|
| actor-critic | \(64\)-\(64\) \(\tanh\), orthogonal init |
| advantages | GAE, \(\gamma=0.99\), \(\lambda=0.95\) |
| optimiser | Adam, \(3\times10^{-4}\), linear annealing |
| rollout | \(2048\) steps, \(K=10\) epochs, minibatch \(64\) |
| normalisation | observations and rewards |
| variant | trust-region knob |
|---|---|
| PPO-Clip | \(\epsilon = 0.2\) |
| fixed-\(\beta\) PPO-KL | \(\beta = 1\) |
| adaptive-\(\beta\) PPO-KL | target \(D_{\mathrm{KL}} = 0.02\) |
| per-sample PPO-KL | \(\beta_t=-w\hat A\) on \(\mathcal I_{\mathrm{kill}}\) (Thm 1) |
Theorem 1 predicts that PPO-Clip and its per-sample KL twin trace the same learning curve. The lines below are the logged returns, drawn in step with training; the scalar-\(\beta\) baselines fall behind on the tasks that move the policy farthest from initialisation.
PPO-Clip and per-sample KL agree on every task. Fixed and adaptive \(\beta\) match on the easier tasks but fall behind on Ant and Humanoid, where the largest fraction of the batch enters \(\mathcal I_{\mathrm{kill}}\).
| task | PPO-Clip | per-sample | fixed \(\beta\) | adaptive \(\beta\) |
|---|
Mean \(\pm\) std over 5 seeds, last 10% of training. The small residual gap between PPO-Clip and per-sample KL is numerical: the two losses use different floating-point operations that accumulate over the run.
The penalty \(\beta_t\) is non-zero only on \(\mathcal I_{\mathrm{kill}}\). Its peak reach over training grows with task difficulty, exceeding half the batch on Humanoid, exactly where a scalar \(\beta\) cannot keep up.
PPO-Clip (left) and per-sample KL (right) on every task; the two panels are indistinguishable.





Episode return on each MuJoCo task. PPO-Clip and per-sample KL coincide; the scalar baselines trail on Ant and Humanoid.





Fraction of the PPO-Clip batch in \(\mathcal I_{\mathrm{kill}}\) and \(\mathcal I_{\mathrm{pass}}\) over training. The penalty \(\beta_t\) acts only on the kill set.





\(\beta_t\) over training: zero in median, with negative tails on the kill set that widen on the harder tasks.





All of the above is in the full write-up (PDF).