Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng Chen, Sanjay Chawla
ICLR (2024)
Expressive stochastic policies are thought to improve stability, sample efficiency, and robustness over deterministic ones. In Maximum Entropy RL, the policy is treated as an energy-based model over Q-values, but estimating the entropy of such models is unsolved. Prior methods either estimate it implicitly at high computational cost and variance (SQL) or fit a simplified actor distribution like a Gaussian for tractability (SAC). The paper proposes S²AC, which uses parameterized Stein Variational Gradient Descent as the policy and derives a closed-form, computationally cheap entropy expression depending only on first-order derivatives and vector products. S²AC reaches more optimal MaxEnt solutions than SQL and SAC on a multi-goal task and outperforms both on the MuJoCo benchmark.