Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis

University of Maryland, College Park

Abstract

Reinforcement Learning (RL) has shown exceptional performance across various applications, enabling autonomous agents to learn optimal policies through interaction with their environments. However, traditional RL frameworks often face challenges in terms of iteration efficiency and robustness. Risk-sensitive policy gradient methods, which incorporate both expected return and risk measures, have been explored for their ability to yield more robust policies, yet their iteration complexity remains largely underexplored. In this work, we conduct a rigorous iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm with an exponential utility function. We establish an iteration complexity of \( \mathcal{O}(\epsilon^{-2}) \) to reach an \( \epsilon \)-approximate first-order stationary point (FOSP). Furthermore, we investigate whether risk-sensitive algorithms can achieve better iteration complexity compared to their risk-neutral counterparts. Our analysis indicates that risk-sensitive REINFORCE can potentially converge faster. To validate our analysis, we empirically evaluate the learning performance and convergence efficiency of the risk-neutral and risk-sensitive REINFORCE algorithms in multiple environments: CartPole, MiniGrid, and Robot Navigation. Empirical results confirm that risk-averse cases can converge and stabilize faster compared to their risk-neutral counterparts.

CartPole
CartPole Result

Learning curves for risk-neutral and risk-averse algorithms with varying \( \beta \) values for the CartPole environment. The shaded area indicates standard deviation over 10 runs. When \( \beta = -0.1 \), the learning curve of the risk-averse REINFORCE converges faster and achieves higher returns compared to the risk-neutral REINFORCE.

Holonomic Robot Navigation
CartPole 1
(a) Risk-neutral
CartPole 2
(b) \( \beta = -0.5 \)
CartPole 3
(c) \( \beta = -1.0 \)
CartPole 4
(d) \( \beta = -5.0 \)

Learning curves for risk-neutral and risk-averse cases with varying \( \beta \) values in the holonomic robot navigation environment. The solid lines represent average returns over 10 runs, while the shaded lines indicate returns from an individual run. The risk-neutral policy exhibits large deviations and instability, with excessive oscillations. In contrast, the risk-averse policies with \( \beta = -0.5 \) and \( \beta = -1.0 \) demonstrate greater stability, faster convergence, and higher returns. However, when \( \beta = -5.0 \), the learning efficiency decreases because an excessively large magnitude of \( \beta \) leads to an overly conservative policy that prioritizes obstacle avoidance to the extent that learning efficiency is compromised.

CartPole Result

Sample navigation trajectories comparing risk-neutral and risk-averse policies with varying \( \beta \) values in the holonomic robot navigation environment. The light blue dot represents the starting position, the yellow dot indicates the goal, and the gray dot in between represents the obstacle. The risk-neutral policy exhibits aggressive movements, while the risk-averse policies follow more stable and conservative paths.

MiniGrid
CartPole 1
(a) Risk-neutral
CartPole 2
(b) \( \beta = -0.1 \)
CartPole 3
(c) \( \beta = -0.5 \)
CartPole 4
(d) \( \beta = -10.0 \)

The gradient norm of risk-neutral and risk-averse algorithms with varying \( \beta \) values for the MiniGrid environment. The gradient norm decreases more rapidly when \( \beta = -0.1 \) and \( \beta = -0.5 \) for the risk-averse algorithm compared to its risk-neutral counterpart, and the norm exhibits smoother behavior in risk-averse cases. However, when the magnitude of $\beta$ becomes overly large (\( \beta = -10.0 \)), the gradient norm becomes large, impeding the learning process.

CartPole 1
(a) Risk-neutral
CartPole 2
(b) \( \beta = -0.1 \)
CartPole 3
(c) \( \beta = -0.5 \)
CartPole 4
(d) \( \beta = -10.0 \)

Learning curves for risk-neutral and risk-averse algorithms with varying \( \beta \) values for the Minigrid environment. Arrows (\( \uparrow \)) in (a) depict extreme values. The risk-neutral case displays more variability and more significant extreme values. In contrast, the risk-averse cases exhibit less variability and fewer extreme values when \( \beta = -0.1 \) and \( \beta = -0.5 \), suggesting that they require fewer episodes to converge and stabilize. However, when \( \beta = -10.0 \), the gradient norm becomes large, leading to an oscillatory learning curve that hinders the learning process.

BibTeX


        @misc{liu2025efficientrisksensitivepolicygradient,
              title={Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis}, 
              author={Rui Liu and Anish Gupta and Erfaun Noorani and Pratap Tokekar},
              year={2025},
              eprint={2403.08955},
              archivePrefix={arXiv},
              primaryClass={cs.LG},
              url={https://arxiv.org/abs/2403.08955}, 
        }