On principle of optimality for safety-constrained Markov Decision Process and p-Safe Reinforcement Learning

Rahul Misra*, Rafal Wisniewski*

*Kontaktforfatter

Publikation: Bidrag til tidsskriftKonferenceartikel i tidsskriftForskningpeer review

8 Downloads (Pure)

Abstract

We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider an undiscounted safety-constrained Markov decision process subject to random stopping times. The decision maker's goal is to reach a goal state while avoiding unsafe states with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be Multichain or non-ergodic since by definition there exists a goal set and an unsafe set. Bellman's principle of optimality does not hold for such a safety-constrained Markov decision process in a Multichain setting as highlighted by a counterexample. We resolve the aforementioned counterexample by considering a zero-sum game setting between the policy and the Lagrange multiplier vector. Under suitable assumptions regarding the existence of admissible policy, we propose an off-policy RL algorithm for learning an optimal policy that satisfies the probabilistic safety guarantees. After that, we present the finite time error bound of the proposed RL algorithm. Lastly, we present simulation results of the aforementioned RL algorithm on a robot in a grid world setting.

OriginalsprogEngelsk
BogserieIFAC-PapersOnLine
Vol/bind58
Udgave nummer17
Sider (fra-til)338-343
Antal sider6
ISSN2405-8971
DOI
StatusUdgivet - 1 aug. 2024
Begivenhed26th International Symposium on Mathematical Theory of Networks and Systems, MTNS 2024 - Cambridge, Storbritannien
Varighed: 19 aug. 202423 aug. 2024

Konference

Konference26th International Symposium on Mathematical Theory of Networks and Systems, MTNS 2024
Land/OmrådeStorbritannien
ByCambridge
Periode19/08/202423/08/2024
SponsorInternational Federation of Automatic Control (IFAC)

Bibliografisk note

Publisher Copyright:
Copyright © 2024 The Authors.

Fingeraftryk

Dyk ned i forskningsemnerne om 'On principle of optimality for safety-constrained Markov Decision Process and p-Safe Reinforcement Learning'. Sammen danner de et unikt fingeraftryk.
  • SWIFT

    Wisniewski, R. (PI (principal investigator)), Misra, R. (Projektdeltager), Rathore, S. S. (Projektdeltager), Kuskonmaz, B. (Projektdeltager), Jessen, J. F. (Projektdeltager), Andersen, A. O. (CoPI) & Gundersen, J. S. (Projektdeltager)

    01/10/201930/09/2024

    Projekter: ProjektForskning

Citationsformater