Learning algorithms for Markov decision processes with average cost (Q2753225)
From MaRDI portal
| This is the item page for this Wikibase entity, intended for internal use and editing purposes. Please use this page instead for the normal view: Learning algorithms for Markov decision processes with average cost |
scientific article; zbMATH DE number 1667501
| Language | Label | Description | Also known as |
|---|---|---|---|
| English | Learning algorithms for Markov decision processes with average cost |
scientific article; zbMATH DE number 1667501 |
Statements
29 October 2001
0 references
relative value iteration algorithm
0 references
stability
0 references
Learning algorithms for Markov decision processes with average cost (English)
0 references
The authors propose and give for the first time a rigorous convergence analysis of two \(Q\)-learning algorithms for the optimal average cost control problem. One is a stochastic approximation analogue of the original relative value iteration (RVI) algorithm, another is a stochastic approximation analogue of a value iteration algorithm based on the stochastic shortest path (SSP) formulation of the problem [cf. \textit{D. Bertsekas} and \textit{J. N. Tsitsiklis}, Neuro-dynamic programming, Athena Scientific, Belmont, MA (1996; Zbl 0924.68163); \textit{V. S. Borkar} and \textit{S. P. Meyn}, SIAM J. Control Optimization 38, 447-469 (2000; Zbl 0990.62071)]. Both synchronous and asynchronous implementations are considered and analyzed here via studying the a.s. boundedness of \(\{Q^n\}\) and the stability behavior of the associated ODE for the \(Q\)-factor. Meaningfully, it is pointed out that due to a large state space, the use of acceleration techniques is worth exploring, while in reinforcement learning, the convergence rate is usually slow. Moreover, the case of an infinite state space is still an open issue. These may be important directions for this area in the future.
0 references