Learning algorithms for Markov decision processes with average cost (Q2753225)

The authors propose and give for the first time a rigorous convergence analysis of two \(Q\)-learning algorithms for the optimal average cost control problem. One is a stochastic approximation analogue of the original relative value iteration (RVI) algorithm, another is a stochastic approximation analogue of a value iteration algorithm based on the stochastic shortest path (SSP) formulation of the problem [cf. \textit{D. Bertsekas} and \textit{J. N. Tsitsiklis}, Neuro-dynamic programming, Athena Scientific, Belmont, MA (1996; Zbl 0924.68163); \textit{V. S. Borkar} and \textit{S. P. Meyn}, SIAM J. Control Optimization 38, 447-469 (2000; Zbl 0990.62071)]. Both synchronous and asynchronous implementations are considered and analyzed here via studying the a.s. boundedness of \(\{Q^n\}\) and the stability behavior of the associated ODE for the \(Q\)-factor. Meaningfully, it is pointed out that due to a large state space, the use of acceleration techniques is worth exploring, while in reinforcement learning, the convergence rate is usually slow. Moreover, the case of an infinite state space is still an open issue. These may be important directions for this area in the future.

0 references

reviewed by

Chengxun Wu

0 references