Batch policy learning in average reward Markov decision processes

Zhengling Qi, Runzhe Wan, Peng Liao, Predrag Klasnja, Susan A. Murphy

Publication date: 12 January 2023

Published in: The Annals of Statistics (Search for Journal in Brave)

Full work available at URL: https://arxiv.org/abs/2007.11771

zbMATH Keywords

Markov decision process average reward doubly robust estimator policy optimization

Mathematics Subject Classification ID

Nonparametric estimation (62G05)

Related Items (3)

A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets ⋮ Off-policy evaluation in partially observed Markov decision processes under sequential ignorability ⋮ Projected state-action balancing weights for offline reinforcement learning

Uses Software

L-BFGS
Spearmint

Cites Work

Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Unnamed Item
Doubly robust policy evaluation and optimization
Dynamic treatment regimes: technical challenges and applications
Model selection in reinforcement learning
On the limited memory BFGS method for large scale optimization
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
Kernel-based reinforcement learning
The landscape of empirical risk for nonconvex losses
Statistical consistency and asymptotic normality for high-dimensional robust \(M\)-estimators
Learning Algorithms for Markov Decision Processes with Average Cost
Semiparametric efficiency bounds
Support Vector Machines
Asymptotic Statistics
Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
Marginal Mean Models for Dynamic Regimes
Constructing dynamic treatment regimes over indefinite time horizons
10.1162/1532443041827907
A Robust Method for Estimating Optimal Treatment Regimes
Double/debiased machine learning for treatment and structural parameters
Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning
New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes
Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions
Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health
Resampling‐based confidence intervals for model‐free robust inference on optimal treatment regimes

This page was built for publication: Batch policy learning in average reward Markov decision processes