Model selection for contextual bandits

From MaRDI portal
Publication:6319823

arXiv1906.00531MaRDI QIDQ6319823

Author name not available (Why is that?)

Publication date: 2 June 2019

Abstract: We introduce the problem of model selection for contextual bandits, where a learner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. Our main result is a new model selection guarantee for linear contextual bandits. We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension d1<d2<ldots, where the mstar-th class contains the optimal policy, and we design an algorithm that achieves ildeO(T2/3dmstar1/3) regret with no prior knowledge of the optimal dimension dmstar. The algorithm also achieves regret ildeO(T3/4+sqrtTdmstar), which is optimal for dmstargeqsqrtT. This is the first model selection result for contextual bandits with non-vacuous regret for all values of dmstar, and to the best of our knowledge is the first positive result of this type for any online learning setting with partial information. The core of the algorithm is a new estimator for the gap in the best loss achievable by two linear policy classes, which we show admits a convergence rate faster than the rate required to learn the parameters for either class.




Has companion code repository: https://github.com/akshaykr/oracle_cb








This page was built for publication: Model selection for contextual bandits

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6319823)