Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback

From MaRDI portal
Publication:6437978

arXiv2305.16074MaRDI QIDQ6437978

Milan Vojnović, Yiliu Wang, Wei Chen

Publication date: 25 May 2023

Abstract: We consider a combinatorial multi-armed bandit problem for maximum value reward function under maximum value and index feedback. This is a new feedback structure that lies in between commonly studied semi-bandit and full-bandit feedback structures. We propose an algorithm and provide a regret bound for problem instances with stochastic arm outcomes according to arbitrary distributions with finite supports. The regret analysis rests on considering an extended set of arms, associated with values and probabilities of arm outcomes, and applying a smoothness condition. Our algorithm achieves a O((k/Delta)log(T)) distribution-dependent and a ildeO(sqrtT) distribution-independent regret where k is the number of arms selected in each round, Delta is a distribution-dependent reward gap and T is the horizon time. Perhaps surprisingly, the regret bound is comparable to previously-known bound under more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.




Has companion code repository: https://github.com/sketch-exp/kmax








This page was built for publication: Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6437978)