A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits
From MaRDI portal
Publication:6376004
arXiv2108.11345MaRDI QIDQ6376004
Author name not available (Why is that?)
Publication date: 25 August 2021
Abstract: This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytical toolkits, we analyse the algorithm -MTS (for multinomial distributions) and prove that they admit asymptotically optimal regret bounds of risk-averse algorithms under CVaR, proportional hazard, and other ubiquitous risk measures. More generally, we prove the asymptotic optimality of -MTS for Bernoulli distributions for a class of risk measures known as empirical distribution performance measures (EDPMs); this includes the well-known mean-variance. Numerical simulations show that the regret bounds incurred by our algorithms are reasonably tight vis-`a-vis algorithm-independent lower bounds.
Has companion code repository: https://github.com/joel-ql-chang/continuous-rho-ts
This page was built for publication: A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6376004)