Estimating the value of a discounted reward process (Q1196212)

There are considered a discounted reward process defined by a sequence of random variables \(\{r_ t\}\), \(t\in\{1,2,\dots\}\), and the discount factor \(\lambda\) with \(0\leq\lambda<1\). Under the assumption \(| E(r_ t)|\leq M<\infty\) for some \(M>0\) the expected total discounted reward or value of the discounted reward process is defined by \(f(\lambda)\equiv E\sum^ \infty_{t=1}\lambda^{t-1} r_ t\). When \(t\in \{1,2,\dots,T\}\) for some stopping time \(T\) then a terminating reward process results. Evaluation of \(f(\lambda)\) requires for all but the simplest models simulation or experimentation. An unbiased estimator for \(f(\lambda)\) is provided from sampling cumulative sums of the rewards up to independent negative binomial stopping times. When rewards are positive this estimator proves to be monotone in the sample variate. Motivated by results of \textit{C. Derman}, \textit{B. L. Fox} and \textit{P. W. Glynn}, the approach is based on a differential equation which relates the expected total discounted return of a reward process to the expected total undiscounted return of the process terminated at a negative bionomial stopping time \(T^*\). The advantage of this procedure for practical applications is discussed, for instance when designing experiments in which the cost per experimental unit is high and the cost per time unit of observation is low. Another advantage is that it provides an easily computed estimator of the derivative of \(f(\lambda)\) with respect to \(\lambda\), describing the sensitivity of \(f(\lambda)\) to changes in \(\lambda\). Additional results refer to variance properties of the estimator and simulations.

0 references

zbMATH Keywords

discounted reward process

0 references

expected total discounted reward

0 references

unbiased estimator

0 references

sampling cumulative sums of the rewards

0 references

independent negative binomial stopping times

0 references