Estimating the value of a discounted reward process (Q1196212)
From MaRDI portal
| This is the item page for this Wikibase entity, intended for internal use and editing purposes. Please use this page instead for the normal view: Estimating the value of a discounted reward process |
scientific article; zbMATH DE number 77894
| Language | Label | Description | Also known as |
|---|---|---|---|
| English | Estimating the value of a discounted reward process |
scientific article; zbMATH DE number 77894 |
Statements
Estimating the value of a discounted reward process (English)
0 references
17 December 1992
0 references
There are considered a discounted reward process defined by a sequence of random variables \(\{r_ t\}\), \(t\in\{1,2,\dots\}\), and the discount factor \(\lambda\) with \(0\leq\lambda<1\). Under the assumption \(| E(r_ t)|\leq M<\infty\) for some \(M>0\) the expected total discounted reward or value of the discounted reward process is defined by \(f(\lambda)\equiv E\sum^ \infty_{t=1}\lambda^{t-1} r_ t\). When \(t\in \{1,2,\dots,T\}\) for some stopping time \(T\) then a terminating reward process results. Evaluation of \(f(\lambda)\) requires for all but the simplest models simulation or experimentation. An unbiased estimator for \(f(\lambda)\) is provided from sampling cumulative sums of the rewards up to independent negative binomial stopping times. When rewards are positive this estimator proves to be monotone in the sample variate. Motivated by results of \textit{C. Derman}, \textit{B. L. Fox} and \textit{P. W. Glynn}, the approach is based on a differential equation which relates the expected total discounted return of a reward process to the expected total undiscounted return of the process terminated at a negative bionomial stopping time \(T^*\). The advantage of this procedure for practical applications is discussed, for instance when designing experiments in which the cost per experimental unit is high and the cost per time unit of observation is low. Another advantage is that it provides an easily computed estimator of the derivative of \(f(\lambda)\) with respect to \(\lambda\), describing the sensitivity of \(f(\lambda)\) to changes in \(\lambda\). Additional results refer to variance properties of the estimator and simulations.
0 references
discounted reward process
0 references
expected total discounted reward
0 references
unbiased estimator
0 references
sampling cumulative sums of the rewards
0 references
independent negative binomial stopping times
0 references
differential equation
0 references
expected total discounted return
0 references
expected total undiscounted return
0 references
variance properties
0 references
simulations
0 references