Policies.UCBVtuned module¶

The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].

class Policies.UCBVtuned.UCBVtuned(nbArms, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.UCBV.UCBV

The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].

__str__()[source]¶: -> str

computeIndex(arm)[source]¶

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ V'_k(t) &= V_k(t) + \sqrt{\frac{2 \log(t)}{N_k(t)}}, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{\log(t) V'_k(t)}{N_k(t)}}.\end{split}\]

Where \(V'_k(t)\) is an other estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.

computeAllIndex()[source]¶: Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBVtuned'¶