Policies.DiscountedThompson module¶

The Discounted Thompson (Bayesian) index policy.

By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.
Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].

Warning

This is still highly experimental!

class Policies.DiscountedThompson.DiscountedThompson(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: Policies.DiscountedBayesianIndexPolicy.DiscountedBayesianIndexPolicy

The DiscountedThompson (Bayesian) index policy.

By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.
Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].

computeIndex(arm)[source]¶

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, by sampling from the DiscountedBeta posterior.

\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \widetilde{S_k}(t), 1 + \widetilde{F_k}(t)).\end{split}\]

It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.
But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):

\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]

\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]

__module__ = 'Policies.DiscountedThompson'¶