Policies.PHE module

The PHE, Perturbed-History Exploration, policy for bounded bandits.

  • Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
Policies.PHE.DEFAULT_PERTURBATION_SCALE = 1.0

By default, \(a\) the perturbation scale in PHE is 1, that is, at current time step t, if there is \(s = T_{i,t-1}\) samples of arm i, PHE generates \(s\) pseudo-rewards (of mean \(1/2\))

class Policies.PHE.PHE(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The PHE, Perturbed-History Exploration, policy for bounded bandits.

  • Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
  • They prove that PHE achieves a regret of \(\mathcal{O}(K \Delta^{-1} \log(T))\) regret for horizon \(T\), and if \(\Delta\) is the minimum gap between the expected rewards of the optimal and suboptimal arms, for \(a > 1\).
  • Note that the limit case of \(a=0\) gives the Follow-the-Leader algorithm (FTL), known to fail.
__init__(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
perturbation_scale = None

Perturbation scale, denoted \(a\) in their paper. Should be a float or int number. With \(s\) current samples, \(\lceil a s \rceil\) additional pseudo-rewards are generated.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute a randomized index by adding \(a\) pseudo-rewards (of mean \(1/2\)) to the current observations of this arm.

__module__ = 'Policies.PHE'