Policies.Thompson module

The Thompson (Bayesian) index policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
  • Reference: [Thompson - Biometrika, 1933].
class Policies.Thompson.Thompson(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

The Thompson (Bayesian) index policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.

  • Prior is initially flat, i.e., \(a=\alpha_0=1\) and \(b=\beta_0=1\).

  • A non-flat prior for each arm can be given with parameters a and b, for instance:

    nbArms = 2
    prior_failures  = a = 100
    prior_successes = b = 50
    policy = Thompson(nbArms, a=a, b=b)
    np.mean([policy.choice() for _ in range(1000)])  # 0.515 ~= 0.5: each arm has same prior!
    
  • A different prior for each arm can be given with parameters params_for_each_posterior, for instance:

    nbArms = 2
    params0 = { 'a': 10, 'b': 5}  # mean 1/3
    params1 = { 'a': 5, 'b': 10}  # mean 2/3
    params = [params0, params1]
    policy = Thompson(nbArms, params_for_each_posterior=params)
    np.mean([policy.choice() for _ in range(1000)])  # 0.9719 ~= 1: arm 1 is better than arm 0 !
    
  • Reference: [Thompson - Biometrika, 1933].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior:

\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \tilde{S_k}(t), 1 + \tilde{N_k}(t) - \tilde{S_k}(t)).\end{split}\]
__module__ = 'Policies.Thompson'