Policies.Thompson module¶

The Thompson (Bayesian) index policy.

By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
Reference: [Thompson - Biometrika, 1933].

class Policies.Thompson.Thompson(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

The Thompson (Bayesian) index policy.

By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
Prior is initially flat, i.e., \(a=\alpha_0=1\) and \(b=\beta_0=1\).

A non-flat prior for each arm can be given with parameters a and b, for instance:

nbArms = 2
prior_failures  = a = 100
prior_successes = b = 50
policy = Thompson(nbArms, a=a, b=b)
np.mean([policy.choice() for _ in range(1000)])  # 0.515 ~= 0.5: each arm has same prior!

A different prior for each arm can be given with parameters params_for_each_posterior, for instance:

nbArms = 2
params0 = { 'a': 10, 'b': 5}  # mean 1/3
params1 = { 'a': 5, 'b': 10}  # mean 2/3
params = [params0, params1]
policy = Thompson(nbArms, params_for_each_posterior=params)
np.mean([policy.choice() for _ in range(1000)])  # 0.9719 ~= 1: arm 1 is better than arm 0 !

Reference: [Thompson - Biometrika, 1933].

__str__()[source]¶: -> str

computeIndex(arm)[source]¶: Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior:

\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \tilde{S_k}(t), 1 + \tilde{N_k}(t) - \tilde{S_k}(t)).\end{split}\]

__module__ = 'Policies.Thompson'¶