Policies.UCBdagger module¶

The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://tor-lattimore.com/)

Policies.UCBdagger.ALPHA = 1¶: Default value for the parameter \(\alpha > 0\) for UCBdagger.

Policies.UCBdagger.log_bar(x)[source]¶

The function defined as \(\mathrm{l\overline{og}}\) by Lattimore:

\[\mathrm{l\overline{og}}(x) := \log\left((x+e)\sqrt{\log(x+e)}\right)\]

Some values:

>>> for x in np.logspace(0, 7, 8):
...     print("x = {:<5.3g} gives log_bar(x) = {:<5.3g}".format(x, log_bar(x)))
x = 1     gives log_bar(x) = 1.45
x = 10    gives log_bar(x) = 3.01
x = 100   gives log_bar(x) = 5.4
x = 1e+03 gives log_bar(x) = 7.88
x = 1e+04 gives log_bar(x) = 10.3
x = 1e+05 gives log_bar(x) = 12.7
x = 1e+06 gives log_bar(x) = 15.1
x = 1e+07 gives log_bar(x) = 17.5

Illustration:

>>> import matplotlib.pyplot as plt
>>> X = np.linspace(0, 1000, 2000)
>>> Y = log_bar(X)
>>> plt.plot(X, Y)
>>> plt.title(r"The $\mathrm{l\overline{og}}$ function")
>>> plt.show()

Policies.UCBdagger.Ki_function(pulls, i)[source]¶: Compute the \(K_i(t)\) index as defined in the article, for one arm i.

Policies.UCBdagger.Ki_vectorized(pulls)[source]¶: Compute the \(K_i(t)\) index as defined in the article, for all arms (in a vectorized manner).

Warning

I didn’t find a fast vectorized formula, so don’t use this one.

class Policies.UCBdagger.UCBdagger(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.IndexPolicy.IndexPolicy

The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://downloads.tor-lattimore.com/papers/XXX)

__init__(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

alpha = None¶: Parameter \(\alpha > 0\).

horizon = None¶: Parameter \(T > 0\).

__str__()[source]¶: -> str

getReward(arm, reward)[source]¶: Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

computeIndex(arm)[source]¶: Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \mathrm{l}\overline{\mathrm{og}}\left( \frac{T}{H_k(t)} \right)}, \\ \text{where}\;\; & H_k(t) := N_k(t) K_k(t) \\ \text{and}\;\; & K_k(t) := \sum_{j=1}^{K} \min(1, \sqrt{\frac{T_j(t)}{T_i(t)}}).\end{split}\]

__module__ = 'Policies.UCBdagger'¶