Policies.AdSwitchNew module¶

The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)

It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

This implementation is still experimental!

Policies.AdSwitchNew.mymean(x)[source]¶

Simply numpy.mean() on x if x is non empty, otherwise 0.0.

>>> np.mean([])
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.

Policies.AdSwitchNew.Constant_C1 = 16.1¶: Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large. In their paper, in section 4.2) page 8, an inequality controls C1: (5) states that for all s’, t’, C1 > 8 (2n - 1)/n where n = n_[s’,t’], so C1 > 16.

Policies.AdSwitchNew.DELTA_T = 50¶: A small trick to speed-up the computations, the checks for changes of good/bad arms are going to have a step DELTA_T.

Policies.AdSwitchNew.DELTA_S = 20¶: A small trick to speed-up the computations, the loops on \(s_1\), \(s_2\) and \(s\) are going to have a step DELTA_S.

class Policies.AdSwitchNew.AdSwitchNew(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)

__init__(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]¶: New policy.

horizon = None¶: Parameter \(T\) for the AdSwitchNew algorithm, the horizon of the experiment. TODO try to use DoublingTrickWrapper to remove the dependency in \(T\) ?

C1 = None¶: Parameter \(C_1\) for the AdSwitchNew algorithm.

delta_s = None¶: Parameter \(\delta_s\) for the AdSwitchNew algorithm.

delta_t = None¶: Parameter \(\delta_s\) for the AdSwitchNew algorithm.

ell = None¶: Variable \(\ell\) in the algorithm. Count the number of new episode.

start_of_episode = None¶: Variable \(t_l\) in the algorithm. Count the starting time of the current episode.

set_GOOD = None¶: Variable \(\mathrm{GOOD}_t\) in the algorithm. Set of “good” arms at current time.

set_BAD = None¶: Variable \(\mathrm{BAD}_t\) in the algorithm. Set of “bad” arms at current time. It always satisfies \(\mathrm{BAD}_t = \{1,\dots,K\} \setminus \mathrm{GOOD}_t\).

set_S = None¶: Variable \(S_t\) in the algorithm. A list of sets of sampling obligations of arm \(a\) at current time.

mu_tilde_of_l = None¶: Vector of variables \(\tilde{\mu}_{\ell}(a)\) in the algorithm. Count the empirical average of arm \(a\).

gap_Delta_tilde_of_l = None¶: Vector of variables \(\tilde{\Delta}_{\ell}(a)\) in the algorithm. Count the estimate of the gap of arm \(a\) against the best of the “good” arms.

all_rewards = None¶: Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!

history_of_plays = None¶: Memory of all the past actions played!

__str__()[source]¶: -> str

new_episode()[source]¶: Start a new episode, line 3-6 of the algorithm.

startGame()[source]¶: Start the game (fill pulls and rewards with 0).

check_changes_good_arms()[source]¶

Check for changes of good arms.

I moved this into a function, in order to stop the 4 for loops (good_arm, s_1, s_2, s) as soon as a change was detected (early stopping).
TODO this takes a crazy O(K t^3) time, it HAS to be done faster!

check_changes_bad_arms()[source]¶

Check for changes of bad arms, in O(K t).

I moved this into a function, in order to stop the 2 for loops (good_arm, s) as soon as a change was detected (early stopping).

getReward(arm, reward)[source]¶: Get a reward from an arm.

n_s_t(arm, s, t)[source]¶: Compute \(n_{[s,t]}(a) := \#\{\tau : s \leq \tau \leq t, a_{\tau} = a \}\), naively by using the dictionary of all plays all_rewards.

mu_hat_s_t(arm, s, t)[source]¶: Compute \(\hat{\tau}_{[s,t]}(a) := \frac{1}{n_{[s,t]}(a)} \sum_{\tau : s \leq \tau \leq t, a_{\tau} = a} r_t\), naively by using the dictionary of all plays all_rewards.

__module__ = 'Policies.AdSwitchNew'¶

find_max_i(gap)[source]¶: Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search:

\[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]

choice()[source]¶: Choose an arm following the different phase of growing lengths according to the AdSwitchNew algorithm.