Policies.CUSUM_UCB module

The CUSUM-UCB and PHT-UCB policies for non-stationary bandits.

  • Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)

  • It runs on top of a simple policy, e.g., UCB, and CUSUM_IndexPolicy is a wrapper:

    >>> policy = CUSUM_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.CUSUM_UCB.VERBOSE = False

Whether to be verbose when doing the change detection algorithm.

Policies.CUSUM_UCB.PROBA_RANDOM_EXPLORATION = 0.1

Default probability of random exploration \(\alpha\).

Policies.CUSUM_UCB.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? For CUSUM-UCB it is True by default.

Policies.CUSUM_UCB.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? For CUSUM-UCB it is False by default.

Policies.CUSUM_UCB.EPSILON = 0.01

Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).

Policies.CUSUM_UCB.LAMBDA = 1

Default value of \(\lambda\). Used only if \(h\) and \(\alpha\) are computed using compute_h_alpha_from_input_parameters__CUSUM_complicated().

Policies.CUSUM_UCB.MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT = 100

Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.

Policies.CUSUM_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes only X steps, where X is small like 20 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 20 should speed up the test by x20.

Policies.CUSUM_UCB.USE_LOCALIZATION = True

Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

Policies.CUSUM_UCB.ALPHA0_SCALE_FACTOR = 1

For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!

Policies.CUSUM_UCB.compute_h_alpha_from_input_parameters__CUSUM_complicated(horizon, max_nb_random_events, nbArms=None, epsilon=None, lmbda=None, M=None, scaleFactor=1)[source]

Compute the values \(C_1^+, C_1^-, C_1, C_2, h\) from the formulas in Theorem 2 and Corollary 2 in the paper.

Policies.CUSUM_UCB.compute_h_alpha_from_input_parameters__CUSUM(horizon, max_nb_random_events, scaleFactor=1, **kwargs)[source]

Compute the values \(h, \alpha\) from the simplified formulas in Theorem 2 and Corollary 2 in the paper.

\[\begin{split}h &= \log(\frac{T}{\Upsilon_T}),\\ \alpha &= \mathrm{scaleFactor} \times \sqrt{\frac{\Upsilon_T}{T} \log(\frac{T}{\Upsilon_T})}.\end{split}\]
class Policies.CUSUM_UCB.CUSUM_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The CUSUM-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__init__(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

New policy.

M = None

Parameter \(M\) for the test.

threshold_h = None

Parameter \(h\) for the test (threshold).

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

use_localization = None

Experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

__str__()[source]

-> str

getReward(arm, reward)[source]

Be sure that the underlying UCB or klUCB indexes are used with \(\log(n_t)\) for the exploration term, where \(n_t = \sum_{i=1}^K N_i(t)\) the number of pulls of each arm since its last restart times (different restart time for each arm, CUSUM use local restart only).

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided CUSUM algorithm [Page, 1954].

  • For each data k, compute:
\[\begin{split}s_k^- &= (y_k - \hat{u}_0 - \varepsilon) 1(k > M),\\ s_k^+ &= (\hat{u}_0 - y_k - \varepsilon) 1(k > M),\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]
  • The change is detected if \(\max(g_k^+, g_k^-) > h\), where threshold_h is the threshold of the test,
  • And \(\hat{u}_0 = \frac{1}{M} \sum_{k=1}^{M} y_k\) is the mean of the first M samples, where M is M the min number of observation between change points.
__module__ = 'Policies.CUSUM_UCB'
class Policies.CUSUM_UCB.PHT_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CUSUM_UCB.CUSUM_IndexPolicy

The PHT-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__module__ = 'Policies.CUSUM_UCB'
__str__()[source]

-> str

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided PHT algorithm [Hinkley, 1971].

  • For each data k, compute:
\[\begin{split}s_k^- &= y_k - \hat{y}_k - \varepsilon,\\ s_k^+ &= \hat{y}_k - y_k - \varepsilon,\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]
  • The change is detected if \(\max(g_k^+, g_k^-) > h\), where threshold_h is the threshold of the test,
  • And \(\hat{y}_k = \frac{1}{k} \sum_{s=1}^{k} y_s\) is the mean of the first k samples.