Policies.SWHash_UCB module¶

The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

Instead of being restricted to UCB, it runs on top of a simple policy, e.g., UCB, and SWHash_IndexPolicy() is a generic policy using any simple policy with this “sliding window” trick:
```
>>> policy = SWHash_IndexPolicy(nbArms, UCB, tau=100, threshold=0.1)
>>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
```
It uses an additional non-fixed \(\mathcal{O}(\tau(t,\alpha))\) memory and an extra time complexity.

Warning

This implementation is still experimental!

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.SWHash_UCB.alpha_for_abruptly_changing_env(nu=0.5)[source]¶: For abruptly-changing environement, if the number of break-points is \(\Upsilon_T = \mathcal{O}(T^{\nu})\), then the SW-UCB# algorithm chooses \(\alpha = \frac{1-\nu}{2}\).

Policies.SWHash_UCB.alpha_for_slowly_varying_env(kappa=1)[source]¶: For slowly-varying environement, if the change in mean reward between two time steps is bounded by \(\varepsilon_T = \mathcal{O}(T^{-\kappa})\), then the SW-UCB# algorithm chooses \(\alpha = \min{1, \frac{3\kappa}{4}}\).

Policies.SWHash_UCB.ALPHA = 0.5¶: Default parameter for \(\alpha\).

Policies.SWHash_UCB.LAMBDA = 1¶: Default parameter for \(\lambda\).

Policies.SWHash_UCB.tau_t_alpha(t, alpha=0.5, lmbda=1)[source]¶: Compute \(\tau(t,\alpha) = \min(\lceil \lambda t^{\alpha} \rceil, t)\).

class Policies.SWHash_UCB.SWHash_IndexPolicy(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

__init__(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶: New policy.

alpha = None¶: The parameter \(\alpha\) for the SW-UCB# algorithm (see article for reference).

lmbda = None¶: The parameter \(\lambda\) for the SW-UCB# algorithm (see article for reference).

all_rewards = None¶: Keep in memory all the rewards obtained in the all the past steps (the size of the window is evolving!).

all_pulls = None¶: Keep in memory all the pulls obtained in the all the past steps (the size of the window is evolving!). Start with -1 (never seen).

__str__()[source]¶: -> str

tau¶: The current \(\tau(t,\alpha)\).

startGame(createNewPolicy=True)[source]¶: Initialize the policy for a new game.

getReward(arm, reward)[source]¶: Give a reward: increase t, pulls, and update cumulated sum of rewards and update total history and partial history of all arms (normalized in [0, 1]).

Warning

So far this is badly implemented and the algorithm is VERY slow: it has to store all the past, as the window-length is increasing when t increases.

__module__ = 'Policies.SWHash_UCB'¶