Policies.DiscountedUCB module

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • \(\gamma\) should not be 1, otherwise you should rather use Policies.UCBalpha.UCBalpha instead.
  • The smaller the \(\gamma\), the shorter the “memory” of the algorithm is.
Policies.DiscountedUCB.ALPHA = 1

Default parameter for alpha.

Policies.DiscountedUCB.GAMMA = 0.99

Default parameter for gamma.

class Policies.DiscountedUCB.DiscountedUCB(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

Bases: Policies.UCBalpha.UCBalpha

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

__init__(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
discounted_pulls = None

Number of pulls of each arms

discounted_rewards = None

Cumulated rewards of each arms

alpha = None

Parameter alpha

gamma = None

Parameter gamma

delta_time_steps = None

Keep memory of the \(\Delta_k(t)\) for each time step.

useRealDiscount = None

Flag to know if the real update should be used, the one with a multiplication by \(\gamma^{1+\Delta_k(t)}\) and not simply a multiplication by \(\gamma\).

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

  • Keep up-to date the following two quantities, using different definition and notation as from the article, but being consistent w.r.t. my project:
\[\begin{split}N_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} N_k(s), \\ X_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} X_k(s).\end{split}\]
  • Instead of keeping the whole history of rewards, as expressed in the math formula, we keep the sum of discounted rewards from s=0 to s=t, because updating it is easy (2 operations instead of just 1 for classical Policies.UCBalpha.UCBalpha, and 2 operations instead of \(\mathcal{O}(t)\) as expressed mathematically). Denote \(\Delta_k(t)\) the number of time steps during which the arm k was not selected (maybe 0 if it is selected twice in a row). Then the update can be done easily by multiplying by \(\gamma^{1+\Delta_k(t)}\):
\[\begin{split}N_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times N_{k,\gamma}(\text{last pull}) + \mathbb{1}(A(t+1) = k), \\ X_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times X_{k,\gamma}(\text{last pull}) + X_k(t+1).\end{split}\]
computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}I_k(t) &:= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} + \sqrt{\frac{\alpha \log(n_{\gamma}(t))}{2 N_{k,\gamma}(t)}}, \\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedUCBPlus(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-UCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
__init__(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__module__ = 'Policies.DiscountedUCB'
Policies.DiscountedUCB.constant_c = 1.0

default value, as it was in pymaBandits v1.0

Policies.DiscountedUCB.tolerance = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.DiscountedUCB.DiscountedklUCB(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

__init__(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
klucb = None

kl function to use

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(t)}{N_{k,\gamma}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedklUCBPlus(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedklUCB, Policies.DiscountedUCB.DiscountedUCBPlus

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
__str__()[source]

-> str

__module__ = 'Policies.DiscountedUCB'