Policies.Exp3 module

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

Policies.Exp3.UNBIASED = True

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

Policies.Exp3.GAMMA = 0.01

Default \(\gamma\) parameter.

class Policies.Exp3.Exp3(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

__init__(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Unbiased estimators ?

weights = None

Weights on the arms

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

gamma

Constant \(\gamma_t = \gamma\).

trusts

Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

  • With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
  • But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).
choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3WithHorizon(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Exp3 with fixed gamma, \(\gamma_t = \gamma_0\), chosen with a knowledge of the horizon.

__init__(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

gamma

Fixed temperature, small, knowing the horizon: \(\gamma_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3Decreasing(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]

-> str

gamma

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3SoftMix(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Another Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]

-> str

gamma

Decreasing gamma parameter with the time: \(\gamma_t = c \frac{\log(t)}{t}\) (heuristic).

__module__ = 'Policies.Exp3'
Policies.Exp3.DELTA = 0.01

Default value for the confidence parameter delta

class Policies.Exp3.Exp3ELM(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

A variant of Exp3, apparently designed to work better in stochastic environments.

__init__(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

delta = None

Confidence parameter, given in input

B = None

Constant B given by \(B = 4 (e - 2) (2 \log K + \log(2 / \delta))\).

availableArms = None

Set of available arms, starting from all arms, and it can get reduced at each step.

varianceTerm = None

Estimated variance term, for each arm.

__str__()[source]

-> str

choice()[source]

Choose among the remaining arms.

getReward(arm, reward)[source]

Get reward and update the weights, as in Exp3, but also update the variance term \(V_k(t)\) for all arms, and the set of available arms \(\mathcal{A}(t)\), by removing arms whose empirical accumulated reward and variance term satisfy a certain inequality.

\[\begin{split}a^*(t+1) &= \arg\max_a \hat{R}_{a}(t+1), \\ V_k(t+1) &= V_k(t) + \frac{1}{\mathrm{trusts}_k(t+1)}, \\ \mathcal{A}(t+1) &= \mathcal{A}(t) \setminus \left\{ a : \hat{R}_{a^*(t+1)}(t+1) - \hat{R}_{a}(t+1) > \sqrt{B (V_{a^*(t+1)}(t+1) + V_{a}(t+1))} \right\}.\end{split}\]
trusts

Update the trusts probabilities according to Exp3ELM formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - |\mathcal{A}_t| \gamma_t) w_k(t) + \gamma_t, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

__module__ = 'Policies.Exp3'
gamma

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).