Policies.Exp3 module¶

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

Policies.Exp3.UNBIASED = True¶: self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

Policies.Exp3.GAMMA = 0.01¶: Default \(\gamma\) parameter.

class Policies.Exp3.Exp3(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

__init__(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶: New policy.

unbiased = None¶: Unbiased estimators ?

weights = None¶: Weights on the arms

startGame()[source]¶: Start with uniform weights.

__str__()[source]¶: -> str

gamma¶: Constant \(\gamma_t = \gamma\).

trusts¶

Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]¶

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).

\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]

choice()[source]¶: One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]¶

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]¶: One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]¶: Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]¶: Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]¶: Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3'¶

class Policies.Exp3.Exp3WithHorizon(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Exp3.Exp3

Exp3 with fixed gamma, \(\gamma_t = \gamma_0\), chosen with a knowledge of the horizon.

__init__(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

__str__()[source]¶: -> str

gamma¶

Fixed temperature, small, knowing the horizon: \(\gamma_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).

__module__ = 'Policies.Exp3'¶

class Policies.Exp3.Exp3Decreasing(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Exp3.Exp3

Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]¶: -> str

gamma¶

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).

__module__ = 'Policies.Exp3'¶

class Policies.Exp3.Exp3SoftMix(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Exp3.Exp3

Another Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]¶: -> str

gamma¶

Decreasing gamma parameter with the time: \(\gamma_t = c \frac{\log(t)}{t}\) (heuristic).

Cf. [Cesa-Bianchi & Fisher, 1998](http://dl.acm.org/citation.cfm?id=657473).
Default value for is \(c = \sqrt(\frac{\log(K)}{K})\).

__module__ = 'Policies.Exp3'¶

Policies.Exp3.DELTA = 0.01¶: Default value for the confidence parameter delta

class Policies.Exp3.Exp3ELM(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Exp3.Exp3

A variant of Exp3, apparently designed to work better in stochastic environments.

Reference: [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

__init__(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶: New policy.

delta = None¶: Confidence parameter, given in input

B = None¶: Constant B given by \(B = 4 (e - 2) (2 \log K + \log(2 / \delta))\).

availableArms = None¶: Set of available arms, starting from all arms, and it can get reduced at each step.

varianceTerm = None¶: Estimated variance term, for each arm.

__str__()[source]¶: -> str

choice()[source]¶: Choose among the remaining arms.

getReward(arm, reward)[source]¶: Get reward and update the weights, as in Exp3, but also update the variance term \(V_k(t)\) for all arms, and the set of available arms \(\mathcal{A}(t)\), by removing arms whose empirical accumulated reward and variance term satisfy a certain inequality.

\[\begin{split}a^*(t+1) &= \arg\max_a \hat{R}_{a}(t+1), \\ V_k(t+1) &= V_k(t) + \frac{1}{\mathrm{trusts}_k(t+1)}, \\ \mathcal{A}(t+1) &= \mathcal{A}(t) \setminus \left\{ a : \hat{R}_{a^*(t+1)}(t+1) - \hat{R}_{a}(t+1) > \sqrt{B (V_{a^*(t+1)}(t+1) + V_{a}(t+1))} \right\}.\end{split}\]

trusts¶

Update the trusts probabilities according to Exp3ELM formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - |\mathcal{A}_t| \gamma_t) w_k(t) + \gamma_t, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

__module__ = 'Policies.Exp3'¶

gamma¶

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).