Policies.ExploreThenCommit module

Different variants of the Explore-Then-Commit policy.

Warning

They sometimes do not work empirically as well as the theory predicted…

Warning

TODO I should factor all this code and write all of them in a more “unified” way…

Policies.ExploreThenCommit.GAP = 0.1

Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.

class Policies.ExploreThenCommit.ETC_KnownGap(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

max_t = None

Time until pure exploitation, m_ steps in each arm.

__str__()[source]

-> str

epsilon

1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:

\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]
__module__ = 'Policies.ExploreThenCommit'
Policies.ExploreThenCommit.ALPHA = 4

Default value for parameter \(\alpha\) for ETC_RandomStop

class Policies.ExploreThenCommit.ETC_RandomStop(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

stillRandom = None

Still randomly exploring?

__str__()[source]

-> str

epsilon

1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:

\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]
__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_FixedBudget(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

max_t = None

Time until pure exploitation.

round_robin_index = None

Internal index to keep the Round-Robin phase

best_identified_arm = None

Arm on which we commit, not defined in the beginning.

__str__()[source]

-> str

choice()[source]

For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.

\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]
  • Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with scipy.special.lambertw()).
epsilon

1 while \(t \leq n\), 0 after.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Base class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

__init__(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

round_robin_index = None

Internal index to keep the Round-Robin phase

best_identified_arm = None

Arm on which we commit, not defined in the beginning.

__str__()[source]

-> str

choice()[source]

Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.

\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]
stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

epsilon

1 while not fixed, 0 after.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_SPRT(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_BAI(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.DeltaUCB(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

__init__(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

epsilon_T = None

Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).

__str__()[source]

-> str

choice()[source]

Chose between the most chosen and the least chosen arm, based on the following criteria:

\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]
\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]
\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]
__module__ = 'Policies.ExploreThenCommit'