Policies.SIC_MMAB module¶

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

The algorithm is quite complicated, please see the paper (Algorithm 1, page 6).
The UCB-H indexes are used, for more details see Policies.UCBH.

Policies.SIC_MMAB.c = 1.0¶: default value, as it was in pymaBandits v1.0

Policies.SIC_MMAB.TOLERANCE = 0.0001¶: Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.SIC_MMAB.State¶

Bases: enum.Enum

Different states during the Musical Chair algorithm

Communication = 4¶

Estimation = 2¶

Exploitation = 5¶

Exploration = 3¶

Fixation = 1¶

__module__ = 'Policies.SIC_MMAB'¶

class Policies.SIC_MMAB.SIC_MMAB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶

Bases: Policies.BasePolicy.BasePolicy

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶

nbArms: number of arms,
horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children

phase = None¶: Current state

horizon = None¶: Horizon T of the experiment.

alpha = None¶: Parameter \(\alpha\) for the UCB/LCB computations.

Time0 = None¶: Parameter \(T_0 = \lceil K \log(T) \rceil\).

ext_rank = None¶: External rank, -1 until known

int_rank = None¶: Internal rank, starts to be 0 then increase when needed

nbPlayers = None¶: Estimated number of players, starts to be 1

last_action = None¶: Keep memory of the last played action (starts randomly)

t_phase = None¶: Number of the phase XXX ?

round_number = None¶: Number of the round XXX ?

active_arms = None¶: Set of active arms (kept as a numpy array)

__str__()[source]¶: -> str

startGame()[source]¶: Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

compute_ucb_lcb()[source]¶

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

By default, the SIC-MMAB algorithm uses the UCB-H confidence bounds:

\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\end{split}\]

Reference: [Audibert et al. 09].
Other possibilities include UCB (see SIC_MMAB_UCB) and klUCB (see SIC_MMAB_klUCB).

choice()[source]¶: Choose an arm, as described by the SIC-MMAB algorithm.

getReward(arm, reward, collision=False)[source]¶

Receive a reward on arm of index ‘arm’, as described by the SIC-MMAB algorithm.

If not collision, receive a reward after pulling the arm.

handleCollision(arm, reward=None)[source]¶: Handle a collision, on arm of index ‘arm’.

__module__ = 'Policies.SIC_MMAB'¶

class Policies.SIC_MMAB.SIC_MMAB_UCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_UCB: SIC-MMAB with the simple UCB-1 confidence bounds.

__str__()[source]¶: -> str

compute_ucb_lcb()[source]¶

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

SIC_MMAB_UCB uses the simple UCB-1 confidence bounds:

\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\end{split}\]

Reference: [Auer et al. 02].
Other possibilities include UCB-H (the default, see SIC_MMAB) and klUCB (see SIC_MMAB_klUCB).

__module__ = 'Policies.SIC_MMAB'¶

class Policies.SIC_MMAB.SIC_MMAB_klUCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]¶

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_klUCB: SIC-MMAB with the kl-UCB confidence bounds.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]¶

nbArms: number of arms,
horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children

c = None¶: Parameter c

klucb = None¶: kl function to use

tolerance = None¶: Numerical tolerance

__str__()[source]¶: -> str

compute_ucb_lcb()[source]¶

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

SIC_MMAB_klUCB uses the simple kl-UCB confidence bounds:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \mathrm{UCB}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathrm{Biais}_k(t) &= \mathrm{UCB}_k(t) - \hat{\mu}_k(t),\\ \mathrm{LCB}_k(t) &= \hat{\mu}_k(t) - \mathrm{Biais}_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback),

and c is the parameter (default to 1).

Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
Other possibilities include UCB-H (the default, see SIC_MMAB) and klUCB (see SIC_MMAB_klUCB).

__module__ = 'Policies.SIC_MMAB'¶