Policies.SIC_MMAB module

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

  • The algorithm is quite complicated, please see the paper (Algorithm 1, page 6).
  • The UCB-H indexes are used, for more details see Policies.UCBH.
Policies.SIC_MMAB.c = 1.0

default value, as it was in pymaBandits v1.0

Policies.SIC_MMAB.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.SIC_MMAB.State

Bases: enum.Enum

Different states during the Musical Chair algorithm

Communication = 4
Estimation = 2
Exploitation = 5
Exploration = 3
Fixation = 1
__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]

Bases: Policies.BasePolicy.BasePolicy

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]
  • nbArms: number of arms,
  • horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
  • alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
phase = None

Current state

horizon = None

Horizon T of the experiment.

alpha = None

Parameter \(\alpha\) for the UCB/LCB computations.

Time0 = None

Parameter \(T_0 = \lceil K \log(T) \rceil\).

ext_rank = None

External rank, -1 until known

int_rank = None

Internal rank, starts to be 0 then increase when needed

nbPlayers = None

Estimated number of players, starts to be 1

last_action = None

Keep memory of the last played action (starts randomly)

t_phase = None

Number of the phase XXX ?

round_number = None

Number of the round XXX ?

active_arms = None

Set of active arms (kept as a numpy array)

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

  • By default, the SIC-MMAB algorithm uses the UCB-H confidence bounds:
\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\end{split}\]
choice()[source]

Choose an arm, as described by the SIC-MMAB algorithm.

getReward(arm, reward, collision=False)[source]

Receive a reward on arm of index ‘arm’, as described by the SIC-MMAB algorithm.

  • If not collision, receive a reward after pulling the arm.
handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB_UCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_UCB: SIC-MMAB with the simple UCB-1 confidence bounds.

__str__()[source]

-> str

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\end{split}\]
  • Reference: [Auer et al. 02].
  • Other possibilities include UCB-H (the default, see SIC_MMAB) and klUCB (see SIC_MMAB_klUCB).
__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB_klUCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_klUCB: SIC-MMAB with the kl-UCB confidence bounds.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]
  • nbArms: number of arms,
  • horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
  • alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
c = None

Parameter c

klucb = None

kl function to use

tolerance = None

Numerical tolerance

__str__()[source]

-> str

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \mathrm{UCB}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathrm{Biais}_k(t) &= \mathrm{UCB}_k(t) - \hat{\mu}_k(t),\\ \mathrm{LCB}_k(t) &= \hat{\mu}_k(t) - \mathrm{Biais}_k(t).\end{split}\]
  • If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback),

and c is the parameter (default to 1).

__module__ = 'Policies.SIC_MMAB'