Policies.MusicalChair module

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

  • Each player has 3 states, 1st is random exploration, 2nd is musical chair, 3rd is staying sit
  • 1st step
    • Every player tries uniformly an arm for \(T_0\) steps, counting the empirical means of each arm, and the number of observed collisions \(C_{T_0}\)
    • Finally, \(N^* = M\) = nbPlayers is estimated based on nb of collisions \(C_{T_0}\), and the \(N^*\) best arms are computed from their empirical means
  • 2nd step:
    • Every player Choose an arm uniformly, among the \(N^*\) best arms, until she does not encounter collision right after choosing it
    • When an arm was chosen by only one player, she decides to sit on this chair (= arm)
  • 3rd step:
    • Every player stays sitted on her chair for the rest of the game
    • \(\implies\) constant regret if \(N^*\) is well estimated and if the estimated N* best arms were correct
    • \(\implies\) linear regret otherwise
Policies.MusicalChair.optimalT0(nbArms=10, epsilon=0.1, delta=0.05)[source]

Compute the lower-bound suggesting “large-enough” values for \(T_0\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\).

Examples:

  • For \(K=2\) arms, and in order to have a constant regret with probability at least \(90\%\), if the gap \(\Delta\) is known to be \(\geq 0.05\), then their theoretical analysis suggests to use \(T_0 \geq 18459\). That’s very huge, for just two arms!
>>> optimalT0(2, 0.1, 0.05)     # Just 2 arms !
18459                           # ==> That's a LOT of steps for just 2 arms!
  • For a harder problem with \(K=6\) arms, for a risk smaller than \(1\%\) and a gap \(\Delta \geq 0.05\), they suggest at least \(T_0 \geq 7646924\), i.e., about 7 millions of trials. That is simply too much for any realistic system, and starts to be too large for simulated systems.
>>> optimalT0(6, 0.01, 0.05)    # Constant regret with >99% proba
7646924                         # ==> That's a LOT of steps!
>>> optimalT0(6, 0.001, 0.05)   # Reasonable value of epsilon
764692376                       # ==> That's a LOT of steps!!!
  • For an even harder problem with \(K=17\) arms, the values given by their Theorem 1 start to be really unrealistic:
>>> optimalT0(17, 0.01, 0.05)   # Constant regret with >99% proba
27331794                        # ==> That's a LOT of steps!
>>> optimalT0(17, 0.001, 0.05)  # Reasonable value of epsilon
2733179304                      # ==> That's a LOT of steps!!!
Policies.MusicalChair.boundOnFinalRegret(T0, nbPlayers)[source]

Use the upper-bound on regret when \(T_0\) and \(M\) are known.

  • The “constant” regret of course grows linearly with \(T_0\), as:

    \[\forall T \geq T_0, \;\; R_T \leq T_0 K + 2 \mathrm{exp}(2) K.\]

Warning

this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_0\) is chosen as given by optimalT0()).

>>> boundOnFinalRegret(18459, 2)        # Crazy constant regret!  # doctest: +ELLIPSIS
36947.5..
>>> boundOnFinalRegret(7646924, 6)      # Crazy constant regret!!  # doctest: +ELLIPSIS
45881632.6...
>>> boundOnFinalRegret(764692376, 6)    # Crazy constant regret!!  # doctest: +ELLIPSIS
4588154344.6...
>>> boundOnFinalRegret(27331794, 17)    # Crazy constant regret!!  # doctest: +ELLIPSIS
464640749.2...
>>> boundOnFinalRegret(2733179304, 17)  # Crazy constant regret!!  # doctest: +ELLIPSIS
46464048419.2...
class Policies.MusicalChair.State

Bases: enum.Enum

Different states during the Musical Chair algorithm

InitialPhase = 2
MusicalChair = 3
NotStarted = 1
Sitted = 4
__module__ = 'Policies.MusicalChair'
class Policies.MusicalChair.MusicalChair(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

__init__(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]
  • nbArms: number of arms,
  • Time0: required, number of step, or portion of the horizon Time1 (optional), for the first step (pure random exploration by each players),
  • N: optional, exact or upper bound on the number of players,
  • Time1: optional, only used to compute Time0 if Time0 is fractional (eg. 0.2).

Example:

>>> nbArms, Time0, Time1, N = 17, 0.1, 10000, 6
>>> player1 = MusicalChair(nbArms, Time0, Time1, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.25, Time1=HORIZON, N=NB_PLAYERS).children
state = None

Current state

Time0 = None

Parameter T0

nbPlayers = None

Number of players

chair = None

Current chair. Not sited yet.

cumulatedRewards = None

That’s the s_i(t) of the paper

nbObservations = None

That’s the o_i of the paper

A = None

A random permutation of arms, it will then be of size nbPlayers!

nbCollision = None

Number of collisions, that’s the C_Time0 of the paper

t = None

Internal times

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

choice()[source]

Choose an arm, as described by the Musical Chair algorithm.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.

  • If not collision, receive a reward after pulling the arm.
_endInitialPhase()[source]

Small computation needed at the end of the initial random exploration phase.

handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

  • Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
__module__ = 'Policies.MusicalChair'