PoliciesMultiPlayers.RandTopM module¶

RandTopM: four proposals for an efficient multi-players learning policy. RandTopM and MCTopM are the two main algorithms, with variants (see below).

Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

Reference: [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733)

PoliciesMultiPlayers.RandTopM.WITH_CHAIR = False¶: Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best. RandTopM (and variants) uses False and MCTopM (and variants) uses True.

PoliciesMultiPlayers.RandTopM.OPTIM_PICK_WORST_FIRST = False¶: XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by RandTopMCautious and RandTopMExtraCautious, and by MCTopMCautious and MCTopMExtraCautious.

PoliciesMultiPlayers.RandTopM.OPTIM_EXIT_IF_WORST_WAS_PICKED = False¶: XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by RandTopMExtraCautious and MCTopMExtraCautious.

PoliciesMultiPlayers.RandTopM.OPTIM_PICK_PREV_WORST_FIRST = True¶: XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for RandTopMOld and MCTopMOld.

class PoliciesMultiPlayers.RandTopM.oneRandTopM(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

Except for the handleCollision method: a new random arm is sampled after observing a collision,
And the player does not aim at the best arm, but at one of the best arm, based on her index policy.
(See variants for more details.)

__init__(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

maxRank = None¶: Max rank, usually nbPlayers but can be different.

chosen_arm = None¶: Current chosen arm.

sitted = None¶: Not yet sitted. After 1 step without collision, don’t react to collision (but still react when the chosen arm lies outside M-best).

prevWorst = None¶: Keep track of the last arms worst than the chosen one (at previous time step).

t = None¶: Internal time

__str__()[source]¶: Return str(self).

startGame()[source]¶: Start game.

Mbest()[source]¶: Current estimate of the M-best arms. M is the maxRank given to the algorithm.

worst_Mbest()[source]¶: Index of the worst of the current estimate of the M-best arms. M is the maxRank given to the algorithm.

worst_previous__and__current_Mbest(current_arm)[source]¶: Return the set from which to select a random arm for MCTopM (the optimization is now the default):

\[\hat{M}^j(t) \cap \{ m : g^j_m(t-1) \leq g^j_k(t-1) \}.\]

handleCollision(arm, reward=None)[source]¶: Get a new random arm from the current estimate of Mbest, and give reward to the algorithm if not None.

getReward(arm, reward)[source]¶: Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

choice()[source]¶

Reconsider the choice of arm, and then use the chosen arm.

For all variants, if the chosen arm is no longer in the current estimate of the Mbest set, a new one is selected,
The basic RandTopM selects uniformly an arm in estimate Mbest,
MCTopM starts by being “non sitted” on its new chosen arm,
MCTopMCautious is forced to first try the arm with lowest UCB indexes (or whatever index policy is used).

_index()[source]¶: Update and return the indexes of the underlying index policy.

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.RandTopM(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopM: a proposal for an efficient multi-players learning policy.

__init__(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
withChair: see WITH_CHAIR,
pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
maxRank: maximum rank allowed by the RandTopM child (default to nbPlayers, but for instance if there is 2 × RandTopM[UCB] + 2 × RandTopM[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopM(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

maxRank = None¶: Max rank, usually nbPlayers but can be different

nbPlayers = None¶: Number of players

withChair = None¶: Using a chair ?

pickWorstFirst = None¶: Using first optimization ?

exitIfWorstWasPicked = None¶: Using second optimization ?

pickPrevWorstFirst = None¶: Using third optimization ? Default to yes now.

children = None¶: List of children, fake algorithms

nbArms = None¶: Number of arms

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.RandTopMCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMCautious[UCB] + 2 × RandTopMCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.RandTopMExtraCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMExtraCautious[UCB] + 2 × RandTopMExtraCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMExtraCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.RandTopMOld(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMOld child (default to nbPlayers, but for instance if there is 2 × RandTopMOld[UCB] + 2 × RandTopMOld[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMOld(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.MCTopM(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopM: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopM child (default to nbPlayers, but for instance if there is 2 × MCTopM[UCB] + 2 × MCTopM[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopM(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.MCTopMCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMCautious[UCB] + 2 × MCTopMCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

class PoliciesMultiPlayers.RandTopM.MCTopMExtraCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMExtraCautious[UCB] + 2 × MCTopMExtraCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMExtraCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

__str__()[source]¶: Return str(self).

class PoliciesMultiPlayers.RandTopM.MCTopMOld(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be one of the most efficient decentralized MP algorithm we have so far… The two other variants of MCTopM seem even better!

__module__ = 'PoliciesMultiPlayers.RandTopM'¶

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMOld child (default to nbPlayers, but for instance if there is 2 × MCTopMOld[UCB] + 2 × MCTopMOld[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMOld(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

__str__()[source]¶: Return str(self).