PoliciesMultiPlayers.RandTopMEst module¶
RandTopMEstEst: four proposals for an efficient multi-players learning policy. RandTopMEstEst
and MCTopMEstEst
are the two main algorithms, with variants (see below).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
- When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.
- The (fixed) number of players is learned on the run.
Note
This is fully decentralized: player do not need to know the (fixed) number of players!
- Reference: [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733)
Warning
This is still very experimental!
Note
For a more generic approach, see the wrapper defined in EstimateM.EstimateM
.
-
class
PoliciesMultiPlayers.RandTopMEst.
oneRandTopMEst
(threshold, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.oneRandTopM
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
-
__init__
(threshold, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
threshold
= None¶ Threshold function
-
nbPlayersEstimate
= None¶ Number of players. Optimistic: start by assuming it is alone!
-
collisionCount
= None¶ Count collisions on each arm, since last increase of nbPlayersEstimate
-
timeSinceLastCollision
= None¶ Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!
-
t
= None¶ Internal time
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
PoliciesMultiPlayers.RandTopMEst.
WITH_CHAIR
= False¶ Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best.
RandTopMEst
(and variants) uses False andMCTopMEst
(and variants) uses True.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_PICK_WORST_FIRST
= False¶ XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by
RandTopMEstCautious
andRandTopMEstExtraCautious
, and byMCTopMEstCautious
andMCTopMEstExtraCautious
.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_EXIT_IF_WORST_WAS_PICKED
= False¶ XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by
RandTopMEstExtraCautious
andMCTopMEstExtraCautious
.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_PICK_PREV_WORST_FIRST
= True¶ XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for
RandTopMEstOld
andMCTopMEstOld
.
-
class
PoliciesMultiPlayers.RandTopMEst.
RandTopMEst
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopMEst: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.
-
__init__
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- withChair: see
WITH_CHAIR
, - pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
, - exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
, - pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
, - threshold: the threshold function to use, see
EstimateM.threshold_on_t_with_horizon()
,EstimateM.threshold_on_t_doubling_trick()
orEstimateM.threshold_on_t()
above. - *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMEst(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
withChair
= None¶ Using a chair ?
-
pickWorstFirst
= None¶ Using first optimization ?
-
exitIfWorstWasPicked
= None¶ Using second optimization ?
-
pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
RandTopMEstPlus
(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopMEstPlus: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.
-
__init__
(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- horizon: need to know the horizon \(T\).
- withChair: see
WITH_CHAIR
, - pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
, - exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
, - pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
, - threshold: the threshold function to use, see
threshold_on_t_with_horizon()
orthreshold_on_t()
above. - *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> horizon = 1000 >>> s = RandTopMEstPlus(nbPlayers, nbArms, UCB, horizon) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
withChair
= None¶ Using a chair ?
-
pickWorstFirst
= None¶ Using first optimization ?
-
exitIfWorstWasPicked
= None¶ Using second optimization ?
-
pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
MCTopMEst
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopMEst.RandTopMEst
MCTopMEst: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
MCTopMEstPlus
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus
MCTopMEstPlus: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
__init__
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
-