PoliciesMultiPlayers.rhoLearnExp3 module¶
rhoLearnExp3: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using the Exp3 learning algorithm instead of a random exploration for choosing the rank.
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
- If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy (like Exp3 here), it might be better! Warning: no theoretical guarantees exist!
- Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
For the Exp3 algorithm:
- Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
- See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).
-
PoliciesMultiPlayers.rhoLearnExp3.
binary_feedback
(sensing, collision)[source]¶ Count 1 iff the sensing authorized to communicate and no collision was observed.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= r_{j,t} = F_{m,t} \times (1 - c_{m,t}), \\ \text{where}\;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
ternary_feedback
(sensing, collision)[source]¶ Count 1 iff the sensing authorized to communicate and no collision was observed, 0 if no communication, and -1 iff communication but a collision was observed.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
generic_ternary_feedback
(sensing, collision, bonus=1, malus=-1)[source]¶ Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, and 0 if no communication.
-
PoliciesMultiPlayers.rhoLearnExp3.
generic_continuous_feedback
(sensing, collision, bonus=1, malus=-1)[source]¶ Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, but possibly does not count 0 if no communication.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= \mathrm{malus} + (\mathrm{bonus} - \mathrm{malus}) \times \frac{r'_{j,t} + 1}{2}, \\ \text{where}\;\; r'_{j,t} &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
reward_from_decoupled_feedback
(sensing, collision)¶ Decide the default function to use. FIXME try all of them!
-
PoliciesMultiPlayers.rhoLearnExp3.
CHANGE_RANK_EACH_STEP
= False¶ Should oneRhoLearnExp3 players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.
-
class
PoliciesMultiPlayers.rhoLearnExp3.
oneRhoLearnExp3
(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
- When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1
-
change_rank_each_step
= None¶ Change rank at each step?
-
feedback_function
= None¶ Feedback function: (sensing, collision) -> reward
-
getReward
(arm, reward)[source]¶ Give a “good” reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.
-
handleCollision
(arm, reward)[source]¶ Give a “bad” reward to the rank selection algorithm, and select a (possibly new) rank.
-
__module__
= 'PoliciesMultiPlayers.rhoLearnExp3'¶
-
class
PoliciesMultiPlayers.rhoLearnExp3.
rhoLearnExp3
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoLearnExp3: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.
-
__init__
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- rankSelectionAlgo: algorithm to use for selecting the ranks.
- maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoLearnExp3(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [0, 1, 9, 0, 10, 3] >>> [ child.choice() for child in s.children ] [11, 2, 0, 0, 4, 5]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
rankSelectionAlgo
= None¶ Policy to use to chose the ranks
-
nbArms
= None¶ Number of arms
-
change_rank_each_step
= None¶ Change rank at every steps?
-
__module__
= 'PoliciesMultiPlayers.rhoLearnExp3'¶
-