PoliciesMultiPlayers.rhoLearnExp3 module¶

rhoLearnExp3: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using the Exp3 learning algorithm instead of a random exploration for choosing the rank.

Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy (like Exp3 here), it might be better! Warning: no theoretical guarantees exist!
Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

For the Exp3 algorithm:

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

PoliciesMultiPlayers.rhoLearnExp3.binary_feedback(sensing, collision)[source]¶: Count 1 iff the sensing authorized to communicate and no collision was observed.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= r_{j,t} = F_{m,t} \times (1 - c_{m,t}), \\ \text{where}\;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]

PoliciesMultiPlayers.rhoLearnExp3.ternary_feedback(sensing, collision)[source]¶: Count 1 iff the sensing authorized to communicate and no collision was observed, 0 if no communication, and -1 iff communication but a collision was observed.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]

PoliciesMultiPlayers.rhoLearnExp3.generic_ternary_feedback(sensing, collision, bonus=1, malus=-1)[source]¶: Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, and 0 if no communication.

PoliciesMultiPlayers.rhoLearnExp3.make_generic_ternary_feedback(bonus=1, malus=-1)[source]¶

PoliciesMultiPlayers.rhoLearnExp3.generic_continuous_feedback(sensing, collision, bonus=1, malus=-1)[source]¶: Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, but possibly does not count 0 if no communication.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= \mathrm{malus} + (\mathrm{bonus} - \mathrm{malus}) \times \frac{r'_{j,t} + 1}{2}, \\ \text{where}\;\; r'_{j,t} &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]

PoliciesMultiPlayers.rhoLearnExp3.make_generic_continuous_feedback(bonus=1, malus=-1)[source]¶

PoliciesMultiPlayers.rhoLearnExp3.reward_from_decoupled_feedback(sensing, collision)¶: Decide the default function to use. FIXME try all of them!

PoliciesMultiPlayers.rhoLearnExp3.CHANGE_RANK_EACH_STEP = False¶: Should oneRhoLearnExp3 players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.

class PoliciesMultiPlayers.rhoLearnExp3.oneRhoLearnExp3(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.

__init__(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

maxRank = None¶: Max rank, usually nbPlayers but can be different

rank = None¶: Current rank, starting to 1

change_rank_each_step = None¶: Change rank at each step?

feedback_function = None¶: Feedback function: (sensing, collision) -> reward

__str__()[source]¶: Return str(self).

startGame()[source]¶: Initialize both rank and arm selection algorithms.

getReward(arm, reward)[source]¶: Give a “good” reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.

handleCollision(arm, reward)[source]¶: Give a “bad” reward to the rank selection algorithm, and select a (possibly new) rank.

__module__ = 'PoliciesMultiPlayers.rhoLearnExp3'¶

class PoliciesMultiPlayers.rhoLearnExp3.rhoLearnExp3(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoLearnExp3: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.

__init__(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
rankSelectionAlgo: algorithm to use for selecting the ranks.
maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoLearnExp3(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[0, 1, 9, 0, 10, 3]
>>> [ child.choice() for child in s.children ]
[11, 2, 0, 0, 4, 5]

To get a list of usable players, use s.children.
Warning: s._players is for internal use ONLY!

maxRank = None¶: Max rank, usually nbPlayers but can be different

nbPlayers = None¶: Number of players

children = None¶: List of children, fake algorithms

rankSelectionAlgo = None¶: Policy to use to chose the ranks

nbArms = None¶: Number of arms

change_rank_each_step = None¶: Change rank at every steps?

__module__ = 'PoliciesMultiPlayers.rhoLearnExp3'¶

__str__()[source]¶: Return str(self).