PoliciesMultiPlayers.EstimateM module¶

EstimateM: generic wrapper on a multi-player decentralized learning policy, to learn on the run the number of players, adapted from rhoEst from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
My choice for the threshold function, see threshold_on_t(), does not need the horizon either, and uses \(t\) instead.

Note

This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).

Warning

This is still very experimental!

Note

For a less generic approach, see the policies defined in rhoEst.rhoEst (generalizing rhoRand.rhoRand) and RandTopMEst.RandTopMEst (generalizing RandTopM.RandTopM).

PoliciesMultiPlayers.EstimateM.threshold_on_t_with_horizon(t, nbPlayersEstimate, horizon=None)[source]¶

Function \(\xi(T, k)\) used as a threshold in rhoEstPlus.

0 if nbPlayersEstimate is 0,
1 if nbPlayersEstimate is 1,
any function such that: \(\xi(T, k) = \omega(\log T)\) for all k > 1. (cf. http://mathworld.wolfram.com/Little-OmegaNotation.html). I choose \(\log(1 + T)^2\) or \(\log(1 + T) \log(1 + \log(1 + T))\), as it seems to work just fine and satisfies the condition (25) from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

Warning

It requires the horizon \(T\), and does not use the current time \(t\).

Example:

>>> threshold_on_t_with_horizon(1000, 3)  # doctest: +ELLIPSIS
14.287...
>>> threshold_on_t_with_horizon(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
16.357...

PoliciesMultiPlayers.EstimateM.threshold_on_t_doubling_trick(t, nbPlayersEstimate, horizon=None, base=2, min_fake_horizon=1000, T0=1)[source]¶

A trick to have a threshold depending on a growing horizon (doubling-trick).

Instead of using \(t\) or \(T\), a fake horizon \(T_t\) is used, corresponding to the horizon a doubling-trick algorithm would be using at time \(t\).
\(T_t = T_0 b^{\lceil \log_b(t) \rceil}\) is the default choice, for \(b=2\) \(T_0 = 10\).
If \(T_t\) is too small, min_fake_horizon is used instead.

Warning

This is ongoing research!

Example:

>>> threshold_on_t_doubling_trick(1000, 3)  # doctest: +ELLIPSIS
14.356...
>>> threshold_on_t_doubling_trick(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
14.356...

PoliciesMultiPlayers.EstimateM.threshold_on_t(t, nbPlayersEstimate, horizon=None)[source]¶

Function \(\xi(t, k)\) used as a threshold in rhoEst.

0 if nbPlayersEstimate is 0,
1 if nbPlayersEstimate is 1,
My heuristic to be any-time (ie, without needing to know the horizon) is to use a function of \(t\) (current time) and not \(T\) (horizon).
The choice which seemed to perform the best in practice was \(\xi(t, k) = c t\) for a small constant \(c\) (like 5 or 10).

Example:

>>> threshold_on_t(1000, 3)  # doctest: +ELLIPSIS
47.730...
>>> threshold_on_t(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
47.730...

class PoliciesMultiPlayers.EstimateM.oneEstimateM(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).

__init__(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

threshold = None¶: Threshold function

nbPlayersEstimate = None¶: Number of players. Optimistic: start by assuming it is alone!

collisionCount = None¶: Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None¶: Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None¶: Internal time

__str__()[source]¶: Return str(self).

updateNbPlayers(nbPlayers=None)[source]¶: Change the value of nbPlayersEstimate, and propagate the change to the underlying policy, for parameters called maxRank or nbPlayers.

startGame()[source]¶: Start game.

handleCollision(arm, reward=None)[source]¶: Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]¶: One transmission without collision.

choice()[source]¶: Pass the call to self._policy.choice() with the player’s ID number.

choiceWithRank(rank=1)[source]¶: Pass the call to self._policy.choiceWithRank() with the player’s ID number.

choiceFromSubSet(availableArms='all')[source]¶: Pass the call to self._policy.choiceFromSubSet() with the player’s ID number.

choiceMultiple(nb=1)[source]¶: Pass the call to self._policy.choiceMultiple() with the player’s ID number.

choiceIMP(nb=1)[source]¶: Pass the call to self._policy.choiceIMP() with the player’s ID number.

estimatedOrder()[source]¶: Pass the call to self._policy.estimatedOrder() with the player’s ID number.

estimatedBestArms(M=1)[source]¶: Pass the call to self._policy.estimatedBestArms() with the player’s ID number.

__module__ = 'PoliciesMultiPlayers.EstimateM'¶

class PoliciesMultiPlayers.EstimateM.EstimateM(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

EstimateM: a generic wrapper for an efficient multi-players learning policy, with no prior knowledge of the number of player, and using any other MP policy.

__init__(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
nbArms: number of arms.
decentralizedPolicy: base MP decentralized policy.
threshold: the threshold function to use, see threshold_on_t_with_horizon(), threshold_on_t_doubling_trick() or threshold_on_t() above.
policyArgs: named arguments (dictionnary), given to decentralizedPolicy.
*args, **kwargs: arguments, named arguments, given to decentralizedPolicy (will probably be given to the single-player decentralized policy under the hood, don’t care).

Example:

>>> from Policies import *; from PoliciesMultiPlayers import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 4
>>> nbPlayers = 2
>>> s = EstimateM(nbPlayers, nbArms, rhoRand, UCBalpha, alpha=0.5)
>>> [ child.choice() for child in s.children ]
[0, 3]

To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None¶: Number of players

children = None¶: List of children, fake algorithms

nbArms = None¶: Number of arms

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.EstimateM'¶