PoliciesMultiPlayers.EstimateM module

EstimateM: generic wrapper on a multi-player decentralized learning policy, to learn on the run the number of players, adapted from rhoEst from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
  • My choice for the threshold function, see threshold_on_t(), does not need the horizon either, and uses \(t\) instead.

Note

This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).

Warning

This is still very experimental!

Note

For a less generic approach, see the policies defined in rhoEst.rhoEst (generalizing rhoRand.rhoRand) and RandTopMEst.RandTopMEst (generalizing RandTopM.RandTopM).

PoliciesMultiPlayers.EstimateM.threshold_on_t_with_horizon(t, nbPlayersEstimate, horizon=None)[source]

Function \(\xi(T, k)\) used as a threshold in rhoEstPlus.

Warning

It requires the horizon \(T\), and does not use the current time \(t\).

Example:

>>> threshold_on_t_with_horizon(1000, 3)  # doctest: +ELLIPSIS
14.287...
>>> threshold_on_t_with_horizon(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
16.357...
PoliciesMultiPlayers.EstimateM.threshold_on_t_doubling_trick(t, nbPlayersEstimate, horizon=None, base=2, min_fake_horizon=1000, T0=1)[source]

A trick to have a threshold depending on a growing horizon (doubling-trick).

  • Instead of using \(t\) or \(T\), a fake horizon \(T_t\) is used, corresponding to the horizon a doubling-trick algorithm would be using at time \(t\).
  • \(T_t = T_0 b^{\lceil \log_b(t) \rceil}\) is the default choice, for \(b=2\) \(T_0 = 10\).
  • If \(T_t\) is too small, min_fake_horizon is used instead.

Warning

This is ongoing research!

Example:

>>> threshold_on_t_doubling_trick(1000, 3)  # doctest: +ELLIPSIS
14.356...
>>> threshold_on_t_doubling_trick(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
14.356...
PoliciesMultiPlayers.EstimateM.threshold_on_t(t, nbPlayersEstimate, horizon=None)[source]

Function \(\xi(t, k)\) used as a threshold in rhoEst.

  • 0 if nbPlayersEstimate is 0,
  • 1 if nbPlayersEstimate is 1,
  • My heuristic to be any-time (ie, without needing to know the horizon) is to use a function of \(t\) (current time) and not \(T\) (horizon).
  • The choice which seemed to perform the best in practice was \(\xi(t, k) = c t\) for a small constant \(c\) (like 5 or 10).

Example:

>>> threshold_on_t(1000, 3)  # doctest: +ELLIPSIS
47.730...
>>> threshold_on_t(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
47.730...
class PoliciesMultiPlayers.EstimateM.oneEstimateM(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
__init__(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

threshold = None

Threshold function

nbPlayersEstimate = None

Number of players. Optimistic: start by assuming it is alone!

collisionCount = None

Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None

Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None

Internal time

__str__()[source]

Return str(self).

updateNbPlayers(nbPlayers=None)[source]

Change the value of nbPlayersEstimate, and propagate the change to the underlying policy, for parameters called maxRank or nbPlayers.

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]

One transmission without collision.

choice()[source]

Pass the call to self._policy.choice() with the player’s ID number.

choiceWithRank(rank=1)[source]

Pass the call to self._policy.choiceWithRank() with the player’s ID number.

choiceFromSubSet(availableArms='all')[source]

Pass the call to self._policy.choiceFromSubSet() with the player’s ID number.

choiceMultiple(nb=1)[source]

Pass the call to self._policy.choiceMultiple() with the player’s ID number.

choiceIMP(nb=1)[source]

Pass the call to self._policy.choiceIMP() with the player’s ID number.

estimatedOrder()[source]

Pass the call to self._policy.estimatedOrder() with the player’s ID number.

estimatedBestArms(M=1)[source]

Pass the call to self._policy.estimatedBestArms() with the player’s ID number.

__module__ = 'PoliciesMultiPlayers.EstimateM'
class PoliciesMultiPlayers.EstimateM.EstimateM(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

EstimateM: a generic wrapper for an efficient multi-players learning policy, with no prior knowledge of the number of player, and using any other MP policy.

__init__(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • nbArms: number of arms.
  • decentralizedPolicy: base MP decentralized policy.
  • threshold: the threshold function to use, see threshold_on_t_with_horizon(), threshold_on_t_doubling_trick() or threshold_on_t() above.
  • policyArgs: named arguments (dictionnary), given to decentralizedPolicy.
  • *args, **kwargs: arguments, named arguments, given to decentralizedPolicy (will probably be given to the single-player decentralized policy under the hood, don’t care).

Example:

>>> from Policies import *; from PoliciesMultiPlayers import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 4
>>> nbPlayers = 2
>>> s = EstimateM(nbPlayers, nbArms, rhoRand, UCBalpha, alpha=0.5)
>>> [ child.choice() for child in s.children ]
[0, 3]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.EstimateM'