PoliciesMultiPlayers package

PoliciesMultiPlayers : contains various collision-avoidance protocol for the multi-players setting.

  • Selfish: a multi-player policy where every player is selfish, they do not try to handle the collisions.
  • CentralizedNotFair: a multi-player policy which uses a centralize intelligence to affect users to a FIXED arm.
  • CentralizedFair: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.
  • CentralizedMultiplePlay and CentralizedIMP: multi-player policies that use centralized but non-omniscient learning to select K = nbPlayers arms at each time step.
  • OracleNotFair: a multi-player policy with full knowledge and centralized intelligence to affect users to a FIXED arm, among the best arms.
  • OracleFair: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbBestArms, among the best arms.
  • rhoRand, ALOHA: implementation of generic collision avoidance algorithms, relying on a single-player bandit policy (eg. UCB, Thompson etc). And variants, rhoRandRand, rhoRandSticky, rhoRandRotating, rhoRandEst, rhoLearn, rhoLearnEst, rhoLearnExp3, rhoRandALOHA,
  • rhoCentralized is a semi-centralized version where orthogonal ranks 1..M are given to the players, instead of just giving them the value of M, but a decentralized learning policy is still used to learn the best arms.
  • RandTopM is another approach, similar to rhoRandSticky and MusicalChair, but we hope it will be better, and we succeed in analyzing more easily.

All policies have the same interface, as described in BaseMPPolicy for decentralized policies, and BaseCentralizedPolicy for centralized policies, in order to use them in any experiment with the following approach:

my_policy_MP = Policy_MP(nbPlayers, nbArms)
children = my_policy_MP.children             # get a list of usable single-player policies
for one_policy in children:
    one_policy.startGame()                       # start the game
for t in range(T):
    for i in range(nbPlayers):
        k_t[i] = children[i].choice()            # chose one arm, for each player
    for k in range(nbArms):
        players_who_played_k = [ k_t[i] for i in range(nbPlayers) if k_t[i] == k ]
        reward = reward_t[k] = sampled from the arm k     # sample a reward
        if len(players_who_played_k) > 1:
            reward = 0
        for i in players_who_played_k:
            children[i].getReward(k, reward)