Welcome to SMPyBandits documentation!

Open-Source Python package for Single- and Multi-Players multi-armed Bandits algorithms.

A research framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms: UCB, KL-UCB, Thompson and many more for single-players, and MCTopM & RandTopM, MusicalChair, ALOHA, MEGA, rhoRand for multi-players simulations. It runs on Python 2 and 3, and is publically released as an open-source software under the MIT License.

This repository contains the code of my numerical environment, written in Python, in order to perform numerical simulations on single-player and multi-players Multi-Armed Bandits (MAB) algorithms.

Open Source? Yes! Maintenance Ask Me Anything ! Analytics PyPI version PyPI implementation PyPI pyversions Documentation Status Build Status

I (Lilian Besson) have started my PhD in October 2016, and this is a part of my on going research since December 2016.

How to cite this work?

If you use this package for your own work, please consider citing it with this piece of BibTeX:

@misc{SMPyBandits,
    title =   {{SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python}},
    author =  {Lilian Besson},
    year =    {2018},
    url =     {https://github.com/SMPyBandits/SMPyBandits/},
    howpublished = {Online at: \url{GitHub.com/SMPyBandits/SMPyBandits}},
    note =    {Code at https://github.com/SMPyBandits/SMPyBandits/, documentation at https://smpybandits.github.io/}
}

I also wrote a small paper to present SMPyBandits, and I will send it to JMLR MLOSS. The paper can be consulted here on my website.


SMPyBandits

Open-Source Python package for Single- and Multi-Players multi-armed Bandits algorithms.

Logo, logo_large.png

This repository contains the code of Lilian Besson’s numerical environment, written in Python (2 or 3), for numerical simulations on :slot_machine: single-player and multi-players Multi-Armed Bandits (MAB) algorithms.

Quick presentation

It contains the most complete collection of single-player (classical) bandit algorithms on the Internet (over 65!), as well as implementation of all the state-of-the-art multi-player algorithms.

I follow very actively the latest publications related to Multi-Armed Bandits (MAB) research, and usually implement quite quickly the new algorithms (see for instance, Exp3++, CORRAL and SparseUCB were each introduced by articles (for Exp3++, for CORRAL, for SparseUCB) presented at COLT in July 2017, LearnExp comes from a NIPS 2017 paper, and kl-UCB++ from an ALT 2017 paper.). More recent examples are klUCBswitch from a paper from May 2018, and also MusicalChairNoSensing from a paper from August 2018.

https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=githubOpen Source? Yes! https://img.shields.io/badge/Maintained%3F-yes-green.svgMaintenance https://img.shields.io/badge/Ask%20me-anything-1abc9c.svgAsk Me Anything ! https://ga-beacon.appspot.com/UA-38514290-17/github.com/SMPyBandits/SMPyBandits/README.md?pixelAnalytics https://img.shields.io/pypi/v/smpybandits.svgPyPI version https://img.shields.io/pypi/implementation/smpybandits.svgPyPI implementation https://img.shields.io/pypi/pyversions/smpybandits.svg?logo=pythonPyPI pyversions https://img.shields.io/pypi/dm/smpybandits.svgPyPI download https://img.shields.io/pypi/status/smpybandits.svgPyPI status https://readthedocs.org/projects/smpybandits/badge/?version=latestDocumentation Status https://travis-ci.org/SMPyBandits/SMPyBandits.svg?branch=masterBuild Status https://badgen.net/github/stars/SMPyBandits/SMPyBanditsStars of https://github.com/SMPyBandits/SMPyBandits/ https://badgen.net/github/release/SMPyBandits/SMPyBanditsReleases of https://github.com/SMPyBandits/SMPyBandits/

  • Classical MAB have a lot of applications, from clinical trials, A/B testing, game tree exploration, and online content recommendation (my framework does not implement contextual bandit - yet).
  • Multi-player MAB have applications in Cognitive Radio, and my framework implements all the collision models found in the literature, as well as all the algorithms from the last 10 years or so (rhoRand from 2009, MEGA from 2015, MusicalChair, and our state-of-the-art algorithms RandTopM and MCTopM, along with very recent algorithms SIC-MMAB from arXiv:1809.08151 and MusicalChairNoSensing from arXiv:1808.08416).
  • I’m working on adding a clean support for non-stationary MAB problem, and I will soon implement all state-of-the-art algorithms for these problems.

With this numerical framework, simulations can run on a single CPU or a multi-core machine, and summary plots are automatically saved as high-quality PNG, PDF and EPS (ready for being used in research article). Making new simulations is very easy, one only needs to write a configuration script and basically no code! See these examples (files named configuration_*.py).

A complete Sphinx documentation for each algorithms and every piece of code (included constants in the configurations!) is available here: SMPyBandits.GitHub.io. (I will use ReadTheDocs for this project, but I won’t use any continuous integration, don’t even think of it!)

I (Lilian Besson) have started my PhD in October 2016, and this is a part of my on going research since December 2016.

I launched the documentation on March 2017, I wrote my first research articles using this framework in 2017 and decided to (finally) open-source my project in February 2018. https://badgen.net/github/commits/SMPyBandits/SMPyBanditsCommits of https://github.com/SMPyBandits/SMPyBandits/ / https://badgen.net/github/last-commit/SMPyBandits/SMPyBanditsDate of last commit of https://github.com/SMPyBandits/SMPyBandits/ https://badgen.net/github/issues/SMPyBandits/SMPyBanditsIssues of https://github.com/SMPyBandits/SMPyBandits/ : https://badgen.net/github/open-issues/SMPyBandits/SMPyBanditsOpen issues of https://github.com/SMPyBandits/SMPyBandits/ / https://badgen.net/github/closed-issues/SMPyBandits/SMPyBanditsClosed issues of https://github.com/SMPyBandits/SMPyBandits/


How to cite this work?

If you use this package for your own work, please consider citing it with this piece of BibTeX:

@misc{SMPyBandits,
    title =   {{SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python}},
    author =  {Lilian Besson},
    year =    {2018},
    url =     {https://github.com/SMPyBandits/SMPyBandits/},
    howpublished = {Online at: \url{github.com/SMPyBandits/SMPyBandits}},
    note =    {Code at https://github.com/SMPyBandits/SMPyBandits/, documentation at https://smpybandits.github.io/}
}

I also wrote a small paper to present SMPyBandits, and I will send it to JMLR MLOSS. The paper can be consulted here on my website.

A DOI will arrive as soon as possible! I tried to publish a paper on both JOSS and MLOSS.

List of research publications using SMPyBandits

1st article, about policy aggregation algorithm (aka model selection)

I designed and added the Aggregator policy, in order to test its validity and performance.

It is a “simple” voting algorithm to combine multiple bandit algorithms into one. Basically, it behaves like a simple MAB bandit just based on empirical means (even simpler than UCB), where arms are the child algorithms A_1 .. A_N, each running in “parallel”.

For more details, refer to this file: Aggregation.md and this research article.
2nd article, about Multi-players Multi-Armed Bandits

There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting. The basic difference is about collisions : at each time t, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.

For more details, refer to this file: MultiPlayers.md and this research article.
3rd article, using Doubling Trick for Multi-Armed Bandits

I studied what Doubling Trick can and can’t do to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.

For more details, refer to this file: DoublingTrick.md and this research article.
4th article, about Piece-Wise Stationary Multi-Armed Bandits

With Emilie Kaufmann, we studied the Generalized Likelihood Ratio Test (GLRT) for sub-Bernoulli distributions, and proposed the B-GLRT algorithm for change-point detection for piece-wise stationary one-armed bandit problems. We combined the B-GLRT with the kl-UCB multi-armed bandit algorithm and proposed the GLR-klUCB algorithm for piece-wise stationary multi-armed bandit problems. We prove finite-time guarantees for the B-GLRT and the GLR-klUCB algorithm, and we illustrate its performance with extensive numerical experiments.

For more details, refer to this file: NonStationaryBandits.md and this research article.

Other interesting things

Single-player Policies
Arms and problems
  • My framework mainly targets stochastic bandits, with arms following Bernoulli, bounded (truncated) or unbounded Gaussian, Exponential, Gamma or Poisson distributions, and more.
  • The default configuration is to use a fixed problem for N repetitions (e.g. 1000 repetitions, use MAB.MAB), but there is also a perfect support for “Bayesian” problems where the mean vector µ1,…,µK change at every repetition (see MAB.DynamicMAB).
  • There is also a good support for Markovian problems, see MAB.MarkovianMAB, even though I didn’t implement any policies tailored for Markovian problems.
  • I’m actively working on adding a very clean support for non-stationary MAB problems, and MAB.PieceWiseStationaryMAB is already working well. Use it with policies designed for piece-wise stationary problems, like Discounted-Thompson, one of the CD-UCB algorithms, M-UCB, SlidingWindowUCB or Discounted-UCB, or SW-UCB#.

Other remarks

  • Everything here is done in an imperative, object oriented style. The API of the Arms, Policy and MultiPlayersPolicy classes is documented in this file (API.md).
  • The code is clean, valid for both Python 2 and Python 3.
  • Some piece of code come from the pymaBandits project, but most of them were refactored. Thanks to the initial project!
  • G.Varoquaux’s joblib is used for the Evaluator and EvaluatorMultiPlayers classes, so the simulations are easily parallelized on multi-core machines. (Put n_jobs = -1 or PARALLEL = True in the config file to use all your CPU cores, as it is by default).

How to run the experiments ?

See this document: How_to_run_the_code.md for more details (or this documentation page).

TL;DR: this short bash snippet shows how to clone the code, install the requirements for Python 3 (in a virtualenv, and starts some simulation for N=100 repetitions of the default non-Bayesian Bernoulli-distributed problem, for K=9 arms, an horizon of T=10000 and on 4 CPUs (it should take about 20 minutes for each simulations):

cd /tmp/  # or wherever you want
git clone -c core.symlinks=true https://GitHub.com/SMPyBandits/SMPyBandits.git
cd SMPyBandits
# just be sure you have the latest virtualenv from Python 3
sudo pip3 install --upgrade --force-reinstall virtualenv
# create and active the virtualenv
virtualenv venv
. venv/bin/activate
type pip  # check it is /tmp/SMPyBandits/venv/bin/pip
type python  # check it is /tmp/SMPyBandits/venv/bin/python
# install the requirements in the virtualenv
pip install -r requirements_full.txt
# run a single-player simulation!
N=100 T=10000 K=9 N_JOBS=4 make single
# run a multi-player simulation!
N=100 T=10000 M=3 K=9 N_JOBS=4 make moremulti

You can also install it directly with pip and from GitHub:

cd /tmp/ ; mkdir SMPyBandits ; cd SMPyBandits/
virtualenv venv
. venv/bin/activate
type pip  # check it is /tmp/SMPyBandits/venv/bin/pip
type python  # check it is /tmp/SMPyBandits/venv/bin/python
pip install git+https://github.com/SMPyBandits/SMPyBandits.git#egg=SMPyBandits[full]
  • If speed matters to you and you want to use algorithms based on kl-UCB, you should take the time to build and install the fast C implementation of the utilities KL functions. Default is to use kullback.py, but using the C version from Policies/C/ really speeds up the computations. Just follow the instructions, it should work well (you need gcc to be installed).
  • And if speed matters, be sure that you have a working version of Numba, it is used by many small functions to (try to automatically) speed up the computations.
Nix

A pinned Nix environment is available for this experimental setup in the nix/pkgs/ directory. From the root of the project:

$ nix-shell
nix-shell$ jupyter_notebook 
nix-shell$ N=100 T=10000 K=9 N_JOBS=4 make single

The following one-liner lets you explore one of the example notebooks from any Nix-enabled machine, without cloning the repository:

$ nix-shell https://github.com/SMPYBandits/SMPyBandits/archive/master.tar.gz --run 'jupyter-notebook $EXAMPLE_NOTEBOOKS/Example_of_a_small_Multi-Player_Simulation__with_Centralized_Algorithms.ipynb' 

:boom: Warning

Contributing?

I don’t except issues or pull requests on this project, but you are welcome to.

Contributions (issues, questions, pull requests) are of course welcome, but this project is and will stay a personal environment designed for quick research experiments, and will never try to be an industry-ready module for applications of Multi-Armed Bandits algorithms. If you want to contribute, please have a look to the CONTRIBUTING.md file, and if you want to be more seriously involved, read the CODE_OF_CONDUCT.md file.

:boom: TODO

See this file TODO.md, and the issues on GitHub.

SMPyBandits modules

Arms package

Arms : contains different types of bandit arms: Constant, UniformArm, Bernoulli, Binomial, Poisson, Gaussian, Exponential, Gamma, DiscreteArm.

Each arm class follows the same interface:

> my_arm = Arm(params)
> my_arm.mean
0.5
> my_arm.draw()  # one random draw
0.0
> my_arm.draw_nparray(20)  # or ((3, 10)), many draw
array([ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,
        1.,  0.,  0.,  0.,  1.,  1.,  1.])

Also contains:

Arms.shuffled(mylist)[source]

Returns a shuffled version of the input 1D list. sorted() exists instead of list.sort(), but shuffled() does not exist instead of random.shuffle()…

>>> from random import seed; seed(1234)  # reproducible results
>>> mylist = [ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9]
>>> shuffled(mylist)
[0.9, 0.4, 0.3, 0.6, 0.5, 0.7, 0.1, 0.2, 0.8]
>>> shuffled(mylist)
[0.4, 0.3, 0.7, 0.5, 0.8, 0.1, 0.9, 0.6, 0.2]
>>> shuffled(mylist)
[0.4, 0.6, 0.9, 0.5, 0.7, 0.2, 0.1, 0.3, 0.8]
>>> shuffled(mylist)
[0.8, 0.7, 0.3, 0.1, 0.9, 0.5, 0.6, 0.2, 0.4]
Arms.uniformMeans(nbArms=3, delta=0.05, lower=0.0, amplitude=1.0, isSorted=True)[source]

Return a list of means of arms, well spaced:

  • in [lower, lower + amplitude],
  • sorted in increasing order,
  • starting from lower + amplitude * delta, up to lower + amplitude * (1 - delta),
  • and there is nbArms arms.
>>> np.array(uniformMeans(2, 0.1))
array([0.1, 0.9])
>>> np.array(uniformMeans(3, 0.1))
array([0.1, 0.5, 0.9])
>>> np.array(uniformMeans(9, 1 / (1. + 9)))
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
Arms.uniformMeansWithSparsity(nbArms=10, sparsity=3, delta=0.05, lower=0.0, lowerNonZero=0.5, amplitude=1.0, isSorted=True)[source]

Return a list of means of arms, well spaced, in [lower, lower + amplitude].

  • Exactly nbArms-sparsity arms will have a mean = lower and the others are randomly sampled uniformly in [lowerNonZero, lower + amplitude].
  • All means will be different, except if mingap=None, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234)  # reproducible results
>>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2))  # doctest: +ELLIPSIS
array([ 0.  ,  0.  ,  0.  ,  0.  ,  0.55,  0.95])
>>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2, lowerNonZero=0.8, delta=0.03))  # doctest: +ELLIPSIS
array([ 0.   ,  0.   ,  0.   ,  0.   ,  0.806,  0.994])
>>> np.array(uniformMeansWithSparsity(nbArms=10, sparsity=2))  # doctest: +ELLIPSIS
array([ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.55,  0.95])
>>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2, delta=0.05))  # doctest: +ELLIPSIS
array([ 0.   ,  0.   ,  0.   ,  0.   ,  0.525,  0.975])
>>> np.array(uniformMeansWithSparsity(nbArms=10, sparsity=4, delta=0.05))  # doctest: +ELLIPSIS
array([ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.525,  0.675,
        0.825,  0.975])
Arms.randomMeans(nbArms=3, mingap=None, lower=0.0, amplitude=1.0, isSorted=True)[source]

Return a list of means of arms, randomly sampled uniformly in [lower, lower + amplitude], with a min gap >= mingap.

  • All means will be different, except if mingap=None, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234)  # reproducible results
>>> randomMeans(nbArms=3, mingap=0.05)  # doctest: +ELLIPSIS
[0.191..., 0.437..., 0.622...]
>>> randomMeans(nbArms=3, mingap=0.01)  # doctest: +ELLIPSIS
[0.276..., 0.801..., 0.958...]
  • Means are sorted, except if isSorted=False.
>>> import random; random.seed(1234)  # reproducible results
>>> randomMeans(nbArms=5, mingap=0.01, isSorted=True)  # doctest: +ELLIPSIS
[0.006..., 0.229..., 0.416..., 0.535..., 0.899...]
>>> randomMeans(nbArms=5, mingap=0.01, isSorted=False)  # doctest: +ELLIPSIS
[0.419..., 0.932..., 0.072..., 0.755..., 0.650...]
Arms.randomMeansWithGapBetweenMbestMworst(nbArms=3, mingap=None, nbPlayers=2, lower=0.0, amplitude=1.0, isSorted=True)[source]

Return a list of means of arms, randomly sampled uniformly in [lower, lower + amplitude], with a min gap >= mingap between the set Mbest and Mworst.

Arms.randomMeansWithSparsity(nbArms=10, sparsity=3, mingap=0.01, delta=0.05, lower=0.0, lowerNonZero=0.5, amplitude=1.0, isSorted=True)[source]

Return a list of means of arms, in [lower, lower + amplitude], with a min gap >= mingap.

  • Exactly nbArms-sparsity arms will have a mean = lower and the others are randomly sampled uniformly in [lowerNonZero, lower + amplitude].
  • All means will be different, except if mingap=None, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234)  # reproducible results
>>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.05)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.595..., 0.811...]
>>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.718..., 0.892...]
  • Means are sorted, except if isSorted=False.
>>> import random; random.seed(1234)  # reproducible results
>>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01, isSorted=True)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.636..., 0.889...]
>>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01, isSorted=False)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.900..., 0.638..., 0.0, 0.0]
Arms.randomMeansWithSparsity2(nbArms=10, sparsity=3, mingap=0.01, lower=-1.0, lowerNonZero=0.0, amplitude=2.0, isSorted=True)[source]

Return a list of means of arms, in [lower, lower + amplitude], with a min gap >= mingap.

  • Exactly nbArms-sparsity arms will have a mean sampled uniformly in [lower, lowerNonZero] and the others are randomly sampled uniformly in [lowerNonZero, lower + amplitude].
  • All means will be different, except if mingap=None, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234)  # reproducible results
>>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.05)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.595..., 0.811...]
>>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.718..., 0.892...]
  • Means are sorted, except if isSorted=False.
>>> import random; random.seed(1234)  # reproducible results
>>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01, isSorted=True)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.0, 0.0, 0.636..., 0.889...]
>>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01, isSorted=False)  # doctest: +ELLIPSIS
[0.0, 0.0, 0.900..., 0.638..., 0.0, 0.0]
Arms.array_from_str(my_str)[source]

Convert a string like “[0.1, 0.2, 0.3]” to a numpy array [0.1, 0.2, 0.3], using safe json.loads instead of exec.

>>> array_from_str("[0.1, 0.2, 0.3]")
array([0.1,  0.2,  0.3])
>>> array_from_str("0.1, 0.2, 0.3")
array([0.1,  0.2,  0.3])
>>> array_from_str("0.9")
array([0.9])
Arms.list_from_str(my_str)[source]

Convert a string like “[0.1, 0.2, 0.3]” to a list (0.1, 0.2, 0.3), using safe json.loads instead of exec.

>>> list_from_str("[0.1, 0.2, 0.3]")
[0.1, 0.2, 0.3]
>>> list_from_str("0.1, 0.2, 0.3")
[0.1, 0.2, 0.3]
>>> list_from_str("0.9")
[0.9]
Arms.tuple_from_str(my_str)[source]

Convert a string like “[0.1, 0.2, 0.3]” to a tuple (0.1, 0.2, 0.3), using safe json.loads instead of exec.

>>> tuple_from_str("[0.1, 0.2, 0.3]")
(0.1, 0.2, 0.3)
>>> tuple_from_str("0.1, 0.2, 0.3")
(0.1, 0.2, 0.3)
>>> tuple_from_str("0.9")
(0.9,)
Arms.optimal_selection_probabilities(M, mu)[source]

Compute the optimal selection probabilities of K arms of means \(\mu_i\) by \(1 \leq M \leq K\) players, if they all observe each other pulls and rewards, as derived in (15) p3 of [[The Effect of Communication on Noncooperative Multiplayer Multi-Armed Bandit Problems, by Noyan Evirgen, Alper Kose, IEEE ICMLA 2017]](https://arxiv.org/abs/1711.01628v1).

Warning

They consider a different collision model than I usually do, when two (or more) players ask for the same resource at same time t, I usually consider than all the colliding players receive a zero reward (see Environment.CollisionModels.onlyUniqUserGetsReward()), but they consider than exactly one of the colliding players gets the reward, and all the others get a zero reward (see Environment.CollisionModels.rewardIsSharedUniformly()).

Example:

>>> optimal_selection_probabilities(3, [0.1,0.1,0.1])
array([0.33333333,  0.33333333,  0.33333333])
>>> optimal_selection_probabilities(3, [0.1,0.2,0.3])  # weird ? not really...
array([0.        ,  0.43055556,  0.56944444])
>>> optimal_selection_probabilities(3, [0.1,0.3,0.9])  # weird ? not really...
array([0.        ,  0.45061728,  0.54938272])
>>> optimal_selection_probabilities(3, [0.7,0.8,0.9])
array([0.15631866,  0.35405647,  0.48962487])

Note

These results may sound counter-intuitive, but again they use a different collision models: in my usual collision model, it makes no sense to completely drop an arm when K=M=3, no matter the probabilities \(\mu_i\), but in their collision model, a player wins more (in average) if she has a \(50\%\) chance of being alone on an arm with mean \(0.3\) than if she is sure to be alone on an arm with mean \(0.1\) (see examples 3 and 4).

Arms.geometricChangePoints(horizon=10000, proba=0.001)[source]

Change points following a geometric distribution: at each time, the probability of having a change point at the next step is proba.

>>> np.random.seed(0)
>>> geometricChangePoints(100, 0.1)
array([ 8, 20, 29, 37, 43, 53, 59, 81])
>>> geometricChangePoints(100, 0.2)
array([ 6,  8, 14, 29, 31, 35, 40, 44, 46, 60, 63, 72, 78, 80, 88, 91])
Arms.continuouslyVaryingMeans(means, sign=1, maxSlowChange=0.1, horizon=None, lower=0.0, amplitude=1.0, isSorted=True)[source]

New means, slightly modified from the previous ones.

  • The change and the sign of change are constants.
Arms.randomContinuouslyVaryingMeans(means, maxSlowChange=0.1, horizon=None, lower=0.0, amplitude=1.0, isSorted=True)[source]

New means, slightly modified from the previous ones.

  • The amplitude c of the change is constant, but it is randomly sampled in \(\mathcal{U}([-c,c])\).
Submodules
Arms.Arm module

Base class for an arm class.

class Arms.Arm.Arm(lower=0.0, amplitude=1.0)[source]

Bases: object

Base class for an arm class.

__init__(lower=0.0, amplitude=1.0)[source]

Base class for an arm class.

lower = None

Lower value of rewards

amplitude = None

Amplitude of value of rewards

min = None

Lower value of rewards

max = None

Higher value of rewards

lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

draw(t=None)[source]

Draw one random sample.

oracle_draw(t=None)[source]
set_mean_param(mean)[source]
draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

static kl(x, y)[source]

The kl(x, y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - mu) / KL(mu, mumax).

static oneHOI(mumax, mu)[source]

One term for the HOI factor for this arm.

__dict__ = mappingproxy({'__module__': 'Arms.Arm', '__doc__': ' Base class for an arm class.', '__init__': <function Arm.__init__>, 'lower_amplitude': <property object>, '__str__': <function Arm.__str__>, '__repr__': <function Arm.__repr__>, 'draw': <function Arm.draw>, 'oracle_draw': <function Arm.oracle_draw>, 'set_mean_param': <function Arm.set_mean_param>, 'draw_nparray': <function Arm.draw_nparray>, 'kl': <staticmethod object>, 'oneLR': <staticmethod object>, 'oneHOI': <staticmethod object>, '__dict__': <attribute '__dict__' of 'Arm' objects>, '__weakref__': <attribute '__weakref__' of 'Arm' objects>})
__module__ = 'Arms.Arm'
__weakref__

list of weak references to the object (if defined)

Arms.Bernoulli module

Bernoulli distributed arm.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> B03 = Bernoulli(0.3)
>>> B03
B(0.3)
>>> B03.mean
0.3

Examples of sampling from an arm:

>>> B03.draw()
0
>>> B03.draw_nparray(20)
array([1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1.,
       1., 1., 1.])
class Arms.Bernoulli.Bernoulli(probability)[source]

Bases: Arms.Arm.Arm

Bernoulli distributed arm.

__init__(probability)[source]

New arm.

probability = None

Parameter p for this Bernoulli arm

mean = None

Mean for this Bernoulli arm

draw(t=None)[source]

Draw one random sample.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

set_mean_param(probability)[source]
lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Bernoulli arms: (mumax - mu) / KL(mu, mumax).

__module__ = 'Arms.Bernoulli'
Arms.Binomial module

Binomial distributed arm.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> B03_10 = Binomial(0.3, 10)
>>> B03_10
Bin(0.3, 10)
>>> B03_10.mean
3.0

Examples of sampling from an arm:

>>> B03_10.draw()
3
>>> B03_10.draw_nparray(20)
array([4., 3., 3., 3., 3., 3., 5., 6., 3., 4., 3., 3., 5., 1., 1., 0., 4.,
       4., 5., 6.])
class Arms.Binomial.Binomial(probability, draws=1)[source]

Bases: Arms.Arm.Arm

Binomial distributed arm.

__init__(probability, draws=1)[source]

New arm.

probability = None

Parameter p for this Binomial arm

draws = None

Parameter n for this Binomial arm

mean = None

Mean for this Binomial arm

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

set_mean_param(probability, draws=None)[source]
lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

kl(x, y)[source]

The kl(x, y) to use for this arm.

oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Binomial arms: (mumax - mu) / KL(mu, mumax).

__module__ = 'Arms.Binomial'
Arms.Constant module

Arm with a constant reward. Useful for debugging.

Example of creating an arm:

>>> C013 = Constant(0.13)
>>> C013
Constant(0.13)
>>> C013.mean
0.13

Examples of sampling from an arm:

>>> C013.draw()
0.13
>>> C013.draw_nparray(3)
array([0.13, 0.13, 0.13])
class Arms.Constant.Constant(constant_reward=0.5, lower=0.0, amplitude=1.0)[source]

Bases: Arms.Arm.Arm

Arm with a constant reward. Useful for debugging.

  • constant_reward is the constant reward,
  • lower, amplitude default to floor(constant_reward), 1 (so the )
>>> arm_0_5 = Constant(0.5)
>>> arm_0_5.draw()
0.5
>>> arm_0_5.draw_nparray((3, 2))
array([[0.5, 0.5],
       [0.5, 0.5],
       [0.5, 0.5]])
__init__(constant_reward=0.5, lower=0.0, amplitude=1.0)[source]

New arm.

constant_reward = None

Constant value of rewards

lower = None

Known lower value of rewards

amplitude = None

Known amplitude of rewards

mean = None

Mean for this Constant arm

draw(t=None)[source]

Draw one constant sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of constant samples, of a certain shape.

set_mean_param(mean)[source]
__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) = abs(x - y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Constant arms: (mumax - mu) / KL(mu, mumax).

__module__ = 'Arms.Constant'
Arms.DiscreteArm module

Discretely distributed arm, of finite support.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> D3values = DiscreteArm({-1: 0.25, 0: 0.5, 1: 0.25})
>>> D3values
D({-1: 0.25, 0: 0.5, 1: 0.25})
>>> D3values.mean
0.0
  • Examples of sampling from an arm:
>>> D3values.draw()
0
>>> D3values.draw_nparray(20)
array([ 0,  0,  0,  0,  0,  0,  1,  1,  0,  1,  0,  0,  1, -1, -1, -1,  1,
        1,  1,  1])
  • Another example, with heavy tail:
>>> D5values = DiscreteArm({-1000: 0.001, 0: 0.5, 1: 0.25, 2:0.25, 1000: 0.001})
>>> D5values
D({-1e+03: 0.001, 0: 0.5, 1: 0.25, 2: 0.25, 1e+03: 0.001})
>>> D5values.mean
0.75

Examples of sampling from an arm:

>>> D5values.draw()
2
>>> D5values.draw_nparray(20)
array([0, 2, 0, 1, 0, 2, 1, 0, 0, 2, 0, 1, 0, 1, 1, 1, 2, 1, 0, 0])
class Arms.DiscreteArm.DiscreteArm(values_to_proba)[source]

Bases: Arms.Arm.Arm

DiscreteArm distributed arm.

__init__(values_to_proba)[source]

New arm.

mean = None

Mean for this DiscreteArm arm

size = None

Number of different values in this DiscreteArm arm

draw(t=None)[source]

Draw one random sample.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) to use for this arm.

Warning

FIXME this is not correctly defined, except for the special case of having only 2 values, a DiscreteArm is NOT a one-dimensional distribution, and so the kl between two distributions is NOT a function of their mean!

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for DiscreteArm arms: (mumax - mu) / KL(mu, mumax).

Warning

FIXME this is not correctly defined, except for the special case of having only 2 values, a DiscreteArm is NOT a one-dimensional distribution, and so the kl between two distributions is NOT a function of their mean!

__module__ = 'Arms.DiscreteArm'
Arms.Exponential module

Exponentially distributed arm.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Exp03 = ExponentialFromMean(0.3)
>>> Exp03
\mathrm{Exp}(3.2, 1)
>>> Exp03.mean  # doctest: +ELLIPSIS
0.3000...

Examples of sampling from an arm:

>>> Exp03.draw()  # doctest: +ELLIPSIS
0.052...
>>> Exp03.draw_nparray(20)  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.18..., 0.10..., 0.15..., 0.18..., 0.26...,
       0.13..., 0.25..., 0.03..., 0.01..., 0.29... ,
       0.07..., 0.19..., 0.17..., 0.02... , 0.82... ,
       0.76..., 1.     , 0.05..., 0.07..., 0.04...])
class Arms.Exponential.Exponential(p, trunc=1)[source]

Bases: Arms.Arm.Arm

Exponentially distributed arm, possibly truncated.

  • Default is to truncate to 1 (so Exponential.draw() is in [0, 1]).
__init__(p, trunc=1)[source]

New arm.

p = None

Parameter p for Exponential arm

trunc = None

Max value of reward

mean = None

Mean of Exponential arm

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw one random sample. The parameter t is ignored in this Arm.

set_mean_param(p_inv)[source]
lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Exponential arms: (mumax - mu) / KL(mu, mumax).

oneHOI(mumax, mu)[source]

One term for the HOI factor for this arm.

__module__ = 'Arms.Exponential'
class Arms.Exponential.ExponentialFromMean(mean, trunc=1)[source]

Bases: Arms.Exponential.Exponential

Exponentially distributed arm, possibly truncated, defined by its mean and not its parameter.

  • Default is to truncate to 1 (so Exponential.draw() is in [0, 1]).
__init__(mean, trunc=1)[source]

New arm.

__module__ = 'Arms.Exponential'
class Arms.Exponential.UnboundedExponential(mu)[source]

Bases: Arms.Exponential.Exponential

Exponential distributed arm, not truncated, ie. trunc = oo.

__init__(mu)[source]

New arm.

__module__ = 'Arms.Exponential'
Arms.Gamma module

Gamma distributed arm.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Gamma03 = GammaFromMean(0.3)
>>> Gamma03
\Gamma(0.3, 1)
>>> Gamma03.mean
0.3

Examples of sampling from an arm:

>>> Gamma03.draw()  # doctest: +ELLIPSIS
0.079...
>>> Gamma03.draw_nparray(20)  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([1.35...e-01, 1.84...e-01, 5.71...e-02, 6.36...e-02,
       4.94...e-01, 1.51...e-01, 1.48...e-04, 2.25...e-06,
       4.56...e-01, 1.00...e+00, 7.59...e-02, 8.12...e-04,
       1.54...e-03, 1.14...e-01, 1.18...e-02, 7.30...e-02,
       1.76...e-06, 1.94...e-01, 1.00...e+00, 3.30...e-02])
class Arms.Gamma.Gamma(shape, scale=1.0, mini=0, maxi=1)[source]

Bases: Arms.Arm.Arm

Gamma distributed arm, possibly truncated.

__init__(shape, scale=1.0, mini=0, maxi=1)[source]

New arm.

shape = None

Shape parameter for this Gamma arm

scale = None

Scale parameter for this Gamma arm

mean = None

Mean for this Gamma arm

min = None

Lower value of rewards

max = None

Larger value of rewards

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

kl(x, y)[source]

The kl(x, y) to use for this arm.

oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - shape) / KL(shape, mumax).

oneHOI(mumax, mu)[source]

One term for the HOI factor for this arm.

__module__ = 'Arms.Gamma'
class Arms.Gamma.GammaFromMean(mean, scale=1.0, mini=0, maxi=1)[source]

Bases: Arms.Gamma.Gamma

Gamma distributed arm, possibly truncated, defined by its mean and not its scale parameter.

__init__(mean, scale=1.0, mini=0, maxi=1)[source]

As mean = scale * shape, shape = mean / scale is used.

__module__ = 'Arms.Gamma'
class Arms.Gamma.UnboundedGamma(shape, scale=1.0)[source]

Bases: Arms.Gamma.Gamma

Gamma distributed arm, not truncated, ie. supported in (-oo, oo).

__init__(shape, scale=1.0)[source]

New arm.

__module__ = 'Arms.Gamma'
Arms.Gaussian module

Gaussian distributed arm.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Gauss03 = Gaussian(0.3, 0.05)  # small variance
>>> Gauss03
N(0.3, 0.05)
>>> Gauss03.mean
0.3

Examples of sampling from an arm:

>>> Gauss03.draw()  # doctest: +ELLIPSIS
0.3470...
>>> Gauss03.draw_nparray(20)  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.388..., 0.320..., 0.348... , 0.412..., 0.393... ,
       0.251..., 0.347..., 0.292..., 0.294..., 0.320...,
       0.307..., 0.372..., 0.338..., 0.306..., 0.322...,
       0.316..., 0.374..., 0.289..., 0.315..., 0.257...])
class Arms.Gaussian.Gaussian(mu, sigma=0.05, mini=0, maxi=1)[source]

Bases: Arms.Arm.Arm

Gaussian distributed arm, possibly truncated.

  • Default is to truncate into [0, 1] (so Gaussian.draw() is in [0, 1]).
__init__(mu, sigma=0.05, mini=0, maxi=1)[source]

New arm.

mu = None

Mean of Gaussian arm

mean = None

Mean of Gaussian arm

sigma = None

Variance of Gaussian arm

min = None

Lower value of rewards

max = None

Higher value of rewards

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

set_mean_param(mean)[source]
lower_amplitude

(lower, amplitude)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

kl(x, y)[source]

The kl(x, y) to use for this arm.

oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - mu) / KL(mu, mumax).

oneHOI(mumax, mu)[source]

One term for the HOI factor for this arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_0_1(mu, sigma=0.05, mini=0, maxi=1)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [0, 1].

__init__(mu, sigma=0.05, mini=0, maxi=1)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_0_2(mu, sigma=0.1, mini=0, maxi=2)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [0, 2].

__init__(mu, sigma=0.1, mini=0, maxi=2)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_0_5(mu, sigma=0.5, mini=0, maxi=5)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [0, 5].

__init__(mu, sigma=0.5, mini=0, maxi=5)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_0_10(mu, sigma=1, mini=0, maxi=10)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [0, 10].

__init__(mu, sigma=1, mini=0, maxi=10)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_0_100(mu, sigma=5, mini=0, maxi=100)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [0, 100].

__init__(mu, sigma=5, mini=0, maxi=100)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_m1_1(mu, sigma=0.1, mini=-1, maxi=1)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [-1, 1].

__init__(mu, sigma=0.1, mini=-1, maxi=1)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_m2_2(mu, sigma=0.25, mini=-2, maxi=2)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [-2, 2].

__init__(mu, sigma=0.25, mini=-2, maxi=2)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_m5_5(mu, sigma=1, mini=-5, maxi=5)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [-5, 5].

__init__(mu, sigma=1, mini=-5, maxi=5)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_m10_10(mu, sigma=2, mini=-10, maxi=10)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [-10, 10].

__init__(mu, sigma=2, mini=-10, maxi=10)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.Gaussian_m100_100(mu, sigma=10, mini=-100, maxi=100)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, truncated to [-100, 100].

__init__(mu, sigma=10, mini=-100, maxi=100)[source]

New arm.

__module__ = 'Arms.Gaussian'
class Arms.Gaussian.UnboundedGaussian(mu, sigma=1)[source]

Bases: Arms.Gaussian.Gaussian

Gaussian distributed arm, not truncated, ie. supported in (-oo, oo).

__init__(mu, sigma=1)[source]

New arm.

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

__repr__()[source]

Return repr(self).

__module__ = 'Arms.Gaussian'
Arms.Poisson module

Poisson distributed arm, possibly truncated.

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Poisson5 = Poisson(5, trunc=10)
>>> Poisson5
P(5, 10)
>>> Poisson5.mean  # doctest: +ELLIPSIS
4.9778...

Examples of sampling from an arm:

>>> Poisson5.draw()  # doctest: +ELLIPSIS
9
>>> Poisson5.draw_nparray(20)  # doctest: +ELLIPSIS
array([ 5,  6,  5,  5,  8,  4,  5,  4,  3,  3,  7,  3,  3,  4,  5,  2,  1,
        7,  7, 10])
class Arms.Poisson.Poisson(p, trunc=1)[source]

Bases: Arms.Arm.Arm

Poisson distributed arm, possibly truncated.

  • Default is to not truncate.
  • Warning: the draw() method is QUITE inefficient! (15 seconds for 200000 draws, 62 µs for 1).
__init__(p, trunc=1)[source]

New arm.

p = None

Parameter p for Poisson arm

trunc = None

Max value of rewards

mean = None

Mean for this Poisson arm

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

set_mean_param(p)[source]
__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for Poisson arms: (mumax - mu) / KL(mu, mumax).

__module__ = 'Arms.Poisson'
class Arms.Poisson.UnboundedPoisson(p)[source]

Bases: Arms.Poisson.Poisson

Poisson distributed arm, not truncated, ie. trunc = oo.

__init__(p)[source]

New arm.

__module__ = 'Arms.Poisson'
Arms.RestedRottingArm module

author: Julien Seznec Rested rotting arm, i.e. arms with mean value which decay at each pull

class Arms.RestedRottingArm.RestedRottingArm(decayingFunction, staticArm)[source]

Bases: Arms.Arm.Arm

__init__(decayingFunction, staticArm)[source]

Base class for an arm class.

draw(t=None)[source]

Draw one random sample.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingBernoulli(decayingFunction)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingBinomial(decayingFunction, draws=1)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction, draws=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingConstant(decayingFunction)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingExponential(decayingFunction)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingGaussian(decayingFunction, sigma=1)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction, sigma=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
class Arms.RestedRottingArm.RestedRottingPoisson(decayingFunction, sigma=1)[source]

Bases: Arms.RestedRottingArm.RestedRottingArm

__init__(decayingFunction, sigma=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestedRottingArm'
Arms.RestlessArm module

author: Julien Seznec Restless arm, i.e. arms with mean value which change at each round

class Arms.RestlessArm.RestlessArm(rewardFunction, staticArm)[source]

Bases: Arms.Arm.Arm

__init__(rewardFunction, staticArm)[source]

Base class for an arm class.

draw(t)[source]

Draw one random sample.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessBernoulli(rewardFunction)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessBinomial(rewardFunction, draws=1)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction, draws=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessConstant(rewardFunction)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessExponential(rewardFunction)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessGaussian(rewardFunction, sigma=1)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction, sigma=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
class Arms.RestlessArm.RestlessPoisson(rewardFunction, sigma=1)[source]

Bases: Arms.RestlessArm.RestlessArm

__init__(rewardFunction, sigma=1)[source]

Base class for an arm class.

__module__ = 'Arms.RestlessArm'
Arms.UniformArm module

Uniformly distributed arm in [0, 1], or [lower, lower + amplitude].

Example of creating an arm:

>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Unif01 = UniformArm(0, 1)
>>> Unif01
U(0, 1)
>>> Unif01.mean
0.5

Examples of sampling from an arm:

>>> Unif01.draw()  # doctest: +ELLIPSIS
0.8444...
>>> Unif01.draw_nparray(20)  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.54... , 0.71..., 0.60..., 0.54..., 0.42... ,
       0.64..., 0.43..., 0.89...  , 0.96..., 0.38...,
       0.79..., 0.52..., 0.56..., 0.92..., 0.07...,
       0.08... , 0.02... , 0.83..., 0.77..., 0.87...])
class Arms.UniformArm.UniformArm(mini=0.0, maxi=1.0, mean=None, lower=0.0, amplitude=1.0)[source]

Bases: Arms.Arm.Arm

Uniformly distributed arm, default in [0, 1],

  • default to (mini, maxi),
  • or [lower, lower + amplitude], if (lower=lower, amplitude=amplitude) is given.
>>> arm_0_1 = UniformArm()
>>> arm_0_10 = UniformArm(0, 10)  # maxi = 10
>>> arm_2_4 = UniformArm(2, 4)
>>> arm_m10_10 = UniformArm(-10, 10)  # also UniformArm(lower=-10, amplitude=20)
__init__(mini=0.0, maxi=1.0, mean=None, lower=0.0, amplitude=1.0)[source]

New arm.

lower = None

Lower value of rewards

amplitude = None

Amplitude of rewards

mean = None

Mean for this UniformArm arm

draw(t=None)[source]

Draw one random sample. The parameter t is ignored in this Arm.

draw_nparray(shape=(1, ))[source]

Draw a numpy array of random samples, of a certain shape.

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

static kl(x, y)[source]

The kl(x, y) to use for this arm.

static oneLR(mumax, mu)[source]

One term of the Lai & Robbins lower bound for UniformArm arms: (mumax - mu) / KL(mu, mumax).

__module__ = 'Arms.UniformArm'
Arms.kullback module

Kullback-Leibler divergence functions and klUCB utilities.

Warning

All functions are not vectorized, and assume only one value for each argument. If you want vectorized function, use the wrapper numpy.vectorize:

>>> import numpy as np
>>> klBern_vect = np.vectorize(klBern)
>>> klBern_vect([0.1, 0.5, 0.9], 0.2)  # doctest: +ELLIPSIS
array([0.036..., 0.223..., 1.145...])
>>> klBern_vect(0.4, [0.2, 0.3, 0.4])  # doctest: +ELLIPSIS
array([0.104..., 0.022..., 0...])
>>> klBern_vect([0.1, 0.5, 0.9], [0.2, 0.3, 0.4])  # doctest: +ELLIPSIS
array([0.036..., 0.087..., 0.550...])

For some functions, you would be better off writing a vectorized version manually, for instance if you want to fix a value of some optional parameters:

>>> # WARNING using np.vectorize gave weird result on klGauss
>>> # klGauss_vect = np.vectorize(klGauss, excluded="y")
>>> def klGauss_vect(xs, y, sig2x=0.25):  # vectorized for first input only
...    return np.array([klGauss(x, y, sig2x) for x in xs])
>>> klGauss_vect([-1, 0, 1], 0.1)  # doctest: +ELLIPSIS
array([2.42, 0.02, 1.62])
Arms.kullback.eps = 1e-15

Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]

Arms.kullback.klBern(x, y)[source]

Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]
>>> klBern(0.5, 0.5)
0.0
>>> klBern(0.1, 0.9)  # doctest: +ELLIPSIS
1.757779...
>>> klBern(0.9, 0.1)  # And this KL is symmetric  # doctest: +ELLIPSIS
1.757779...
>>> klBern(0.4, 0.5)  # doctest: +ELLIPSIS
0.020135...
>>> klBern(0.01, 0.99)  # doctest: +ELLIPSIS
4.503217...
  • Special values:
>>> klBern(0, 1)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
34.539575...
Arms.kullback.klBin(x, y, n)[source]

Kullback-Leibler divergence for Binomial distributions. https://math.stackexchange.com/questions/320399/kullback-leibner-divergence-of-binomial-distributions

  • It is simply the n times klBern() on x and y.
\[\mathrm{KL}(\mathrm{Bin}(x, n), \mathrm{Bin}(y, n)) = n \times \left(x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}) \right).\]

Warning

The two distributions must have the same parameter n, and x, y are p, q in (0, 1).

>>> klBin(0.5, 0.5, 10)
0.0
>>> klBin(0.1, 0.9, 10)  # doctest: +ELLIPSIS
17.57779...
>>> klBin(0.9, 0.1, 10)  # And this KL is symmetric  # doctest: +ELLIPSIS
17.57779...
>>> klBin(0.4, 0.5, 10)  # doctest: +ELLIPSIS
0.20135...
>>> klBin(0.01, 0.99, 10)  # doctest: +ELLIPSIS
45.03217...
  • Special values:
>>> klBin(0, 1, 10)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
345.39575...
Arms.kullback.klPoisson(x, y)[source]

Kullback-Leibler divergence for Poison distributions. https://en.wikipedia.org/wiki/Poisson_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathrm{Poisson}(x), \mathrm{Poisson}(y)) = y - x + x \times \log(\frac{x}{y}).\]
>>> klPoisson(3, 3)
0.0
>>> klPoisson(2, 1)  # doctest: +ELLIPSIS
0.386294...
>>> klPoisson(1, 2)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klPoisson(3, 6)  # doctest: +ELLIPSIS
0.920558...
>>> klPoisson(6, 8)  # doctest: +ELLIPSIS
0.273907...
  • Special values:
>>> klPoisson(1, 0)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
33.538776...
>>> klPoisson(0, 0)
0.0
Arms.kullback.klExp(x, y)[source]

Kullback-Leibler divergence for exponential distributions. https://en.wikipedia.org/wiki/Exponential_distribution#Kullback.E2.80.93Leibler_divergence

\[\begin{split}\mathrm{KL}(\mathrm{Exp}(x), \mathrm{Exp}(y)) = \begin{cases} \frac{x}{y} - 1 - \log(\frac{x}{y}) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]
>>> klExp(3, 3)
0.0
>>> klExp(3, 6)  # doctest: +ELLIPSIS
0.193147...
>>> klExp(1, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.193147...
>>> klExp(2, 1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klExp(4, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.306852...
>>> klExp(6, 8)  # doctest: +ELLIPSIS
0.037682...
  • x, y have to be positive:
>>> klExp(-3, 2)
inf
>>> klExp(3, -2)
inf
>>> klExp(-3, -2)
inf
Arms.kullback.klGamma(x, y, a=1)[source]

Kullback-Leibler divergence for gamma distributions. https://en.wikipedia.org/wiki/Gamma_distribution#Kullback.E2.80.93Leibler_divergence

  • It is simply the a times klExp() on x and y.
\[\begin{split}\mathrm{KL}(\Gamma(x, a), \Gamma(y, a)) = \begin{cases} a \times \left( \frac{x}{y} - 1 - \log(\frac{x}{y}) \right) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]

Warning

The two distributions must have the same parameter a.

>>> klGamma(3, 3)
0.0
>>> klGamma(3, 6)  # doctest: +ELLIPSIS
0.193147...
>>> klGamma(1, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.193147...
>>> klGamma(2, 1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klGamma(4, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.306852...
>>> klGamma(6, 8)  # doctest: +ELLIPSIS
0.037682...
  • x, y have to be positive:
>>> klGamma(-3, 2)
inf
>>> klGamma(3, -2)
inf
>>> klGamma(-3, -2)
inf
Arms.kullback.klNegBin(x, y, r=1)[source]

Kullback-Leibler divergence for negative binomial distributions. https://en.wikipedia.org/wiki/Negative_binomial_distribution

\[\mathrm{KL}(\mathrm{NegBin}(x, r), \mathrm{NegBin}(y, r)) = r \times \log((r + x) / (r + y)) - x \times \log(y \times (r + x) / (x \times (r + y))).\]

Warning

The two distributions must have the same parameter r.

>>> klNegBin(0.5, 0.5)
0.0
>>> klNegBin(0.1, 0.9)  # doctest: +ELLIPSIS
-0.711611...
>>> klNegBin(0.9, 0.1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
2.0321564...
>>> klNegBin(0.4, 0.5)  # doctest: +ELLIPSIS
-0.130653...
>>> klNegBin(0.01, 0.99)  # doctest: +ELLIPSIS
-0.717353...
  • Special values:
>>> klBern(0, 1)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
34.539575...
  • With other values for r:
>>> klNegBin(0.5, 0.5, r=2)
0.0
>>> klNegBin(0.1, 0.9, r=2)  # doctest: +ELLIPSIS
-0.832991...
>>> klNegBin(0.1, 0.9, r=4)  # doctest: +ELLIPSIS
-0.914890...
>>> klNegBin(0.9, 0.1, r=2)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
2.3325528...
>>> klNegBin(0.4, 0.5, r=2)  # doctest: +ELLIPSIS
-0.154572...
>>> klNegBin(0.01, 0.99, r=2)  # doctest: +ELLIPSIS
-0.836257...
Arms.kullback.klGauss(x, y, sig2x=0.25, sig2y=None)[source]

Kullback-Leibler divergence for Gaussian distributions of means x and y and variances sig2x and sig2y, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):

\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]

See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties

  • By default, sig2y is assumed to be sig2x (same variance).

Warning

The C version does not support different variances.

>>> klGauss(3, 3)
0.0
>>> klGauss(3, 6)
18.0
>>> klGauss(1, 2)
2.0
>>> klGauss(2, 1)  # And this KL is symmetric
2.0
>>> klGauss(4, 2)
8.0
>>> klGauss(6, 8)
8.0
  • x, y can be negative:
>>> klGauss(-3, 2)
50.0
>>> klGauss(3, -2)
50.0
>>> klGauss(-3, -2)
2.0
>>> klGauss(3, 2)
2.0
  • With other values for sig2x:
>>> klGauss(3, 3, sig2x=10)
0.0
>>> klGauss(3, 6, sig2x=10)
0.45
>>> klGauss(1, 2, sig2x=10)
0.05
>>> klGauss(2, 1, sig2x=10)  # And this KL is symmetric
0.05
>>> klGauss(4, 2, sig2x=10)
0.2
>>> klGauss(6, 8, sig2x=10)
0.2
  • With different values for sig2x and sig2y:
>>> klGauss(0, 0, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
-0.0284...
>>> klGauss(0, 0, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.2243...
>>> klGauss(0, 0, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
1.1534...
>>> klGauss(0, 1, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
0.9715...
>>> klGauss(0, 1, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.7243...
>>> klGauss(0, 1, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
3.1534...
>>> klGauss(1, 0, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
0.9715...
>>> klGauss(1, 0, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.7243...
>>> klGauss(1, 0, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
3.1534...

Warning

Using Policies.klUCB (and variants) with klGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Arms.kullback.klucb(x, d, kl, upperbound, precision=1e-06, lowerbound=-inf, max_iterations=50)[source]

The generic KL-UCB index computation.

  • x: value of the cum reward,
  • d: upper bound on the divergence,
  • kl: the KL divergence to be used (klBern(), klGauss(), etc),
  • upperbound, lowerbound=float('-inf'): the known bound of the values x,
  • precision=1e-6: the threshold from where to stop the research,
  • max_iterations=50: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{klucb}(x, d) \simeq \sup_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) < d \}.\]

Note

It uses a bisection search, and one call to kl for each step of the bisection search.

For example, for klucbBern(), the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:

>>> x, d = 0.9, 0.2   # mean x, exploration term d
>>> upperbound = min(1., klucbGauss(x, d, sig2x=0.25))  # variance 1/4 for [0,1] bounded distributions
>>> upperbound  # doctest: +ELLIPSIS
1.0
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=10)  # doctest: +ELLIPSIS
0.9941...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=10)  # doctest: +ELLIPSIS
0.9944...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=50)  # doctest: +ELLIPSIS
0.9941...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=100)  # more and more precise!  # doctest: +ELLIPSIS
0.994489...

Note

See below for more examples for different KL divergence functions.

Arms.kullback.klucbBern(x, d, precision=1e-06)[source]

KL-UCB index computation for Bernoulli distributions, using klucb().

  • Influence of x:
>>> klucbBern(0.1, 0.2)  # doctest: +ELLIPSIS
0.378391...
>>> klucbBern(0.5, 0.2)  # doctest: +ELLIPSIS
0.787088...
>>> klucbBern(0.9, 0.2)  # doctest: +ELLIPSIS
0.994489...
  • Influence of d:
>>> klucbBern(0.1, 0.4)  # doctest: +ELLIPSIS
0.519475...
>>> klucbBern(0.1, 0.9)  # doctest: +ELLIPSIS
0.734714...
>>> klucbBern(0.5, 0.4)  # doctest: +ELLIPSIS
0.871035...
>>> klucbBern(0.5, 0.9)  # doctest: +ELLIPSIS
0.956809...
>>> klucbBern(0.9, 0.4)  # doctest: +ELLIPSIS
0.999285...
>>> klucbBern(0.9, 0.9)  # doctest: +ELLIPSIS
0.999995...
Arms.kullback.klucbGauss(x, d, sig2x=0.25, precision=0.0)[source]

KL-UCB index computation for Gaussian distributions.

  • Note that it does not require any search.

Warning

it works only if the good variance constant is given.

  • Influence of x:
>>> klucbGauss(0.1, 0.2)  # doctest: +ELLIPSIS
0.416227...
>>> klucbGauss(0.5, 0.2)  # doctest: +ELLIPSIS
0.816227...
>>> klucbGauss(0.9, 0.2)  # doctest: +ELLIPSIS
1.216227...
  • Influence of d:
>>> klucbGauss(0.1, 0.4)  # doctest: +ELLIPSIS
0.547213...
>>> klucbGauss(0.1, 0.9)  # doctest: +ELLIPSIS
0.770820...
>>> klucbGauss(0.5, 0.4)  # doctest: +ELLIPSIS
0.947213...
>>> klucbGauss(0.5, 0.9)  # doctest: +ELLIPSIS
1.170820...
>>> klucbGauss(0.9, 0.4)  # doctest: +ELLIPSIS
1.347213...
>>> klucbGauss(0.9, 0.9)  # doctest: +ELLIPSIS
1.570820...

Warning

Using Policies.klUCB (and variants) with klucbGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Arms.kullback.klucbPoisson(x, d, precision=1e-06)[source]

KL-UCB index computation for Poisson distributions, using klucb().

  • Influence of x:
>>> klucbPoisson(0.1, 0.2)  # doctest: +ELLIPSIS
0.450523...
>>> klucbPoisson(0.5, 0.2)  # doctest: +ELLIPSIS
1.089376...
>>> klucbPoisson(0.9, 0.2)  # doctest: +ELLIPSIS
1.640112...
  • Influence of d:
>>> klucbPoisson(0.1, 0.4)  # doctest: +ELLIPSIS
0.693684...
>>> klucbPoisson(0.1, 0.9)  # doctest: +ELLIPSIS
1.252796...
>>> klucbPoisson(0.5, 0.4)  # doctest: +ELLIPSIS
1.422933...
>>> klucbPoisson(0.5, 0.9)  # doctest: +ELLIPSIS
2.122985...
>>> klucbPoisson(0.9, 0.4)  # doctest: +ELLIPSIS
2.033691...
>>> klucbPoisson(0.9, 0.9)  # doctest: +ELLIPSIS
2.831573...
Arms.kullback.klucbExp(x, d, precision=1e-06)[source]

KL-UCB index computation for exponential distributions, using klucb().

  • Influence of x:
>>> klucbExp(0.1, 0.2)  # doctest: +ELLIPSIS
0.202741...
>>> klucbExp(0.5, 0.2)  # doctest: +ELLIPSIS
1.013706...
>>> klucbExp(0.9, 0.2)  # doctest: +ELLIPSIS
1.824671...
  • Influence of d:
>>> klucbExp(0.1, 0.4)  # doctest: +ELLIPSIS
0.285792...
>>> klucbExp(0.1, 0.9)  # doctest: +ELLIPSIS
0.559088...
>>> klucbExp(0.5, 0.4)  # doctest: +ELLIPSIS
1.428962...
>>> klucbExp(0.5, 0.9)  # doctest: +ELLIPSIS
2.795442...
>>> klucbExp(0.9, 0.4)  # doctest: +ELLIPSIS
2.572132...
>>> klucbExp(0.9, 0.9)  # doctest: +ELLIPSIS
5.031795...
Arms.kullback.klucbGamma(x, d, precision=1e-06)[source]

KL-UCB index computation for Gamma distributions, using klucb().

  • Influence of x:
>>> klucbGamma(0.1, 0.2)  # doctest: +ELLIPSIS
0.202...
>>> klucbGamma(0.5, 0.2)  # doctest: +ELLIPSIS
1.013...
>>> klucbGamma(0.9, 0.2)  # doctest: +ELLIPSIS
1.824...
  • Influence of d:
>>> klucbGamma(0.1, 0.4)  # doctest: +ELLIPSIS
0.285...
>>> klucbGamma(0.1, 0.9)  # doctest: +ELLIPSIS
0.559...
>>> klucbGamma(0.5, 0.4)  # doctest: +ELLIPSIS
1.428...
>>> klucbGamma(0.5, 0.9)  # doctest: +ELLIPSIS
2.795...
>>> klucbGamma(0.9, 0.4)  # doctest: +ELLIPSIS
2.572...
>>> klucbGamma(0.9, 0.9)  # doctest: +ELLIPSIS
5.031...
Arms.kullback.kllcb(x, d, kl, lowerbound, precision=1e-06, upperbound=inf, max_iterations=50)[source]

The generic KL-LCB index computation.

  • x: value of the cum reward,
  • d: lower bound on the divergence,
  • kl: the KL divergence to be used (klBern(), klGauss(), etc),
  • lowerbound, upperbound=float('-inf'): the known bound of the values x,
  • precision=1e-6: the threshold from where to stop the research,
  • max_iterations=50: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{kllcb}(x, d) \simeq \inf_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) > d \}.\]

Note

It uses a bisection search, and one call to kl for each step of the bisection search.

For example, for kllcbBern(), the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:

>>> x, d = 0.9, 0.2   # mean x, exploration term d
>>> lowerbound = max(0., kllcbGauss(x, d, sig2x=0.25))  # variance 1/4 for [0,1] bounded distributions
>>> lowerbound  # doctest: +ELLIPSIS
0.5837...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=10)  # doctest: +ELLIPSIS
0.29...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=10)  # doctest: +ELLIPSIS
0.29188...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=50)  # doctest: +ELLIPSIS
0.291886...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=100)  # more and more precise!  # doctest: +ELLIPSIS
0.29188611...

Note

See below for more examples for different KL divergence functions.

Arms.kullback.kllcbBern(x, d, precision=1e-06)[source]

KL-LCB index computation for Bernoulli distributions, using kllcb().

  • Influence of x:
>>> kllcbBern(0.1, 0.2)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.5, 0.2)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbBern(0.9, 0.2)  # doctest: +ELLIPSIS
0.89999...
  • Influence of d:
>>> kllcbBern(0.1, 0.4)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.1, 0.9)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.5, 0.4)  # doctest: +ELLIPSIS
0.4999...
>>> kllcbBern(0.5, 0.9)  # doctest: +ELLIPSIS
0.4999...
>>> kllcbBern(0.9, 0.4)  # doctest: +ELLIPSIS
0.8999...
>>> kllcbBern(0.9, 0.9)  # doctest: +ELLIPSIS
0.8999...
Arms.kullback.kllcbGauss(x, d, sig2x=0.25, precision=0.0)[source]

KL-LCB index computation for Gaussian distributions.

  • Note that it does not require any search.

Warning

it works only if the good variance constant is given.

  • Influence of x:
>>> kllcbGauss(0.1, 0.2)  # doctest: +ELLIPSIS
-0.21622...
>>> kllcbGauss(0.5, 0.2)  # doctest: +ELLIPSIS
0.18377...
>>> kllcbGauss(0.9, 0.2)  # doctest: +ELLIPSIS
0.58377...
  • Influence of d:
>>> kllcbGauss(0.1, 0.4)  # doctest: +ELLIPSIS
-0.3472...
>>> kllcbGauss(0.1, 0.9)  # doctest: +ELLIPSIS
-0.5708...
>>> kllcbGauss(0.5, 0.4)  # doctest: +ELLIPSIS
0.0527...
>>> kllcbGauss(0.5, 0.9)  # doctest: +ELLIPSIS
-0.1708...
>>> kllcbGauss(0.9, 0.4)  # doctest: +ELLIPSIS
0.4527...
>>> kllcbGauss(0.9, 0.9)  # doctest: +ELLIPSIS
0.2291...

Warning

Using Policies.kllCB (and variants) with kllcbGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Arms.kullback.kllcbPoisson(x, d, precision=1e-06)[source]

KL-LCB index computation for Poisson distributions, using kllcb().

  • Influence of x:
>>> kllcbPoisson(0.1, 0.2)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.5, 0.2)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.9, 0.2)  # doctest: +ELLIPSIS
0.89999...
  • Influence of d:
>>> kllcbPoisson(0.1, 0.4)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.1, 0.9)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.5, 0.4)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.5, 0.9)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.9, 0.4)  # doctest: +ELLIPSIS
0.89999...
>>> kllcbPoisson(0.9, 0.9)  # doctest: +ELLIPSIS
0.89999...
Arms.kullback.kllcbExp(x, d, precision=1e-06)[source]

KL-LCB index computation for exponential distributions, using kllcb().

  • Influence of x:
>>> kllcbExp(0.1, 0.2)  # doctest: +ELLIPSIS
0.15267...
>>> kllcbExp(0.5, 0.2)  # doctest: +ELLIPSIS
0.7633...
>>> kllcbExp(0.9, 0.2)  # doctest: +ELLIPSIS
1.3740...
  • Influence of d:
>>> kllcbExp(0.1, 0.4)  # doctest: +ELLIPSIS
0.2000...
>>> kllcbExp(0.1, 0.9)  # doctest: +ELLIPSIS
0.3842...
>>> kllcbExp(0.5, 0.4)  # doctest: +ELLIPSIS
1.0000...
>>> kllcbExp(0.5, 0.9)  # doctest: +ELLIPSIS
1.9214...
>>> kllcbExp(0.9, 0.4)  # doctest: +ELLIPSIS
1.8000...
>>> kllcbExp(0.9, 0.9)  # doctest: +ELLIPSIS
3.4586...
Arms.kullback.maxEV(p, V, klMax)[source]

Maximize expectation of \(V\) with respect to \(q\) st. \(\mathrm{KL}(p, q) < \text{klMax}\).

Arms.kullback.reseqp(p, V, klMax, max_iterations=50)[source]

Solve f(reseqp(p, V, klMax)) = klMax, using Newton method.

Note

This is a subroutine of maxEV().

Warning

np.dot is very slow!

Arms.kullback.reseqp2(p, V, klMax)[source]

Solve f(reseqp(p, V, klMax)) = klMax, using a blackbox minimizer, from scipy.optimize.

  • FIXME it does not work well yet!

Note

This is a subroutine of maxEV().

  • Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011].

Warning

np.dot is very slow!

Arms.usenumba module

Import numba.jit or a dummy decorator.

Arms.usenumba.USE_NUMBA = False

Configure the use of numba

Arms.usenumba.jit(f)[source]

Fake numba.jit decorator.

Environment package

Environment module:

  • MAB, MarkovianMAB, ChangingAtEachRepMAB, IncreasingMAB, PieceWiseStationaryMAB, NonStationaryMAB objects, used to wrap the problems (essentially a list of arms).
  • Result and ResultMultiPlayers objects, used to wrap simulation results (list of decisions and rewards).
  • Evaluator environment, used to wrap simulation, for the single player case.
  • EvaluatorMultiPlayers environment, used to wrap simulation, for the multi-players case.
  • EvaluatorSparseMultiPlayers environment, used to wrap simulation, for the multi-players case with sparse activated players.
  • CollisionModels implements different collision models.

And useful constants and functions for the plotting and stuff:

  • DPI, signature(), maximizeWindow(), palette(), makemarkers(), wraptext(): for plotting,
  • notify(): send a desktop notification,
  • Parallel(), delayed(): joblib related,
  • tqdm: pretty range() loops,
  • sortedDistance, fairnessMeasures: science related,
  • getCurrentMemory(), sizeof_fmt(): to measure and pretty print memory consumption.
Submodules
Environment.CollisionModels module

Define the different collision models.

Collision models are generic functions, taking:

  • the time: t
  • the arms of the current environment: arms
  • the list of players: players
  • the numpy array of their choices: choices
  • the numpy array to store their rewards: rewards
  • the numpy array to store their pulls: pulls
  • the numpy array to store their collisions: collisions

As far as now, there is 4 different collision models implemented:

  • noCollision(): simple collision model where all players sample it and receive the reward.
  • onlyUniqUserGetsReward(): simple collision model, where only the players alone on one arm sample it and receive the reward (default).
  • rewardIsSharedUniformly(): in case of more than one player on one arm, only one player (uniform choice) can sample it and receive the reward.
  • closerUserGetsReward(): in case of more than one player on one arm, only the closer player can sample it and receive the reward. It can take, or create if not given, a random distance of each player to the base station (random number in [0, 1]).
Environment.CollisionModels.onlyUniqUserGetsReward(t, arms, players, choices, rewards, pulls, collisions)[source]

Simple collision model where only the players alone on one arm samples it and receives the reward.

  • This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
  • The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
Environment.CollisionModels.defaultCollisionModel(t, arms, players, choices, rewards, pulls, collisions)

Simple collision model where only the players alone on one arm samples it and receives the reward.

  • This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
  • The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
Environment.CollisionModels.onlyUniqUserGetsRewardSparse(t, arms, players, choices, rewards, pulls, collisions)[source]

Simple collision model where only the players alone on one arm samples it and receives the reward.

  • This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
  • The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
  • Support for player non activated, by choosing a negative index.
Environment.CollisionModels.allGetRewardsAndUseCollision(t, arms, players, choices, rewards, pulls, collisions)[source]

A variant of the first simple collision model where all players sample their arm, receive their rewards, and are informed of the collisions.

Note

it is NOT the one we consider, and so our lower-bound on centralized regret is wrong (users don’t care about collisions for their internal rewards so regret does not take collisions into account!)

  • This is the NOT default collision model, cf. [Liu & Zhao, 2009](https://arxiv.org/abs/0910.2065v3) collision model 1.
  • The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
Environment.CollisionModels.noCollision(t, arms, players, choices, rewards, pulls, collisions)[source]

Simple collision model where all players sample it and receive the reward.

  • It corresponds to the single-player simulation: each player is a policy, compared without collision.
  • The numpy array ‘collisions’ is not modified.
Environment.CollisionModels.rewardIsSharedUniformly(t, arms, players, choices, rewards, pulls, collisions)[source]

Less simple collision model where:

  • The players alone on one arm sample it and receive the reward.
  • In case of more than one player on one arm, only one player (uniform choice) can sample it and receive the reward. It is chosen by the base station.

Note

it can also model a choice from the users point of view: in a time frame (eg. 1 second), when there is a collision, each colliding user chose (uniformly) a random small time offset (eg. 20 ms), and start sensing + emitting again after that time. The first one to sense is alone, it transmits, and the next ones find the channel used when sensing. So only one player is transmitting, and from the base station point of view, it is the same as if it was chosen uniformly among the colliding users.

Environment.CollisionModels.closerUserGetsReward(t, arms, players, choices, rewards, pulls, collisions, distances='uniform')[source]

Simple collision model where:

  • The players alone on one arm sample it and receive the reward.
  • In case of more than one player on one arm, only the closer player can sample it and receive the reward. It can take, or create if not given, a distance of each player to the base station (numbers in [0, 1]).
  • If distances is not given, it is either generated randomly (random numbers in [0, 1]) or is a linspace of nbPlayers values in (0, 1), equally spacen (default).

Note

This kind of effects is known in telecommunication as the Near-Far effect or the Capture effect [Roberts, 1975](https://dl.acm.org/citation.cfm?id=1024920)

Environment.CollisionModels.collision_models = [<function onlyUniqUserGetsReward>, <function onlyUniqUserGetsRewardSparse>, <function allGetRewardsAndUseCollision>, <function noCollision>, <function rewardIsSharedUniformly>, <function closerUserGetsReward>]

List of possible collision models

Environment.CollisionModels.full_lost_if_collision = {'allGetRewardsAndUseCollision': True, 'closerUserGetsReward': False, 'noCollision': False, 'onlyUniqUserGetsReward': True, 'onlyUniqUserGetsRewardSparse': True, 'rewardIsSharedUniformly': False}

Mapping of collision model names to True or False, to know if a collision implies a lost communication or not in this model

Environment.Evaluator module

Evaluator class to wrap and run the simulations. Lots of plotting methods, to have various visualizations.

Environment.Evaluator.USE_PICKLE = False

Should we save the figure objects to a .pickle file at the end of the simulation?

Environment.Evaluator._nbOfArgs(function)[source]
Environment.Evaluator.REPETITIONS = 1

Default nb of repetitions

Environment.Evaluator.DELTA_T_PLOT = 50

Default sampling rate for plotting

Environment.Evaluator.plot_lowerbound = True

Default is to plot the lower-bound

Environment.Evaluator.USE_BOX_PLOT = True

True to use boxplot, False to use violinplot.

Environment.Evaluator.random_shuffle = False

Use basic random events of shuffling the arms?

Environment.Evaluator.random_invert = False

Use basic random events of inverting the arms?

Environment.Evaluator.nb_break_points = 0

Default nb of random events

Environment.Evaluator.STORE_ALL_REWARDS = False

Store all rewards?

Environment.Evaluator.STORE_REWARDS_SQUARED = False

Store rewards squared?

Environment.Evaluator.MORE_ACCURATE = True

Use the count of selections instead of rewards for a more accurate mean/var reward measure.

Environment.Evaluator.FINAL_RANKS_ON_AVERAGE = True

Final ranks are printed based on average on last 1% rewards and not only the last rewards

Environment.Evaluator.USE_JOBLIB_FOR_POLICIES = False

Don’t use joblib to parallelize the simulations on various policies (we parallelize the random Monte Carlo repetitions)

class Environment.Evaluator.Evaluator(configuration, finalRanksOnAverage=True, averageOn=0.005, useJoblibForPolicies=False, moreAccurate=True)[source]

Bases: object

Evaluator class to run the simulations.

__init__(configuration, finalRanksOnAverage=True, averageOn=0.005, useJoblibForPolicies=False, moreAccurate=True)[source]

Initialize self. See help(type(self)) for accurate signature.

cfg = None

Configuration dictionnary

nbPolicies = None

Number of policies

horizon = None

Horizon (number of time steps)

repetitions = None

Number of repetitions

delta_t_plot = None

Sampling rate for plotting

random_shuffle = None

Random shuffling of arms?

random_invert = None

Random inversion of arms?

nb_break_points = None

How many random events?

plot_lowerbound = None

Should we plot the lower-bound?

moreAccurate = None

Use the count of selections instead of rewards for a more accurate mean/var reward measure.

finalRanksOnAverage = None

Final display of ranks are done on average rewards?

averageOn = None

How many last steps for final rank average rewards

useJoblibForPolicies = None

Use joblib to parallelize for loop on policies (useless)

useJoblib = None

Use joblib to parallelize for loop on repetitions (useful)

cache_rewards = None

Should we cache and precompute rewards

environment_bayesian = None

Is the environment Bayesian?

showplot = None

Show the plot (interactive display or not)

use_box_plot = None

To use box plot (or violin plot if False). Force to use boxplot if repetitions=1.

change_labels = None

Possibly empty dictionary to map ‘policyId’ to new labels (overwrite their name).

append_labels = None

Possibly empty dictionary to map ‘policyId’ to new labels (by appending the result from ‘append_labels’).

envs = None

List of environments

policies = None

List of policies

rewards = None

For each env, history of rewards, ie accumulated rewards

lastCumRewards = None

For each env, last accumulated rewards, to compute variance and histogram of whole regret R_T

minCumRewards = None

For each env, history of minimum of rewards, to compute amplitude (+- STD)

maxCumRewards = None

For each env, history of maximum of rewards, to compute amplitude (+- STD)

rewardsSquared = None

For each env, history of rewards squared

allRewards = None

For each env, full history of rewards

bestArmPulls = None

For each env, keep the history of best arm pulls

pulls = None

For each env, keep cumulative counts of all arm pulls

allPulls = None

For each env, keep cumulative counts of all arm pulls

lastPulls = None

For each env, keep cumulative counts of all arm pulls

runningTimes = None

For each env, keep the history of running times

memoryConsumption = None

For each env, keep the history of running times

numberOfCPDetections = None

For each env, store the number of change-point detections by each algorithms, to print it’s average at the end (to check if a certain Change-Point detector algorithm detects too few or too many changes).

__initEnvironments__()[source]

Create environments.

__initPolicies__(env)[source]

Create or initialize policies.

compute_cache_rewards(arms)[source]

Compute only once the rewards, then launch the experiments with the same matrix (r_{k,t}).

startAllEnv()[source]

Simulate all envs.

startOneEnv(envId, env)[source]

Simulate that env.

saveondisk(filepath='saveondisk_Evaluator.hdf5')[source]

Save the content of the internal data to into a HDF5 file on the disk.

getPulls(policyId, envId=0)[source]

Extract mean pulls.

getBestArmPulls(policyId, envId=0)[source]

Extract mean best arm pulls.

getRewards(policyId, envId=0)[source]

Extract mean rewards.

getAverageWeightedSelections(policyId, envId=0)[source]

Extract weighted count of selections.

getMaxRewards(envId=0)[source]

Extract max mean rewards.

getCumulatedRegret_LessAccurate(policyId, envId=0)[source]

Compute cumulative regret, based on accumulated rewards.

getCumulatedRegret_MoreAccurate(policyId, envId=0)[source]

Compute cumulative regret, based on counts of selections and not actual rewards.

getCumulatedRegret(policyId, envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

getLastRegrets_LessAccurate(policyId, envId=0)[source]

Extract last regrets, based on accumulated rewards.

getAllLastWeightedSelections(policyId, envId=0)[source]

Extract weighted count of selections.

getLastRegrets_MoreAccurate(policyId, envId=0)[source]

Extract last regrets, based on counts of selections and not actual rewards.

getLastRegrets(policyId, envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

getAverageRewards(policyId, envId=0)[source]

Extract mean rewards (not rewards but cumsum(rewards)/cumsum(1).

getRewardsSquared(policyId, envId=0)[source]

Extract rewards squared.

getSTDRegret(policyId, envId=0, meanReward=False)[source]

Extract standard deviation of rewards.

Warning

FIXME experimental!

getMaxMinReward(policyId, envId=0)[source]

Extract amplitude of rewards as maxCumRewards - minCumRewards.

getRunningTimes(envId=0)[source]

Get the means and stds and list of running time of the different policies.

getMemoryConsumption(envId=0)[source]

Get the means and stds and list of memory consumptions of the different policies.

getNumberOfCPDetections(envId=0)[source]

Get the means and stds and list of numberOfCPDetections of the different policies.

printFinalRanking(envId=0, moreAccurate=None)[source]

Print the final ranking of the different policies.

_xlabel(envId, *args, **kwargs)[source]

Add xlabel to the plot, and if the environment has change-point, draw vertical lines to clearly identify the locations of the change points.

plotRegrets(envId=0, savefig=None, meanReward=False, plotSTD=False, plotMaxMin=False, semilogx=False, semilogy=False, loglog=False, normalizedRegret=False, drawUpperBound=False, moreAccurate=None)[source]

Plot the centralized cumulated regret, support more than one environments (use evaluators to give a list of other environments).

plotBestArmPulls(envId, savefig=None)[source]

Plot the frequency of pulls of the best channel.

  • Warning: does not adapt to dynamic settings!
printRunningTimes(envId=0, precision=3)[source]

Print the average+-std running time of the different policies.

plotRunningTimes(envId=0, savefig=None, base=1, unit='seconds')[source]

Plot the running times of the different policies, as a box plot for each.

printMemoryConsumption(envId=0)[source]

Print the average+-std memory consumption of the different policies.

plotMemoryConsumption(envId=0, savefig=None, base=1024, unit='KiB')[source]

Plot the memory consumption of the different policies, as a box plot for each.

printNumberOfCPDetections(envId=0)[source]

Print the average+-std number_of_cp_detections of the different policies.

plotNumberOfCPDetections(envId=0, savefig=None)[source]

Plot the number of change-point detections of the different policies, as a box plot for each.

__dict__ = mappingproxy({'__module__': 'Environment.Evaluator', '__doc__': ' Evaluator class to run the simulations.', '__init__': <function Evaluator.__init__>, '__initEnvironments__': <function Evaluator.__initEnvironments__>, '__initPolicies__': <function Evaluator.__initPolicies__>, 'compute_cache_rewards': <function Evaluator.compute_cache_rewards>, 'startAllEnv': <function Evaluator.startAllEnv>, 'startOneEnv': <function Evaluator.startOneEnv>, 'saveondisk': <function Evaluator.saveondisk>, 'getPulls': <function Evaluator.getPulls>, 'getBestArmPulls': <function Evaluator.getBestArmPulls>, 'getRewards': <function Evaluator.getRewards>, 'getAverageWeightedSelections': <function Evaluator.getAverageWeightedSelections>, 'getMaxRewards': <function Evaluator.getMaxRewards>, 'getCumulatedRegret_LessAccurate': <function Evaluator.getCumulatedRegret_LessAccurate>, 'getCumulatedRegret_MoreAccurate': <function Evaluator.getCumulatedRegret_MoreAccurate>, 'getCumulatedRegret': <function Evaluator.getCumulatedRegret>, 'getLastRegrets_LessAccurate': <function Evaluator.getLastRegrets_LessAccurate>, 'getAllLastWeightedSelections': <function Evaluator.getAllLastWeightedSelections>, 'getLastRegrets_MoreAccurate': <function Evaluator.getLastRegrets_MoreAccurate>, 'getLastRegrets': <function Evaluator.getLastRegrets>, 'getAverageRewards': <function Evaluator.getAverageRewards>, 'getRewardsSquared': <function Evaluator.getRewardsSquared>, 'getSTDRegret': <function Evaluator.getSTDRegret>, 'getMaxMinReward': <function Evaluator.getMaxMinReward>, 'getRunningTimes': <function Evaluator.getRunningTimes>, 'getMemoryConsumption': <function Evaluator.getMemoryConsumption>, 'getNumberOfCPDetections': <function Evaluator.getNumberOfCPDetections>, 'printFinalRanking': <function Evaluator.printFinalRanking>, '_xlabel': <function Evaluator._xlabel>, 'plotRegrets': <function Evaluator.plotRegrets>, 'plotBestArmPulls': <function Evaluator.plotBestArmPulls>, 'printRunningTimes': <function Evaluator.printRunningTimes>, 'plotRunningTimes': <function Evaluator.plotRunningTimes>, 'printMemoryConsumption': <function Evaluator.printMemoryConsumption>, 'plotMemoryConsumption': <function Evaluator.plotMemoryConsumption>, 'printNumberOfCPDetections': <function Evaluator.printNumberOfCPDetections>, 'plotNumberOfCPDetections': <function Evaluator.plotNumberOfCPDetections>, 'printLastRegrets': <function Evaluator.printLastRegrets>, 'plotLastRegrets': <function Evaluator.plotLastRegrets>, 'plotHistoryOfMeans': <function Evaluator.plotHistoryOfMeans>, '__dict__': <attribute '__dict__' of 'Evaluator' objects>, '__weakref__': <attribute '__weakref__' of 'Evaluator' objects>})
__module__ = 'Environment.Evaluator'
__weakref__

list of weak references to the object (if defined)

printLastRegrets(envId=0, moreAccurate=False)[source]

Print the last regrets of the different policies.

plotLastRegrets(envId=0, normed=False, subplots=True, nbbins=15, log=False, all_on_separate_figures=False, sharex=False, sharey=False, boxplot=False, normalized_boxplot=True, savefig=None, moreAccurate=False)[source]

Plot histogram of the regrets R_T for all policies.

plotHistoryOfMeans(envId=0, horizon=None, savefig=None)[source]

Plot the history of means, as a plot with x axis being the time, y axis the mean rewards, and K curves one for each arm.

Environment.Evaluator.delayed_play(env, policy, horizon, random_shuffle=False, random_invert=False, nb_break_points=0, seed=None, allrewards=None, repeatId=0, useJoblib=False)[source]

Helper function for the parallelization.

Environment.Evaluator.EvaluatorFromDisk(filepath='/tmp/saveondiskEvaluator.hdf5')[source]

Create a new Evaluator object from the HDF5 file given in argument.

Environment.Evaluator.shuffled(mylist)[source]

Returns a shuffled version of the input 1D list. sorted() exists instead of list.sort(), but shuffled() does not exist instead of random.shuffle()…

>>> from random import seed; seed(1234)  # reproducible results
>>> mylist = [ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9]
>>> shuffled(mylist)
[0.9, 0.4, 0.3, 0.6, 0.5, 0.7, 0.1, 0.2, 0.8]
>>> shuffled(mylist)
[0.4, 0.3, 0.7, 0.5, 0.8, 0.1, 0.9, 0.6, 0.2]
>>> shuffled(mylist)
[0.4, 0.6, 0.9, 0.5, 0.7, 0.2, 0.1, 0.3, 0.8]
>>> shuffled(mylist)
[0.8, 0.7, 0.3, 0.1, 0.9, 0.5, 0.6, 0.2, 0.4]
Environment.EvaluatorMultiPlayers module

EvaluatorMultiPlayers class to wrap and run the simulations, for the multi-players case. Lots of plotting methods, to have various visualizations. See documentation.

Environment.EvaluatorMultiPlayers.USE_PICKLE = False

Should we save the figure objects to a .pickle file at the end of the simulation?

Environment.EvaluatorMultiPlayers._nbOfArgs(function)[source]
Environment.EvaluatorMultiPlayers.REPETITIONS = 1

Default nb of repetitions

Environment.EvaluatorMultiPlayers.DELTA_T_PLOT = 50

Default sampling rate for plotting

Environment.EvaluatorMultiPlayers.COUNT_RANKS_MARKOV_CHAIN = False

If true, count and then print a lot of statistics for the Markov Chain of the underlying configurations on ranks

Environment.EvaluatorMultiPlayers.MORE_ACCURATE = True

Use the count of selections instead of rewards for a more accurate mean/var reward measure.

Environment.EvaluatorMultiPlayers.plot_lowerbounds = True

Default is to plot the lower-bounds

Environment.EvaluatorMultiPlayers.USE_BOX_PLOT = True

True to use boxplot, False to use violinplot (default).

Environment.EvaluatorMultiPlayers.nb_break_points = 0

Default nb of random events

Environment.EvaluatorMultiPlayers.FINAL_RANKS_ON_AVERAGE = True

Default value for finalRanksOnAverage

Environment.EvaluatorMultiPlayers.USE_JOBLIB_FOR_POLICIES = False

Default value for useJoblibForPolicies. Does not speed up to use it (too much overhead in using too much threads); so it should really be disabled.

class Environment.EvaluatorMultiPlayers.EvaluatorMultiPlayers(configuration, moreAccurate=True)[source]

Bases: object

Evaluator class to run the simulations, for the multi-players case.

__init__(configuration, moreAccurate=True)[source]

Initialize self. See help(type(self)) for accurate signature.

cfg = None

Configuration dictionnary

nbPlayers = None

Number of players

repetitions = None

Number of repetitions

horizon = None

Horizon (number of time steps)

collisionModel = None

Which collision model should be used

full_lost_if_collision = None

Is there a full loss of rewards if collision ? To compute the correct decomposition of regret

moreAccurate = None

Use the count of selections instead of rewards for a more accurate mean/var reward measure.

finalRanksOnAverage = None

Final display of ranks are done on average rewards?

averageOn = None

How many last steps for final rank average rewards

nb_break_points = None

How many random events?

plot_lowerbounds = None

Should we plot the lower-bounds?

useJoblib = None

Use joblib to parallelize for loop on repetitions (useful)

showplot = None

Show the plot (interactive display or not)

use_box_plot = None

To use box plot (or violin plot if False). Force to use boxplot if repetitions=1.

count_ranks_markov_chain = None

If true, count and then print a lot of statistics for the Markov Chain of the underlying configurations on ranks

change_labels = None

Possibly empty dictionary to map ‘playerId’ to new labels (overwrite their name).

append_labels = None

Possibly empty dictionary to map ‘playerId’ to new labels (by appending the result from ‘append_labels’).

envs = None

List of environments

players = None

List of players

rewards = None

For each env, history of rewards

pulls = None

For each env, keep the history of arm pulls (mean)

lastPulls = None

For each env, keep the distribution of arm pulls

allPulls = None

For each env, keep the full history of arm pulls

collisions = None

For each env, keep the history of collisions on all arms

lastCumCollisions = None

For each env, last count of collisions on all arms

nbSwitchs = None

For each env, keep the history of switches (change of configuration of players)

bestArmPulls = None

For each env, keep the history of best arm pulls

freeTransmissions = None

For each env, keep the history of successful transmission (1 - collisions, basically)

lastCumRewards = None

For each env, last accumulated rewards, to compute variance and histogram of whole regret R_T

runningTimes = None

For each env, keep the history of running times

memoryConsumption = None

For each env, keep the history of running times

__initEnvironments__()[source]

Create environments.

__initPlayers__(env)[source]

Create or initialize players.

startAllEnv()[source]

Simulate all envs.

startOneEnv(envId, env)[source]

Simulate that env.

saveondisk(filepath='saveondisk_EvaluatorMultiPlayers.hdf5')[source]

Save the content of the internal data to into a HDF5 file on the disk.

loadfromdisk(filepath)[source]

Update internal memory of the Evaluator object by loading data the opened HDF5 file.

Warning

FIXME this is not YET implemented!

getPulls(playerId, envId=0)[source]

Extract mean pulls.

getAllPulls(playerId, armId, envId=0)[source]

Extract mean of all pulls.

getNbSwitchs(playerId, envId=0)[source]

Extract mean nb of switches.

getCentralizedNbSwitchs(envId=0)[source]

Extract average of mean nb of switches.

getBestArmPulls(playerId, envId=0)[source]

Extract mean of best arms pulls.

getfreeTransmissions(playerId, envId=0)[source]

Extract mean of successful transmission.

getCollisions(armId, envId=0)[source]

Extract mean of number of collisions.

getRewards(playerId, envId=0)[source]

Extract mean of rewards.

getRegretMean(playerId, envId=0)[source]

Extract mean of regret, for one arm for one player (no meaning).

Warning

This is the centralized regret, for one arm, it does not make much sense in the multi-players setting!

getCentralizedRegret_LessAccurate(envId=0)[source]

Compute the empirical centralized regret: cumsum on time of the mean rewards of the M best arms - cumsum on time of the empirical rewards obtained by the players, based on accumulated rewards.

getFirstRegretTerm(envId=0)[source]

Extract and compute the first term \((a)\) in the centralized regret: losses due to pulling suboptimal arms.

getSecondRegretTerm(envId=0)[source]

Extract and compute the second term \((b)\) in the centralized regret: losses due to not pulling optimal arms.

getThirdRegretTerm(envId=0)[source]

Extract and compute the third term \((c)\) in the centralized regret: losses due to collisions.

getCentralizedRegret_MoreAccurate(envId=0)[source]

Compute the empirical centralized regret, based on counts of selections and not actual rewards.

getCentralizedRegret(envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

getLastRegrets_LessAccurate(envId=0)[source]

Extract last regrets, based on accumulated rewards.

getAllLastWeightedSelections(envId=0)[source]

Extract weighted count of selections.

getLastRegrets_MoreAccurate(envId=0)[source]

Extract last regrets, based on counts of selections and not actual rewards.

getLastRegrets(envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

getRunningTimes(envId=0)[source]

Get the means and stds and list of running time of the different players.

getMemoryConsumption(envId=0)[source]

Get the means and stds and list of memory consumptions of the different players.

plotRewards(envId=0, savefig=None, semilogx=False, moreAccurate=None)[source]

Plot the decentralized (vectorial) rewards, for each player.

plotFairness(envId=0, savefig=None, semilogx=False, fairness='default', evaluators=())[source]

Plot a certain measure of “fairness”, from these personal rewards, support more than one environments (use evaluators to give a list of other environments).

plotRegretCentralized(envId=0, savefig=None, semilogx=False, semilogy=False, loglog=False, normalized=False, evaluators=(), subTerms=False, sumofthreeterms=False, moreAccurate=None)[source]

Plot the centralized cumulated regret, support more than one environments (use evaluators to give a list of other environments).

  • The lower bounds are also plotted (Besson & Kaufmann, and Anandkumar et al).
  • The three terms of the regret are also plotting if evaluators = () (that’s the default).
plotNbSwitchs(envId=0, savefig=None, semilogx=False, cumulated=False)[source]

Plot cumulated number of switchs (to evaluate the switching costs), comparing each player.

plotNbSwitchsCentralized(envId=0, savefig=None, semilogx=False, cumulated=False, evaluators=())[source]

Plot the centralized cumulated number of switchs (to evaluate the switching costs), support more than one environments (use evaluators to give a list of other environments).

plotBestArmPulls(envId=0, savefig=None)[source]

Plot the frequency of pulls of the best channel.

  • Warning: does not adapt to dynamic settings!
plotAllPulls(envId=0, savefig=None, cumulated=True, normalized=False)[source]

Plot the frequency of use of every channels, one figure for each channel. Not so useful.

plotFreeTransmissions(envId=0, savefig=None, cumulated=False)[source]

Plot the frequency free transmission.

plotNbCollisions(envId=0, savefig=None, semilogx=False, semilogy=False, loglog=False, cumulated=False, upperbound=False, evaluators=())[source]

Plot the frequency or cum number of collisions, support more than one environments (use evaluators to give a list of other environments).

plotFrequencyCollisions(envId=0, savefig=None, piechart=True, semilogy=False)[source]

Plot the frequency of collision, in a pie chart (histogram not supported yet).

printRunningTimes(envId=0, precision=3, evaluators=())[source]

Print the average+-std runnning time of the different players.

printMemoryConsumption(envId=0, evaluators=())[source]

Print the average+-std memory consumption of the different players.

plotRunningTimes(envId=0, savefig=None, base=1, unit='seconds', evaluators=())[source]

Plot the running times of the different players, as a box plot for each evaluators.

plotMemoryConsumption(envId=0, savefig=None, base=1024, unit='KiB', evaluators=())[source]

Plot the memory consumption of the different players, as a box plot for each.

printFinalRanking(envId=0, verb=True)[source]

Compute and print the ranking of the different players.

printFinalRankingAll(envId=0, evaluators=())[source]

Compute and print the ranking of the different players.

printLastRegrets(envId=0, evaluators=(), moreAccurate=None)[source]

Print the last regrets of the different evaluators.

printLastRegretsPM(envId=0, evaluators=(), moreAccurate=None)[source]

Print the average+-std last regret of the different players.

plotLastRegrets(envId=0, normed=False, subplots=True, nbbins=15, log=False, all_on_separate_figures=False, sharex=False, sharey=False, boxplot=False, normalized_boxplot=True, savefig=None, moreAccurate=None, evaluators=())[source]

Plot histogram of the regrets R_T for all evaluators.

plotHistoryOfMeans(envId=0, horizon=None, savefig=None)[source]

Plot the history of means, as a plot with x axis being the time, y axis the mean rewards, and K curves one for each arm.

strPlayers(short=False, latex=True)[source]

Get a string of the players for this environment.

__dict__ = mappingproxy({'__module__': 'Environment.EvaluatorMultiPlayers', '__doc__': ' Evaluator class to run the simulations, for the multi-players case.\n ', '__init__': <function EvaluatorMultiPlayers.__init__>, '__initEnvironments__': <function EvaluatorMultiPlayers.__initEnvironments__>, '__initPlayers__': <function EvaluatorMultiPlayers.__initPlayers__>, 'startAllEnv': <function EvaluatorMultiPlayers.startAllEnv>, 'startOneEnv': <function EvaluatorMultiPlayers.startOneEnv>, 'saveondisk': <function EvaluatorMultiPlayers.saveondisk>, 'loadfromdisk': <function EvaluatorMultiPlayers.loadfromdisk>, 'getPulls': <function EvaluatorMultiPlayers.getPulls>, 'getAllPulls': <function EvaluatorMultiPlayers.getAllPulls>, 'getNbSwitchs': <function EvaluatorMultiPlayers.getNbSwitchs>, 'getCentralizedNbSwitchs': <function EvaluatorMultiPlayers.getCentralizedNbSwitchs>, 'getBestArmPulls': <function EvaluatorMultiPlayers.getBestArmPulls>, 'getfreeTransmissions': <function EvaluatorMultiPlayers.getfreeTransmissions>, 'getCollisions': <function EvaluatorMultiPlayers.getCollisions>, 'getRewards': <function EvaluatorMultiPlayers.getRewards>, 'getRegretMean': <function EvaluatorMultiPlayers.getRegretMean>, 'getCentralizedRegret_LessAccurate': <function EvaluatorMultiPlayers.getCentralizedRegret_LessAccurate>, 'getFirstRegretTerm': <function EvaluatorMultiPlayers.getFirstRegretTerm>, 'getSecondRegretTerm': <function EvaluatorMultiPlayers.getSecondRegretTerm>, 'getThirdRegretTerm': <function EvaluatorMultiPlayers.getThirdRegretTerm>, 'getCentralizedRegret_MoreAccurate': <function EvaluatorMultiPlayers.getCentralizedRegret_MoreAccurate>, 'getCentralizedRegret': <function EvaluatorMultiPlayers.getCentralizedRegret>, 'getLastRegrets_LessAccurate': <function EvaluatorMultiPlayers.getLastRegrets_LessAccurate>, 'getAllLastWeightedSelections': <function EvaluatorMultiPlayers.getAllLastWeightedSelections>, 'getLastRegrets_MoreAccurate': <function EvaluatorMultiPlayers.getLastRegrets_MoreAccurate>, 'getLastRegrets': <function EvaluatorMultiPlayers.getLastRegrets>, 'getRunningTimes': <function EvaluatorMultiPlayers.getRunningTimes>, 'getMemoryConsumption': <function EvaluatorMultiPlayers.getMemoryConsumption>, 'plotRewards': <function EvaluatorMultiPlayers.plotRewards>, 'plotFairness': <function EvaluatorMultiPlayers.plotFairness>, 'plotRegretCentralized': <function EvaluatorMultiPlayers.plotRegretCentralized>, 'plotNbSwitchs': <function EvaluatorMultiPlayers.plotNbSwitchs>, 'plotNbSwitchsCentralized': <function EvaluatorMultiPlayers.plotNbSwitchsCentralized>, 'plotBestArmPulls': <function EvaluatorMultiPlayers.plotBestArmPulls>, 'plotAllPulls': <function EvaluatorMultiPlayers.plotAllPulls>, 'plotFreeTransmissions': <function EvaluatorMultiPlayers.plotFreeTransmissions>, 'plotNbCollisions': <function EvaluatorMultiPlayers.plotNbCollisions>, 'plotFrequencyCollisions': <function EvaluatorMultiPlayers.plotFrequencyCollisions>, 'printRunningTimes': <function EvaluatorMultiPlayers.printRunningTimes>, 'printMemoryConsumption': <function EvaluatorMultiPlayers.printMemoryConsumption>, 'plotRunningTimes': <function EvaluatorMultiPlayers.plotRunningTimes>, 'plotMemoryConsumption': <function EvaluatorMultiPlayers.plotMemoryConsumption>, 'printFinalRanking': <function EvaluatorMultiPlayers.printFinalRanking>, 'printFinalRankingAll': <function EvaluatorMultiPlayers.printFinalRankingAll>, 'printLastRegrets': <function EvaluatorMultiPlayers.printLastRegrets>, 'printLastRegretsPM': <function EvaluatorMultiPlayers.printLastRegretsPM>, 'plotLastRegrets': <function EvaluatorMultiPlayers.plotLastRegrets>, 'plotHistoryOfMeans': <function EvaluatorMultiPlayers.plotHistoryOfMeans>, 'strPlayers': <function EvaluatorMultiPlayers.strPlayers>, '__dict__': <attribute '__dict__' of 'EvaluatorMultiPlayers' objects>, '__weakref__': <attribute '__weakref__' of 'EvaluatorMultiPlayers' objects>})
__module__ = 'Environment.EvaluatorMultiPlayers'
__weakref__

list of weak references to the object (if defined)

Environment.EvaluatorMultiPlayers.delayed_play(env, players, horizon, collisionModel, seed=None, repeatId=0, count_ranks_markov_chain=False, useJoblib=False)[source]

Helper function for the parallelization.

Environment.EvaluatorMultiPlayers._extract(text)[source]

Extract the str of a player, if it is a child, printed as ‘#[0-9]+<…>’ –> …

Environment.EvaluatorSparseMultiPlayers module

EvaluatorSparseMultiPlayers class to wrap and run the simulations, for the multi-players case with sparse activated players. Lots of plotting methods, to have various visualizations. See documentation.

Warning

FIXME this environment is not as up-to-date as Environment.EvaluatorMultiPlayers.

Environment.EvaluatorSparseMultiPlayers.REPETITIONS = 1

Default nb of repetitions

Environment.EvaluatorSparseMultiPlayers.ACTIVATION = 1

Default probability of activation

Environment.EvaluatorSparseMultiPlayers.DELTA_T_PLOT = 50

Default sampling rate for plotting

Environment.EvaluatorSparseMultiPlayers.MORE_ACCURATE = True

Use the count of selections instead of rewards for a more accurate mean/std reward measure.

Environment.EvaluatorSparseMultiPlayers.FINAL_RANKS_ON_AVERAGE = True

Default value for finalRanksOnAverage

Environment.EvaluatorSparseMultiPlayers.USE_JOBLIB_FOR_POLICIES = False

Default value for useJoblibForPolicies. Does not speed up to use it (too much overhead in using too much threads); so it should really be disabled.

Environment.EvaluatorSparseMultiPlayers.PICKLE_IT = True

Default value for pickleit for saving the figures. If True, then all plt.figure object are saved (in pickle format).

class Environment.EvaluatorSparseMultiPlayers.EvaluatorSparseMultiPlayers(configuration, moreAccurate=True)[source]

Bases: Environment.EvaluatorMultiPlayers.EvaluatorMultiPlayers

Evaluator class to run the simulations, for the multi-players case.

__init__(configuration, moreAccurate=True)[source]

Initialize self. See help(type(self)) for accurate signature.

activations = None

Probability of activations

collisionModel = None

Which collision model should be used

full_lost_if_collision = None

Is there a full loss of rewards if collision ? To compute the correct decomposition of regret

startOneEnv(envId, env)[source]

Simulate that env.

getCentralizedRegret_LessAccurate(envId=0)[source]

Compute the empirical centralized regret: cumsum on time of the mean rewards of the M best arms - cumsum on time of the empirical rewards obtained by the players, based on accumulated rewards.

getFirstRegretTerm(envId=0)[source]

Extract and compute the first term \((a)\) in the centralized regret: losses due to pulling suboptimal arms.

getSecondRegretTerm(envId=0)[source]

Extract and compute the second term \((b)\) in the centralized regret: losses due to not pulling optimal arms.

getThirdRegretTerm(envId=0)[source]

Extract and compute the third term \((c)\) in the centralized regret: losses due to collisions.

getCentralizedRegret_MoreAccurate(envId=0)[source]

Compute the empirical centralized regret, based on counts of selections and not actual rewards.

getCentralizedRegret(envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

getLastRegrets_LessAccurate(envId=0)[source]

Extract last regrets, based on accumulated rewards.

getAllLastWeightedSelections(envId=0)[source]

Extract weighted count of selections.

getLastRegrets_MoreAccurate(envId=0)[source]

Extract last regrets, based on counts of selections and not actual rewards.

getLastRegrets(envId=0, moreAccurate=None)[source]

Using either the more accurate or the less accurate regret count.

strPlayers(short=False, latex=True)[source]

Get a string of the players and their activations probability for this environment.

__module__ = 'Environment.EvaluatorSparseMultiPlayers'
Environment.EvaluatorSparseMultiPlayers.delayed_play(env, players, horizon, collisionModel, activations, seed=None, repeatId=0)[source]

Helper function for the parallelization.

Environment.EvaluatorSparseMultiPlayers.uniform_in_zero_one()

random() -> x in the interval [0, 1).

Environment.EvaluatorSparseMultiPlayers.with_proba(proba)[source]

True with probability = proba, False with probability = 1 - proba.

Examples:

>>> import random; random.seed(0)
>>> tosses = [with_proba(0.6) for _ in range(10000)]; sum(tosses)
5977
>>> tosses = [with_proba(0.111) for _ in range(100000)]; sum(tosses)
11158
Environment.MAB module

MAB, MarkovianMAB, ChangingAtEachRepMAB, IncreasingMAB, PieceWiseStationaryMAB and NonStationaryMAB classes to wrap the arms of some Multi-Armed Bandit problems.

Such class has to have at least these methods:

  • draw(armId, t) to draw one sample from that armId at time t,
  • and reprarms() to pretty print the arms (for titles of a plot),
  • and more, see below.

Warning

FIXME it is still a work in progress, I need to add continuously varying environments. See https://github.com/SMPyBandits/SMPyBandits/issues/71

class Environment.MAB.MAB(configuration)[source]

Bases: object

Basic Multi-Armed Bandit problem, for stochastic and i.i.d. arms.

  • configuration can be a dict with ‘arm_type’ and ‘params’ keys. ‘arm_type’ is a class from the Arms module, and ‘params’ is a dict, used as a list/tuple/iterable of named parameters given to ‘arm_type’. Example:

    configuration = {
        'arm_type': Bernoulli,
        'params':   [0.1, 0.5, 0.9]
    }
    
    configuration = {  # for fixed variance Gaussian
        'arm_type': Gaussian,
        'params':   [0.1, 0.5, 0.9]
    }
    
  • But it can also accept a list of already created arms:

    configuration = [
        Bernoulli(0.1),
        Bernoulli(0.5),
        Bernoulli(0.9),
    ]
    
  • Both will create three Bernoulli arms, of parameters (means) 0.1, 0.5 and 0.9.

__init__(configuration)[source]

New MAB.

isChangingAtEachRepetition = None

Flag to know if the problem is changing at each repetition or not.

isDynamic = None

Flag to know if the problem is static or not.

isMarkovian = None

Flag to know if the problem is Markovian or not.

arms = None

List of arms

means = None

Means of arms

nbArms = None

Number of arms

maxArm = None

Max mean of arms

minArm = None

Min mean of arms

new_order_of_arm(arms)[source]

Feed a new order of the arms to the environment.

  • Updates means correctly.
  • Return the new position(s) of the best arm (to count and plot BestArmPulls correctly).

Warning

This is a very limited support of non-stationary environment: only permutations of the arms are allowed, see NonStationaryMAB for more.

__repr__()[source]

Return repr(self).

reprarms(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]

Return a str representation of the list of the arms (like repr(self.arms) but better)

  • If nbPlayers > 0, it surrounds the representation of the best arms by openTag, endTag (for plot titles, in a multi-player setting).
  • Example: openTag = ‘’, endTag = ‘^*’ for LaTeX tags to put a star exponent.
  • Example: openTag = ‘<red>’, endTag = ‘</red>’ for HTML-like tags.
  • Example: openTag = r’ extcolor{red}{‘, endTag = ‘}’ for LaTeX tags.
draw(armId, t=1)[source]

Return a random sample from the armId-th arm, at time t. Usually t is not used.

draw_nparray(armId, shape=(1, ))[source]

Return a numpy array of random sample from the armId-th arm, of a certain shape.

draw_each(t=1)[source]

Return a random sample from each arm, at time t. Usually t is not used.

draw_each_nparray(shape=(1, ))[source]

Return a numpy array of random sample from each arm, of a certain shape.

Mbest(M=1)[source]

Set of M best means.

Mworst(M=1)[source]

Set of M worst means.

sumBestMeans(M=1)[source]

Sum of the M best means.

get_minArm(horizon=None)[source]

Return the vector of min mean of the arms.

  • It is a vector of length horizon.
get_maxArm(horizon=None)[source]

Return the vector of max mean of the arms.

  • It is a vector of length horizon.
get_maxArms(M=1, horizon=None)[source]

Return the vector of sum of the M-best means of the arms.

  • It is a vector of length horizon.
get_allMeans(horizon=None)[source]

Return the vector of means of the arms.

  • It is a numpy array of shape (nbArms, horizon).
sparsity

Estimate the sparsity of the problem, i.e., the number of arms with positive means.

str_sparsity()[source]

Empty string if sparsity = nbArms, or a small string ‘, $s={}$’ if the sparsity is strictly less than the number of arm.

lowerbound()[source]

Compute the constant \(C(\mu)\), for the [Lai & Robbins] lower-bound for this MAB problem (complexity), using functions from kullback.py or kullback.so (see Arms.kullback).

lowerbound_sparse(sparsity=None)[source]

Compute the constant \(C(\mu)\), for [Kwon et al, 2017] lower-bound for sparse bandits for this MAB problem (complexity)

  • I recomputed suboptimal solution to the optimization problem, and found the same as in [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].
hoifactor()[source]

Compute the HOI factor H_OI(mu), the Optimal Arm Identification (OI) factor, for this MAB problem (complexity). Cf. (3.3) in Navikkumar MODI’s thesis, “Machine Learning and Statistical Decision Making for Green Radio” (2017).

lowerbound_multiplayers(nbPlayers=1)[source]

Compute our multi-players lower bound for this MAB problem (complexity), using functions from kullback.

upperbound_collisions(nbPlayers, times)[source]

Compute Anandkumar et al. multi-players upper bound for this MAB problem (complexity), for UCB only. Warning: it is HIGHLY asymptotic!

plotComparison_our_anandkumar(savefig=None)[source]

Plot a comparison of our lowerbound and their lowerbound.

plotHistogram(horizon=10000, savefig=None, bins=50, alpha=0.9, density=None)[source]

Plot a horizon=10000 draws of each arms.

__dict__ = mappingproxy({'__module__': 'Environment.MAB', '__doc__': " Basic Multi-Armed Bandit problem, for stochastic and i.i.d. arms.\n\n - configuration can be a dict with 'arm_type' and 'params' keys. 'arm_type' is a class from the Arms module, and 'params' is a dict, used as a list/tuple/iterable of named parameters given to 'arm_type'. Example::\n\n configuration = {\n 'arm_type': Bernoulli,\n 'params': [0.1, 0.5, 0.9]\n }\n\n configuration = { # for fixed variance Gaussian\n 'arm_type': Gaussian,\n 'params': [0.1, 0.5, 0.9]\n }\n\n - But it can also accept a list of already created arms::\n\n configuration = [\n Bernoulli(0.1),\n Bernoulli(0.5),\n Bernoulli(0.9),\n ]\n\n - Both will create three Bernoulli arms, of parameters (means) 0.1, 0.5 and 0.9.\n ", '__init__': <function MAB.__init__>, 'new_order_of_arm': <function MAB.new_order_of_arm>, '__repr__': <function MAB.__repr__>, 'reprarms': <function MAB.reprarms>, 'draw': <function MAB.draw>, 'draw_nparray': <function MAB.draw_nparray>, 'draw_each': <function MAB.draw_each>, 'draw_each_nparray': <function MAB.draw_each_nparray>, 'Mbest': <function MAB.Mbest>, 'Mworst': <function MAB.Mworst>, 'sumBestMeans': <function MAB.sumBestMeans>, 'get_minArm': <function MAB.get_minArm>, 'get_maxArm': <function MAB.get_maxArm>, 'get_maxArms': <function MAB.get_maxArms>, 'get_allMeans': <function MAB.get_allMeans>, 'sparsity': <property object>, 'str_sparsity': <function MAB.str_sparsity>, 'lowerbound': <function MAB.lowerbound>, 'lowerbound_sparse': <function MAB.lowerbound_sparse>, 'hoifactor': <function MAB.hoifactor>, 'lowerbound_multiplayers': <function MAB.lowerbound_multiplayers>, 'upperbound_collisions': <function MAB.upperbound_collisions>, 'plotComparison_our_anandkumar': <function MAB.plotComparison_our_anandkumar>, 'plotHistogram': <function MAB.plotHistogram>, '__dict__': <attribute '__dict__' of 'MAB' objects>, '__weakref__': <attribute '__weakref__' of 'MAB' objects>})
__module__ = 'Environment.MAB'
__weakref__

list of weak references to the object (if defined)

Environment.MAB.RESTED = True

Default is rested Markovian.

Environment.MAB.dict_of_transition_matrix(mat)[source]

Convert a transition matrix (list of list or numpy array) to a dictionary mapping (state, state) to probabilities (as used by pykov.Chain).

Environment.MAB.transition_matrix_of_dict(dic)[source]

Convert a dictionary mapping (state, state) to probabilities (as used by pykov.Chain) to a transition matrix (numpy array).

class Environment.MAB.MarkovianMAB(configuration)[source]

Bases: Environment.MAB.MAB

Classic MAB problem but the rewards are drawn from a rested/restless Markov chain.

  • configuration is a dict with rested and transitions keys.
  • rested is a Boolean. See [Kalathil et al., 2012](https://arxiv.org/abs/1206.3582) page 2 for a description.
  • transitions is list of K transition matrices or dictionary (to specify non-integer states), one for each arm.

Example:

configuration = {
    "arm_type": "Markovian",
    "params": {
        "rested": True,  # or False
        # Example from [Kalathil et al., 2012](https://arxiv.org/abs/1206.3582) Table 1
        "transitions": [
            # 1st arm, Either a dictionary
            {   # Mean = 0.375
                (0, 0): 0.7, (0, 1): 0.3,
                (1, 0): 0.5, (1, 1): 0.5,
            },
            # 2nd arm, Or a right transition matrix
            [[0.2, 0.8], [0.6, 0.4]],  # Mean = 0.571
        ],
        # FIXME make this by default! include it in MAB.py and not in the configuration!
        "steadyArm": Bernoulli
    }
}
__init__(configuration)[source]

New MarkovianMAB.

isChangingAtEachRepetition = None

The problem is not changing at each repetition.

isDynamic = None

The problem is static.

isMarkovian = None

The problem is Markovian.

rested = None

Rested or not Markovian model?

nbArms = None

Number of arms

means = None

Means of each arms, from their steady distributions.

maxArm = None

Max mean of arms

minArm = None

Min mean of arms

states = None

States of each arm, initially they are all busy

__repr__()[source]

Return repr(self).

reprarms(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]

Return a str representation of the list of the arms (like repr(self.arms) but better).

  • For Markovian MAB, the chain and the steady Bernoulli arm is represented.
  • If nbPlayers > 0, it surrounds the representation of the best arms by openTag, endTag (for plot titles, in a multi-player setting).
  • Example: openTag = ‘’, endTag = ‘^*’ for LaTeX tags to put a star exponent.
  • Example: openTag = ‘<red>’, endTag = ‘</red>’ for HTML-like tags.
  • Example: openTag = r’ extcolor{red}{‘, endTag = ‘}’ for LaTeX tags.
draw(armId, t=1)[source]

Move on the Markov chain and return its state as a reward (0 or 1, or else).

  • If rested Markovian, only the state of the Markov chain of arm armId changes. It is the simpler model, and the default model.
  • But if restless (non rested) Markovian, the states of all the Markov chain of all arms change (not only armId).
__module__ = 'Environment.MAB'
Environment.MAB.VERBOSE = False

Whether to be verbose when generating new arms for Dynamic MAB

class Environment.MAB.ChangingAtEachRepMAB(configuration, verbose=False)[source]

Bases: Environment.MAB.MAB

Like a stationary MAB problem, but the arms are (randomly) regenerated for each repetition, with the newRandomArms() method.

  • M.arms and M.means is changed after each call to newRandomArms(), but not nbArm. All the other methods are carefully written to still make sense (Mbest, Mworst, minArm, maxArm).

Warning

It works perfectly fine, but it is still experimental, be careful when using this feature.

Note

Testing bandit algorithms against randomly generated problems at each repetitions is usually referred to as “Bayesian problems” in the literature: a prior is set on problems (eg. uniform on \([0,1]^K\) or less obvious for instance if a mingap is set), and the performance is assessed against this prior. It differs from the frequentist point of view of having one fixed problem and doing eg. n=1000 repetitions on the same problem.

__init__(configuration, verbose=False)[source]

New ChangingAtEachRepMAB.

isChangingAtEachRepetition = None

The problem is changing at each repetition or not.

isDynamic = None

The problem is static.

isMarkovian = None

The problem is not Markovian.

newMeans = None

Function to generate the means

args = None

Args to give to function

nbArms = None

Means of arms

__repr__()[source]

Return repr(self).

reprarms(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]

Cannot represent the dynamic arms, so print the ChangingAtEachRepMAB object

newRandomArms(t=None, verbose=False)[source]

Generate a new list of arms, from arm_type(params['newMeans'](*params['args'])).

arms

Return the current list of arms.

means

Return the list of means of arms for this ChangingAtEachRepMAB: after \(x\) calls to newRandomArms(), the return mean of arm \(k\) is the mean of the \(x\) means of that arm.

Warning

Highly experimental!

Mbest(M=1)[source]

Set of M best means (averaged on all the draws of new means).

Mworst(M=1)[source]

Set of M worst means (averaged on all the draws of new means).

minArm

Return the smallest mean of the arms, for a dynamic MAB (averaged on all the draws of new means).

maxArm

Return the largest mean of the arms, for a dynamic MAB (averaged on all the draws of new means).

lowerbound()[source]

Compute the constant C(mu), for [Lai & Robbins] lower-bound for this MAB problem (complexity), using functions from kullback (averaged on all the draws of new means).

hoifactor()[source]

Compute the HOI factor H_OI(mu), the Optimal Arm Identification (OI) factor, for this MAB problem (complexity). Cf. (3.3) in Navikkumar MODI’s thesis, “Machine Learning and Statistical Decision Making for Green Radio” (2017) (averaged on all the draws of new means).

lowerbound_multiplayers(nbPlayers=1)[source]

Compute our multi-players lower bound for this MAB problem (complexity), using functions from kullback.

__module__ = 'Environment.MAB'
class Environment.MAB.PieceWiseStationaryMAB(configuration, verbose=False)[source]

Bases: Environment.MAB.MAB

Like a stationary MAB problem, but piece-wise stationary.

  • Give it a list of vector of means, and a list of change-point locations.
  • You can use plotHistoryOfMeans() to see a nice plot of the history of means.

Note

This is a generic class to implement one “easy” kind of non-stationary bandits, abruptly changing non-stationary bandits, if changepoints are fixed and decided in advanced.

Warning

It works fine, but it is still experimental, be careful when using this feature.

Warning

The number of arms is fixed, see https://github.com/SMPyBandits/SMPyBandits/issues/123 if you are curious about bandit problems with a varying number of arms (or sleeping bandits where some arms can be enabled or disabled at each time).

__init__(configuration, verbose=False)[source]

New PieceWiseStationaryMAB.

isChangingAtEachRepetition = None

The problem is not changing at each repetition.

isDynamic = None

The problem is dynamic.

isMarkovian = None

The problem is not Markovian.

listOfMeans = None

The list of means

nbArms = None

Number of arms

changePoints = None

List of the change points

__repr__()[source]

Return repr(self).

reprarms(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]

Cannot represent the dynamic arms, so print the PieceWiseStationaryMAB object

newRandomArms(t=None, onlyOneArm=None, verbose=False)[source]

Fake function, there is nothing random here, it is just to tell the piece-wise stationary MAB problem to maybe use the next interval.

plotHistoryOfMeans(horizon=None, savefig=None, forceTo01=False, showplot=True, pickleit=False)[source]

Plot the history of means, as a plot with x axis being the time, y axis the mean rewards, and K curves one for each arm.

arms

Return the current list of arms. at time \(t\) , the return mean of arm \(k\) is the mean during the time interval containing \(t\).

means

Return the list of means of arms for this PieceWiseStationaryMAB: at time \(t\) , the return mean of arm \(k\) is the mean during the time interval containing \(t\).

minArm

Return the smallest mean of the arms, for the current vector of means.

maxArm

Return the largest mean of the arms, for the current vector of means.

get_minArm(horizon=None)[source]

Return the smallest mean of the arms, for a piece-wise stationary MAB

  • It is a vector of length horizon.
get_minArms(M=1, horizon=None)[source]

Return the vector of sum of the M-worst means of the arms, for a piece-wise stationary MAB.

  • It is a vector of length horizon.
get_maxArm(horizon=None)[source]

Return the vector of max mean of the arms, for a piece-wise stationary MAB.

  • It is a vector of length horizon.
get_maxArms(M=1, horizon=None)[source]

Return the vector of sum of the M-best means of the arms, for a piece-wise stationary MAB.

  • It is a vector of length horizon.
get_allMeans(horizon=None)[source]

Return the vector of mean of the arms, for a piece-wise stationary MAB.

  • It is a numpy array of shape (nbArms, horizon).
__module__ = 'Environment.MAB'
class Environment.MAB.NonStationaryMAB(configuration, verbose=False)[source]

Bases: Environment.MAB.PieceWiseStationaryMAB

Like a stationary MAB problem, but the arms can be modified at each time step, with the newRandomArms() method.

  • M.arms and M.means is changed after each call to newRandomArms(), but not nbArm. All the other methods are carefully written to still make sense (Mbest, Mworst, minArm, maxArm).

Note

This is a generic class to implement different kinds of non-stationary bandits:

  • Abruptly changing non-stationary bandits, in different variants: changepoints are randomly drawn (once for all n repetitions or at different location fo each repetition).
  • Slowly varying non-stationary bandits, where the underlying mean of each arm is slowing randomly modified and a bound on the speed of change (e.g., Lipschitz constant of \(t \mapsto \mu_i(t)\)) is known.

Warning

It works fine, but it is still experimental, be careful when using this feature.

Warning

The number of arms is fixed, see https://github.com/SMPyBandits/SMPyBandits/issues/123 if you are curious about bandit problems with a varying number of arms (or sleeping bandits where some arms can be enabled or disabled at each time).

__init__(configuration, verbose=False)[source]

New NonStationaryMAB.

isChangingAtEachRepetition = None

The problem is not changing at each repetition.

isDynamic = None

The problem is dynamic.

isMarkovian = None

The problem is not Markovian.

newMeans = None

Function to generate the means

changePoints = None

List of the change points

onlyOneArm = None

None by default, but can be “uniform” to only change one arm at each change point.

args = None

Args to give to function

nbArms = None

Means of arms

reprarms(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]

Cannot represent the dynamic arms, so print the NonStationaryMAB object

newRandomArms(t=None, onlyOneArm=None, verbose=False)[source]

Generate a new list of arms, from arm_type(params['newMeans'](t, **params['args'])).

  • If onlyOneArm is given and is an integer, the change of mean only occurs for this arm and the others stay the same.
  • If onlyOneArm="uniform", the change of mean only occurs for one arm and the others stay the same, and the changing arm is chosen uniformly at random.

Note

Only the means of the arms change (and so, their order), not their family.

Warning

TODO? So far the only change points we consider is when the means of arms change, but the family of distributions stay the same. I could implement a more generic way, for instance to be able to test algorithms that detect change between different families of distribution (e.g., from a Gaussian of variance=1 to a Gaussian of variance=2, with different or not means).

get_minArm(horizon=None)[source]

Return the smallest mean of the arms, for a non-stationary MAB

  • It is a vector of length horizon.
get_maxArm(horizon=None)[source]

Return the vector of max mean of the arms, for a non-stationary MAB.

  • It is a vector of length horizon.
get_allMeans(horizon=None)[source]

Return the vector of mean of the arms, for a non-stationary MAB.

  • It is a numpy array of shape (nbArms, horizon).
__module__ = 'Environment.MAB'
Environment.MAB.static_change_lower_amplitude(t, l_t, a_t)[source]

A function called by IncreasingMAB at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).

  • First argument is a boolean, True if a change occurred, False otherwise.
Environment.MAB.L0 = -1

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.A0 = 2

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.DELTA = 0

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.T0 = -1

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.DELTA_T = -1

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.ZOOM = 2

Default value for the doubling_change_lower_amplitude() function.

Environment.MAB.doubling_change_lower_amplitude(t, l_t, a_t, l0=-1, a0=2, delta=0, T0=-1, deltaT=-1, zoom=2)[source]

A function called by IncreasingMAB at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).

  • At time 0, it forces to use \(l_0, a_0\) if they are given and not None.
  • At step T0, it reduces \(l_t\) by delta (typically from 0 to -1).
  • Every deltaT steps, it multiplies both \(l_t\) and \(a_t\) by zoom.
  • First argument is a boolean, True if a change occurred, False otherwise.
Environment.MAB.default_change_lower_amplitude(t, l_t, a_t, l0=-1, a0=2, delta=0, T0=-1, deltaT=-1, zoom=2)

A function called by IncreasingMAB at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).

  • At time 0, it forces to use \(l_0, a_0\) if they are given and not None.
  • At step T0, it reduces \(l_t\) by delta (typically from 0 to -1).
  • Every deltaT steps, it multiplies both \(l_t\) and \(a_t\) by zoom.
  • First argument is a boolean, True if a change occurred, False otherwise.
class Environment.MAB.IncreasingMAB(configuration)[source]

Bases: Environment.MAB.MAB

Like a stationary MAB problem, but the range of the rewards is increased from time to time, to test the Policy.WrapRange policy.

  • M.arms and M.means is NOT changed after each call to newRandomArms(), but not nbArm.

Warning

It is purely experimental, be careful when using this feature.

__module__ = 'Environment.MAB'
__init__(configuration)[source]

New MAB.

isDynamic = None

Flag to know if the problem is static or not.

draw(armId, t=1)[source]

Return a random sample from the armId-th arm, at time t. Usually t is not used.

Environment.MAB.binomialCoefficient(k, n)[source]

Compute a binomial coefficient \(C^n_k\) by a direct multiplicative method: \(C^n_k = {k \choose n}\).

>>> binomialCoefficient(-3, 10)
0
>>> binomialCoefficient(1, -10)
0
>>> binomialCoefficient(1, 10)
10
>>> binomialCoefficient(5, 10)
80
>>> binomialCoefficient(5, 20)
12960
>>> binomialCoefficient(10, 30)
10886400
Environment.MAB_rotting module

author : Julien SEZNEC Code to launch (rotting) bandit games. It is code in a functional programming way : each execution return arrays related to each run.

Environment.MAB_rotting.repetedRuns(policy, arms, rep=1000, T=10000, parallel=True, oracle=False)[source]
Environment.MAB_rotting.singleRun(policy, arms, T=10000, rep_index=0, oracle=False)[source]
Environment.MAB_rotting.play(arms, policy, T, Oracle=False)[source]
Environment.Result module

Result.Result class to wrap the simulation results.

class Environment.Result.Result(nbArms, horizon, indexes_bestarm=-1, means=None)[source]

Bases: object

Result accumulators.

__init__(nbArms, horizon, indexes_bestarm=-1, means=None)[source]

Create ResultMultiPlayers.

choices = None

Store all the choices.

rewards = None

Store all the rewards, to compute the mean.

pulls = None

Store the pulls.

indexes_bestarm = None

Store also the position of the best arm, XXX in case of dynamically switching environment.

running_time = None

Store the running time of the experiment.

memory_consumption = None

Store the memory consumption of the experiment.

number_of_cp_detections = None

Store the number of change point detected during the experiment.

store(time, choice, reward)[source]

Store results.

change_in_arms(time, indexes_bestarm)[source]

Store the position of the best arm from this list of arm.

  • From that time t and after, the index of the best arm is stored as indexes_bestarm.

Warning

FIXME This is still experimental!

__dict__ = mappingproxy({'__module__': 'Environment.Result', '__doc__': ' Result accumulators.', '__init__': <function Result.__init__>, 'store': <function Result.store>, 'change_in_arms': <function Result.change_in_arms>, '__dict__': <attribute '__dict__' of 'Result' objects>, '__weakref__': <attribute '__weakref__' of 'Result' objects>})
__module__ = 'Environment.Result'
__weakref__

list of weak references to the object (if defined)

Environment.ResultMultiPlayers module

ResultMultiPlayers.ResultMultiPlayers class to wrap the simulation results, for the multi-players case.

class Environment.ResultMultiPlayers.ResultMultiPlayers(nbArms, horizon, nbPlayers, means=None)[source]

Bases: object

ResultMultiPlayers accumulators, for the multi-players case.

__init__(nbArms, horizon, nbPlayers, means=None)[source]

Create ResultMultiPlayers.

choices = None

Store all the choices of all the players

rewards = None

Store all the rewards of all the players, to compute the mean

pulls = None

Store the pulls of all the players

allPulls = None

Store all the pulls of all the players

collisions = None

Store the collisions on all the arms

running_time = None

Store the running time of the experiment

memory_consumption = None

Store the memory consumption of the experiment

store(time, choices, rewards, pulls, collisions)[source]

Store results.

__dict__ = mappingproxy({'__module__': 'Environment.ResultMultiPlayers', '__doc__': ' ResultMultiPlayers accumulators, for the multi-players case. ', '__init__': <function ResultMultiPlayers.__init__>, 'store': <function ResultMultiPlayers.store>, '__dict__': <attribute '__dict__' of 'ResultMultiPlayers' objects>, '__weakref__': <attribute '__weakref__' of 'ResultMultiPlayers' objects>})
__module__ = 'Environment.ResultMultiPlayers'
__weakref__

list of weak references to the object (if defined)

Environment.fairnessMeasures module

Define some function to measure fairness of a vector of cumulated rewards, of shape (nbPlayers, horizon).

Environment.fairnessMeasures.amplitude_fairness(X, axis=0)[source]

(Normalized) Amplitude fairness, homemade formula: \(1 - \min(X, axis) / \max(X, axis)\).

Examples:

>>> import numpy.random as rn; rn.seed(1)  # for reproductibility
>>> X = np.cumsum(rn.rand(10, 1000))
>>> amplitude_fairness(X)  # doctest: +ELLIPSIS
0.999...
>>> amplitude_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.999...
>>> amplitude_fairness(np.log(1 + np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.959...
>>> rn.seed(3)  # for reproductibility
>>> X = rn.randint(0, 10, (10, 1000)); Y = np.cumsum(X, axis=1)
>>> np.min(Y, axis=0)[0], np.max(Y, axis=0)[0]
(3, 9)
>>> np.min(Y, axis=0)[-1], np.max(Y, axis=0)[-1]
(4387, 4601)
>>> amplitude_fairness(Y, axis=0).shape
(1000,)
>>> list(amplitude_fairness(Y, axis=0))  # doctest: +ELLIPSIS
[0.666..., 0.764..., ..., 0.0465...]
>>> X[X >= 3] = 3; Y = np.cumsum(X, axis=1)
>>> np.min(Y, axis=0)[0], np.max(Y, axis=0)[0]
(3, 3)
>>> np.min(Y, axis=0)[-1], np.max(Y, axis=0)[-1]
(2353, 2433)
>>> amplitude_fairness(Y, axis=0).shape
(1000,)
>>> list(amplitude_fairness(Y, axis=0))  # Less spreadout # doctest: +ELLIPSIS
[0.0, 0.5, ..., 0.0328...]
Environment.fairnessMeasures.std_fairness(X, axis=0)[source]

(Normalized) Standard-variation fairness, homemade formula: \(2 * \mathrm{std}(X, axis) / \max(X, axis)\).

Examples:

>>> import numpy.random as rn; rn.seed(1)  # for reproductibility
>>> X = np.cumsum(rn.rand(10, 1000))
>>> std_fairness(X)  # doctest: +ELLIPSIS
0.575...
>>> std_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.594...
>>> std_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.470...
>>> rn.seed(2)  # for reproductibility
>>> X = np.cumsum(rn.randint(0, 10, (10, 100)))
>>> std_fairness(X)  # doctest: +ELLIPSIS
0.570...
>>> std_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.587...
>>> std_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.463...
Environment.fairnessMeasures.rajjain_fairness(X, axis=0)[source]

Raj Jain’s fairness index: \((\sum_{i=1}^{n} x_i)^2 / (n \times \sum_{i=1}^{n} x_i^2)\), projected to \([0, 1]\) instead of \([\frac{1}{n}, 1]\) as introduced in the reference article.

Examples:

>>> import numpy.random as rn; rn.seed(1)  # for reproductibility
>>> X = np.cumsum(rn.rand(10, 1000))
>>> rajjain_fairness(X)  # doctest: +ELLIPSIS
0.248...
>>> rajjain_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.441...
>>> rajjain_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.110...
>>> rn.seed(2)  # for reproductibility
>>> X = np.cumsum(rn.randint(0, 10, (10, 100)))
>>> rajjain_fairness(X)  # doctest: +ELLIPSIS
0.246...
>>> rajjain_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.917...
>>> rajjain_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.107...
Environment.fairnessMeasures.mo_walrand_fairness(X, axis=0, alpha=2)[source]

Mo and Walrand’s family fairness index: \(U_{\alpha}(X)\), NOT projected to \([0, 1]\).

\[\begin{split}U_{\alpha}(X) = \begin{cases} \frac{1}{1 - \alpha} \sum_{i=1}^n x_i^{1 - \alpha} & \;\text{if}\; \alpha\in[0,+\infty)\setminus\{1\}, \\ \sum_{i=1}^{n} \ln(x_i) & \;\text{otherwise}. \end{cases}\end{split}\]

Examples:

>>> import numpy.random as rn; rn.seed(1)  # for reproductibility
>>> X = np.cumsum(rn.rand(10, 1000))
>>> alpha = 0
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
24972857.013...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
82933940429.039...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
471371.219...
>>> alpha = 0.99999
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
1000075176.390...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
1000150358.528...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
1000037587.478...
>>> alpha = 1
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
75173.509...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
150347.019...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
37586.754...
>>> alpha = 1.00001
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
-999924829.359...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
-999849664.476...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
-999962413.957...
>>> alpha = 2
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
-22.346...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
-9.874...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
-283.255...
>>> alpha = 5
>>> mo_walrand_fairness(X, alpha=alpha)  # doctest: +ELLIPSIS
-8.737...
>>> mo_walrand_fairness(X ** 2, alpha=alpha)  # More spreadout  # doctest: +ELLIPSIS
-273.522...
>>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha)  # Less spreadout  # doctest: +ELLIPSIS
-2.468...
Environment.fairnessMeasures.mean_fairness(X, axis=0, methods=(<function amplitude_fairness>, <function std_fairness>, <function rajjain_fairness>))[source]

Fairness index, based on mean of the 3 fairness measures: Amplitude, STD and Raj Jain fairness.

Examples:

>>> import numpy.random as rn; rn.seed(1)  # for reproductibility
>>> X = np.cumsum(rn.rand(10, 1000))
>>> mean_fairness(X)  # doctest: +ELLIPSIS
0.607...
>>> mean_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.678...
>>> mean_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.523...
>>> rn.seed(2)  # for reproductibility
>>> X = np.cumsum(rn.randint(0, 10, (10, 100)))
>>> mean_fairness(X)  # doctest: +ELLIPSIS
0.605...
>>> mean_fairness(X ** 2)  # More spreadout  # doctest: +ELLIPSIS
0.834...
>>> mean_fairness(np.sqrt(np.abs(X)))  # Less spreadout  # doctest: +ELLIPSIS
0.509...
Environment.fairnessMeasures.fairnessMeasure(X, axis=0, methods=(<function amplitude_fairness>, <function std_fairness>, <function rajjain_fairness>))

Default fairness measure

Environment.fairnessMeasures.fairness_mapping = {'Amplitude': <function amplitude_fairness>, 'Default': <function mean_fairness>, 'Mean': <function mean_fairness>, 'MoWalrand': <function mo_walrand_fairness>, 'RajJain': <function rajjain_fairness>, 'STD': <function std_fairness>}

Mapping of names of measure to their function

Environment.memory_consumption module

Tiny module to measure and work on memory consumption.

It defines a utility function to get the memory consumes in the current process or the current thread (getCurrentMemory()), and a function to pretty print memory size (sizeof_fmt()).

It also imports tracemalloc and define a convenient function that pretty print the most costly lines after a run.

>>> return_code = start_tracemalloc()
Starting to trace memory allocations...
>>> # ... run your application ...
>>> display_top_tracemalloc()
<BLANKLINE>
Top 10 lines ranked by memory consumption:
#1: python3.6/doctest.py:1330: 636 B
    compileflags, 1), test.globs)
#2: <doctest __main__[1]>:1: 568 B
    display_top_tracemalloc()
#3: python3.6/doctest.py:1346: 472 B
    if check(example.want, got, self.optionflags):
#4: python3.6/doctest.py:1374: 464 B
    self.report_success(out, test, example, got)
#5: python3.6/doctest.py:1591: 456 B
    got = self._toAscii(got)
#6: ./memory_consumption.py:168: 448 B
    snapshot = tracemalloc.take_snapshot()
#7: python3.6/doctest.py:1340: 440 B
    self._fakeout.truncate(0)
#8: python3.6/doctest.py:1339: 440 B
    got = self._fakeout.getvalue()  # the actual output
#9: python3.6/doctest.py:1331: 432 B
    self.debugger.set_continue() # ==== Example Finished ====
#10: python3.6/doctest.py:251: 89 B
    result = StringIO.getvalue(self)
2 others: 78 B
<BLANKLINE>
Total allocated size: 4.4 KiB
4523

Warning

This is automatically used (for main.py at least) when DEBUGMEMORY=True (cli env).

Warning

This is experimental and does not work as well on Mac OS X and Windows as it works on GNU/Linux systems.

Environment.memory_consumption.getCurrentMemory(thread=False, both=False)[source]

Get the current memory consumption of the process, or the thread.

  • Example, before and after creating a huge random matrix in Numpy, and asking to invert it:
>>> currentMemory = getCurrentMemory()
>>> print("Consumed {} memory".format(sizeof_fmt(currentMemory)))  # doctest: +SKIP
Consumed 16.8 KiB memory
>>> import numpy as np; x = np.random.randn(1000, 1000)  # doctest: +SKIP
>>> diffMemory = getCurrentMemory() - currentMemory; currentMemory += diffMemory
>>> print("Consumed {} more memory".format(sizeof_fmt(diffMemory)))  # doctest: +SKIP
Consumed 18.8 KiB more memory
>>> y = np.linalg.pinv(x)  # doctest: +SKIP
>>> diffMemory = getCurrentMemory() - currentMemory; currentMemory += diffMemory
>>> print("Consumed {} more memory".format(sizeof_fmt(diffMemory)))  # doctest: +SKIP
Consumed 63.9 KiB more memory

Warning

This is still experimental for multi-threaded code.

Warning

It can break on some systems, see for instance [the issue #142](https://github.com/SMPyBandits/SMPyBandits/issues/142).

Warning

FIXME even on my own system, it works for the last few policies I test, but fails for the first??

Warning

This returns 0 on Microsoft Windows, because the resource module is not available on non-UNIX systems (see https://docs.python.org/3/library/unix.html).

Environment.memory_consumption.sizeof_fmt(num, suffix='B', longsuffix=True, usespace=True, base=1024)[source]

Returns a string representation of the size num.

  • Examples:
>>> sizeof_fmt(1020)
'1020 B'
>>> sizeof_fmt(1024)
'1 KiB'
>>> sizeof_fmt(12011993)
'11.5 MiB'
>>> sizeof_fmt(123456789)
'117.7 MiB'
>>> sizeof_fmt(123456789911)
'115 GiB'

Options include:

  • No space before unit:
>>> sizeof_fmt(123456789911, usespace=False)
'115GiB'
  • French style, with short suffix, the “O” suffix for “octets”, and a base 1000:
>>> sizeof_fmt(123456789911, longsuffix=False, suffix='O', base=1000)
'123.5 GO'
Environment.memory_consumption.start_tracemalloc()[source]

Wrapper function around tracemalloc.start(), to log the start of tracing memory allocation.

Environment.memory_consumption.display_top_tracemalloc(snapshot=None, key_type='lineno', limit=10)[source]

Display detailed information on the limit most costly lines in this memory snapshot.

Environment.notify module

Defines one useful function notify() to (try to) send a desktop notification.

Warning

Experimental support of Mac OS X has been added since #143 (https://github.com/SMPyBandits/SMPyBandits/issues/143).

Environment.notify.PROGRAM_NAME = 'SMPyBandits'

Program name

Environment.notify.ICON_PATH = 'logo.png'

Icon to use

Environment.notify.load_icon()[source]

Load and open the icon.

Environment.notify.has_Notify = False

Trying to import gi.repository.Notify

Environment.notify.notify_gi(body, summary='SMPyBandits', icon='terminal', timeout=5)[source]

Send a notification, with gi.repository.Notify.

  • icon can be “dialog-information”, “dialog-warn”, “dialog-error”.
Environment.notify.notify_cli(body, summary='SMPyBandits', icon='terminal', timeout=5, gnulinux=True)[source]

Send a notification, with a subprocess call to ‘notify-send’.

Environment.notify.notify(body, summary='SMPyBandits', icon='terminal', timeout=5)[source]

Send a notification, using one of the previously defined method, until it works. Usually it works.

Environment.plot_Cmu_HOI module
Environment.plotsettings module

plotsettings: use it like this, in the Environment folder:

>>> import sys; sys.path.insert(0, '..')
>>> from .plotsettings import BBOX_INCHES, signature, maximizeWindow, palette, makemarkers, add_percent_formatter, wraptext, wraplatex, legend, show_and_save, nrows_ncols
Environment.plotsettings.monthyear = 'Mar.2021'

Month.Year date

Environment.plotsettings.signature = ''

A small string to use as a signature

Environment.plotsettings.DPI = 120

DPI to use for the figures

Environment.plotsettings.FIGSIZE = (16, 9)

Figure size, in inches!

Environment.plotsettings.HLS = True

Use the HLS mapping, or HUSL mapping

Environment.plotsettings.VIRIDIS = False

Use the Viridis colormap

Environment.plotsettings.BBOX_INCHES = None

Use this parameter for bbox

Environment.plotsettings.palette(nb, hls=True, viridis=False)[source]

Use a smart palette from seaborn, for nb different plots on the same figure.

>>> palette(10, hls=True)  # doctest: +ELLIPSIS
[(0.86..., 0.37..., 0.33...), (0.86...,.65..., 0.33...), (0.78..., 0.86...,.33...), (0.49..., 0.86...,.33...), (0.33..., 0.86...,.46...), (0.33..., 0.86...,.74...), (0.33..., 0.68..., 0.86...) (0.33..., 0.40..., 0.86...) (0.56..., 0.33..., 0.86...) (0.84..., 0.33..., 0.86...)]
>>> palette(10, hls=False)  # doctest: +ELLIPSIS
[[0.96..., 0.44..., 0.53...], [0.88..., 0.52..., 0.19...], [0.71..., 0.60..., 0.19...], [0.54..., 0.65..., 0.19...], [0.19..., 0.69..., 0.34...], [0.20..., 0.68..., 0.58...],[0.21..., 0.67..., 0.69...], [0.22..., 0.65..., 0.84...], [0.55..., 0.57..., 0.95...], [0.85..., 0.44..., 0.95...]]
>>> palette(10, viridis=True)  # doctest: +ELLIPSIS
[(0.28..., 0.13..., 0.44...), (0.26..., 0.24..., 0.52...), (0.22..., 0.34..., 0.54...), (0.17..., 0.43..., 0.55...), (0.14..., 0.52..., 0.55...), (0.11..., 0.60..., 0.54...), (0.16..., 0.69..., 0.49...), (0.31..., 0.77..., 0.41...), (0.52..., 0.83..., 0.28...), (0.76..., 0.87..., 0.13...)]
  • To visualize:
>>> sns.palplot(palette(10, hls=True))  # doctest: +SKIP
>>> sns.palplot(palette(10, hls=False))  # use HUSL by default  # doctest: +SKIP
>>> sns.palplot(palette(10, viridis=True))  # doctest: +SKIP
Environment.plotsettings.makemarkers(nb)[source]

Give a list of cycling markers. See http://matplotlib.org/api/markers_api.html

Note

This what I consider the optimal sequence of markers, they are clearly differentiable one from another and all are pretty.

Examples:

>>> makemarkers(7)
['o', 'D', 'v', 'p', '<', 's', '^']
>>> makemarkers(12)
['o', 'D', 'v', 'p', '<', 's', '^', '*', 'h', '>', 'o', 'D']
Environment.plotsettings.PUTATRIGHT = False

Default parameter for legend(): if True, the legend is placed at the right side of the figure, not on it. This is almost mandatory for plots with more than 10 algorithms (good for experimenting, bad for publications).

Environment.plotsettings.SHRINKFACTOR = 0.75

Shrink factor if the legend is displayed on the right of the plot.

Warning

I still don’t really understand how this works. Just manually decrease if the legend takes more space (i.e., more algorithms with longer names)

Environment.plotsettings.MAXNBOFLABELINFIGURE = 8

Default parameter for maximum number of label to display in the legend INSIDE the figure

Environment.plotsettings.legend(putatright=False, fontsize='large', shrinkfactor=0.75, maxnboflabelinfigure=8, fig=None, title=None)[source]

plt.legend() with good options, cf. http://matplotlib.org/users/recipes.html#transparent-fancy-legends.

Environment.plotsettings.maximizeWindow()[source]

Experimental function to try to maximize a plot.

Warning

This function is still experimental, but “it works on my machine” so I keep it.

Environment.plotsettings.FORMATS = ('png', 'pdf')

List of formats to use for saving the figures, by default. It is a smart idea to save in both a raster and vectorial formats

Environment.plotsettings.show_and_save(showplot=True, savefig=None, formats=('png', 'pdf'), pickleit=False, fig=None)[source]

Maximize the window if need to show it, save it if needed, and then show it or close it.

Environment.plotsettings.add_percent_formatter(which='xaxis', amplitude=1.0, oldformatter='%.2g%%', formatter='{x:.1%}')[source]

Small function to use a Percentage formatter for xaxis or yaxis, of a certain amplitude.

  • which can be “xaxis” or “yaxis”,
  • amplitude is a float, default to 1.
  • More detail at http://stackoverflow.com/a/36320013/
  • Not that the use of matplotlib.ticker.PercentFormatter require matplotlib >= 2.0.1
  • But if not available, use matplotlib.ticker.StrMethodFormatter(“{:.0%}”) instead
Environment.plotsettings.WIDTH = 95

Default value for the width parameter for wraptext() and wraplatex().

Environment.plotsettings.wraptext(text, width=95)[source]

Wrap the text, using textwrap module, and width.

Environment.plotsettings.wraplatex(text, width=95)[source]

Wrap the text, for LaTeX, using textwrap module, and width.

Environment.plotsettings.nrows_ncols(N)[source]

Return (nrows, ncols) to create a subplots for N plots of the good size.

>>> for N in range(1, 22):
...     nrows, ncols = nrows_ncols(N)
...     print("For N = {:>2}, {} rows and {} cols are enough.".format(N, nrows, ncols))
For N =  1, 1 rows and 1 cols are enough.
For N =  2, 2 rows and 1 cols are enough.
For N =  3, 2 rows and 2 cols are enough.
For N =  4, 2 rows and 2 cols are enough.
For N =  5, 3 rows and 2 cols are enough.
For N =  6, 3 rows and 2 cols are enough.
For N =  7, 3 rows and 3 cols are enough.
For N =  8, 3 rows and 3 cols are enough.
For N =  9, 3 rows and 3 cols are enough.
For N = 10, 4 rows and 3 cols are enough.
For N = 11, 4 rows and 3 cols are enough.
For N = 12, 4 rows and 3 cols are enough.
For N = 13, 4 rows and 4 cols are enough.
For N = 14, 4 rows and 4 cols are enough.
For N = 15, 4 rows and 4 cols are enough.
For N = 16, 4 rows and 4 cols are enough.
For N = 17, 5 rows and 4 cols are enough.
For N = 18, 5 rows and 4 cols are enough.
For N = 19, 5 rows and 4 cols are enough.
For N = 20, 5 rows and 4 cols are enough.
For N = 21, 5 rows and 5 cols are enough.
Environment.plotsettings.addTextForWorstCases(ax, n, bins, patches, rate=0.85, normed=False, fontsize=8)[source]

Add some text labels to the patches of an histogram, for the last ‘rate’%.

Use it like this, to add labels for the bins in the 65% largest values n:

>>> n, bins, patches = plt.hist(...)
>>> addTextForWorstCases(ax, n, bins, patches, rate=0.65)
Environment.plotsettings.myviolinplot(*args, nonsymmetrical=False, **kwargs)[source]
Environment.plotsettings.violin_or_box_plot(data=None, labels=None, boxplot=False, **kwargs)[source]

Automatically add labels to a box or violin plot.

Warning

Requires pandas (https://pandas.pydata.org/) to add the xlabel for violin plots.

Environment.plotsettings.MAX_NB_OF_LABELS = 50

If more than MAX_NB_OF_LABELS labels have to be displayed on a boxplot, don’t put a legend.

Environment.plotsettings.adjust_xticks_subplots(ylabel=None, labels=(), maxNbOfLabels=50)[source]

Adjust the size of the xticks, and maybe change size of ylabel.

Environment.plotsettings.table_to_latex(mean_data, std_data=None, labels=None, fmt_function=None, name_of_table=None, filename=None, erase_output=False, *args, **kwargs)[source]

Tries to print the data from the input array or collection of array or pandas.DataFrame to the stdout and to the file filename (if it does not exist).

Warning

FIXME this is still experimental! And useless, most of the time we simply do a copy/paste from the terminal to the LaTeX in the article…

Environment.pykov module

Pykov documentation.

Environment.pykov._del_cache(fn)[source]

Delete cache.

exception Environment.pykov.PykovError(value)[source]

Bases: Exception

Exception definition form Pykov Errors.

__init__(value)[source]

Initialize self. See help(type(self)) for accurate signature.

__str__()[source]

Return str(self).

__module__ = 'Environment.pykov'
__weakref__

list of weak references to the object (if defined)

class Environment.pykov.Vector(data=None, **kwargs)[source]

Bases: collections.OrderedDict

__init__(data=None, **kwargs)[source]
>>> pykov.Vector({'A':.3, 'B':.7})
{'A':.3, 'B':.7}
>>> pykov.Vector(A=.3, B=.7)
{'A':.3, 'B':.7}
__getitem__(key)[source]
>>> q = pykov.Vector(C=.4, B=.6)
>>> q['C']
0.4
>>> q['Z']
0.0
__setitem__(key, value)[source]
>>> q = pykov.Vector(C=.4, B=.6)
>>> q['Z']=.2
>>> q
{'C': 0.4, 'B': 0.6, 'Z': 0.2}
>>> q['Z']=0
>>> q
{'C': 0.4, 'B': 0.6}
__mul__(M)[source]
>>> p = pykov.Vector(A=.3, B=.7)
>>> p * 3
{'A': 0.9, 'B': 2.1}
>>> q = pykov.Vector(C=.5, B=.5)
>>> p * q
0.35
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> p * T
{'A': 0.91, 'B': 0.09}
>>> T * p
{'A': 0.42, 'B': 0.3}
__rmul__(M)[source]
>>> p = pykov.Vector(A=.3, B=.7)
>>> 3 * p
{'A': 0.9, 'B': 2.1}
__add__(v)[source]
>>> p = pykov.Vector(A=.3, B=.7)
>>> q = pykov.Vector(C=.5, B=.5)
>>> p + q
{'A': 0.3, 'C': 0.5, 'B': 1.2}
__sub__(v)[source]
>>> p = pykov.Vector(A=.3, B=.7)
>>> q = pykov.Vector(C=.5, B=.5)
>>> p - q
{'A': 0.3, 'C': -0.5, 'B': 0.2}
>>> q - p
{'A': -0.3, 'C': 0.5, 'B': -0.2}
_toarray(el2pos)[source]
>>> p = pykov.Vector(A=.3, B=.7)
>>> el2pos = {'A': 1, 'B': 0}
>>> v = p._toarray(el2pos)
>>> v
array([ 0.7,  0.3])
_fromarray(arr, el2pos)[source]
>>> p = pykov.Vector()
>>> el2pos = {'A': 1, 'B': 0}
>>> v = numpy.array([ 0.7,  0.3])
>>> p._fromarray(v,el2pos)
>>> p
{'A': 0.3, 'B': 0.7}
sort(reverse=False)[source]

List of (state,probability) sorted according the probability.

>>> p = pykov.Vector({'A':.3, 'B':.1, 'C':.6})
>>> p.sort()
[('B', 0.1), ('A', 0.3), ('C', 0.6)]
>>> p.sort(reverse=True)
[('C', 0.6), ('A', 0.3), ('B', 0.1)]
normalize()[source]

Normalize the vector so that the entries sum is 1.

>>> p = pykov.Vector({'A':3, 'B':1, 'C':6})
>>> p.normalize()
>>> p
{'A': 0.3, 'C': 0.6, 'B': 0.1}
choose(random_func=None)[source]

Choose a state according to its probability.

>>> p = pykov.Vector(A=.3, B=.7)
>>> p.choose()
'B'

Optionally, a function that generates a random number can be supplied. >>> def FakeRandom(min, max): return 0.01 >>> p = pykov.Vector(A=.05, B=.4, C=.4, D=.15) >>> p.choose(FakeRandom) ‘A’

entropy()[source]

Return the entropy.

\[H(p) = \sum_i p_i \ln p_i\]

See also

Khinchin, A. I. Mathematical Foundations of Information Theory Dover, 1957.

>>> p = pykov.Vector(A=.3, B=.7)
>>> p.entropy()
0.6108643020548935
relative_entropy(p)[source]

Return the Kullback-Leibler distance.

\[d(q,p) = \sum_i q_i \ln (q_i/p_i)\]

Note

The Kullback-Leibler distance is not symmetric.

>>> p = pykov.Vector(A=.3, B=.7)
>>> q = pykov.Vector(A=.4, B=.6)
>>> p.relative_entropy(q)
0.02160085414354654
>>> q.relative_entropy(p)
0.022582421084357485
copy()[source]

Return a shallow copy.

>>> p = pykov.Vector(A=.3, B=.7)
>>> q = p.copy()
>>> p['C'] = 1.
>>> q
{'A': 0.3, 'B': 0.7}
sum()[source]

Sum the values.

>>> p = pykov.Vector(A=.3, B=.7)
>>> p.sum()
1.0
dist(v)[source]

Return the distance between the two probability vectors.

\[d(q,p) = \sum_i |q_i - p_i|\]
>>> p = pykov.Vector(A=.3, B=.7)
>>> q = pykov.Vector(C=.5, B=.5)
>>> q.dist(p)
1.0
__module__ = 'Environment.pykov'
class Environment.pykov.Matrix(data=None)[source]

Bases: collections.OrderedDict

__init__(data=None)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
__getitem__(*args)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T[('A','B')]
0.3
>>> T['A','B']
0.3
>>>
0.0
__setitem__(**kwargs)[source]
__delitem__(**kwargs)[source]
pop(**kwargs)[source]
popitem(**kwargs)[source]
clear(**kwargs)[source]
update(**kwargs)[source]
setdefault(**kwargs)[source]
copy()[source]

Return a shallow copy.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> W = T.copy()
>>> T[('B','B')] = 1.
>>> W
{('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7}
__reduce__()[source]

Return state information for pickling

_dok_(el2pos, method='')[source]
_from_dok_(mat, pos2el)[source]
_numpy_mat(el2pos)[source]

Return a numpy.matrix object from a dictionary.

– Parameters – t_ij : the OrderedDict, values must be real numbers, keys should be tuples of two strings. el2pos : see _map()

_from_numpy_mat(T, pos2el)[source]

Return a dictionary from a numpy.matrix object.

– Parameters – T : the numpy.matrix. pos2el : see _map()

_el2pos_()[source]
stochastic()[source]

Make a right stochastic matrix.

Set the sum of every row equal to one, raise PykovError if it is not possible.

>>> T = pykov.Matrix({('A','B'): 3, ('A','A'): 7, ('B','A'): .2})
>>> T.stochastic()
>>> T
{('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7}
>>> T[('A','C')]=1
>>> T.stochastic()
pykov.PykovError: 'Zero links from node C'
pred(key=None)[source]

Return the precedessors of a state (if not indicated, of all states). In Matrix notation: return the coloum of the indicated state.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.pred()
{'A': {'A': 0.7, 'B': 1.0}, 'B': {'A': 0.3}}
>>> T.pred('A')
{'A': 0.7, 'B': 1.0}
succ(key=None)[source]

Return the successors of a state (if not indicated, of all states). In Matrix notation: return the row of the indicated state.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.succ()
{'A': {'A': 0.7, 'B': 0.3}, 'B': {'A': 1.0}}
>>> T.succ('A')
{'A': 0.7, 'B': 0.3}
remove(states)[source]

Return a copy of the Chain, without the indicated states.

Warning

All the links where the states appear are deleted, so that the result will not be in general a stochastic matrix.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.remove(['B'])
{('A', 'A'): 0.7}
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.,
                     ('C','D'): .5, ('D','C'): 1., ('C','B'): .5})
>>> T.remove(['A','B'])
{('C', 'D'): 0.5, ('D', 'C'): 1.0}
states()[source]

Return the set of states.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.states()
{'A', 'B'}
__pow__(n)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T**2
{('A', 'B'): 0.21, ('B', 'A'): 0.70, ('A', 'A'): 0.79, ('B', 'B'): 0.30}
>>> T**0
{('A', 'A'): 1.0, ('B', 'B'): 1.0}
pow(n)[source]
__mul__(v)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T * 3
{('B', 'A'): 3.0, ('A', 'B'): 0.9, ('A', 'A'): 2.1}
>>> p = pykov.Vector(A=.3, B=.7)
>>> T * p
{'A': 0.42, 'B': 0.3}
>>> W = pykov.Matrix({('N', 'M'): 0.5, ('M', 'N'): 0.7,
                      ('M', 'M'): 0.3, ('O', 'N'): 0.5,
                      ('O', 'O'): 0.5, ('N', 'O'): 0.5})
>>> W * W
{('N', 'M'): 0.15, ('M', 'N'): 0.21, ('M', 'O'): 0.35,
 ('M', 'M'): 0.44, ('O', 'M'): 0.25, ('O', 'N'): 0.25,
 ('O', 'O'): 0.5, ('N', 'O'): 0.25, ('N', 'N'): 0.6}
__rmul__(v)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> 3 * T
{('B', 'A'): 3.0, ('A', 'B'): 0.9, ('A', 'A'): 2.1}
__add__(M)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> I = pykov.Matrix({('A','A'):1, ('B','B'):1})
>>> T + I
{('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 1.7, ('B', 'B'): 1.0}
__sub__(M)[source]
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> I = pykov.Matrix({('A','A'):1, ('B','B'):1})
>>> T - I
{('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): -0.3, ('B', 'B'): -1}
trace()[source]

Return the Matrix trace.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.trace()
0.7
eye()[source]

Return the Identity Matrix.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.eye()
{('A', 'A'): 1., ('B', 'B'): 1.}
ones()[source]

Return a Vector instance with entries equal to one.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.ones()
{'A': 1.0, 'B': 1.0}
transpose()[source]

Return the transpose Matrix.

>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.transpose()
{('B', 'A'): 0.3, ('A', 'B'): 1.0, ('A', 'A'): 0.7}
_UMPFPACKSolve(b, x=None, method='UMFPACK_A')[source]

UMFPACK ( U nsymmetric M ulti F Rontal PACK age)

method:
“UMFPACK_A” : mathbf{A} x = b (default) “UMFPACK_At” : mathbf{A}^T x = b

A column pre-ordering strategy for the unsymmetric-pattern multifrontal method, T. A. Davis, ACM Transactions on Mathematical Software, vol 30, no. 2, June 2004, pp. 165-195.

__module__ = 'Environment.pykov'
class Environment.pykov.Chain(data=None)[source]

Bases: Environment.pykov.Matrix

move(state, random_func=None)[source]

Do one step from the indicated state, and return the final state.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.move('A')
'B'

Optionally, a function that generates a random number can be supplied. >>> def FakeRandom(min, max): return 0.01 >>> T.move(‘A’, FakeRandom) ‘B’

pow(p, n)[source]

Find the probability distribution after n steps, starting from an initial Vector.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> p = pykov.Vector(A=1)
>>> T.pow(p,3)
{'A': 0.7629999999999999, 'B': 0.23699999999999996}
>>> p * T * T * T
{'A': 0.7629999999999999, 'B': 0.23699999999999996}
steady()[source]

With the assumption of ergodicity, return the steady state.

Note

Inverse iteration method (P is the Markov chain)

\[ \begin{align}\begin{aligned}Q = \mathbf{I} - P\\Q^T x = e\\e = (0,0,\dots,0,1)\end{aligned}\end{align} \]

See also

W. Stewart: Introduction to the Numerical Solution of Markov Chains, Princeton University Press, Chichester, West Sussex, 1994.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.steady()
{'A': 0.7692307692307676, 'B': 0.23076923076923028}
entropy(p=None, norm=False)[source]

Return the Chain entropy, calculated with the indicated probability Vector (the steady state by default).

\[ \begin{align}\begin{aligned}H_i = \sum_j P_{ij} \ln P_{ij}\\H = \sum \pi_i H_i\end{aligned}\end{align} \]

See also

Khinchin, A. I. Mathematical Foundations of Information Theory Dover, 1957.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.entropy()
0.46989561696530169

With normalization entropy belongs to [0,1]

>>> T.entropy(norm=True)
0.33895603665233132
mfpt_to(state)[source]

Return the Mean First Passage Times of every state to the indicated state.

See also

Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.

>>> d = {('R', 'N'): 0.25, ('R', 'S'): 0.25, ('S', 'R'): 0.25,
         ('R', 'R'): 0.5, ('N', 'S'): 0.5, ('S', 'S'): 0.5,
         ('S', 'N'): 0.25, ('N', 'R'): 0.5, ('N', 'N'): 0.0}
>>> T = pykov.Chain(d)
>>> T.mfpt_to('R')
{'S': 3.333333333333333, 'N': 2.666666666666667}
adjacency()[source]

Return the adjacency matrix.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.adjacency()
{('B', 'A'): 1, ('A', 'B'): 1, ('A', 'A'): 1}
walk(steps, start=None, stop=None)[source]

Return a random walk of n steps, starting and stopping at the indicated states.

Note

If not indicated or is None, then the starting state is chosen according to its steady probability. If the stopping state is not None, the random walk stops early if the stopping state is reached.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.walk(10)
['B', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'A']
>>> T.walk(10,'B','B')
['B', 'A', 'A', 'A', 'A', 'A', 'B']
walk_probability(walk)[source]

Given a walk, return the log of its probability.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.walk_probability(['A','A','B','A','A'])
-1.917322692203401
>>> probability = math.exp(-1.917322692203401)
0.147
>>> p = T.walk_probability(['A','B','B','B','A'])
>>> math.exp(p)
0.0
mixing_time(cutoff=0.25, jump=1, p=None)[source]

Return the mixing time.

If the initial distribution (p) is not indicated, then it is set to p={‘less probable state’:1}.

Note

The mixing time is calculated here as the number of steps (n) needed to have

\[ \begin{align}\begin{aligned}|p(n)-\pi| < 0.25\\p(n)=p P^n\\\pi=\pi P\end{aligned}\end{align} \]

The parameter jump controls the iteration step, for example with jump=2 n has values 2,4,6,8,..

>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4,
         ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2,
         ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2}
>>> T = pykov.Chain(d)
>>> T.mixing_time()
2
absorbing_time(transient_set)[source]

Mean number of steps needed to leave the transient set.

Return the Vector tau, the tau[i] is the mean number of steps needed to leave the transient set starting from state i. The parameter transient_set is a subset of nodes.

Note

If the starting point is a Vector p, then it is sufficient to calculate p * tau in order to weigh the mean times according the initial conditions.

>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4,
         ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2,
         ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2}
>>> T = pykov.Chain(d)
>>> p = pykov.Vector({'N':.3, 'S':.7})
>>> tau = T.absorbing_time(p.keys())
>>> p * tau
3.1333333333333329
absorbing_tour(p, transient_set=None)[source]

Return a Vector v, v[i] is the mean of the total number of times the process is in a given transient state i before to leave the transient set.

Note

v.sum() is equal to p * tau (see absorbing_time() method).

In not specified, the transient set is defined by means of the Vector p.

See also

Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.

>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4,
         ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2,
         ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2}
>>> T = pykov.Chain(d)
>>> p = pykov.Vector({'N':.3, 'S':.7})
>>> T.absorbing_tour(p)
{'S': 2.2666666666666666, 'N': 0.8666666666666669}
fundamental_matrix()[source]

Return the fundamental matrix.

See also

Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.fundamental_matrix()
{('B', 'A'): 0.17751479289940991, ('A', 'B'): 0.053254437869822958,
('A', 'A'): 0.94674556213017902, ('B', 'B'): 0.82248520710059214}
kemeny_constant()[source]

Return the Kemeny constant of the transition matrix.

>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.})
>>> T.kemeny_constant()
1.7692307692307712
accessibility_matrix()[source]

Return the accessibility matrix of the Markov chain.

..see also: http://www.ssc.wisc.edu/~jmontgom/commclasses.pdf

is_accessible(i, j)[source]

Return whether state j is accessible from state i.

communicates(i, j)[source]

Return whether states i and j communicate.

communication_classes()[source]

Return a Set of all communication classes of the Markov chain.

..see also: http://www.ssc.wisc.edu/~jmontgom/commclasses.pdf

>>> T = pykov.Chain({('A','A'): 1.0, ('B','B'): 1.0})
>>> T.communication_classes()
__module__ = 'Environment.pykov'
Environment.pykov.readmat(filename)[source]

Read an external file and return a Chain.

The file must be of the form:

A A .7 A B .3 B A 1

>>> P = pykov.readmat('/mypath/mat')
>>> P
{('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7}
Environment.pykov.readtrj(filename)[source]

In the case the Chain instance must be created from a finite chain of states, the transition matrix is not fully defined. The function defines the transition probabilities as the maximum likelihood probabilities calculated along the chain. Having the file /mypath/trj with the following format:

1
1
1
2
1
3

the Chain instance defined from that chain is:

>>> t = pykov.readtrj('/mypath/trj')
>>> t
(1, 1, 1, 2, 1, 3)
>>> p, P = maximum_likelihood_probabilities(t,lag_time=1, separator='0')
>>> p
{1: 0.6666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666}
>>> P
{(1, 2): 0.25, (1, 3): 0.25, (1, 1): 0.5, (2, 1): 1.0, (3, 3): 1.0}
>>> type(P)
<class 'pykov.Chain'>
>>> type(p)
<class 'pykov.Vector'>
Environment.pykov._writefile(mylist, filename)[source]

Export in a file the list.

mylist could be a list of list.

>>> L = [[2,3],[4,5]]
>>> pykov.writefile(L,'tmp')
>>> l = [1,2]
>>> pykov.writefile(l,'tmp')
Environment.pykov.transitions(trj, nsteps=1, lag_time=1, separator='0')[source]

Return the temporal list of transitions observed.

trj : the symbolic trajectory. nsteps : number of steps. lag_time : step length. separator: the special symbol indicating the presence of sub-trajectories.

>>> trj = [1,2,1,0,2,3,1,0,2,3,2,3,1,2,3]
>>> pykov.transitions(trj,1,1,0)
[(1, 2), (2, 1), (2, 3), (3, 1), (2, 3), (3, 2), (2, 3), (3, 1), (1, 2),
(2, 3)]
>>> pykov.transitions(trj,1,2,0)
[(1, 1), (2, 1), (2, 2), (3, 3), (2, 1), (3, 2), (1, 3)]
>>> pykov.transitions(trj,2,2,0)
[(2, 2, 1), (3, 3, 2), (2, 1, 3)]
Environment.pykov.maximum_likelihood_probabilities(trj, lag_time=1, separator='0')[source]

Return a Chain calculated by means of maximum likelihood probabilities.

Return two objects: p : a Vector object, the probability distribution over the nodes. T : a Chain object, the Markov chain.

trj : the symbolic trajectory. lag_time : number of steps defining a transition. separator: the special symbol indicating the presence of sub-trajectories.

>>> t = [1,2,3,2,3,2,1,2,2,3,3,2]
>>> p, T = pykov.maximum_likelihood_probabilities(t)
>>> p
{1: 0.18181818181818182, 2: 0.4545454545454546, 3: 0.36363636363636365}
>>> T
{(1, 2): 1.0, (3, 2): 0.7499999999999999, (2, 3): 0.5999999999999999, (3,
3): 0.25, (2, 2): 0.19999999999999998, (2, 1): 0.19999999999999998}
Environment.pykov._remove_dead_branch(transitions_list)[source]

Remove dead branchs inserting a selfloop in every node that has not outgoing links.

>>> trj = [1,2,3,1,2,3,2,2,4,3,5]
>>> tr = pykov.transitions(trj, nsteps=1)
>>> tr
[(1, 2), (2, 3), (3, 1), (1, 2), (2, 3), (3, 2), (2, 2), (2, 4), (4, 3),
(3, 5)]
>>> pykov._remove_dead_branch(tr)
>>> tr
[(1, 2), (2, 3), (3, 1), (1, 2), (2, 3), (3, 2), (2, 2), (2, 4), (4, 3),
(3, 5), (5, 5)]
Environment.pykov._machineEpsilon(func=<class 'float'>)[source]

should be the same result of: numpy.finfo(numpy.float).eps

Environment.sortedDistance module

sortedDistance: define function to measure of sortedness of permutations of [0..N-1].

Environment.sortedDistance.weightedDistance(choices, weights, n=None)[source]

Relative difference between the best possible weighted choices and the actual choices.

>>> weights = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
>>> choices = [8, 6, 5, 2]
>>> weightedDistance(choices, weights)  # not a bad choice  # doctest: +ELLIPSIS
0.8333...
>>> choices = [8, 6, 5, 7]
>>> weightedDistance(choices, weights)  # best choice!  # doctest: +ELLIPSIS
1.000...
>>> choices = [3, 2, 1, 0]
>>> weightedDistance(choices, weights)  # worst choice!  # doctest: +ELLIPSIS
0.3333...
Environment.sortedDistance.manhattan(permutation, comp=None)[source]

A certain measure of sortedness for the list A, based on Manhattan distance.

>>> perm = [0, 1, 2, 3, 4]
>>> manhattan(perm)  # sorted  # doctest: +ELLIPSIS
1.0...
>>> perm = [0, 1, 2, 5, 4, 3]
>>> manhattan(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.777...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]  # doctest: +ELLIPSIS
>>> manhattan(perm)
0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!  # doctest: +ELLIPSIS
>>> manhattan(perm)
0.72
Environment.sortedDistance.kendalltau(permutation, comp=None)[source]

A certain measure of sortedness for the list A, based on Kendall Tau ranking coefficient.

>>> perm = [0, 1, 2, 3, 4]
>>> kendalltau(perm)  # sorted  # doctest: +ELLIPSIS
0.98...
>>> perm = [0, 1, 2, 5, 4, 3]
>>> kendalltau(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.90...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]
>>> kendalltau(perm)  # doctest: +ELLIPSIS
0.211...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!
>>> kendalltau(perm)  # doctest: +ELLIPSIS
0.984...
Environment.sortedDistance.spearmanr(permutation, comp=None)[source]

A certain measure of sortedness for the list A, based on Spearman ranking coefficient.

>>> perm = [0, 1, 2, 3, 4]
>>> spearmanr(perm)  # sorted  # doctest: +ELLIPSIS
1.0...
>>> perm = [0, 1, 2, 5, 4, 3]
>>> spearmanr(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.92...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]
>>> spearmanr(perm)  # doctest: +ELLIPSIS
0.248...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!
>>> spearmanr(perm)  # doctest: +ELLIPSIS
0.986...
Environment.sortedDistance.gestalt(permutation, comp=None)[source]

A certain measure of sortedness for the list A, based on Gestalt pattern matching.

>>> perm = [0, 1, 2, 3, 4]
>>> gestalt(perm)  # sorted  # doctest: +ELLIPSIS
1.0...
>>> perm = [0, 1, 2, 5, 4, 3]
>>> gestalt(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.666...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]
>>> gestalt(perm)  # doctest: +ELLIPSIS
0.4...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!
>>> gestalt(perm)  # doctest: +ELLIPSIS
0.5...
>>> import random
>>> random.seed(0)
>>> ratings = [random.gauss(1200, 200) for i in range(100000)]
>>> gestalt(ratings)  # doctest: +ELLIPSIS
8e-05...
Environment.sortedDistance.meanDistance(permutation, comp=None, methods=(<function manhattan>, <function gestalt>))[source]

A certain measure of sortedness for the list A, based on mean of the 2 distances: manhattan and gestalt.

>>> perm = [0, 1, 2, 3, 4]
>>> meanDistance(perm)  # sorted  # doctest: +ELLIPSIS
1.0
>>> perm = [0, 1, 2, 5, 4, 3]
>>> meanDistance(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.722...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]  # doctest: +ELLIPSIS
>>> meanDistance(perm)
0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!  # doctest: +ELLIPSIS
>>> meanDistance(perm)
0.61

Warning

I removed kendalltau() and spearmanr() as they were giving 100% for many cases where clearly there were no reason to give 100%…

Environment.sortedDistance.sortedDistance(permutation, comp=None, methods=(<function manhattan>, <function gestalt>))

A certain measure of sortedness for the list A, based on mean of the 2 distances: manhattan and gestalt.

>>> perm = [0, 1, 2, 3, 4]
>>> meanDistance(perm)  # sorted  # doctest: +ELLIPSIS
1.0
>>> perm = [0, 1, 2, 5, 4, 3]
>>> meanDistance(perm)  # almost sorted!  # doctest: +ELLIPSIS
0.722...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5]  # doctest: +ELLIPSIS
>>> meanDistance(perm)
0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9]  # better sorted!  # doctest: +ELLIPSIS
>>> meanDistance(perm)
0.61

Warning

I removed kendalltau() and spearmanr() as they were giving 100% for many cases where clearly there were no reason to give 100%…

Environment.usejoblib module

Import Parallel and delayed from joblib, safely.

class Environment.usejoblib.Parallel(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)[source]

Bases: joblib.logger.Logger

Helper class for readable parallel mapping.

Read more in the User Guide.

n_jobs: int, default: None
The maximum number of concurrently running jobs, such as the number of Python worker processes when backend=”multiprocessing” or the size of the thread-pool when backend=”threading”. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. None is a marker for ‘unset’ that will be interpreted as n_jobs=1 (sequential execution) unless the call is performed under a parallel_backend context manager that sets another value for n_jobs.
backend: str, ParallelBackendBase instance or None, default: ‘loky’

Specify the parallelization backend implementation. Supported backends are:

  • “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes.
  • “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
  • “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).
  • finally, you can register backends by calling register_parallel_backend. This will allow you to implement a backend of your liking.

It is not recommended to hard-code the backend name in a call to Parallel in a library. Instead it is recommended to set soft hints (prefer) or hard constraints (require) so as to make it possible for library users to change the backend from the outside using the parallel_backend context manager.

prefer: str in {‘processes’, ‘threads’} or None, default: None
Soft hint to choose the default backend if no specific backend was selected with the parallel_backend context manager. The default process-based backend is ‘loky’ and the default thread-based backend is ‘threading’. Ignored if the backend parameter is specified.
require: ‘sharedmem’ or None, default None
Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with parallel_backend.
verbose: int, optional
The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported.
timeout: float, optional
Timeout limit for each task to complete. If any task takes longer a TimeOutError will be raised. Only applied when n_jobs != 1
pre_dispatch: {‘all’, integer, or expression, as in ‘3*n_jobs’}
The number of batches (of tasks) to be pre-dispatched. Default is ‘2*n_jobs’. When batch_size=”auto” this is reasonable default and the workers should never starve.
batch_size: int or ‘auto’, default: ‘auto’
The number of atomic tasks to dispatch at once to each worker. When individual evaluations are very fast, dispatching calls to workers can be slower than sequential computation because of the overhead. Batching fast computations together can mitigate this. The 'auto' strategy keeps track of the time it takes for a batch to complete, and dynamically adjusts the batch size to keep the time on the order of half a second, using a heuristic. The initial batch size is 1. batch_size="auto" with backend="threading" will dispatch batches of a single task at a time as the threading backend has very little overhead and using larger batch size has not proved to bring any gain in that case.
temp_folder: str, optional

Folder to be used by the pool for memmapping large arrays for sharing memory with worker processes. If None, this will try in order:

  • a folder pointed by the JOBLIB_TEMP_FOLDER environment variable,
  • /dev/shm if the folder exists and is writable: this is a RAM disk filesystem available by default on modern Linux distributions,
  • the default system temporary folder that can be overridden with TMP, TMPDIR or TEMP environment variables, typically /tmp under Unix operating systems.

Only active when backend=”loky” or “multiprocessing”.

max_nbytes int, str, or None, optional, 1M by default
Threshold on the size of arrays passed to the workers that triggers automated memory mapping in temp_folder. Can be an int in Bytes, or a human-readable string, e.g., ‘1M’ for 1 megabyte. Use None to disable memmapping of large arrays. Only active when backend=”loky” or “multiprocessing”.
mmap_mode: {None, ‘r+’, ‘r’, ‘w+’, ‘c’}
Memmapping mode for numpy arrays passed to workers. See ‘max_nbytes’ parameter documentation for more details.

This object uses workers to compute in parallel the application of a function to many different arguments. The main functionality it brings in addition to using the raw multiprocessing or concurrent.futures API are (see examples for details):

  • More readable code, in particular since it avoids constructing list of arguments.
  • Easier debugging:
    • informative tracebacks even when the error happens on the client side
    • using ‘n_jobs=1’ enables to turn off parallel computing for debugging without changing the codepath
    • early capture of pickling errors
  • An optional progress meter.
  • Interruption of multiprocesses jobs with ‘Ctrl-C’
  • Flexible pickling control for the communication to and from the worker processes.
  • Ability to use shared memory efficiently with worker processes for large numpy-based datastructures.

A simple example:

>>> from math import sqrt
>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

Reshaping the output when the function has several return values:

>>> from math import modf
>>> from joblib import Parallel, delayed
>>> r = Parallel(n_jobs=1)(delayed(modf)(i/2.) for i in range(10))
>>> res, i = zip(*r)
>>> res
(0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5)
>>> i
(0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0)

The progress meter: the higher the value of verbose, the more messages:

>>> from time import sleep
>>> from joblib import Parallel, delayed
>>> r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10)) #doctest: +SKIP
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed:    0.8s
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    1.4s finished

Traceback example, note how the line of the error is indicated as well as the values of the parameter passed to the function that triggered the exception, even though the traceback happens in the child process:

>>> from heapq import nlargest
>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=2)(delayed(nlargest)(2, n) for n in (range(4), 'abcde', 3)) #doctest: +SKIP
#...
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
TypeError                                          Mon Nov 12 11:37:46 2012
PID: 12934                                    Python 2.7.3: /usr/bin/python
...........................................................................
/usr/lib/python2.7/heapq.pyc in nlargest(n=2, iterable=3, key=None)
    419         if n >= size:
    420             return sorted(iterable, key=key, reverse=True)[:n]
    421
    422     # When key is none, use simpler decoration
    423     if key is None:
--> 424         it = izip(iterable, count(0,-1))                    # decorate
    425         result = _nlargest(n, it)
    426         return map(itemgetter(0), result)                   # undecorate
    427
    428     # General case, slowest method
 TypeError: izip argument #1 must support iteration
___________________________________________________________________________

Using pre_dispatch in a producer/consumer situation, where the data is generated on the fly. Note how the producer is first called 3 times before the parallel loop is initiated, and then called to generate new data on the fly:

>>> from math import sqrt
>>> from joblib import Parallel, delayed
>>> def producer():
...     for i in range(6):
...         print('Produced %s' % i)
...         yield i
>>> out = Parallel(n_jobs=2, verbose=100, pre_dispatch='1.5*n_jobs')(
...                delayed(sqrt)(i) for i in producer()) #doctest: +SKIP
Produced 0
Produced 1
Produced 2
[Parallel(n_jobs=2)]: Done 1 jobs     | elapsed:  0.0s
Produced 3
[Parallel(n_jobs=2)]: Done 2 jobs     | elapsed:  0.0s
Produced 4
[Parallel(n_jobs=2)]: Done 3 jobs     | elapsed:  0.0s
Produced 5
[Parallel(n_jobs=2)]: Done 4 jobs     | elapsed:  0.0s
[Parallel(n_jobs=2)]: Done 6 out of 6 | elapsed:  0.0s remaining: 0.0s
[Parallel(n_jobs=2)]: Done 6 out of 6 | elapsed:  0.0s finished
__init__(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)[source]
depth: int, optional
The depth of objects printed.
__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
_initialize_backend()[source]

Build a process or thread pool and return the number of workers

_effective_n_jobs()[source]
_terminate_backend()[source]
_dispatch(batch)[source]

Queue the batch for computing, with or without multiprocessing

WARNING: this method is not thread-safe: it should be only called indirectly via dispatch_one_batch.

dispatch_next()[source]

Dispatch more data for parallel processing

This method is meant to be called concurrently by the multiprocessing callback. We rely on the thread-safety of dispatch_one_batch to protect against concurrent consumption of the unprotected iterator.

dispatch_one_batch(iterator)[source]

Prefetch the tasks for the next batch and dispatch them.

The effective size of the batch is computed here. If there are no more jobs to dispatch, return False, else return True.

The iterator consumption and dispatching is protected by the same lock so calling this function should be thread safe.

_print(msg, msg_args)[source]

Display the message on stout or stderr depending on verbosity

print_progress()[source]

Display the process of the parallel execution only a fraction of time, controlled by self.verbose.

__module__ = 'joblib.parallel'
retrieve()[source]
__call__(iterable)[source]

Call self as a function.

__repr__()[source]

Return repr(self).

Environment.usejoblib.delayed(function)[source]

Decorator used to capture the arguments of a function.

Environment.usenumba module

Import numba.jit or a dummy decorator.

Environment.usenumba.USE_NUMBA = False

Configure the use of numba

Environment.usenumba.jit(f)[source]

Fake numba.jit decorator.

Environment.usetqdm module

Import tqdm from tqdm, safely.

class Environment.usetqdm.tqdm(iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, lock_args=None, nrows=None, colour=None, delay=0, gui=False, **kwargs)[source]

Bases: tqdm.utils.Comparable

Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.

monitor_interval = 10
monitor = None
_instances = <_weakrefset.WeakSet object>
static format_sizeof(num, suffix='', divisor=1000)[source]

Formats a number (greater than unity) with SI Order of Magnitude prefixes.

num : float
Number ( >= 1) to format.
suffix : str, optional
Post-postfix [default: ‘’].
divisor : float, optional
Divisor between prefixes [default: 1000].
out : str
Number with Order of Magnitude SI unit postfix.
static format_interval(t)[source]

Formats a number of seconds as a clock time, [H:]MM:SS

t : int
Number of seconds.
out : str
[H:]MM:SS
static format_num(n)[source]

Intelligent scientific notation (.3g).

n : int or float or Numeric
A Number.
out : str
Formatted number.
static status_printer(file)[source]

Manage the printing and in-place updating of a line of characters. Note that if the string is longer than a line, then in-place updating may not work (it will print a new line at each refresh).

static format_meter(n, total, elapsed, ncols=None, prefix='', ascii=False, unit='it', unit_scale=False, rate=None, bar_format=None, postfix=None, unit_divisor=1000, initial=0, colour=None, **extra_kwargs)[source]

Return a string-based progress bar given some parameters

n : int or float
Number of finished iterations.
total : int or float
The expected total number of iterations. If meaningless (None), only basic progress statistics are displayed (no ETA).
elapsed : float
Number of seconds passed since start.
ncols : int, optional
The width of the entire output message. If specified, dynamically resizes {bar} to stay within this bound [default: None]. If 0, will not print any bar (only stats). The fallback is {bar:10}.
prefix : str, optional
Prefix message (included in total width) [default: ‘’]. Use as {desc} in bar_format string.
ascii : bool, optional or str, optional
If not set, use unicode (smooth blocks) to fill the meter [default: False]. The fallback is to use ASCII characters ” 123456789#”.
unit : str, optional
The iteration unit [default: ‘it’].
unit_scale : bool or int or float, optional
If 1 or True, the number of iterations will be printed with an appropriate SI metric prefix (k = 10^3, M = 10^6, etc.) [default: False]. If any other non-zero number, will scale total and n.
rate : float, optional
Manual override for iteration rate. If [default: None], uses n/elapsed.
bar_format : str, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’
Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,
percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s, eta.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

postfix : *, optional
Similar to prefix, but placed at the end (e.g. for additional stats). Note: postfix is usually a string (not a dict) for this method, and will if possible be set to postfix = ‘, ‘ + postfix. However other types are supported (#382).
unit_divisor : float, optional
[default: 1000], ignored unless unit_scale is True.
initial : int or float, optional
The initial counter value [default: 0].
colour : str, optional
Bar colour (e.g. ‘green’, ‘#00ff00’).

out : Formatted meter and stats, ready to display.

static __new__(cls, *_, **__)[source]

Create and return a new object. See help(type) for accurate signature.

classmethod _get_free_pos(instance=None)[source]

Skips specified instance.

classmethod _decr_instances(instance)[source]

Remove from list and reposition another unfixed bar to fill the new gap.

This means that by default (where all nested bars are unfixed), order is not maintained but screen flicker/blank space is minimised. (tqdm<=4.44.1 moved ALL subsequent unfixed bars up.)

classmethod write(s, file=None, end='\n', nolock=False)[source]

Print a message via tqdm (without overlap with bars).

classmethod external_write_mode(file=None, nolock=False)[source]

Disable tqdm within context and refresh tqdm when exits. Useful when writing to standard output stream

classmethod set_lock(lock)[source]

Set the global lock.

classmethod get_lock()[source]

Get the global lock. Construct it if it does not exist.

classmethod pandas(**tqdm_kwargs)[source]
Registers the current tqdm class with
pandas.core. ( frame.DataFrame | series.Series | groupby.(generic.)DataFrameGroupBy | groupby.(generic.)SeriesGroupBy ).progress_apply

A new instance will be create every time progress_apply is called, and each instance will automatically close() upon completion.

tqdm_kwargs : arguments for the tqdm instance

>>> import pandas as pd
>>> import numpy as np
>>> from tqdm import tqdm
>>> from tqdm.gui import tqdm as tqdm_gui
>>>
>>> df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
>>> tqdm.pandas(ncols=50)  # can use tqdm_gui, optional kwargs, etc
>>> # Now you can use `progress_apply` instead of `apply`
>>> df.groupby(0).progress_apply(lambda x: x**2)

<https://stackoverflow.com/questions/18603270/ progress-indicator-during-pandas-operations-python>

__init__(iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, lock_args=None, nrows=None, colour=None, delay=0, gui=False, **kwargs)[source]
iterable : iterable, optional
Iterable to decorate with a progressbar. Leave blank to manually manage the updates.
desc : str, optional
Prefix for the progressbar.
total : int or float, optional
The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If gui is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.
leave : bool, optional
If [default: True], keeps all traces of the progressbar upon termination of iteration. If None, will leave only if position is 0.
file : io.TextIOWrapper or io.StringIO, optional
Specifies where to output the progress messages (default: sys.stderr). Uses file.write(str) and file.flush() methods. For encoding, see write_bytes.
ncols : int, optional
The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).
mininterval : float, optional
Minimum progress display update interval [default: 0.1] seconds.
maxinterval : float, optional
Maximum progress display update interval [default: 10] seconds. Automatically adjusts miniters to correspond to mininterval after long display update lag. Only works if dynamic_miniters or monitor thread is enabled.
miniters : int or float, optional
Minimum progress display update interval, in iterations. If 0 and dynamic_miniters, will automatically adjust to equal mininterval (more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this and mininterval to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.
ascii : bool or str, optional
If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.
disable : bool, optional
Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.
unit : str, optional
String that will be used to define the unit of each iteration [default: it].
unit_scale : bool or int or float, optional
If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale total and n.
dynamic_ncols : bool, optional
If set, constantly alters ncols and nrows to the environment (allowing for window resizes) [default: False].
smoothing : float, optional
Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].
bar_format : str, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’
Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,
percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s, eta.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

initial : int or float, optional
The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.
position : int, optional
Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).
postfix : dict or *, optional
Specify additional stats to display at the end of the bar. Calls set_postfix(**postfix) if possible (dict).
unit_divisor : float, optional
[default: 1000], ignored unless unit_scale is True.
write_bytes : bool, optional
If (default: None) and file is unspecified, bytes will be written in Python 2. If True will also write bytes. In all other cases will default to unicode.
lock_args : tuple, optional
Passed to refresh for intermediate output (initialisation, iterating, and updating).
nrows : int, optional
The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.
colour : str, optional
Bar colour (e.g. ‘green’, ‘#00ff00’).
delay : float, optional
Don’t display until [default: 0] seconds have elapsed.
gui : bool, optional
WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].

out : decorated iterator.

__bool__()[source]
__nonzero__()[source]
__len__()[source]
__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
__del__()[source]
__str__()[source]

Return str(self).

_comparable
__hash__()[source]

Return hash(self).

__iter__()[source]

Backward-compatibility to use: for x in tqdm(iterable)

update(n=1)[source]

Manually update the progress bar, useful for streams such as reading files. E.g.: >>> t = tqdm(total=filesize) # Initialise >>> for current_buffer in stream: … … … t.update(len(current_buffer)) >>> t.close() The last line is highly recommended, but possibly not necessary if t.update() will be called in such a way that filesize will be exactly reached and printed.

n : int or float, optional
Increment to add to the internal counter of iterations [default: 1]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.
out : bool or None
True if a display() was triggered.
close()[source]

Cleanup and (if leave=False) close the progressbar.

clear(nolock=False)[source]

Clear current bar display.

__module__ = 'tqdm.std'
refresh(nolock=False, lock_args=None)[source]

Force refresh the display of this bar.

nolock : bool, optional
If True, does not lock. If [default: False]: calls acquire() on internal lock.
lock_args : tuple, optional
Passed to internal lock’s acquire(). If specified, will only display() if acquire() returns True.
unpause()[source]

Restart tqdm timer from last print time.

reset(total=None)[source]

Resets to 0 iterations for repeated use.

Consider combining with leave=True.

total : int or float, optional. Total to use for the new bar.

set_description(desc=None, refresh=True)[source]

Set/modify description of the progress bar.

desc : str, optional refresh : bool, optional

Forces refresh [default: True].
set_description_str(desc=None, refresh=True)[source]

Set/modify description without ‘: ‘ appended.

set_postfix(ordered_dict=None, refresh=True, **kwargs)[source]

Set/modify postfix (additional stats) with automatic formatting based on datatype.

ordered_dict : dict or OrderedDict, optional refresh : bool, optional

Forces refresh [default: True].

kwargs : dict, optional

set_postfix_str(s='', refresh=True)[source]

Postfix without dictionary expansion, similar to prefix handling.

moveto(n)[source]
format_dict

Public API for read-only member access.

display(msg=None, pos=None)[source]

Use self.sp to display msg in the specified pos.

Consider overloading this function when inheriting to use e.g.: self.some_frontend(**self.format_dict) instead of self.sp.

msg : str, optional. What to display (default: repr(self)). pos : int, optional. Position to moveto

(default: abs(self.pos)).
classmethod wrapattr(stream, method, total=None, bytes=True, **tqdm_kwargs)[source]

stream : file-like object. method : str, “read” or “write”. The result of read() and

the first argument of write() should have a len().
>>> with tqdm.wrapattr(file_obj, "read", total=file_obj.size) as fobj:
...     while True:
...         chunk = fobj.read(chunk_size)
...         if not chunk:
...             break

Policies package

Policies module : contains all the (single-player) bandits algorithms:

Note

The list above might not be complete, see the details below.

All policies have the same interface, as described in BasePolicy, in order to use them in any experiment with the following approach:

my_policy = Policy(nbArms)
my_policy.startGame()  # start the game
for t in range(T):
    chosen_arm_t = k_t = my_policy.choice()  # chose one arm
    reward_t     = sampled from an arm k_t   # sample a reward
    my_policy.getReward(k_t, reward_t)       # give it the the policy
Policies.klucb_mapping = {'Bernoulli': <function klucbBern>, 'Exponential': <function klucbExp>, 'Gamma': <function klucbGamma>, 'Gaussian': <function klucbGauss>, 'Poisson': <function klucbPoisson>}

Maps name of arms to kl functions

Subpackages
Policies.Experimentals package
Submodules
Policies.Experimentals.BlackBoxOpt module
Policies.Experimentals.KLempUCB module

The Empirical KL-UCB algorithm non-parametric policy. Reference: [Maillard, Munos & Stoltz - COLT, 2011], [Cappé, Garivier, Maillard, Munos & Stoltz, 2012].

class Policies.Experimentals.KLempUCB.KLempUCB(nbArms, maxReward=1.0, lower=0.0, amplitude=1.0)[source]

Bases: IndexPolicy.IndexPolicy

The Empirical KL-UCB algorithm non-parametric policy. References: [Maillard, Munos & Stoltz - COLT, 2011], [Cappé, Garivier, Maillard, Munos & Stoltz, 2012].

__init__(nbArms, maxReward=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

maxReward = None

Known upper bound on the rewards

pulls = None

Keep track of pulls of each arm

obs = None

UNBOUNDED dictionnary for each arm: keep track of how many observation of each rewards were seen. Warning: KLempUCB works better for discrete distributions!

startGame()[source]

Initialize the policy for a new game.

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update count of observations for that arm.

static _KLucb(obs, klMax, debug=False)[source]

Optimization method.

__module__ = 'Policies.Experimentals.KLempUCB'
Policies.Experimentals.ThompsonRobust module

The Thompson (Bayesian) index policy, using an average of 20 index. By default, it uses a Beta posterior. Reference: [Thompson - Biometrika, 1933].

Policies.Experimentals.ThompsonRobust.AVERAGEON = 10

Default value of how many indexes are computed by sampling the posterior for the ThompsonRobust variant.

class Policies.Experimentals.ThompsonRobust.ThompsonRobust(nbArms, posterior=<class 'Posterior.Beta.Beta'>, averageOn=10, lower=0.0, amplitude=1.0)[source]

Bases: Thompson.Thompson

The Thompson (Bayesian) index policy, using an average of 20 index. By default, it uses a Beta posterior. Reference: [Thompson - Biometrika, 1933].

__init__(nbArms, posterior=<class 'Posterior.Beta.Beta'>, averageOn=10, lower=0.0, amplitude=1.0)[source]

Create a new Bayesian policy, by creating a default posterior on each arm.

averageOn = None

How many indexes are computed before averaging

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index for this arm, by sampling averageOn times the posterior and returning the average index.

At time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior and averaging:

\[\begin{split}I_k(t) &= \frac{1}{\mathrm{averageOn}} \sum_{i=1}^{\mathrm{averageOn}} I_k^{(i)}(t), \\ I_k^{(i)}(t) &\sim \mathrm{Beta}(1 + S_k(t), 1 + N_k(t) - S_k(t)).\end{split}\]
__module__ = 'Policies.Experimentals.ThompsonRobust'
Policies.Experimentals.UCBcython module
Policies.Experimentals.UCBjulia module

The UCB policy for bounded bandits, with UCB indexes computed with Julia. Reference: [Lai & Robbins, 1985].

Warning

Using a Julia function from Python will not speed up anything, as there is a lot of overhead in the “bridge” protocol used by pyjulia. The idea of using naively a tiny Julia function to speed up computations is basically useless.

A naive benchmark showed that in this approach, UCBjulia (used withing Python) is about 125 times slower (!) than UCB.

Warning

This is only experimental, and purely useless. See https://github.com/SMPyBandits/SMPyBandits/issues/98

class Policies.Experimentals.UCBjulia.UCBjulia(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: IndexPolicy.IndexPolicy

The UCB policy for bounded bandits, with UCB indexes computed with Julia. Reference: [Lai & Robbins, 1985].

Warning

This is only experimental, and purely useless. See https://github.com/SMPyBandits/SMPyBandits/issues/98

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

Will fail directly if the bridge with julia is unavailable or buggy.

__module__ = 'Policies.Experimentals.UCBjulia'
computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \log(t)}{N_k(t)}}.\]
Policies.Experimentals.UCBlog10 module

The UCB policy for bounded bandits, using \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Lai & Robbins, 1985].

class Policies.Experimentals.UCBlog10.UCBlog10(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: IndexPolicy.IndexPolicy

The UCB policy for bounded bandits, using \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Lai & Robbins, 1985].

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \log_{10}(t)}{N_k(t)}}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.Experimentals.UCBlog10'
Policies.Experimentals.UCBlog10alpha module

The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Note: \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Auer et al. 02].

Policies.Experimentals.UCBlog10alpha.ALPHA = 1

Default parameter for alpha

class Policies.Experimentals.UCBlog10alpha.UCBlog10alpha(nbArms, alpha=1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Experimentals.UCBlog10.UCBlog10

The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Note: \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Auer et al. 02].

__init__(nbArms, alpha=1, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
alpha = None

Parameter alpha

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log_{10}(t)}{2 N_k(t)}}.\]
__module__ = 'Policies.Experimentals.UCBlog10alpha'
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

Policies.Experimentals.UCBoost_cython module
Policies.Experimentals.UCBoost_faster module
Policies.Experimentals.UCBoost_faster_cython module
Policies.Experimentals.UCBwrong module

The UCBwrong policy for bounded bandits, like UCB but with a typo on the estimator of means: \(\frac{X_k(t)}{t}\) is used instead of \(\frac{X_k(t)}{N_k(t)}\).

One paper of W.Jouini, C.Moy and J.Palicot from 2009 contained this typo, I reimplemented it just to check that:

  • its performance is worse than simple UCB,
  • but not that bad…
class Policies.Experimentals.UCBwrong.UCBwrong(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: IndexPolicy.IndexPolicy

The UCBwrong policy for bounded bandits, like UCB but with a typo on the estimator of means.

One paper of W.Jouini, C.Moy and J.Palicot from 2009 contained this typo, I reimplemented it just to check that:

  • its performance is worse than simple UCB
  • but not that bad…
computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{t} + \sqrt{\frac{2 \log(t)}{N_k(t)}}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.Experimentals.UCBwrong'
Policies.Experimentals.UnsupervisedLearning module
Policies.Experimentals.klUCBlog10 module

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].

class Policies.Experimentals.klUCBlog10.klUCBlog10(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: klUCB.klUCB

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log_{10}(t)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.Experimentals.klUCBlog10'
Policies.Experimentals.klUCBloglog10 module

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].

class Policies.Experimentals.klUCBloglog10.klUCBloglog10(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: klUCB.klUCB

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log_{10}(t) + c \log(\max(1, \log_{10}(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.Experimentals.klUCBloglog10'
Policies.Experimentals.setup module
Policies.Posterior package

Posteriors for Bayesian Index policies:

  • Beta is the default for Thompson Sampling and BayesUCB, ideal for Bernoulli experiments,
  • Gamma and Gauss are more suited for respectively Poisson and Gaussian arms,
  • DiscountedBeta is the default for Policies.DiscountedThompson Sampling, ideal for Bernoulli experiments on non stationary bandits.
Submodules
Policies.Posterior.Beta module

Manipulate posteriors of Bernoulli/Beta experiments.

Rewards not in \({0, 1}\) are handled with a trick, see bernoulliBinarization(), with a “random binarization”, cf., [Agrawal12] (algorithm 2). When reward \(r_t \in [0, 1]\) is observed, the player receives the result of a Bernoulli sample of average \(r_t\): \(r_t \sim \mathrm{Bernoulli}(r_t)\) so it is well in \({0, 1}\).

[Agrawal12]http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf
Policies.Posterior.Beta.bernoulliBinarization(r_t)[source]

Return a (random) binarization of a reward \(r_t\), in the continuous interval \([0, 1]\) as an observation in discrete \({0, 1}\).

  • Useful to allow to use a Beta posterior for non-Bernoulli experiments,
  • That way, Thompson sampling can be used for any continuous-valued bounded rewards.

Examples:

>>> import random
>>> random.seed(0)
>>> bernoulliBinarization(0.3)
1
>>> bernoulliBinarization(0.3)
0
>>> bernoulliBinarization(0.3)
0
>>> bernoulliBinarization(0.3)
0
>>> bernoulliBinarization(0.9)
1
>>> bernoulliBinarization(0.9)
1
>>> bernoulliBinarization(0.9)
1
>>> bernoulliBinarization(0.9)
0
class Policies.Posterior.Beta.Beta(a=1, b=1)[source]

Bases: Policies.Posterior.Posterior.Posterior

Manipulate posteriors of Bernoulli/Beta experiments.

__init__(a=1, b=1)[source]

Create a Beta posterior \(\mathrm{Beta}(\alpha, \beta)\) with no observation, i.e., \(\alpha = 1\) and \(\beta = 1\) by default.

N = None

List of two parameters [a, b]

__str__()[source]

Return str(self).

reset(a=None, b=None)[source]

Reset alpha and beta, both to 1 as when creating a new default Beta.

sample()[source]

Get a random sample from the Beta posterior (using numpy.random.betavariate()).

  • Used only by Thompson Sampling and AdBandits so far.
quantile(p)[source]

Return the p quantile of the Beta posterior (using scipy.stats.btdtri()).

  • Used only by BayesUCB and AdBandits so far.
mean()[source]

Compute the mean of the Beta posterior (should be useless).

forget(obs)[source]

Forget the last observation.

update(obs)[source]

Add an observation.

  • If obs is 1, update \(\alpha\) the count of positive observations,
  • If it is 0, update \(\beta\) the count of negative observations.

Note

Otherwise, a trick with bernoulliBinarization() has to be used.

__module__ = 'Policies.Posterior.Beta'
Policies.Posterior.Beta.betavariate()

beta(a, b, size=None)

Draw samples from a Beta distribution.

The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function

\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]

where the normalization, B, is the beta function,

\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]

It is often seen in Bayesian inference and order statistics.

Note

New code should use the beta method of a default_rng() instance instead; please see the Quick Start.

a : float or array_like of floats
Alpha, positive (>0).
b : float or array_like of floats
Beta, positive (>0).
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a and b are both scalars. Otherwise, np.broadcast(a, b).size samples are drawn.
out : ndarray or scalar
Drawn samples from the parameterized beta distribution.

Generator.beta: which should be used for new code.

Policies.Posterior.Beta.random() → x in the interval [0, 1).
Policies.Posterior.DiscountedBeta module

Manipulate posteriors of Bernoulli/Beta experiments., for discounted Bayesian policies (Policies.DiscountedBayesianIndexPolicy).

Policies.Posterior.DiscountedBeta.GAMMA = 0.95

Default value for the discount factor \(\gamma\in(0,1)\). 0.95 is empirically a reasonable value for short-term non-stationary experiments.

class Policies.Posterior.DiscountedBeta.DiscountedBeta(gamma=0.95, a=1, b=1)[source]

Bases: Policies.Posterior.Beta.Beta

Manipulate posteriors of Bernoulli/Beta experiments, for discounted Bayesian policies (Policies.DiscountedBayesianIndexPolicy).

  • It keeps \(\tilde{S}(t)\) and \(\tilde{F}(t)\) the discounted counts of successes and failures (S and F).
__init__(gamma=0.95, a=1, b=1)[source]

Create a Beta posterior \(\mathrm{Beta}(\alpha, \beta)\) with no observation, i.e., \(\alpha = 1\) and \(\beta = 1\) by default.

N = None

List of two parameters [a, b]

gamma = None

Discount factor \(\gamma\in(0,1)\).

__str__()[source]

Return str(self).

reset(a=None, b=None)[source]

Reset alpha and beta, both to 0 as when creating a new default DiscountedBeta.

sample()[source]

Get a random sample from the DiscountedBeta posterior (using numpy.random.betavariate()).

  • Used only by Thompson Sampling and AdBandits so far.
quantile(p)[source]

Return the p quantile of the DiscountedBeta posterior (using scipy.stats.btdtri()).

  • Used only by BayesUCB and AdBandits so far.
forget(obs)[source]

Forget the last observation, and undiscount the count of observations.

update(obs)[source]

Add an observation, and discount the previous observations.

  • If obs is 1, update \(\alpha\) the count of positive observations,
  • If it is 0, update \(\beta\) the count of negative observations.
  • But instead of using \(\tilde{S}(t) = S(t)\) and \(\tilde{N}(t) = N(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\tilde{S}(t+1) &= \gamma \tilde{S}(t) + r(t), \tilde{F}(t+1) &= \gamma \tilde{F}(t) + (1 - r(t)).\]

Note

Otherwise, a trick with bernoulliBinarization() has to be used.

discount()[source]

Simply discount the old observation, when no observation is given at this time.

\[\tilde{S}(t+1) &= \gamma \tilde{S}(t), \tilde{F}(t+1) &= \gamma \tilde{F}(t).\]
undiscount()[source]

Simply cancel the discount on the old observation, when no observation is given at this time.

\[\tilde{S}(t+1) &= \frac{1}{\gamma} \tilde{S}(t), \tilde{F}(t+1) &= \frac{1}{\gamma} \tilde{F}(t).\]
__module__ = 'Policies.Posterior.DiscountedBeta'
Policies.Posterior.DiscountedBeta.betavariate()

beta(a, b, size=None)

Draw samples from a Beta distribution.

The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function

\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]

where the normalization, B, is the beta function,

\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]

It is often seen in Bayesian inference and order statistics.

Note

New code should use the beta method of a default_rng() instance instead; please see the Quick Start.

a : float or array_like of floats
Alpha, positive (>0).
b : float or array_like of floats
Beta, positive (>0).
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if a and b are both scalars. Otherwise, np.broadcast(a, b).size samples are drawn.
out : ndarray or scalar
Drawn samples from the parameterized beta distribution.

Generator.beta: which should be used for new code.

Policies.Posterior.Gamma module

Manipulate a Gamma posterior. No need for tricks to handle non-binary rewards.

class Policies.Posterior.Gamma.Gamma(k=1, lmbda=1)[source]

Bases: Policies.Posterior.Posterior.Posterior

Manipulate a Gamma posterior.

__init__(k=1, lmbda=1)[source]

Create a Gamma posterior, \(\Gamma(k, \lambda)\), with \(k=1\) and \(\lambda=1\) by default.

k = None

Parameter \(k\)

lmbda = None

Parameter \(\lambda\)

__str__()[source]

Return str(self).

reset(k=None, lmbda=None)[source]

Reset k and lmbda, both to 1 as when creating a new default Gamma.

sample()[source]

Get a random sample from the Beta posterior (using numpy.random.gammavariate()).

  • Used only by Thompson Sampling and AdBandits so far.
quantile(p)[source]

Return the p quantile of the Gamma posterior (using scipy.stats.gdtrix()).

  • Used only by BayesUCB and AdBandits so far.
mean()[source]

Compute the mean of the Gamma posterior (should be useless).

forget(obs)[source]

Forget the last observation.

update(obs)[source]

Add an observation: increase k by k0, and lmbda by obs (do not have to be normalized).

__module__ = 'Policies.Posterior.Gamma'
Policies.Posterior.Gamma.gammavariate()

gamma(shape, scale=1.0, size=None)

Draw samples from a Gamma distribution.

Samples are drawn from a Gamma distribution with specified parameters, shape (sometimes designated “k”) and scale (sometimes designated “theta”), where both parameters are > 0.

Note

New code should use the gamma method of a default_rng() instance instead; please see the Quick Start.

shape : float or array_like of floats
The shape of the gamma distribution. Must be non-negative.
scale : float or array_like of floats, optional
The scale of the gamma distribution. Must be non-negative. Default is equal to 1.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if shape and scale are both scalars. Otherwise, np.broadcast(shape, scale).size samples are drawn.
out : ndarray or scalar
Drawn samples from the parameterized gamma distribution.
scipy.stats.gamma : probability density function, distribution or
cumulative density function, etc.

Generator.gamma: which should be used for new code.

The probability density for the Gamma distribution is

\[p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},\]

where \(k\) is the shape and \(\theta\) the scale, and \(\Gamma\) is the Gamma function.

The Gamma distribution is often used to model the times to failure of electronic components, and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.

[1]Weisstein, Eric W. “Gamma Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/GammaDistribution.html
[2]Wikipedia, “Gamma distribution”, https://en.wikipedia.org/wiki/Gamma_distribution

Draw samples from the distribution:

>>> shape, scale = 2., 2.  # mean=4, std=2*sqrt(2)
>>> s = np.random.gamma(shape, scale, 1000)

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> import scipy.special as sps  # doctest: +SKIP
>>> count, bins, ignored = plt.hist(s, 50, density=True)
>>> y = bins**(shape-1)*(np.exp(-bins/scale) /  # doctest: +SKIP
...                      (sps.gamma(shape)*scale**shape))
>>> plt.plot(bins, y, linewidth=2, color='r')  # doctest: +SKIP
>>> plt.show()
Policies.Posterior.Gauss module

Manipulate a posterior of Gaussian experiments, which happens to also be a Gaussian distribution if the prior is Gaussian. Easy peasy!

Warning

TODO I have to test it!

class Policies.Posterior.Gauss.Gauss(mu=0.0)[source]

Bases: Policies.Posterior.Posterior.Posterior

Manipulate a posterior of Gaussian experiments, which happens to also be a Gaussian distribution if the prior is Gaussian.

The posterior distribution is a \(\mathcal{N}(\hat{\mu_k}(t), \hat{\sigma_k}^2(t))\), where

\[\hat{\mu_k}(t) &= \frac{X_k(t)}{N_k(t)}, \hat{\sigma_k}^2(t) &= \frac{1}{N_k(t)}.\]

Warning

This works only for prior with a variance \(\sigma^2=1\) !

__init__(mu=0.0)[source]

Create a posterior assuming the prior is \(\mathcal{N}(\mu, 1)\).

  • The prior is centered (\(\mu=1\)) by default, but parameter mu can be used to change this default.
mu = None

Parameter \(\mu\) of the posterior

sigma = None

The parameter \(\sigma\) of the posterior

__str__()[source]

Return str(self).

reset(mu=None)[source]

Reset the for parameters \(\mu, \sigma\), as when creating a new Gauss posterior.

sample()[source]

Get a random sample \((x, \sigma^2)\) from the Gaussian posterior (using scipy.stats.invgamma() for the variance \(\sigma^2\) parameter and numpy.random.normal() for the mean \(x\)).

  • Used only by Thompson Sampling and AdBandits so far.
quantile(p)[source]

Return the p-quantile of the Gauss posterior.

Note

It now works fine with Policies.BayesUCB with Gauss posteriors, even if it is MUCH SLOWER than the Bernoulli posterior (Gamma).

mean()[source]

Compute the mean, \(\mu\) of the Gauss posterior (should be useless).

variance()[source]

Compute the variance, \(\sigma\), of the Gauss posterior (should be useless).

update(obs)[source]

Add an observation \(x\) or a vector of observations, assumed to be drawn from an unknown normal distribution.

forget(obs)[source]

Forget the last observation. Should work, but should also not be used…

__module__ = 'Policies.Posterior.Gauss'
Policies.Posterior.Gauss.normalvariate()

normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [2], is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [2].

Note

New code should use the normal method of a default_rng() instance instead; please see the Quick Start.

loc : float or array_like of floats
Mean (“centre”) of the distribution.
scale : float or array_like of floats
Standard deviation (spread or “width”) of the distribution. Must be non-negative.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.
out : ndarray or scalar
Drawn samples from the parameterized normal distribution.
scipy.stats.norm : probability density function, distribution or
cumulative density function, etc.

Generator.normal: which should be used for new code.

The probability density for the Gaussian distribution is

\[p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },\]

where \(\mu\) is the mean and \(\sigma\) the standard deviation. The square of the standard deviation, \(\sigma^2\), is called the variance.

The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at \(x + \sigma\) and \(x - \sigma\) [2]). This implies that normal is more likely to return samples lying close to the mean, rather than those far away.

[1]Wikipedia, “Normal distribution”, https://en.wikipedia.org/wiki/Normal_distribution
[2](1, 2, 3) P. R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th ed., 2001, pp. 51, 51, 125.

Draw samples from the distribution:

>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)

Verify the mean and the variance:

>>> abs(mu - np.mean(s))
0.0  # may vary
>>> abs(sigma - np.std(s, ddof=1))
0.1  # may vary

Display the histogram of the samples, along with the probability density function:

>>> import matplotlib.pyplot as plt
>>> count, bins, ignored = plt.hist(s, 30, density=True)
>>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
...                np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
...          linewidth=2, color='r')
>>> plt.show()

Two-by-four array of samples from N(3, 6.25):

>>> np.random.normal(3, 2.5, size=(2, 4))
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],   # random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]])  # random
Policies.Posterior.Posterior module

Base class for a posterior. Cf. http://chercheurs.lille.inria.fr/ekaufman/NIPS13 Fig.1 for a list of posteriors.

class Policies.Posterior.Posterior.Posterior(*args, **kwargs)[source]

Bases: object

Manipulate posteriors experiments.

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

reset(*args, **kwargs)[source]

Reset posterior, new experiment.

sample()[source]

Sample from the posterior.

quantile(p)[source]

p quantile from the posterior.

mean()[source]

Mean of the posterior.

forget(obs)[source]

Forget last observation (never used).

update(obs)[source]

Update posterior with this observation.

__dict__ = mappingproxy({'__module__': 'Policies.Posterior.Posterior', '__doc__': ' Manipulate posteriors experiments.', '__init__': <function Posterior.__init__>, 'reset': <function Posterior.reset>, 'sample': <function Posterior.sample>, 'quantile': <function Posterior.quantile>, 'mean': <function Posterior.mean>, 'forget': <function Posterior.forget>, 'update': <function Posterior.update>, '__dict__': <attribute '__dict__' of 'Posterior' objects>, '__weakref__': <attribute '__weakref__' of 'Posterior' objects>})
__module__ = 'Policies.Posterior.Posterior'
__weakref__

list of weak references to the object (if defined)

Policies.Posterior.with_proba module

Simply defines a function with_proba() that is used everywhere.

Policies.Posterior.with_proba.with_proba(epsilon)[source]

Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.

Example:

>>> from random import seed; seed(0)  # reproductible
>>> with_proba(0.5)
False
>>> with_proba(0.9)
True
>>> with_proba(0.1)
False
>>> if with_proba(0.2):
...     print("This happens 20% of the time.")
Policies.Posterior.with_proba.random() → x in the interval [0, 1).
Submodules
Policies.AdBandits module

The AdBandits bandit algorithm, mixing Thompson Sampling and BayesUCB.

Warning

This policy is very not famous, but for stochastic bandits it works usually VERY WELL! It is not anytime thought.

Policies.AdBandits.ALPHA = 1

Default value for the parameter \(\alpha\) for the AdBandits class.

class Policies.AdBandits.AdBandits(nbArms, horizon=1000, alpha=1, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The AdBandits bandit algorithm, mixing Thompson Sampling and BayesUCB.

Warning

This policy is very not famous, but for stochastic bandits it works usually VERY WELL! It is not anytime thought.

__init__(nbArms, horizon=1000, alpha=1, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

Parameter alpha

horizon = None

Parameter \(T\) = known horizon of the experiment. Default value is 1000.

posterior = None

Posterior for each arm. List instead of dict, quicker access

__str__()[source]

-> str

startGame()[source]

Reset each posterior.

getReward(arm, reward)[source]

Store the reward, and update the posterior for that arm.

epsilon

Time variating parameter \(\varepsilon(t)\).

choice()[source]

With probability \(1 - \varepsilon(t)\), use a Thompson Sampling step, otherwise use a UCB-Bayes step, to choose one arm.

choiceWithRank(rank=1)[source]

With probability \(1 - \varepsilon(t)\), use a Thompson Sampling step, otherwise use a UCB-Bayes step, to choose one arm of a certain rank.

__module__ = 'Policies.AdBandits'
Policies.AdBandits.random() → x in the interval [0, 1).
Policies.AdSwitch module

The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)

  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

This implementation is still experimental!

class Policies.AdSwitch.Phase

Bases: enum.Enum

Different phases during the AdSwitch algorithm

Checking = 2
Estimation = 1
Exploitation = 3
__module__ = 'Policies.AdSwitch'
Policies.AdSwitch.mymean(x)[source]

Simply numpy.mean() on x if x is non empty, otherwise 0.0.

>>> np.mean([])
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
Policies.AdSwitch.Constant_C1 = 1.0

Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large.

Policies.AdSwitch.Constant_C2 = 1.0

Default value for the constant \(C_2\). Should be \(>0\) and as large as possible, but not too large.

class Policies.AdSwitch.AdSwitch(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)

__init__(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]

New policy.

horizon = None

Parameter \(T\) for the AdSwitch algorithm, the horizon of the experiment. TODO try to use DoublingTrickWrapper to remove the dependency in \(T\) ?

C1 = None

Parameter \(C_1\) for the AdSwitch algorithm.

C2 = None

Parameter \(C_2\) for the AdSwitch algorithm.

phase = None

Current phase, exploration or exploitation.

current_exploration_arm = None

Currently explored arm. It cycles uniformly, in step 2.

current_exploitation_arm = None

Currently exploited arm. It is \(\overline{a_k}\) in the algorithm.

batch_number = None

Number of batch

last_restart_time = None

Time step of the last restart (beginning of phase of Estimation)

length_of_current_phase = None

Length of the current tests phase, computed as \(s_i\), with compute_di_pi_si().

step_of_current_phase = None

Timer inside the current phase.

current_best_arm = None

Current best arm, when finishing step 3. Denote \(\overline{a_k}\) in the algorithm.

current_worst_arm = None

Current worst arm, when finishing step 3. Denote \(\underline{a_k}\) in the algorithm.

current_estimated_gap = None

Gap between the current best and worst arms, ie largest gap, when finishing step 3. Denote \(\widehat{\Delta_k}\) in the algorithm.

last_used_di_pi_si = None

Memory of the currently used \((d_i, p_i, s_i)\).

all_rewards = None

Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!

__str__()[source]

-> str

startGame()[source]

Start the game (fill pulls and rewards with 0).

getReward(arm, reward)[source]

Get a reward from an arm.

read_range_of_rewards(arm, start, end)[source]

Read the all_rewards attribute to extract all the rewards for that arm, obtained between time start (included) and end (not included).

statistical_test(t, t0)[source]

Test if at time \(t\) there is a \(\sigma\), \(t_0 \leq \sigma < t\), and a pair of arms \(a,b\), satisfying this test:

\[| \hat{\mu_a}[\sigma,t] - \hat{\mu_b}[\sigma,t] | > \sqrt{\frac{C_1 \log T}{t - \sigma}}.\]

where \(\hat{\mu_a}[t_1,t_2]\) is the empirical mean for arm \(a\) for samples obtained from times \(t \in [t_1,t_2)\).

  • Return True, sigma if the test was satisfied, and the smallest \(\sigma\) that was satisfying the test, or False, None otherwise.
find_Ik()[source]

Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search:

\[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]
__module__ = 'Policies.AdSwitch'
compute_di_pi_si()[source]

Compute the values of \(d_i\), \(p_{k,i}\), \(s_i\) according to the AdSwitch algorithm.

choice()[source]

Choose an arm following the different phase of growing lengths according to the AdSwitch algorithm.

Policies.AdSwitchNew module

The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)

  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

This implementation is still experimental!

Policies.AdSwitchNew.mymean(x)[source]

Simply numpy.mean() on x if x is non empty, otherwise 0.0.

>>> np.mean([])
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
Policies.AdSwitchNew.Constant_C1 = 16.1

Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large. In their paper, in section 4.2) page 8, an inequality controls C1: (5) states that for all s’, t’, C1 > 8 (2n - 1)/n where n = n_[s’,t’], so C1 > 16.

Policies.AdSwitchNew.DELTA_T = 50

A small trick to speed-up the computations, the checks for changes of good/bad arms are going to have a step DELTA_T.

Policies.AdSwitchNew.DELTA_S = 20

A small trick to speed-up the computations, the loops on \(s_1\), \(s_2\) and \(s\) are going to have a step DELTA_S.

class Policies.AdSwitchNew.AdSwitchNew(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)

__init__(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]

New policy.

horizon = None

Parameter \(T\) for the AdSwitchNew algorithm, the horizon of the experiment. TODO try to use DoublingTrickWrapper to remove the dependency in \(T\) ?

C1 = None

Parameter \(C_1\) for the AdSwitchNew algorithm.

delta_s = None

Parameter \(\delta_s\) for the AdSwitchNew algorithm.

delta_t = None

Parameter \(\delta_s\) for the AdSwitchNew algorithm.

ell = None

Variable \(\ell\) in the algorithm. Count the number of new episode.

start_of_episode = None

Variable \(t_l\) in the algorithm. Count the starting time of the current episode.

set_GOOD = None

Variable \(\mathrm{GOOD}_t\) in the algorithm. Set of “good” arms at current time.

set_BAD = None

Variable \(\mathrm{BAD}_t\) in the algorithm. Set of “bad” arms at current time. It always satisfies \(\mathrm{BAD}_t = \{1,\dots,K\} \setminus \mathrm{GOOD}_t\).

set_S = None

Variable \(S_t\) in the algorithm. A list of sets of sampling obligations of arm \(a\) at current time.

mu_tilde_of_l = None

Vector of variables \(\tilde{\mu}_{\ell}(a)\) in the algorithm. Count the empirical average of arm \(a\).

gap_Delta_tilde_of_l = None

Vector of variables \(\tilde{\Delta}_{\ell}(a)\) in the algorithm. Count the estimate of the gap of arm \(a\) against the best of the “good” arms.

all_rewards = None

Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!

history_of_plays = None

Memory of all the past actions played!

__str__()[source]

-> str

new_episode()[source]

Start a new episode, line 3-6 of the algorithm.

startGame()[source]

Start the game (fill pulls and rewards with 0).

check_changes_good_arms()[source]

Check for changes of good arms.

  • I moved this into a function, in order to stop the 4 for loops (good_arm, s_1, s_2, s) as soon as a change was detected (early stopping).
  • TODO this takes a crazy O(K t^3) time, it HAS to be done faster!
check_changes_bad_arms()[source]

Check for changes of bad arms, in O(K t).

  • I moved this into a function, in order to stop the 2 for loops (good_arm, s) as soon as a change was detected (early stopping).
getReward(arm, reward)[source]

Get a reward from an arm.

n_s_t(arm, s, t)[source]

Compute \(n_{[s,t]}(a) := \#\{\tau : s \leq \tau \leq t, a_{\tau} = a \}\), naively by using the dictionary of all plays all_rewards.

mu_hat_s_t(arm, s, t)[source]

Compute \(\hat{\tau}_{[s,t]}(a) := \frac{1}{n_{[s,t]}(a)} \sum_{\tau : s \leq \tau \leq t, a_{\tau} = a} r_t\), naively by using the dictionary of all plays all_rewards.

__module__ = 'Policies.AdSwitchNew'
find_max_i(gap)[source]

Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search:

\[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]
choice()[source]

Choose an arm following the different phase of growing lengths according to the AdSwitchNew algorithm.

Policies.Aggregator module

My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.

The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).

  • At every step, the prediction of every slave is gathered, and a vote is done to decide A’s decision.
  • The vote is simply a majority vote, weighted by a trust probability. If \(A_i\) decides arm \(I_i\), then the probability of selecting \(k\) is the sum of trust probabilities, \(P_i\), of every \(A_i\) for which \(I_i = k\).
  • The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm \(k\) (the reward), the trust in each slave \(A_i\) is updated: \(P_i\) increases if \(A_i\) advised \(k\) (\(I_i = k\)), or decreases if \(A_i\) advised another arm.
  • The detail about how to increase or decrease the probabilities are specified below.
  • Reference: [[Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017]](https://hal.inria.fr/hal-01705292)

Note

Why call it Aggregator ? Because this algorithm is an efficient aggregation algorithm, and like The Terminator, he beats his opponents with an iron fist! (OK, that’s a stupid joke but a cool name, thanks Emilie!)

https://en.wikipedia.org/wiki/Terminator_T-800_Model_101

Note

I wanted to call it Aggragorn. Because this algorithm is like Aragorn the ranger, it starts like a simple bandit, but soon it will become king!!

https://en.wikipedia.org/wiki/Aragorn
Policies.Aggregator.UNBIASED = True

A flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).

Policies.Aggregator.UPDATE_LIKE_EXP4 = False

Flag to know if we should update the trusts proba like in Exp4 or like in my initial Aggregator proposal

  • First choice: like Exp4, trusts are fully recomputed, trusts^(t+1) = exp(rate_t * estimated mean rewards upto time t),
  • Second choice: my proposal, trusts are just updated multiplicatively, trusts^(t+1) <-- trusts^t * exp(rate_t * estimate instant reward at time t).

Both choices seem fine, and anyway the trusts are renormalized to be a probability distribution, so it doesn’t matter much.

Policies.Aggregator.USE_LOSSES = False

Non parametric flag to know if the Exp4-like update uses losses or rewards. Losses are 1 - reward, in which case the rate_t is negative.

Policies.Aggregator.UPDATE_ALL_CHILDREN = False

Should all trusts be updated, or only the trusts of slaves Ai who advised the decision Aggregator[A1..AN] followed.

class Policies.Aggregator.Aggregator(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]

Bases: Policies.BasePolicy.BasePolicy

My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.

__init__(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]

New policy.

nbArms = None

Number of arms

lower = None

Lower values for rewards

amplitude = None

Larger values for rewards

unbiased = None

Flag, see above.

horizon = None

Horizon T, if given and not None, can be used to compute a “good” constant learning rate, \(\sqrt{\frac{2 \log(N)}{T K}}\) for N slaves, K arms (heuristic).

extra_str = None

A string to add at the end of the str(self), to specify which algorithms are aggregated for instance.

update_all_children = None

Flag, see above.

nbChildren = None

Number N of slave algorithms.

t = None

Internal time

update_like_exp4 = None

Flag, see above.

learningRate = None

Value of the learning rate (can be decreasing in time)

decreaseRate = None

Value of the constant used in the decreasing of the learning rate

children = None

List of slave algorithms.

trusts = None

Initial trusts in the slaves. Default to uniform, but a prior can also be given.

choices = None

Keep track of the last choices of each slave, to know whom to update if update_all_children is false.

children_cumulated_losses = None

Keep track of the cumulated loss (empirical mean)

index = None

Numerical index for each arms

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

rate

Learning rate, can be constant if self.decreaseRate is None, or decreasing.

  • if horizon is known, use the formula which uses it,
  • if horizon is not known, use the formula which uses current time \(t\),
  • else, if decreaseRate is a number, use an exponentionally decreasing learning rate, rate = learningRate * exp(- t / decreaseRate). Bad.
startGame()[source]

Start the game for each child.

getReward(arm, reward)[source]

Give reward for each child, and then update the trust probabilities.

_makeChildrenChoose()[source]

Convenience method to make every children chose their best arm, and store their decision in self.choices.

choice()[source]

Make each child vote, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceWithRank(rank=1)[source]

Make each child vote, with rank, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceFromSubSet(availableArms='all')[source]

Make each child vote, on subsets of arms, then sample the decision by importance sampling on their votes with the trust probabilities.

__module__ = 'Policies.Aggregator'
choiceMultiple(nb=1)[source]

Make each child vote, multiple times, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Make each child vote, multiple times (with IMP scheme), then sample the decision by importance sampling on their votes with the trust probabilities.

estimatedOrder()[source]

Make each child vote for their estimate order of the arms, then randomly select an ordering by importance sampling with the trust probabilities. Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing means.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

computeIndex(arm)[source]

Compute the current index of arm ‘arm’, by computing all the indexes of the children policies, and computing a convex combination using the trusts probabilities.

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

handleCollision(arm, reward=None)[source]

Default to give a 0 reward (or self.lower).

Policies.ApproximatedFHGittins module

The approximated Finite-Horizon Gittins index policy for bounded bandits.

Policies.ApproximatedFHGittins.ALPHA = 0.5

Default value for the parameter \(\alpha > 0\) for ApproximatedFHGittins.

Policies.ApproximatedFHGittins.DISTORTION_HORIZON = 1.01

Default value for the parameter \(\tau \geq 1\) that is used to artificially increase the horizon, from \(T\) to :math`tau T`.

class Policies.ApproximatedFHGittins.ApproximatedFHGittins(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The approximated Finite-Horizon Gittins index policy for bounded bandits.

__init__(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
alpha = None

Parameter \(\alpha > 0\).

distortion_horizon = None

Parameter \(\tau > 0\).

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

m

\(m = T - t + 1\) is the number of steps to be played until end of the game.

Note

The article does not explain how to deal with unknown horizon, but eventually if \(T\) is wrong, this m becomes negative. Empirically, I force it to be \(\geq 1\), to not mess up with the \(\log(m)\) used below, by using \(\tau T\) instead of \(T\) (e.g., \(\tau = 1.01\) is enough to not ruin the performance in the last steps of the experiment).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \log\left( \frac{m}{N_k(t) \log^{1/2}\left( \frac{m}{N_k(t)} \right)} \right)}, \\ \text{where}\;\; & m = T - t + 1.\end{split}\]

Note

This \(\log^{1/2}(\dots) = \sqrt(\log(\dots)))\) term can be undefined, as soon as \(m < N_k(t)\), so empirically, \(\sqrt(\max(0, \log(\dots))\) is used instead, or a larger horizon can be used to make \(m\) artificially larger (e.g., \(T' = 1.1 T\)).

__module__ = 'Policies.ApproximatedFHGittins'
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

Policies.BESA module

The Best Empirical Sampled Average (BESA) algorithm.

Warning

This algorithm works VERY well but it is looks weird at first sight. It sounds “too easy”, so take a look to the article before wondering why it should work.

Warning

Right now, it is between 10 and 25 times slower than Policies.klUCB and other single-player policies.

Policies.BESA.subsample_deterministic(n, m)[source]

Returns \(\{1,\dots,n\}\) if \(n < m\) or \(\{1,\dots,m\}\) if \(n \geq m\) (ie, it is \(\{1,\dots,\min(n,m)\}\)).

Warning

The BESA algorithm is efficient only with the random sub-sampling, don’t use this one except for comparing.

>>> subsample_deterministic(5, 3)  # doctest: +ELLIPSIS
array([0, 1, 2, 3])
>>> subsample_deterministic(10, 20)  # doctest: +ELLIPSIS
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
Policies.BESA.subsample_uniform(n, m)[source]

Returns a uniform sub-set of size \(n\), from \(\{1,dots, m\}\).

  • Fails if n > m.

Note

The BESA algorithm is efficient only with the random sub-sampling.

>>> np.random.seed(1234)  # reproducible results
>>> subsample_uniform(3, 5)  # doctest: +ELLIPSIS
array([4, 0, 1])
>>> subsample_uniform(10, 20)  # doctest: +ELLIPSIS
array([ 7, 16,  2,  3,  1, 18,  5,  4,  0,  8])
Policies.BESA.TOLERANCE = 1e-06

Numerical tolerance when comparing two means. Should not be zero!

Policies.BESA.inverse_permutation(permutation, j)[source]

Inverse the permutation for given input j, that is, it finds i such that p[i] = j.

>>> permutation = [1, 0, 3, 2]
>>> inverse_permutation(permutation, 1)
0
>>> inverse_permutation(permutation, 0)
1
Policies.BESA.besa_two_actions(rewards, pulls, a, b, subsample_function=<function subsample_uniform>)[source]

Core algorithm for the BESA selection, for two actions a and b:

  • N = min(Na, Nb),
  • Sub-sample N values from rewards of arm a, and N values from rewards of arm b,
  • Compute mean of both samples of size N, call them m_a, m_b,
  • If m_a > m_b, choose a,
  • Else if m_a < m_b, choose b,
  • And in case of a tie, break by choosing i such that Ni is minimal (or random [a, b] if Na=Nb).

Note

rewards can be a numpy array of shape (at least) (nbArms, max(Na, Nb)) or a dictionary maping a,b to lists (or iterators) of lengths >= max(Na, Nb).

>>> np.random.seed(2345)  # reproducible results
>>> pulls = [6, 10]; K = len(pulls); N = max(pulls)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 1 is better  # doctest: +ELLIPSIS
array([0.154..., 0.158...])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 0 is better in the first 6 samples  # doctest: +ELLIPSIS
array([0.341..., 0.019...])
>>> besa_two_actions(rewards, pulls, 0, 1, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
0
>>> [besa_two_actions(rewards, pulls, 0, 1, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[0, 0, 1, 1, 0, 0, 1, 0, 0, 0]
Policies.BESA.besa_K_actions__non_randomized(rewards, pulls, left, right, subsample_function=<function subsample_uniform>, depth=0)[source]

BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).

  • I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic actions vector, to speed up the code, but it is less readable.
  • The depth argument is just for pretty printing debugging information (useless).

Warning

The binary tournament is NOT RANDOMIZED here, this version is only for testing.

>>> np.random.seed(1234)  # reproducible results
>>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 0 is better
array([ 0.09876921, -0.18561207,  0.04463033,  0.0653539 ])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 1 is better in the first 6 samples
array([-0.06401484,  0.17366346,  0.05323033, -0.09514708])
>>> besa_K_actions__non_randomized(rewards, pulls, 0, K-1, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
3
>>> [besa_K_actions__non_randomized(rewards, pulls, 0, K-1, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
Policies.BESA.besa_K_actions__smart_divideandconquer(rewards, pulls, left, right, random_permutation_of_arm=None, subsample_function=<function subsample_uniform>, depth=0)[source]

BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).

  • I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic actions vector, to speed up the code, but it is less readable.
  • The depth argument is just for pretty printing debugging information (useless).

Note

The binary tournament is RANDOMIZED here, as it should be.

>>> np.random.seed(1234)  # reproducible results
>>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 0 is better
array([ 0.09876921, -0.18561207,  0.04463033,  0.0653539 ])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 1 is better in the first 6 samples
array([-0.06401484,  0.17366346,  0.05323033, -0.09514708])
>>> besa_K_actions__smart_divideandconquer(rewards, pulls, 0, K-1, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
3
>>> [besa_K_actions__smart_divideandconquer(rewards, pulls, 0, K-1, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
Policies.BESA.besa_K_actions(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]

BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).

  • The divide and conquer is implemented for a generic list of actions, it’s slower but simpler to write! Left and right divisions are just actions[:len(actions)//2] and actions[len(actions)//2:].
  • Actions is assumed to be shuffled before calling this function!
  • The depth argument is just for pretty printing debugging information (useless).

Note

The binary tournament is RANDOMIZED here, as it should be.

>>> np.random.seed(1234)  # reproducible results
>>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls)
>>> actions = np.arange(K)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 0 is better
array([ 0.09876921, -0.18561207,  0.04463033,  0.0653539 ])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 1 is better in the first 6 samples
array([-0.06401484,  0.17366346,  0.05323033, -0.09514708])
>>> besa_K_actions(rewards, pulls, actions, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
3
>>> [besa_K_actions(rewards, pulls, actions, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
Policies.BESA.besa_K_actions__non_binary(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]

BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).

  • Instead of doing this binary tree tournaments (which results in \(\mathcal{O}(K^2)\) calls to the 2-arm procedure), we can do a line tournaments: 1 vs 2, winner vs 3, winner vs 4 etc, winner vs K-1 (which results in \(\mathcal{O}(K)\) calls),
  • Actions is assumed to be shuffled before calling this function!
  • The depth argument is just for pretty printing debugging information (useless).
>>> np.random.seed(1234)  # reproducible results
>>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls)
>>> actions = np.arange(K)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 0 is better
array([ 0.09876921, -0.18561207,  0.04463033,  0.0653539 ])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 1 is better in the first 6 samples
array([-0.06401484,  0.17366346,  0.05323033, -0.09514708])
>>> besa_K_actions__non_binary(rewards, pulls, actions, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
3
>>> [besa_K_actions__non_binary(rewards, pulls, actions, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[3, 3, 3, 2, 0, 3, 3, 3, 3, 3]
Policies.BESA.besa_K_actions__non_recursive(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]

BESA non-recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).

  • No calls to besa_two_actions(), just generalize it to K actions instead of 2.
  • Actions is assumed to be shuffled before calling this function!
>>> np.random.seed(1234)  # reproducible results
>>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls)
>>> rewards = np.random.randn(K, N)
>>> np.mean(rewards, axis=1)  # arm 0 is better
array([ 0.09876921, -0.18561207,  0.04463033,  0.0653539 ])
>>> np.mean(rewards[:, :min(pulls)], axis=1)  # arm 1 is better in the first 6 samples
array([-0.06401484,  0.17366346,  0.05323033, -0.09514708])
>>> besa_K_actions__non_recursive(rewards, pulls, None, subsample_function=subsample_deterministic)  # doctest: +ELLIPSIS
3
>>> [besa_K_actions__non_recursive(rewards, pulls, None, subsample_function=subsample_uniform) for _ in range(10)]  # doctest: +ELLIPSIS
[1, 3, 0, 2, 2, 3, 1, 1, 3, 1]
class Policies.BESA.BESA(nbArms, horizon=None, minPullsOfEachArm=1, randomized_tournament=True, random_subsample=True, non_binary=False, non_recursive=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The Best Empirical Sampled Average (BESA) algorithm.

Warning

The BESA algorithm requires to store all the history of rewards, so its memory usage for \(T\) rounds with \(K\) arms is \(\mathcal{O}(K T)\), which is huge for large \(T\), be careful! Aggregating different BESA instances is probably a bad idea because of this limitation!

__init__(nbArms, horizon=None, minPullsOfEachArm=1, randomized_tournament=True, random_subsample=True, non_binary=False, non_recursive=False, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Just to know the memory to allocate for rewards. It could be implemented without knowing the horizon, by using lists to keep all the reward history, but this would be way slower!

minPullsOfEachArm = None

Minimum number of pulls of each arm before using the BESA algorithm. Using 1 might not be the best choice

randomized_tournament = None

Whether to use a deterministic or random tournament.

random_subsample = None

Whether to use a deterministic or random sub-sampling procedure.

non_binary = None

Whether to use besa_K_actions() or besa_K_actions__non_binary() for the selection of K arms.

non_recursive = None

Whether to use besa_K_actions() or besa_K_actions__non_recursive() for the selection of K arms.

all_rewards = None

Keep all rewards of each arms. It consumes a \(\mathcal{O}(K T)\) memory, that’s really bad!!

__str__()[source]

-> str

getReward(arm, reward)[source]

Add the current reward in the global history.

Note

There is no need to normalize the reward in [0,1], that’s one of the strong point of the BESA algorithm.

choice()[source]

Applies the BESA procedure with the current data history.

choiceFromSubSet(availableArms='all')[source]

Applies the BESA procedure with the current data history, to the restricted set of arm.

choiceMultiple(nb=1)[source]

Applies the multiple-choice BESA procedure with the current data history:

  1. select a first arm with basic BESA procedure with full action set,
  2. remove it from the set of actions,
  3. restart step 1 with new smaller set of actions, until nb arm where chosen by basic BESA.

Note

This was not studied or published before, and there is no theoretical results about it!

Warning

This is very inefficient! The BESA procedure is already quite slow (with my current naive implementation), this is crazily slow!

choiceWithRank(rank=1)[source]

Applies the ranked BESA procedure with the current data history:

  1. use choiceMultiplie() to select rank actions,
  2. then take the rank-th chosen action (the last one).

Note

This was not studied or published before, and there is no theoretical results about it!

Warning

This is very inefficient! The BESA procedure is already quite slow (with my current naive implementation), this is crazily slow!

__module__ = 'Policies.BESA'
computeIndex(arm)[source]

Compute the current index of arm ‘arm’.

Warning

This index is not the one used for the choice of arm (which use sub sampling). It’s just the empirical mean of the arm.

computeAllIndex()[source]

Compute the current index of arm ‘arm’ (vectorized).

Warning

This index is not the one used for the choice of arm (which use sub sampling). It’s just the empirical mean of the arm.

handleCollision(arm, reward=None)[source]

Nothing special to do.

Policies.BasePolicy module

Base class for any policy.

  • If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
Policies.BasePolicy.CHECKBOUNDS = False

If True, every time a reward is received, a warning message is displayed if it lies outsides of [lower, lower + amplitude].

class Policies.BasePolicy.BasePolicy(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: object

Base class for any policy.

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms

lower = None

Lower values for rewards

amplitude = None

Larger values for rewards

t = None

Internal time

pulls = None

Number of pulls of each arms

rewards = None

Cumulated rewards of each arms

__str__()[source]

-> str

startGame()[source]

Start the game (fill pulls and rewards with 0).

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

choice()[source]

Not defined.

choiceWithRank(rank=1)[source]

Not defined.

choiceFromSubSet(availableArms='all')[source]

Not defined.

choiceMultiple(nb=1)[source]

Not defined.

__dict__ = mappingproxy({'__module__': 'Policies.BasePolicy', '__doc__': ' Base class for any policy.', '__init__': <function BasePolicy.__init__>, '__str__': <function BasePolicy.__str__>, 'startGame': <function BasePolicy.startGame>, 'getReward': <function BasePolicy.getReward>, 'choice': <function BasePolicy.choice>, 'choiceWithRank': <function BasePolicy.choiceWithRank>, 'choiceFromSubSet': <function BasePolicy.choiceFromSubSet>, 'choiceMultiple': <function BasePolicy.choiceMultiple>, 'choiceIMP': <function BasePolicy.choiceIMP>, 'estimatedOrder': <function BasePolicy.estimatedOrder>, '__dict__': <attribute '__dict__' of 'BasePolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BasePolicy' objects>})
__module__ = 'Policies.BasePolicy'
__weakref__

list of weak references to the object (if defined)

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Not defined.

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing means.

  • For a base policy, it is completely random.
Policies.BaseWrapperPolicy module

Base class for any wrapper policy.

class Policies.BaseWrapperPolicy.BaseWrapperPolicy(nbArms, policy=<class 'Policies.UCB.UCB'>, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

Base class for any wrapper policy.

__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, *args, **kwargs)[source]

New policy.

startGame(createNewPolicy=True)[source]

Initialize the policy for a new game.

Warning

createNewPolicy=True creates a new object for the underlying policy, while createNewPolicy=False only call BasePolicy.startGame().

getReward(arm, reward)[source]

Pass the reward, as usual, update t and sometimes restart the underlying policy.

choice()[source]

Pass the call to choice of the underlying policy.

index

Get attribute index from the underlying policy.

choiceWithRank(rank=1)[source]

Pass the call to choiceWithRank of the underlying policy.

choiceFromSubSet(availableArms='all')[source]

Pass the call to choiceFromSubSet of the underlying policy.

choiceMultiple(nb=1)[source]

Pass the call to choiceMultiple of the underlying policy.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Pass the call to choiceIMP of the underlying policy.

estimatedOrder()[source]

Pass the call to estimatedOrder of the underlying policy.

estimatedBestArms(M=1)[source]

Pass the call to estimatedBestArms of the underlying policy.

computeIndex(arm)[source]

Pass the call to computeIndex of the underlying policy.

computeAllIndex()[source]

Pass the call to computeAllIndex of the underlying policy.

__module__ = 'Policies.BaseWrapperPolicy'
Policies.BayesUCB module

The Bayes-UCB policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
  • Reference: [Kaufmann, Cappé & Garivier - AISTATS, 2012]
class Policies.BayesUCB.BayesUCB(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

The Bayes-UCB policy.

-Reference: [Kaufmann, Cappé & Garivier - AISTATS, 2012].

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by taking the \(1 - \frac{1}{t}\) quantile from the Beta posterior:

\[I_k(t) = \mathrm{Quantile}\left(\mathrm{Beta}(1 + S_k(t), 1 + N_k(t) - S_k(t)), 1 - \frac{1}{t}\right).\]
__module__ = 'Policies.BayesUCB'
Policies.BayesianIndexPolicy module

Basic Bayesian index policy. By default, it uses a Beta posterior.

class Policies.BayesianIndexPolicy.BayesianIndexPolicy(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.IndexPolicy.IndexPolicy

Basic Bayesian index policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
  • Use *args and **kwargs if you want to give parameters to the underlying posteriors.
  • Or use params_for_each_posterior as a list of parameters (as a dictionary) to give a different set of parameters for each posterior.
__init__(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Create a new Bayesian policy, by creating a default posterior on each arm.

posterior = None

Posterior for each arm. List instead of dict, quicker access

__str__()[source]

-> str

startGame()[source]

Reset the posterior on each arm.

getReward(arm, reward)[source]

Update the posterior on each arm, with the normalized reward.

computeIndex(arm)[source]

Compute the current index of arm ‘arm’.

__module__ = 'Policies.BayesianIndexPolicy'
Policies.BoltzmannGumbel module

The Boltzmann-Gumbel Exploration (BGE) index policy, a different formulation of the Exp3 policy with an optimally tune decreasing sequence of temperature parameters \(\gamma_t\).

  • Reference: Section 4 of [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
  • It is an index policy with indexes computed from the empirical mean estimators and a random sample from a Gumbel distribution.
Policies.BoltzmannGumbel.SIGMA = 1

Default constant \(\sigma\) assuming the arm distributions are \(\sigma^2\)-subgaussian. 1 for Bernoulli arms.

class Policies.BoltzmannGumbel.BoltzmannGumbel(nbArms, C=1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The Boltzmann-Gumbel Exploration (BGE) index policy, a different formulation of the Exp3 policy with an optimally tune decreasing sequence of temperature parameters \(\gamma_t\).

  • Reference: Section 4 of [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
  • It is an index policy with indexes computed from the empirical mean estimators and a random sample from a Gumbel distribution.
__init__(nbArms, C=1, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

computeIndex(arm)[source]

Take a random index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \beta_k(t) Z_k(t), \\ \text{where}\;\; \beta_k(t) &:= \sqrt{C^2 / N_k(t)}, \\ \text{and}\;\; Z_k(t) &\sim \mathrm{Gumbel}(0, 1).\end{split}\]

Where \(\mathrm{Gumbel}(0, 1)\) is the standard Gumbel distribution. See [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.gumbel.html#numpy.random.gumbel) or [Wikipedia page](https://en.wikipedia.org/wiki/Gumbel_distribution) for more details.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.BoltzmannGumbel'
Policies.CD_UCB module

The CD-UCB generic policy policies for non-stationary bandits.

  • Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)

  • It runs on top of a simple policy, e.g., UCB, and UCBLCB_IndexPolicy is a wrapper:

    >>> policy = UCBLCB_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.CD_UCB.VERBOSE = False

Whether to be verbose when doing the change detection algorithm.

Policies.CD_UCB.PROBA_RANDOM_EXPLORATION = 0.1

Default probability of random exploration \(\alpha\).

Policies.CD_UCB.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? Default is True, it’s usually more efficient!

Policies.CD_UCB.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.CD_UCB.EPSILON = 0.05

Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).

Policies.CD_UCB.LAMBDA = 1

Default value of \(\lambda\).

Policies.CD_UCB.MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT = 50

Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.

Policies.CD_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.

class Policies.CD_UCB.CD_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The CD-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__init__(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

New policy.

epsilon = None

Parameter \(\varepsilon\) for the test.

lazy_detect_change_only_x_steps = None

Be lazy and try to detect changes only X steps, where X is small like 10 for instance.

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

all_rewards = None

Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None

Keep in memory the number times since last restart. Start with -1 (never seen)

last_restart_times = None

Keep in memory the times of last restarts (for each arm).

number_of_restart = None

Keep in memory the number of restarts.

__str__()[source]

-> str

choice()[source]

With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to choice() of the underlying policy.

choiceWithRank(rank=1)[source]

With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to choiceWithRank() of the underlying policy.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the change detection algorithm says so, with method detect_change(), for this arm at this current time step.

Warning

This is computationally costly, so an easy way to speed up this step is to use lazy_detect_change_only_x_steps \(= \mathrm{Step_t}\) for a small value (e.g., 10), so not test for all \(t\in\mathbb{N}^*\) but only \(s\in\mathbb{N}^*, s % \mathrm{Step_t} = 0\) (e.g., one out of every 10 steps).

Warning

If the \(detect_change\) method also returns an estimate of the position of the change-point, \(\hat{tau}\), then it is used to reset the memory of the changing arm and keep the observations from \(\hat{tau}+1\).

detect_change(arm, verbose=False)[source]

Try to detect a change in the current arm.

Warning

This is not implemented for the generic CD algorithm, it has to be implement by a child of the class CD_IndexPolicy.

__module__ = 'Policies.CD_UCB'
class Policies.CD_UCB.SlidingWindowRestart_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

A more generic implementation is the Policies.SlidingWindowRestart class.

Warning

I have no idea if what I wrote is correct or not!

detect_change(arm, verbose=False)[source]

Try to detect a change in the current arm.

Warning

This one is simply using a sliding-window of fixed size = 100. A more generic implementation is the Policies.SlidingWindowRestart class.

__module__ = 'Policies.CD_UCB'
Policies.CD_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes for \(s\) taking steps of size steps_s. Default is to have steps_s=1, but only using steps_s=2 should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.

Policies.CD_UCB.USE_LOCALIZATION = True

Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

class Policies.CD_UCB.UCBLCB_IndexPolicy(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The UCBLCB-UCB generic policy for non-stationary bandits, from [[Improved Changepoint Detection for Piecewise i.i.d Bandits, by S. Mukherjee & O.-A. Maillard, preprint 2018](https://subhojyoti.github.io/pdf/aistats_2019.pdf)].

Warning

This is still experimental! See https://github.com/SMPyBandits/SMPyBandits/issues/177

__init__(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

New policy.

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

lazy_try_value_s_only_x_steps = None

Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

use_localization = None

experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

__module__ = 'Policies.CD_UCB'
__str__()[source]

-> str

delta(t)[source]

Use \(\delta = \delta_0\) if it was given as an argument to the policy, or \(\frac{\delta_0}{t}\) as the confidence level of UCB/LCB test (default is \(\delta_0=1\)).

Warning

It is unclear (in the article) whether \(t\) is the time since the last restart or the total time?

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided UCB-LCB algorithm [Mukherjee & Maillard, 2018].

  • Let \(\hat{\mu}_{i,t:t'}\) the empirical mean of rewards obtained for arm i from time \(t\) to \(t'\), and \(N_{i,t:t'}\) the number of samples.
  • Let \(S_{i,t:t'} = \sqrt{\frac{\log(4 t^2 / \delta)}{2 N_{i,t:t'}}}\) the length of the confidence interval.
  • When we have data starting at \(t_0=0\) (since last restart) and up-to current time \(t\), for each arm i,
    • For each intermediate time steps \(t' \in [t_0, t)\),
      • Compute \(LCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} - S_{i,t_0:t'}\),
      • Compute \(UCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} + S_{i,t_0:t'}\),
      • Compute \(LCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} - S_{i,t'+1:t}\),
      • Compute \(UCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} + S_{i,t'+1:t}\),
      • If \(UCB_{\text{before}} < LCB_{\text{after}}\) or \(UCB_{\text{after}} < LCB_{\text{before}}\), then restart.
Policies.CORRAL module

The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.

The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).

  • At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).
  • Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
  • The reward is reweighted by the trust of the listened algorithm, and given back to it.
  • The other slaves, whose decision was not even asked, receive a zero reward, or no reward at all.
  • The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.
  • The detail about how to increase or decrease the probabilities are specified in the reference article.

Note

Reference: [[“Corralling a Band of Bandit Algorithms”, by A. Agarwal, H. Luo, B. Neyshabur, R.E. Schapire, 01.2017](https://arxiv.org/abs/1612.06246v2)].

Policies.CORRAL.renormalize_reward(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]

Renormalize the reward to [0, 1]:

  • divide by (trust/mintrust) if unbiased is True.
  • simply project to [0, 1] if unbiased is False,

Warning

If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.

Policies.CORRAL.unnormalize_reward(reward, lower=0.0, amplitude=1.0)[source]

Project back reward to [lower, lower + amplitude].

Policies.CORRAL.log_Barrier_OMB(trusts, losses, rates)[source]

A step of the log-barrier Online Mirror Descent, updating the trusts:

  • Find \(\lambda \in [\min_i l_{t,i}, \max_i l_{t,i}]\) such that \(\sum_i \frac{1}{1/p_{t,i} + \eta_{t,i}(l_{t,i} - \lambda)} = 1\).
  • Return \(\mathbf{p}_{t+1,i}\) such that \(\frac{1}{p_{t+1,i}} = \frac{1}{p_{t,i}} + \eta_{t,i}(l_{t,i} - \lambda)\).
  • Note: uses scipy.optimize.minimize_scalar() for the optimization.
  • Reference: [Learning in games: Robustness of fast convergence, by D.Foster, Z.Li, T.Lykouris, K.Sridharan, and E.Tardos, NIPS 2016].
Policies.CORRAL.UNBIASED = True

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).

Policies.CORRAL.BROADCAST_ALL = False

Whether to give back a reward to only one slave algorithm (default, False) or to all slaves who voted for the same arm

class Policies.CORRAL.CORRAL(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.

__init__(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms.

lower = None

Lower values for rewards.

amplitude = None

Larger values for rewards.

unbiased = None

Flag, see above.

broadcast_all = None

Flag, see above.

gamma = None

Constant \(\gamma = 1 / T\).

beta = None

Constant \(\beta = \exp(1 / \log(T))\).

rates = None

Value of the learning rate (will be increasing in time).

children = None

List of slave algorithms.

trusts = None

Initial trusts in the slaves. Default to uniform, but a prior can also be given.

bar_trusts = None

Initial bar trusts in the slaves. Default to uniform, but a prior can also be given.

choices = None

Keep track of the last choices of each slave, to know whom to update if update_all_children is false.

last_choice = None

Remember the index of the last child trusted for a decision.

losses = None

For the log-barrier OMD step, a vector of losses has to be given. Faster to keep it as an attribute instead of reallocating it every time.

rhos = None

I use the inverses of the \(\rho_{t,i}\) from the Algorithm in the reference article. Simpler to understand, less numerical errors.

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

__setattr__(name, value)[source]

Trick method, to update the \(\gamma\) and \(\beta\) parameters of the CORRAL algorithm if the horizon T changes.

Warning

Not tested yet!

startGame()[source]

Start the game for each child.

getReward(arm, reward)[source]

Give reward for each child, and then update the trust probabilities.

choice()[source]

Trust one of the slave and listen to his choice.

choiceWithRank(rank=1)[source]

Trust one of the slave and listen to his choiceWithRank.

choiceFromSubSet(availableArms='all')[source]

Trust one of the slave and listen to his choiceFromSubSet.

__module__ = 'Policies.CORRAL'
choiceMultiple(nb=1)[source]

Trust one of the slave and listen to his choiceMultiple.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Trust one of the slave and listen to his choiceIMP.

estimatedOrder()[source]

Trust one of the slave and listen to his estimatedOrder.

  • Return the estimate order of the arms, as a permutation on \([0,...,K-1]\) that would order the arms by increasing means.
estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

Policies.CPUCB module

The Clopper-Pearson UCB policy for bounded bandits. Reference: [Garivier & Cappé, COLT 2011](https://arxiv.org/pdf/1102.2490.pdf).

Policies.CPUCB.binofit_scalar(x, n, alpha=0.05)[source]

Parameter estimates and confidence intervals for binomial data.

For example:

>>> np.random.seed(1234)  # reproducible results
>>> true_p = 0.6
>>> N = 100
>>> x = np.random.binomial(N, true_p)
>>> (phat, pci) = binofit_scalar(x, N)
>>> phat
0.61
>>> pci  # 0.6 of course lies in the 95% confidence interval  # doctest: +ELLIPSIS
(0.507..., 0.705...)
>>> (phat, pci) = binofit_scalar(x, N, 0.01)
>>> pci  # 0.6 is also in the 99% confidence interval, but it is larger  # doctest: +ELLIPSIS
(0.476..., 0.732...)

Like binofit_scalar in MATLAB, see https://fr.mathworks.com/help/stats/binofit_scalar.html.

  • (phat, pci) = binofit_scalar(x, n) returns a maximum likelihood estimate of the probability of success in a given binomial trial based on the number of successes, x, observed in n independent trials.
  • (phat, pci) = binofit_scalar(x, n) returns the probability estimate, phat, and the 95% confidence intervals, pci, by using the Clopper-Pearson method to calculate confidence intervals.
  • (phat, pci) = binofit_scalar(x, n, alpha) returns the 100(1 - alpha)% confidence intervals. For example, alpha = 0.01 yields 99% confidence intervals.

For the Clopper-Pearson UCB algorithms:

  • x is the cum rewards of some arm k, \(x = X_k(t)\),
  • n is the number of samples of that arm k, \(n = N_k(t)\),
  • and alpha is a small positive number, \(\alpha = \frac{1}{t^c}\) in this algorithm (for \(c > 1, \simeq 1\), for instance c = 1.01).

Returns: (phat, pci)

  • phat: is the estimate of p
  • pci: is the confidence interval

Note

My reference implementation was https://github.com/sjara/extracellpy/blob/master/extrastats.py#L35, but http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.proportion.proportion_confint.html can also be used (it implies an extra requirement for the project).

Policies.CPUCB.binofit(xArray, nArray, alpha=0.05)[source]

Parameter estimates and confidence intervals for binomial data, for vectorial inputs.

For example:

>>> np.random.seed(1234)  # reproducible results
>>> true_p = 0.6
>>> N = 100
>>> xArray = np.random.binomial(N, true_p, 4)
>>> xArray
array([61, 54, 61, 52])
>>> (phat, pci) = binofit(xArray, N)
>>> phat
array([0.61, 0.54, 0.61, 0.52])
>>> pci  # 0.6 of course lies in the 95% confidence intervals  # doctest: +ELLIPSIS
array([[0.507..., 0.705...],
       [0.437..., 0.640...],
       [0.507..., 0.705...],
       [0.417..., 0.620...]])
>>> (phat, pci) = binofit(xArray, N, 0.01)
>>> pci  # 0.6 is also in the 99% confidence intervals, but it is larger  # doctest: +ELLIPSIS
array([[0.476..., 0.732...],
       [0.407..., 0.668...],
       [0.476..., 0.732...],
       [0.387..., 0.650...]])
Policies.CPUCB.ClopperPearsonUCB(x, N, alpha=0.05)[source]

Returns just the upper-confidence bound of the confidence interval.

Policies.CPUCB.C = 1.01

Default value for the parameter c for CP-UCB

class Policies.CPUCB.CPUCB(nbArms, c=1.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The Clopper-Pearson UCB policy for bounded bandits. Reference: [Garivier & Cappé, COLT 2011].

__init__(nbArms, c=1.01, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c for the CP-UCB formula (see below)

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \mathrm{ClopperPearsonUCB}\left( X_k(t), N_k(t), \frac{1}{t^c} \right).\]

Where \(\mathrm{ClopperPearsonUCB}\) is defined above. The index is the upper-confidence bound of the binomial trial of \(N_k(t)\) samples from arm k, having mean \(\mu_k\), and empirical outcome \(X_k(t)\). The confidence interval is with \(\alpha = 1 / t^c\), for a \(100(1 - \alpha)\%\) confidence bound.

__module__ = 'Policies.CPUCB'
Policies.CUSUM_UCB module

The CUSUM-UCB and PHT-UCB policies for non-stationary bandits.

  • Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)

  • It runs on top of a simple policy, e.g., UCB, and CUSUM_IndexPolicy is a wrapper:

    >>> policy = CUSUM_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.CUSUM_UCB.VERBOSE = False

Whether to be verbose when doing the change detection algorithm.

Policies.CUSUM_UCB.PROBA_RANDOM_EXPLORATION = 0.1

Default probability of random exploration \(\alpha\).

Policies.CUSUM_UCB.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? For CUSUM-UCB it is True by default.

Policies.CUSUM_UCB.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? For CUSUM-UCB it is False by default.

Policies.CUSUM_UCB.EPSILON = 0.01

Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).

Policies.CUSUM_UCB.LAMBDA = 1

Default value of \(\lambda\). Used only if \(h\) and \(\alpha\) are computed using compute_h_alpha_from_input_parameters__CUSUM_complicated().

Policies.CUSUM_UCB.MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT = 100

Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.

Policies.CUSUM_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes only X steps, where X is small like 20 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 20 should speed up the test by x20.

Policies.CUSUM_UCB.USE_LOCALIZATION = True

Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

Policies.CUSUM_UCB.ALPHA0_SCALE_FACTOR = 1

For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!

Policies.CUSUM_UCB.compute_h_alpha_from_input_parameters__CUSUM_complicated(horizon, max_nb_random_events, nbArms=None, epsilon=None, lmbda=None, M=None, scaleFactor=1)[source]

Compute the values \(C_1^+, C_1^-, C_1, C_2, h\) from the formulas in Theorem 2 and Corollary 2 in the paper.

Policies.CUSUM_UCB.compute_h_alpha_from_input_parameters__CUSUM(horizon, max_nb_random_events, scaleFactor=1, **kwargs)[source]

Compute the values \(h, \alpha\) from the simplified formulas in Theorem 2 and Corollary 2 in the paper.

\[\begin{split}h &= \log(\frac{T}{\Upsilon_T}),\\ \alpha &= \mathrm{scaleFactor} \times \sqrt{\frac{\Upsilon_T}{T} \log(\frac{T}{\Upsilon_T})}.\end{split}\]
class Policies.CUSUM_UCB.CUSUM_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The CUSUM-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__init__(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

New policy.

M = None

Parameter \(M\) for the test.

threshold_h = None

Parameter \(h\) for the test (threshold).

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

use_localization = None

Experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

__str__()[source]

-> str

getReward(arm, reward)[source]

Be sure that the underlying UCB or klUCB indexes are used with \(\log(n_t)\) for the exploration term, where \(n_t = \sum_{i=1}^K N_i(t)\) the number of pulls of each arm since its last restart times (different restart time for each arm, CUSUM use local restart only).

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided CUSUM algorithm [Page, 1954].

  • For each data k, compute:
\[\begin{split}s_k^- &= (y_k - \hat{u}_0 - \varepsilon) 1(k > M),\\ s_k^+ &= (\hat{u}_0 - y_k - \varepsilon) 1(k > M),\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]
  • The change is detected if \(\max(g_k^+, g_k^-) > h\), where threshold_h is the threshold of the test,
  • And \(\hat{u}_0 = \frac{1}{M} \sum_{k=1}^{M} y_k\) is the mean of the first M samples, where M is M the min number of observation between change points.
__module__ = 'Policies.CUSUM_UCB'
class Policies.CUSUM_UCB.PHT_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CUSUM_UCB.CUSUM_IndexPolicy

The PHT-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__module__ = 'Policies.CUSUM_UCB'
__str__()[source]

-> str

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided PHT algorithm [Hinkley, 1971].

  • For each data k, compute:
\[\begin{split}s_k^- &= y_k - \hat{y}_k - \varepsilon,\\ s_k^+ &= \hat{y}_k - y_k - \varepsilon,\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]
  • The change is detected if \(\max(g_k^+, g_k^-) > h\), where threshold_h is the threshold of the test,
  • And \(\hat{y}_k = \frac{1}{k} \sum_{s=1}^{k} y_s\) is the mean of the first k samples.
Policies.DMED module

The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).

class Policies.DMED.DMED(nbArms, genuine=False, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).

__init__(nbArms, genuine=False, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

New policy.

kl = None

kl function to use

tolerance = None

Numerical tolerance

genuine = None

Flag to know which variant is implemented, DMED or DMED+

nextActions = None

List of next actions to play, every next step is playing nextActions.pop(0)

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

choice()[source]

If there is still a next action to play, pop it and play it, otherwise make new list and play first action.

The list of action is obtained as all the indexes \(k\) satisfying the following equation.

  • For the naive version (genuine = False), DMED:
\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(t)}{N_k(t)}.\]
  • For the original version (genuine = True), DMED+:
\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(\frac{t}{N_k(t)})}{N_k(t)}.\]

Where \(X_k(t)\) is the sum of rewards from arm k, \(\hat{\mu}_k(t)\) is the empirical mean, and \(\hat{\mu}^*(t)\) is the best empirical mean.

\[\begin{split}X_k(t) &= \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma) \\ \hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \hat{\mu}^*(t) &= \max_{k=1}^{K} \hat{\mu}_k(t)\end{split}\]
choiceMultiple(nb=1)[source]

If there is still enough actions to play, pop them and play them, otherwise make new list and play nb first actions.

__module__ = 'Policies.DMED'
class Policies.DMED.DMEDPlus(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

Bases: Policies.DMED.DMED

The DMED+ policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]).

__init__(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

New policy.

__module__ = 'Policies.DMED'
Policies.DiscountedBayesianIndexPolicy module

Discounted Bayesian index policy.

Warning

This is still highly experimental!

Policies.DiscountedBayesianIndexPolicy.GAMMA = 0.95

Default value for the discount factor \(\gamma\in(0,1)\). 0.95 is empirically a reasonable value for short-term non-stationary experiments.

class Policies.DiscountedBayesianIndexPolicy.DiscountedBayesianIndexPolicy(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

Discounted Bayesian index policy.

  • By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.
  • Use discount factor \(\gamma\in(0,1)\).
  • It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.
  • But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]
\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]
__init__(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Create a new Bayesian policy, by creating a default posterior on each arm.

gamma = None

Discount factor \(\gamma\in(0,1)\).

__str__()[source]

-> str

getReward(arm, reward)[source]

Update the posterior on each arm, with the normalized reward.

__module__ = 'Policies.DiscountedBayesianIndexPolicy'
Policies.DiscountedThompson module

The Discounted Thompson (Bayesian) index policy.

  • By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.
  • Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].

Warning

This is still highly experimental!

class Policies.DiscountedThompson.DiscountedThompson(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.DiscountedBayesianIndexPolicy.DiscountedBayesianIndexPolicy

The DiscountedThompson (Bayesian) index policy.

  • By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.
  • Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].
computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, by sampling from the DiscountedBeta posterior.

\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \widetilde{S_k}(t), 1 + \widetilde{F_k}(t)).\end{split}\]
  • It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.
  • But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]
\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]
__module__ = 'Policies.DiscountedThompson'
Policies.DiscountedUCB module

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • \(\gamma\) should not be 1, otherwise you should rather use Policies.UCBalpha.UCBalpha instead.
  • The smaller the \(\gamma\), the shorter the “memory” of the algorithm is.
Policies.DiscountedUCB.ALPHA = 1

Default parameter for alpha.

Policies.DiscountedUCB.GAMMA = 0.99

Default parameter for gamma.

class Policies.DiscountedUCB.DiscountedUCB(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

Bases: Policies.UCBalpha.UCBalpha

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

__init__(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
discounted_pulls = None

Number of pulls of each arms

discounted_rewards = None

Cumulated rewards of each arms

alpha = None

Parameter alpha

gamma = None

Parameter gamma

delta_time_steps = None

Keep memory of the \(\Delta_k(t)\) for each time step.

useRealDiscount = None

Flag to know if the real update should be used, the one with a multiplication by \(\gamma^{1+\Delta_k(t)}\) and not simply a multiplication by \(\gamma\).

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

  • Keep up-to date the following two quantities, using different definition and notation as from the article, but being consistent w.r.t. my project:
\[\begin{split}N_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} N_k(s), \\ X_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} X_k(s).\end{split}\]
  • Instead of keeping the whole history of rewards, as expressed in the math formula, we keep the sum of discounted rewards from s=0 to s=t, because updating it is easy (2 operations instead of just 1 for classical Policies.UCBalpha.UCBalpha, and 2 operations instead of \(\mathcal{O}(t)\) as expressed mathematically). Denote \(\Delta_k(t)\) the number of time steps during which the arm k was not selected (maybe 0 if it is selected twice in a row). Then the update can be done easily by multiplying by \(\gamma^{1+\Delta_k(t)}\):
\[\begin{split}N_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times N_{k,\gamma}(\text{last pull}) + \mathbb{1}(A(t+1) = k), \\ X_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times X_{k,\gamma}(\text{last pull}) + X_k(t+1).\end{split}\]
computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}I_k(t) &:= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} + \sqrt{\frac{\alpha \log(n_{\gamma}(t))}{2 N_{k,\gamma}(t)}}, \\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedUCBPlus(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-UCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
__init__(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__module__ = 'Policies.DiscountedUCB'
Policies.DiscountedUCB.constant_c = 1.0

default value, as it was in pymaBandits v1.0

Policies.DiscountedUCB.tolerance = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.DiscountedUCB.DiscountedklUCB(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

__init__(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
klucb = None

kl function to use

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(t)}{N_{k,\gamma}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedklUCBPlus(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.DiscountedUCB.DiscountedklUCB, Policies.DiscountedUCB.DiscountedUCBPlus

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

  • Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
__str__()[source]

-> str

__module__ = 'Policies.DiscountedUCB'
Policies.DoublingTrickWrapper module

A policy that acts as a wrapper on another policy P, assumed to be horizon dependent (has to known \(T\)), by implementing a “doubling trick”:

  • starts to assume that \(T=T_0=1000\), and run the policy \(P(T_0)\), from \(t=1\) to \(t=T_0\),
  • if \(t > T_0\), then the “doubling trick” is performed, by either re-initializing or just changing the parameter horizon of the policy P, for instance with \(T_2 = 10 \times T_0\),
  • and keep doing this until \(t = T\).

Note

This is implemented in a very generic way, with simply a function next_horizon(horizon) that gives the next horizon to try when crossing the current guess. It can be a simple linear function (next_horizon(horizon) = horizon + 100), a geometric growth to have the “real” doubling trick (next_horizon(horizon) = horizon * 10), or even functions growing exponentially fast (next_horizon(horizon) = horizon ** 1.1, next_horizon(horizon) = horizon ** 1.5, next_horizon(horizon) = horizon ** 2).

Note

My guess is that this “doubling trick” wrapping policy can only be efficient (for stochastic problems) if:

  • the underlying policy P is a very efficient horizon-dependent algorithm, e.g., the Policies.ApproximatedFHGittins,
  • the growth function next_horizon is growing faster than any geometric rate, so that the number of refresh is \(o(\log T)\) and not \(O(\log T)\).

See also

Reference: [[What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018]](https://hal.inria.fr/hal-01736357), to be presented soon.

Warning

Interface: If FULL_RESTART=False (default), the underlying algorithm is recreated at every breakpoint, instead its attribute horizon or _horizon is updated. Be sure that this is enough to really change the internal value used by the policy. Some policy use T only once to compute others parameters, which should be updated as well. A manual implementation of the __setattr__ method can help.

Policies.DoublingTrickWrapper.default_horizonDependent_policy

alias of Policies.UCBH.UCBH

Policies.DoublingTrickWrapper.FULL_RESTART = False

Default constant to know what to do when restarting the underlying policy with a new horizon parameter.

  • True means that a new policy, initialized from scratch, will be created at every breakpoint.
  • False means that the same policy object is used but just its attribute horizon is updated (default).
Policies.DoublingTrickWrapper.DEFAULT_FIRST_HORIZON = 200

Default horizon, used for the first step.

Policies.DoublingTrickWrapper.ARITHMETIC_STEP = 200

Default stepsize for the arithmetic horizon progression.

Policies.DoublingTrickWrapper.next_horizon__arithmetic(i, horizon)[source]

The arithmetic horizon progression function:

\[\begin{split}T &\mapsto T + 100,\\ T_i &:= T_0 + 100 \times i.\end{split}\]
Policies.DoublingTrickWrapper.GEOMETRIC_STEP = 2

Default multiplicative constant for the geometric horizon progression.

Policies.DoublingTrickWrapper.next_horizon__geometric(i, horizon)[source]

The geometric horizon progression function:

\[\begin{split}T &\mapsto T \times 2,\\ T_i &:= T_0 2^i.\end{split}\]
Policies.DoublingTrickWrapper.EXPONENTIAL_STEP = 1.5

Default exponential constant for the exponential horizon progression.

Policies.DoublingTrickWrapper.next_horizon__exponential(i, horizon)[source]

The exponential horizon progression function:

\[\begin{split}T &\mapsto \left\lfloor T^{1.5} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.5^i} \right\rfloor.\end{split}\]
Policies.DoublingTrickWrapper.SLOW_EXPONENTIAL_STEP = 1.1

Default exponential constant for the slow exponential horizon progression.

Policies.DoublingTrickWrapper.next_horizon__exponential_slow(i, horizon)[source]

The exponential horizon progression function:

\[\begin{split}T &\mapsto \left\lfloor T^{1.1} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.1^i} \right\rfloor.\end{split}\]
Policies.DoublingTrickWrapper.FAST_EXPONENTIAL_STEP = 2

Default exponential constant for the fast exponential horizon progression.

Policies.DoublingTrickWrapper.next_horizon__exponential_fast(i, horizon)[source]

The exponential horizon progression function:

\[\begin{split}T &\mapsto \lfloor T^{2} \rfloor,\\ T_i &:= \lfloor T_0^{2^i} \rfloor.\end{split}\]
Policies.DoublingTrickWrapper.ALPHA = 2

Default constant \(\alpha\) for the generic exponential sequence.

Policies.DoublingTrickWrapper.BETA = 2

Default constant \(\beta\) for the generic exponential sequence.

Policies.DoublingTrickWrapper.next_horizon__exponential_generic(i, horizon)[source]

The generic exponential horizon progression function:

\[T_i := \left\lfloor \frac{T_0}{a} a^{b^i} \right\rfloor.\]
Policies.DoublingTrickWrapper.default_next_horizon(i, horizon)

The exponential horizon progression function:

\[\begin{split}T &\mapsto \left\lfloor T^{1.1} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.1^i} \right\rfloor.\end{split}\]
Policies.DoublingTrickWrapper.breakpoints(next_horizon, first_horizon, horizon, debug=False)[source]

Return the list of restart point (breakpoints), if starting from first_horizon to horizon with growth function next_horizon.

  • Also return the gap between the last guess for horizon and the true horizon. This gap should not be too large.
  • Nicely print all the values if debug=True.
  • First examples:
>>> first_horizon = 1000
>>> horizon = 30000
>>> breakpoints(next_horizon__arithmetic, first_horizon, horizon)  # doctest: +ELLIPSIS
([1000, 1200, 1400, ..., 29800, 30000], 0)
>>> breakpoints(next_horizon__geometric, first_horizon, horizon)
([1000, 2000, 4000, 8000, 16000, 32000], 2000)
>>> breakpoints(next_horizon__exponential, first_horizon, horizon)
([1000, 31622], 1622)
>>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon)
([1000, 1995, 4265, 9838, 24671, 67827], 37827)
>>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon)
([1000, 1000000], 970000)
  • Second examples:
>>> first_horizon = 5000
>>> horizon = 1000000
>>> breakpoints(next_horizon__arithmetic, first_horizon, horizon)  # doctest: +ELLIPSIS
([5000, 5200, ..., 999600, 999800, 1000000], 0)
>>> breakpoints(next_horizon__geometric, first_horizon, horizon)
([5000, 10000, 20000, 40000, 80000, 160000, 320000, 640000, 1280000], 280000)
>>> breakpoints(next_horizon__exponential, first_horizon, horizon)
([5000, 353553, 210223755], 209223755)
>>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon)
([5000, 11718, 29904, 83811, 260394, 906137, 3572014], 2572014)
>>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon)
([5000, 25000000], 24000000)
  • Third examples:
>>> first_horizon = 10
>>> horizon = 1123456
>>> breakpoints(next_horizon__arithmetic, first_horizon, horizon)  # doctest: +ELLIPSIS
([10, 210, 410, ..., 1123210, 1123410, 1123610], 154)
>>> breakpoints(next_horizon__geometric, first_horizon, horizon)
([10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920, 163840, 327680, 655360, 1310720], 187264)
>>> breakpoints(next_horizon__exponential, first_horizon, horizon)
([10, 31, 172, 2255, 107082, 35040856], 33917400)
>>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon)
([10, 12, 15, 19, 25, 34, 48, 70, 107, 170, 284, 499, 928, 1837, 3895, 8903, 22104, 60106, 180638, 606024, 2294768], 1171312)
>>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon)
([10, 100, 10000, 100000000], 98876544)
Policies.DoublingTrickWrapper.constant_c_for_the_functions_f = 0.5

The constant c in front of the function f.

Policies.DoublingTrickWrapper.function_f__for_geometric_sequences(i, c=0.5)[source]

For the geometric doubling sequences, \(f(i) = c \times \log(i)\).

Policies.DoublingTrickWrapper.function_f__for_exponential_sequences(i, c=0.5)[source]

For the exponential doubling sequences, \(f(i) = c \times i\).

Policies.DoublingTrickWrapper.function_f__for_generic_sequences(i, c=0.5, d=0.5, e=0.0)[source]

For a certain generic family of doubling sequences, \(f(i) = c \times i^{d} \times (\log(i))^{e}\).

Warning

d should most probably be smaller than 1.

Policies.DoublingTrickWrapper.function_f__for_intermediate_sequences(i)[source]
Policies.DoublingTrickWrapper.function_f__for_intermediate2_sequences(i)[source]
Policies.DoublingTrickWrapper.function_f__for_intermediate3_sequences(i)[source]
Policies.DoublingTrickWrapper.function_f__for_intermediate4_sequences(i)[source]
Policies.DoublingTrickWrapper.function_f__for_intermediate5_sequences(i)[source]
Policies.DoublingTrickWrapper.alpha_for_Ti = 0.5

Value of the parameter \(\alpha\) for the Ti_from_f() function.

Policies.DoublingTrickWrapper.Ti_from_f(f, alpha=0.5, *args, **kwargs)[source]

For any non-negative and increasing function \(f: i \mapsto f(i)\), the corresponding sequence is defined by:

\[\forall i\in\mathbb{N},\; T_i := \lfloor \exp(\alpha \times \exp(f(i))) \rfloor.\]

Warning

\(f(i)\) can need other parameters, see the examples above. They can be given as *args or **kwargs to Ti_from_f().

Warning

it should be computed otherwise, I should give \(i \mapsto \exp(f(i))\) instead of \(f: i \mapsto f(i)\). I need to try as much as possible to reduce the risk of overflow errors!

Policies.DoublingTrickWrapper.Ti_geometric(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_geometric_sequences().

Policies.DoublingTrickWrapper.Ti_exponential(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_exponential_sequences().

Policies.DoublingTrickWrapper.Ti_intermediate_sqrti(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_intermediate_sequences().

Policies.DoublingTrickWrapper.Ti_intermediate_i13(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_intermediate2_sequences().

Policies.DoublingTrickWrapper.Ti_intermediate_i23(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_intermediate3_sequences().

Policies.DoublingTrickWrapper.Ti_intermediate_i12_logi12(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_intermediate4_sequences().

Policies.DoublingTrickWrapper.Ti_intermediate_i_by_logi(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]

Sequence \(T_i\) generated from the function \(f\) = function_f__for_intermediate5_sequences().

Policies.DoublingTrickWrapper.last_term_operator_LT(Ti, max_i=10000)[source]

For a certain function representing a doubling sequence, \(T: i \mapsto T_i\), this last_term_operator_LT() function returns the function \(L: T \mapsto L_T\), defined as:

\[\forall T\in\mathbb{N},\; L_T := \min\{ i \in\mathbb{N},\; T \leq T_i \}.\]

\(L_T\) is the only integer which satisfies \(T_{L_T - 1} < T \leq T_{L_T}\).

Policies.DoublingTrickWrapper.plot_doubling_sequences(i_min=1, i_max=30, list_of_f=(<function function_f__for_geometric_sequences>, <function function_f__for_intermediate_sequences>, <function function_f__for_intermediate2_sequences>, <function function_f__for_intermediate3_sequences>, <function function_f__for_intermediate4_sequences>, <function function_f__for_exponential_sequences>), label_of_f=('Geometric doubling (d=0, e=1)', 'Intermediate doubling (d=1/2, e=0)', 'Intermediate doubling (d=1/3, e=0)', 'Intermediate doubling (d=2/3, e=0)', 'Intermediate doubling (d=1/2, e=1/2)', 'Exponential doubling (d=1, e=0)'), *args, **kwargs)[source]

Display a plot to illustrate the values of the \(T_i\) as a function of \(i\) for some i.

  • Can accept many functions f (and labels).
Policies.DoublingTrickWrapper.plot_quality_first_upper_bound(Tmin=10, Tmax=100000000, nbTs=100, gamma=0.0, delta=1.0, list_of_f=(<function function_f__for_geometric_sequences>, <function function_f__for_intermediate_sequences>, <function function_f__for_intermediate2_sequences>, <function function_f__for_intermediate3_sequences>, <function function_f__for_intermediate4_sequences>, <function function_f__for_exponential_sequences>), label_of_f=('Geometric doubling (d=0, e=1)', 'Intermediate doubling (d=1/2, e=0)', 'Intermediate doubling (d=1/3, e=0)', 'Intermediate doubling (d=2/3, e=0)', 'Intermediate doubling (d=1/2, e=1/2)', 'Exponential doubling (d=1, e=0)'), show_Ti_m_Tim1=True, *args, **kwargs)[source]

Display a plot to compare numerically between the following sum \(S\) and the upper-bound we hope to have, \(T^{\gamma} (\log T)^{\delta}\), as a function of \(T\) for some values between \(T_{\min}\) and \(T_{\max}\):

\[S := \sum_{i=0}^{L_T} (T_i - T_{i-1})^{\gamma} (\log (T_i - T_{i-1}))^{\delta}.\]
  • Can accept many functions f (and labels).
  • Can use \(T_i\) instead of \(T_i - T_{i-1}\) if show_Ti_m_Tim1=False (default is to use the smaller possible bound, with difference of sequence lengths, \(T_i - T_{i-1}\)).

Warning

This is still ON GOING WORK.

Policies.DoublingTrickWrapper.MAX_NB_OF_TRIALS = 500

If the sequence Ti does not grow enough, artificially increase i until T_inext > T_i

class Policies.DoublingTrickWrapper.DoublingTrickWrapper(nbArms, full_restart=False, policy=<class 'Policies.UCBH.UCBH'>, next_horizon=<function next_horizon__exponential_slow>, first_horizon=200, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

A policy that acts as a wrapper on another policy P, assumed to be horizon dependent (has to known \(T\)), by implementing a “doubling trick”.

  • Reference: [[What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018]](https://hal.inria.fr/hal-01736357), to be presented soon.
__init__(nbArms, full_restart=False, policy=<class 'Policies.UCBH.UCBH'>, next_horizon=<function next_horizon__exponential_slow>, first_horizon=200, *args, **kwargs)[source]

New policy.

full_restart = None

Constant to know how to refresh the underlying policy.

__module__ = 'Policies.DoublingTrickWrapper'
next_horizon_name = None

Pretty string of the name of this growing function

horizon = None

Last guess for the horizon

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Pass the reward, as usual, update t and sometimes restart the underlying policy.

Policies.EmpiricalMeans module

The naive Empirical Means policy for bounded bandits: like UCB but without a bias correction term. Note that it is equal to UCBalpha with alpha=0, only quicker.

class Policies.EmpiricalMeans.EmpiricalMeans(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The naive Empirical Means policy for bounded bandits: like UCB but without a bias correction term. Note that it is equal to UCBalpha with alpha=0, only quicker.

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.EmpiricalMeans'
Policies.EpsilonGreedy module

The epsilon-greedy random policies, with the naive one and some variants.

Warning

Except if \(\varepsilon(t)\) is optimally tuned for a specific problem, none of these policies can hope to be efficient.

class Policies.EpsilonGreedy.EpsilonGreedy(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The epsilon-greedy random policy.

__init__(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

epsilon
__str__()[source]

-> str

choice()[source]

With a probability of epsilon, explore (uniform choice), otherwhise exploit based on just accumulated rewards (not empirical mean rewards).

choiceWithRank(rank=1)[source]

With a probability of epsilon, explore (uniform choice), otherwhise exploit with the rank, based on just accumulated rewards (not empirical mean rewards).

choiceFromSubSet(availableArms='all')[source]

Not defined.

choiceMultiple(nb=1)[source]

Not defined.

__module__ = 'Policies.EpsilonGreedy'
class Policies.EpsilonGreedy.EpsilonDecreasing(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The epsilon-decreasing random policy.

__init__(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

__str__()[source]

-> str

epsilon

Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\).

__module__ = 'Policies.EpsilonGreedy'
Policies.EpsilonGreedy.C = 0.1

Constant C in the MEGA formula

Policies.EpsilonGreedy.D = 0.5

Constant C in the MEGA formula

Policies.EpsilonGreedy.epsilon0(c, d, nbArms)[source]

MEGA heuristic:

\[\varepsilon_0 = \frac{c K^2}{d^2 (K - 1)}.\]
class Policies.EpsilonGreedy.EpsilonDecreasingMEGA(nbArms, c=0.1, d=0.5, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The epsilon-decreasing random policy, using MEGA’s heuristic for a good choice of epsilon0 value.

__init__(nbArms, c=0.1, d=0.5, lower=0.0, amplitude=1.0)[source]

New policy.

__str__()[source]

-> str

epsilon

Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\).

__module__ = 'Policies.EpsilonGreedy'
class Policies.EpsilonGreedy.EpsilonFirst(nbArms, horizon, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The epsilon-first random policy. Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies

__init__(nbArms, horizon, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

epsilon

1 while \(t \leq \varepsilon_0 T\), 0 after.

__module__ = 'Policies.EpsilonGreedy'
Policies.EpsilonGreedy.EPSILON = 0.1

Default value for epsilon for EpsilonDecreasing

Policies.EpsilonGreedy.DECREASINGRATE = 1e-06

Default value for the constant for the decreasing rate

class Policies.EpsilonGreedy.EpsilonExpDecreasing(nbArms, epsilon=0.1, decreasingRate=1e-06, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The epsilon exp-decreasing random policy.

__init__(nbArms, epsilon=0.1, decreasingRate=1e-06, lower=0.0, amplitude=1.0)[source]

New policy.

__module__ = 'Policies.EpsilonGreedy'
__str__()[source]

-> str

epsilon

Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 \exp(- t \tau))\).

Policies.EpsilonGreedy.random() → x in the interval [0, 1).
Policies.Exp3 module

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

Policies.Exp3.UNBIASED = True

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

Policies.Exp3.GAMMA = 0.01

Default \(\gamma\) parameter.

class Policies.Exp3.Exp3(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Exp3 randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).

__init__(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Unbiased estimators ?

weights = None

Weights on the arms

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

gamma

Constant \(\gamma_t = \gamma\).

trusts

Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

  • With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
  • But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).
choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3WithHorizon(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Exp3 with fixed gamma, \(\gamma_t = \gamma_0\), chosen with a knowledge of the horizon.

__init__(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

gamma

Fixed temperature, small, knowing the horizon: \(\gamma_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3Decreasing(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]

-> str

gamma

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

__module__ = 'Policies.Exp3'
class Policies.Exp3.Exp3SoftMix(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

Another Exp3 with decreasing parameter \(\gamma_t\).

__str__()[source]

-> str

gamma

Decreasing gamma parameter with the time: \(\gamma_t = c \frac{\log(t)}{t}\) (heuristic).

__module__ = 'Policies.Exp3'
Policies.Exp3.DELTA = 0.01

Default value for the confidence parameter delta

class Policies.Exp3.Exp3ELM(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

A variant of Exp3, apparently designed to work better in stochastic environments.

__init__(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]

New policy.

delta = None

Confidence parameter, given in input

B = None

Constant B given by \(B = 4 (e - 2) (2 \log K + \log(2 / \delta))\).

availableArms = None

Set of available arms, starting from all arms, and it can get reduced at each step.

varianceTerm = None

Estimated variance term, for each arm.

__str__()[source]

-> str

choice()[source]

Choose among the remaining arms.

getReward(arm, reward)[source]

Get reward and update the weights, as in Exp3, but also update the variance term \(V_k(t)\) for all arms, and the set of available arms \(\mathcal{A}(t)\), by removing arms whose empirical accumulated reward and variance term satisfy a certain inequality.

\[\begin{split}a^*(t+1) &= \arg\max_a \hat{R}_{a}(t+1), \\ V_k(t+1) &= V_k(t) + \frac{1}{\mathrm{trusts}_k(t+1)}, \\ \mathcal{A}(t+1) &= \mathcal{A}(t) \setminus \left\{ a : \hat{R}_{a^*(t+1)}(t+1) - \hat{R}_{a}(t+1) > \sqrt{B (V_{a^*(t+1)}(t+1) + V_{a}(t+1))} \right\}.\end{split}\]
trusts

Update the trusts probabilities according to Exp3ELM formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - |\mathcal{A}_t| \gamma_t) w_k(t) + \gamma_t, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

__module__ = 'Policies.Exp3'
gamma

Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

Policies.Exp3PlusPlus module

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

Policies.Exp3PlusPlus.ALPHA = 3

Value for the \(\alpha\) parameter.

Policies.Exp3PlusPlus.BETA = 256

Value for the \(\beta\) parameter.

class Policies.Exp3PlusPlus.Exp3PlusPlus(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

__init__(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

\(\alpha\) parameter for computations of \(\xi_t(a)\).

beta = None

\(\beta\) parameter for computations of \(\xi_t(a)\).

weights = None

Weights on the arms

losses = None

Cumulative sum of losses estimates for each arm

unweighted_losses = None

Cumulative sum of unweighted losses for each arm

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

eta

Decreasing sequence of learning rates, given by \(\eta_t = \frac{1}{2} \sqrt{\frac{\log K}{t K}}\).

gamma

Constant \(\gamma_t = \gamma\).

gap_estimate

Compute the gap estimate \(\widehat{\Delta}^{\mathrm{LCB}}_t(a)\) from :

  • Compute the UCB: \(\mathrm{UCB}_t(a) = \min\left( 1, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} + \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),
  • Compute the LCB: \(\mathrm{LCB}_t(a) = \max\left( 0, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} - \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),
  • Then the gap: \(\widehat{\Delta}^{\mathrm{LCB}}_t(a) = \max\left( 0, \mathrm{LCB}_t(a) - \min_{a'} \mathrm{UCB}_t(a') \right)\).
  • The gap should be in \([0, 1]\).
xi

Compute the \(\xi_t(a) = \frac{\beta \log t}{t \widehat{\Delta}^{\mathrm{LCB}}_t(a)^2}\) vector of indexes.

epsilon

Compute the vector of parameters \(\eta_t(a) = \min\left(\frac{1}{2 K}, \frac{1}{2} \sqrt{\frac{\log K}{t K}}, \xi_t(a) \right)\).

trusts

Update the trusts probabilities according to Exp3PlusPlus formula, and the parameter \(\eta_t\).

\[\begin{split}\tilde{\rho}'_{t+1}(a) &= (1 - \sum_{a'=1}^{K}\eta_t(a')) w_t(a) + \eta_t(a), \\ \tilde{\rho}_{t+1} &= \tilde{\rho}'_{t+1} / \sum_{a=1}^{K} \tilde{\rho}'_{t+1}(a).\end{split}\]

If \(rho_t(a)\) is the current weight from arm a.

getReward(arm, reward)[source]

Give a reward: accumulate losses on that arm a, then update the weight \(\rho_t(a)\) and renormalize the weights.

  • Divide by the trust on that arm a, i.e., the probability of observing arm a: \(\tilde{l}_t(a) = \frac{l_t(a)}{\tilde{\rho}_t(a)} 1(A_t = a)\).
  • Add this loss to the cumulative loss: \(\tilde{L}_t(a) := \tilde{L}_{t-1}(a) + \tilde{l}_t(a)\).
  • But the un-weighted loss is added to the other cumulative loss: \(\widehat{L}_t(a) := \widehat{L}_{t-1}(a) + l_t(a) 1(A_t = a)\).
\[\begin{split}\rho'_{t+1}(a) &= \exp\left( - \tilde{L}_t(a) \eta_t \right) \\ \rho_{t+1} &= \rho'_{t+1} / \sum_{a=1}^{K} \rho'_{t+1}(a).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).
choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3PlusPlus'
Policies.Exp3R module

The Drift-Detection algorithm for non-stationary bandits.

Warning

It works on Exp3 or other parametrizations of the Exp3 policy, e.g., Exp3PlusPlus.

Policies.Exp3R.VERBOSE = False

Whether to be verbose when doing the search for valid parameter \(\ell\).

Policies.Exp3R.CONSTANT_C = 1.0

The constant \(C\) used in Corollary 1 of paper [[“EXP3 with Drift Detection for the Switching Bandit Problem”, Robin Allesiardo & Raphael Feraud]](https://www.researchgate.net/profile/Allesiardo_Robin/publication/281028960_EXP3_with_Drift_Detection_for_the_Switching_Bandit_Problem/links/55d1927808aee19936fdac8e.pdf).

class Policies.Exp3R.DriftDetection_IndexPolicy(nbArms, H=None, delta=None, C=1.0, horizon=None, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The Drift-Detection generic policy for non-stationary bandits, using a custom Drift-Detection test, for 1-dimensional exponential families.

__init__(nbArms, H=None, delta=None, C=1.0, horizon=None, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]

New policy.

H = None

Parameter \(H\) for the Drift-Detection algorithm. Default value is \(\lceil C \sqrt{T \log(T)} \rceil\), for some constant \(C=\) C (= CONSTANT_C by default).

delta = None

Parameter \(\delta\) for the Drift-Detection algorithm. Default value is \(\sqrt{\frac{\log(T)}{K T}}\) for \(K\) arms and horizon \(T\).

proba_random_exploration

Parameter \(\gamma\) for the Exp3 algorithm.

threshold_h

Parameter \(\varepsilon\) for the Drift-Detection algorithm.

\[\varepsilon = \sqrt{\frac{K \log(\frac{1}{\delta})}{2 \gamma H}}.\]
min_number_of_pulls_to_test_change

Compute \(\Gamma_{\min}(I) := \frac{\gamma H}{K}\), the minimum number of samples we should have for all arms before testing for a change.

__str__()[source]

-> str

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using a Drift-Detection test (DD).

\[\begin{split}k_{\max} &:= \arg\max_k \tilde{\rho}_k(t),\\ DD_t(k) &= \hat{\mu}_k(I) - \hat{\mu}_{k_{\max}}(I).\end{split}\]
  • The change is detected if there is an arm \(k\) such that \(DD_t(k) \geq 2 * \varepsilon = h\), where threshold_h is the threshold of the test, and \(I\) is the (number of the) current interval since the last (global) restart,
  • where \(\tilde{\rho}_k(t)\) is the trust probability of arm \(k\) from the Exp3 algorithm,
  • and where \(\hat{\mu}_k(I)\) is the empirical mean of arm \(k\) from the data in the current interval.

Warning

FIXME I know this implementation is not (yet) correct… I should count differently the samples we obtained from the Gibbs distribution (when Exp3 uses the trust vector) and from the uniform distribution This \(\Gamma_{\min}(I)\) is the minimum number of samples obtained from the uniform exploration (of probability \(\gamma\)). It seems painful to code correctly, I will do it later.

__module__ = 'Policies.Exp3R'
class Policies.Exp3R.Exp3R(nbArms, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]

Bases: Policies.Exp3R.DriftDetection_IndexPolicy

The Exp3.R policy for non-stationary bandits.

__init__(nbArms, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]

New policy.

__str__()[source]

-> str

__module__ = 'Policies.Exp3R'
class Policies.Exp3R.Exp3RPlusPlus(nbArms, policy=<class 'Policies.Exp3PlusPlus.Exp3PlusPlus'>, *args, **kwargs)[source]

Bases: Policies.Exp3R.DriftDetection_IndexPolicy

The Exp3.R++ policy for non-stationary bandits.

__init__(nbArms, policy=<class 'Policies.Exp3PlusPlus.Exp3PlusPlus'>, *args, **kwargs)[source]

New policy.

__module__ = 'Policies.Exp3R'
__str__()[source]

-> str

Policies.Exp3S module

The historical Exp3.S algorithm for non-stationary bandits.

  • Reference: [[“The nonstochastic multiarmed bandit problem”, P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire, SIAM journal on computing, 2002]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8735&rep=rep1&type=pdf)

  • It is a simple extension of the Exp3 policy:

    >>> policy = Exp3S(nbArms, C=1)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

class Policies.Exp3S.Exp3S(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]

Bases: Policies.Exp3.Exp3

The historical Exp3.S algorithm for non-stationary bandits.

__init__(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]

New policy.

weights = None

Weights on the arms

__str__()[source]

-> str

gamma

Constant \(\gamma_t = \gamma\).

alpha

Constant \(\alpha_t = \alpha\).

startGame()[source]

Start with uniform weights.

trusts

Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

  • With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
  • But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
__module__ = 'Policies.Exp3S'
Policies.ExploreThenCommit module

Different variants of the Explore-Then-Commit policy.

Warning

They sometimes do not work empirically as well as the theory predicted…

Warning

TODO I should factor all this code and write all of them in a more “unified” way…

Policies.ExploreThenCommit.GAP = 0.1

Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.

class Policies.ExploreThenCommit.ETC_KnownGap(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

max_t = None

Time until pure exploitation, m_ steps in each arm.

__str__()[source]

-> str

epsilon

1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:

\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]
__module__ = 'Policies.ExploreThenCommit'
Policies.ExploreThenCommit.ALPHA = 4

Default value for parameter \(\alpha\) for ETC_RandomStop

class Policies.ExploreThenCommit.ETC_RandomStop(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

stillRandom = None

Still randomly exploring?

__str__()[source]

-> str

epsilon

1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:

\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]
__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_FixedBudget(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

max_t = None

Time until pure exploitation.

round_robin_index = None

Internal index to keep the Round-Robin phase

best_identified_arm = None

Arm on which we commit, not defined in the beginning.

__str__()[source]

-> str

choice()[source]

For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.

\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]
  • Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with scipy.special.lambertw()).
epsilon

1 while \(t \leq n\), 0 after.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Base class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

__init__(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

round_robin_index = None

Internal index to keep the Round-Robin phase

best_identified_arm = None

Arm on which we commit, not defined in the beginning.

__str__()[source]

-> str

choice()[source]

Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.

\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]
stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

epsilon

1 while not fixed, 0 after.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_SPRT(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.ETC_BAI(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

stopping_criteria()[source]

Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'
class Policies.ExploreThenCommit.DeltaUCB(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

__init__(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

gap = None

Known gap parameter for the stopping rule.

alpha = None

Parameter \(\alpha\) in the formula (4 by default).

epsilon_T = None

Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).

__str__()[source]

-> str

choice()[source]

Chose between the most chosen and the least chosen arm, based on the following criteria:

\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]
\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]
\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]
__module__ = 'Policies.ExploreThenCommit'
Policies.FEWA module

author: Julien Seznec

Filtering on Expanding Window Algorithm for rotting bandits.

Reference: [Seznec et al., 2019a] Rotting bandits are not harder than stochastic ones; Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko ; Proceedings of Machine Learning Research, PMLR 89:2564-2572, 2019. http://proceedings.mlr.press/v89/seznec19a.html https://arxiv.org/abs/1811.11043 (updated version)

Reference : [Seznec et al., 2019b] A single algorithm for both rested and restless rotting bandits (WIP) Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko

class Policies.FEWA.EFF_FEWA(nbArms, alpha=0.06, subgaussian=1, m=None, delta=None, delay=False)[source]

Bases: Policies.BasePolicy.BasePolicy

Efficient Filtering on Expanding Window Average Efficient trick described in [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043] (m=2) and [Seznec et al., 2019b, WIP] (m<=2) We use the confidence level :math:`delta_t =

rac{1}{t^lpha}`.

__init__(nbArms, alpha=0.06, subgaussian=1, m=None, delta=None, delay=False)[source]

New policy.

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

choice()[source]

Not defined.

_append_thresholds(w)[source]
_compute_windows(first_window, add_size)[source]
_inlog()[source]
startGame()[source]

Start the game (fill pulls and rewards with 0).

__module__ = 'Policies.FEWA'
class Policies.FEWA.FEWA(nbArms, subgaussian=1, alpha=4, delta=None)[source]

Bases: Policies.FEWA.EFF_FEWA

Filtering on Expanding Window Average. Reference: [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043]. FEWA is equivalent to EFF_FEWA for \(m < 1+1/T\) [Seznec et al., 2019b, WIP]. This implementation is valid for $:math:T < 10^{15}. For \(T>10^{15}\), FEWA will have time and memory issues as its time and space complexity is O(KT) per round.

__init__(nbArms, subgaussian=1, alpha=4, delta=None)[source]

New policy.

__str__()[source]

-> str

__module__ = 'Policies.FEWA'
Policies.GLR_UCB module

The GLR-UCB policy and variants, for non-stationary bandits.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

  • It runs on top of a simple policy, e.g., UCB, and BernoulliGLR_IndexPolicy is a wrapper:

    >>> policy = BernoulliGLR_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.GLR_UCB.VERBOSE = False

Whether to be verbose when doing the change detection algorithm.

Policies.GLR_UCB.PROBA_RANDOM_EXPLORATION = 0.1

Default probability of random exploration \(\alpha\).

Policies.GLR_UCB.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? Default is True, it’s usually more efficient!

Policies.GLR_UCB.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.GLR_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.

Policies.GLR_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes for \(s\) taking steps of size steps_s. Default is to have steps_s=1, but only using steps_s=2 should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.

Policies.GLR_UCB.USE_LOCALIZATION = True

Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

Policies.GLR_UCB.eps = 1e-10

Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]

Policies.GLR_UCB.klBern(x, y)[source]

Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]
Policies.GLR_UCB.klGauss(x, y, sig2x=1)[source]

Kullback-Leibler divergence for Gaussian distributions of means x and y and variances sig2x and sig2y, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):

\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]

See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties

Policies.GLR_UCB.threshold_GaussianGLR(t, horizon=None, delta=None, variant=None)[source]

Compute the value :math:`c from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].

  • The threshold is computed as (with \(t_0 = 0\)):
\[\beta(t_0, t, \delta) := \left(1 + \frac{1}{t - t_0 + 1}\right) 2 \log\left(\frac{2 (t - t_0) \sqrt{(t - t_0) + 2}}{\delta}\right).\]
Policies.GLR_UCB.function_h(u)[source]

The function \(h(u) = u - \log(u)\).

Policies.GLR_UCB.function_h_minus_one(x)[source]

The inverse function of \(h(u)\), that is \(h^{-1}(x) = u \Leftrightarrow h(u) = x\). It is given by the Lambert W function, see scipy.special.lambertw():

\[h^{-1}(x) = - \mathcal{W}(- \exp(-x)).\]
  • Example:
>>> np.random.seed(105)
>>> y = np.random.randn() ** 2
>>> print(f"y = {y}")
y = 0.060184682907834595
>>> x = function_h(y)
>>> print(f"h(y) = {x}")
h(y) = 2.8705220786966508
>>> z = function_h_minus_one(x)
>>> print(f"h^-1(x) = {z}")
h^-1(x) = 0.060184682907834595
>>> assert np.isclose(z, y), "Error: h^-1(h(y)) = z = {z} should be very close to y = {}...".format(z, y)
Policies.GLR_UCB.constant_power_function_h = 1.5

The constant \(\frac{3}{2}\), used in the definition of functions \(h\), \(h^{-1}\), \(\tilde{h}\) and \(\mathcal{T}\).

Policies.GLR_UCB.threshold_function_h_tilde = 3.801770285137458

The constant \(h^{-1}(1/\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).

Policies.GLR_UCB.constant_function_h_tilde = -0.90272045571788

The constant \(\log(\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).

Policies.GLR_UCB.function_h_tilde(x)[source]

The function \(\tilde{h}(x)\), defined by:

\[\begin{split}\tilde{h}(x) = \begin{cases} e^{1/h^{-1}(x)} h^{-1}(x) & \text{ if } x \ge h^{-1}(1/\ln (3/2)), \\ (3/2) (x-\ln \ln (3/2)) & \text{otherwise}. \end{cases}\end{split}\]
Policies.GLR_UCB.zeta_of_two = 1.6449340668482264

The constant \(\zeta(2) = \frac{\pi^2}{6}\).

Policies.GLR_UCB.function_T_mathcal(x)[source]

The function \(\mathcal{T}(x)\), defined by:

\[\mathcal{T}(x) = 2 \tilde h\left(\frac{h^{-1}(1+x) + \ln(2\zeta(2))}{2}\right).\]
Policies.GLR_UCB.approximation_function_T_mathcal(x)[source]

An efficiently computed approximation of \(\mathcal{T}(x)\), valid for \(x \geq 5\):

\[\mathcal{T}(x) \simeq x + 4 \log(1 + x + \sqrt(2 x)).\]
Policies.GLR_UCB.threshold_BernoulliGLR(t, horizon=None, delta=None, variant=None)[source]

Compute the value \(c\) from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].

Warning

This is still experimental, you can try different variants of the threshold function:

  • Variant #0 (default) is:
\[\beta(t, \delta) := \log\left(\frac{3 t^{3/2}}{\delta}\right) = \log(\frac{1}{\delta}) + \log(3) + 3/2 \log(t).\]
  • Variant #1 is smaller:
\[\beta(t, \delta) := \log(\frac{1}{\delta}) + \log(1 + \log(t)).\]
  • Variant #2 is using \(\mathcal{T}\):
\[\beta(t, \delta) := 2 \mathcal{T}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]
  • Variant #3 is using \(\tilde{\mathcal{T}}(x) = x + 4 \log(1 + x + \sqrt{2x})\) an approximation of \(\mathcal{T}(x)\) (valid and quite accurate as soon as \(x \geq 5\)):
\[\beta(t, \delta) := 2 \tilde{\mathcal{T}}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]
Policies.GLR_UCB.EXPONENT_BETA = 1.01

The default value of parameter \(\beta\) for the function decreasing_alpha__GLR().

Policies.GLR_UCB.ALPHA_T1 = 0.05

The default value of parameter \(\alpha_{t=1}\) for the function decreasing_alpha__GLR().

Policies.GLR_UCB.decreasing_alpha__GLR(alpha0=None, t=1, exponentBeta=1.01, alpha_t1=0.05)[source]

Either use a fixed alpha, or compute it with an exponential decay (if alpha0=None).

Note

I am currently exploring the following variant (November 2018):

  • The probability of uniform exploration, \(\alpha\), is computed as a function of the current time:
\[\forall t>0, \alpha = \alpha_t := \alpha_{t=1} \frac{1}{\max(1, t^{\beta})}.\]
  • with \(\beta > 1, \beta\) = exponentBeta (=1.05) and \(\alpha_{t=1} < 1, \alpha_{t=1}\) = alpha_t1 (=0.01).
  • the only requirement on \(\alpha_t\) seems to be that sum_{t=1}^T alpha_t < +infty (ie. be finite), which is the case for \(\alpha_t = \alpha = \frac{1}{T}\), but also any \(\alpha_t = \frac{\alpha_1}{t^{\beta}}\) for any \(\beta>1\) (cf. Riemann series).
Policies.GLR_UCB.smart_delta_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=1.0, per_arm_restart=True, nbArms=1)[source]

Compute a smart estimate of the optimal value for the confidence level \(\delta\), with scaleFactor \(= \delta_0\in(0,1)\) a constant.

  • If per_arm_restart is True (Local option):
\[\delta = \frac{\delta_0}{\sqrt{K \Upsilon_T T}.\]
  • If per_arm_restart is False (Global option):
\[\delta = \frac{\delta_0}{\sqrt{\Upsilon_T T}.\]

Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).

Policies.GLR_UCB.smart_alpha_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=0.1, per_arm_restart=True, nbArms=1)[source]

Compute a smart estimate of the optimal value for the fixed or random forced exploration probability \(\alpha\) (or tracking based), with scaleFactor \(= \alpha_0\in(0,1)\) a constant.

  • If per_arm_restart is True (Local option):
\[\alpha = \alpha_0 \times \sqrt{\frac{K \Upsilon_T}{T} \log(T)}.\]
  • If per_arm_restart is False (Global option):
\[\alpha = \alpha_0 \times \sqrt{\frac{\Upsilon_T}{T} \log(T)}.\]

Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).

class Policies.GLR_UCB.GLR_IndexPolicy(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The GLR-UCB generic policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for 1-dimensional exponential families.

  • It works for any 1-dimensional exponential family, you just have to give a kl function.
  • For instance kullback.klBern(), for Bernoulli distributions, gives GaussianGLR_IndexPolicy,
  • And kullback.klGauss() for univariate Gaussian distributions, gives BernoulliGLR_IndexPolicy.
  • threshold_function computes the threshold \(\beta(t, \delta)\), it can be for instance threshold_GaussianGLR() or threshold_BernoulliGLR().
  • From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__init__(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

New policy.

horizon = None

The horizon \(T\).

max_nb_random_events = None

The number of breakpoints \(\Upsilon_T\).

use_localization = None

experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

delta = None

The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{\sqrt{T}}\) if horizon is given and delta=None but \(\Upsilon_T\) is unknown. Defaults to \(\delta=\frac{1}{\sqrt{\Upsilon_T T}}\) if both \(T\) and \(\Upsilon_T\) are given (horizon and max_nb_random_events).

kl = None

The parametrized Kullback-Leibler divergence (\(\mathrm{kl}(x,y) = KL(D(x),D(y))\)) for the 1-dimensional exponential family \(x\mapsto D(x)\). Example: kullback.klBern() or kullback.klGauss().

lazy_try_value_s_only_x_steps = None

Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

compute_threshold_h(t)[source]

Compute the threshold \(h\) with _threshold_function.

proba_random_exploration

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

__str__()[source]

-> str

getReward(arm, reward)[source]

Do as CD_UCB to handle the new reward, and also, update the internal times of each arm for the indexes of klUCB_forGLR (or other index policies), which use \(f(t - \tau_i(t))\) for the exploration function of each arm \(i\) at time \(t\), where \(\tau_i(t)\) denotes the (last) restart time of the arm.

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the Generalized Likelihood Ratio test (GLR) and the kl function.

  • For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\mathrm{kl}}_{t_0:s:t} = (s-t_0+1) \mathrm{kl}(\mu_{t_0,s}, \mu_{t_0,t}) + (t-s) \mathrm{kl}(\mu_{s+1,t}, \mu_{t_0,t}).\]
  • The change is detected if there is a time \(s\) such that \(G^{\mathrm{kl}}_{t_0:s:t} > h\), where threshold_h is the threshold of the test,
  • And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).

Warning

This is computationally costly, so an easy way to speed up this test is to use lazy_try_value_s_only_x_steps \(= \mathrm{Step_s}\) for a small value (e.g., 10), so not test for all \(s\in[t_0, t-1]\) but only \(s\in[t_0, t-1], s \mod \mathrm{Step_s} = 0\) (e.g., one out of every 10 steps).

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.GLR_IndexPolicy_WithTracking(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy

A variant of the GLR policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough (with a tracking).

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
choice()[source]

If any arm is not explored enough (\(n_k \leq \frac{\alpha}{K} \times (t - n_k)\), play uniformly at random one of these arms, otherwise, pass the call to choice() of the underlying policy.

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy

A variant of the GLR policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

  • If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]
  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
choice()[source]

For some time steps, play uniformly at random one of these arms, otherwise, pass the call to choice() of the underlying policy.

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.GaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed).

__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]

New policy.

_sig2 = None

Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.GaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.GaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.BernoulliGLR_IndexPolicy(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The BernoulliGLR-UCB policy for non-stationary bandits, for Bernoulli distributions.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__init__(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

New policy.

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithTracking(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.BernoulliGLR_IndexPolicy

A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithDeterministicExploration(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.BernoulliGLR_IndexPolicy

A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed), but with our threshold designed for the sub-Bernoulli case.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

New policy.

_sig2 = None

Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).

__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.OurGaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough, but with our threshold designed for the sub-Bernoulli case, but with our threshold designed for the sub-Bernoulli case.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__module__ = 'Policies.GLR_UCB'
class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.OurGaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed, but with our threshold designed for the sub-Bernoulli case.

  • Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
__module__ = 'Policies.GLR_UCB'
Policies.GLR_UCB.SubGaussianGLR_DELTA = 0.01

Default confidence level for SubGaussianGLR_IndexPolicy.

Policies.GLR_UCB.SubGaussianGLR_SIGMA = 0.25

By default, SubGaussianGLR_IndexPolicy assumes distributions are 0.25-sub Gaussian, like Bernoulli or any distributions with support on \([0,1]\).

Policies.GLR_UCB.SubGaussianGLR_JOINT = True

Whether to use the joint or disjoint threshold function (threshold_SubGaussianGLR_joint() or threshold_SubGaussianGLR_disjoint()) for SubGaussianGLR_IndexPolicy.

Policies.GLR_UCB.threshold_SubGaussianGLR_joint(s, t, delta=0.01, sigma=0.25)[source]

Compute the threshold :math:`b^{text{joint}}_{t_0}(s,t,delta) according to this formula:

\[b^{\text{joint}}_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]
Policies.GLR_UCB.threshold_SubGaussianGLR_disjoint(s, t, delta=0.01, sigma=0.25)[source]

Compute the threshold \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\) according to this formula:

\[b^{\text{disjoint}}_{t_0}(s,t,\delta) := \sqrt{2} \sigma \sqrt{\frac{1 + \frac{1}{s - t_0 + 1}}{s - t_0 + 1} \log\left( \frac{4 \sqrt{s - t_0 + 2}}{\delta}\right)} + \sqrt{\frac{1 + \frac{1}{t - s + 1}}{t - s + 1} \log\left( \frac{4 (t - t_0) \sqrt{t - s + 1}}{\delta}\right)}.\]
Policies.GLR_UCB.threshold_SubGaussianGLR(s, t, delta=0.01, sigma=0.25, joint=True)[source]

Compute the threshold \(b^{\text{joint}}_{t_0}(s,t,\delta)\) or \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\).

class Policies.GLR_UCB.SubGaussianGLR_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The SubGaussianGLR-UCB policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for sub-Gaussian distributions.

  • It works for any sub-Gaussian family of distributions, being \(\sigma^2\)-sub Gaussian with known \(\sigma\).
  • From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
__init__(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

New policy.

horizon = None

The horizon \(T\).

max_nb_random_events = None

The number of breakpoints \(\Upsilon_T\).

delta = None

The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{T}\) if horizon is given and delta=None.

sigma = None

Parameter \(\sigma\) for the Sub-Gaussian-GLR test.

joint = None

Parameter joint for the Sub-Gaussian-GLR test.

lazy_try_value_s_only_x_steps = None

Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

use_localization = None

experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

compute_threshold_h(s, t)[source]

Compute the threshold \(h\) with threshold_SubGaussianGLR().

__module__ = 'Policies.GLR_UCB'
proba_random_exploration

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

__str__()[source]

-> str

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the non-parametric sub-Gaussian Generalized Likelihood Ratio test (GLR) works like this:

  • For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\text{sub-}\sigma}_{t_0:s:t} = |\mu_{t_0,s} - \mu_{s+1,t}|.\]
  • The change is detected if there is a time \(s\) such that \(G^{\text{sub-}\sigma}_{t_0:s:t} > b_{t_0}(s,t,\delta)\), where \(b_{t_0}(s,t,\delta)\) is the threshold of the test,
  • The threshold is computed as:
\[b_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]
  • And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).
Policies.GenericAggregation module

The GenericAggregation aggregation bandit algorithm: use a bandit policy A (master), managing several “slave” algorithms, \(A_1, ..., A_N\).

  • At every step, one slave algorithm A_i is selected, by the master policy A.
  • Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
  • All slaves receive the observation (arm, reward).
  • The master also receives the same observation.
class Policies.GenericAggregation.GenericAggregation(nbArms, master=None, children=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The GenericAggregation aggregation bandit algorithm.

__init__(nbArms, master=None, children=None, lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms.

lower = None

Lower values for rewards.

amplitude = None

Larger values for rewards.

last_choice = None

Remember the index of the last child trusted for a decision.

children = None

List of slave algorithms.

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

startGame()[source]

Start the game for each child, and for the master.

getReward(arm, reward)[source]

Give reward for each child, and for the master.

choice()[source]

Trust one of the slave and listen to his choice.

choiceWithRank(rank=1)[source]

Trust one of the slave and listen to his choiceWithRank.

choiceFromSubSet(availableArms='all')[source]

Trust one of the slave and listen to his choiceFromSubSet.

choiceMultiple(nb=1)[source]

Trust one of the slave and listen to his choiceMultiple.

__module__ = 'Policies.GenericAggregation'
choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Trust one of the slave and listen to his choiceIMP.

estimatedOrder()[source]

Trust one of the slave and listen to his estimatedOrder.

  • Return the estimate order of the arms, as a permutation on \([0,...,K-1]\) that would order the arms by increasing means.
estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

Policies.GenericAggregation.random() → x in the interval [0, 1).
Policies.GreedyOracle module

author: Julien Seznec

Oracle and near-minimax policy for rotting bandits without noise.

Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf] Tight Policy Regret Bounds for Improving and Decaying Bandits. Hoda Heidari, Michael Kearns, Aaron Roth. International Joint Conference on Artificial Intelligence (IJCAI) 2016, 1562.

class Policies.GreedyOracle.GreedyPolicy(nbArms)[source]

Bases: Policies.IndexPolicy.IndexPolicy

Greedy Policy for rotting bandits (A2 in the reference below). Selects arm with best last value. Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf]

__init__(nbArms)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

computeIndex(arm)[source]

Compute the mean of the h last value

startGame()[source]

Initialize the policy for a new game.

__module__ = 'Policies.GreedyOracle'
class Policies.GreedyOracle.GreedyOracle(nbArms, arms)[source]

Bases: Policies.IndexPolicy.IndexPolicy

Greedy Oracle for rotting bandits (A0 in the reference below). Look 1 step forward and select next best value. Optimal policy for rotting bandits problem. Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf]

__init__(nbArms, arms)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
computeIndex(arm)[source]

Compute the current index of arm ‘arm’.

__module__ = 'Policies.GreedyOracle'
Policies.Hedge module

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

Policies.Hedge.EPSILON = 0.01

Default \(\varepsilon\) parameter.

class Policies.Hedge.Hedge(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf).

__init__(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

New policy.

weights = None

Weights on the arms

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

epsilon

Constant \(\varepsilon_t = \varepsilon\).

trusts

Update the trusts probabilities according to Hedge formula, and the parameter \(\varepsilon_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \varepsilon_t) w_k(t) + \varepsilon_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\varepsilon_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).
choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Hedge.Hedge

Hedge with fixed epsilon, \(\varepsilon_t = \varepsilon_0\), chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

epsilon

Fixed temperature, small, knowing the horizon: \(\varepsilon_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeDecreasing(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Hedge.Hedge

Hedge with decreasing parameter \(\varepsilon_t\).

__str__()[source]

-> str

epsilon

Decreasing epsilon with the time: \(\varepsilon_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

__module__ = 'Policies.Hedge'
Policies.IMED module

The IMED policy of [Honda & Takemura, JMLR 2015].

Policies.IMED.Dinf(x=None, mu=None, kl=<function klBern>, lowerbound=0, upperbound=1, precision=1e-06, max_iterations=50)[source]

The generic Dinf index computation.

  • x: value of the cum reward,
  • mu: upperbound on the mean y,
  • kl: the KL divergence to be used (klBern(), klGauss(), etc),
  • lowerbound, upperbound=1: the known bound of the values y and x,
  • precision=1e-6: the threshold from where to stop the research,
  • max_iterations: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[D_{\inf}(x, d) \simeq \inf_{\max(\mu, \mathrm{lowerbound}) \leq y \leq \mathrm{upperbound}} \mathrm{kl}(x, y).\]

Note

It uses a call the scipy.optimize.minimize_scalar(). If this fails, it uses a bisection search, and one call to kl for each step of the bisection search.

class Policies.IMED.IMED(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

Bases: Policies.DMED.DMED

The IMED policy of [Honda & Takemura, JMLR 2015].

__init__(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]

New policy.

__str__()[source]

-> str

one_Dinf(x, mu)[source]

Compute the \(D_{\inf}\) solution, for one value of x, and one value for mu.

Dinf(xs, mu)[source]

Compute the \(D_{\inf}\) solution, for a vector of value of xs, and one value for mu.

choice()[source]

Choose an arm with minimal index (uniformly at random):

\[A(t) \sim U(\arg\min_{1 \leq k \leq K} I_k(t)).\]

Where the indexes are:

\[I_k(t) = N_k(t) D_{\inf}(\hat{\mu_{k}}(t), \max_{k'} \hat{\mu_{k'}}(t)) + \log(N_k(t)).\]
__module__ = 'Policies.IMED'
Policies.IndexPolicy module

Generic index policy.

  • If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
class Policies.IndexPolicy.IndexPolicy(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

Class that implements a generic index policy.

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
index = None

Numerical index for each arms

startGame()[source]

Initialize the policy for a new game.

computeIndex(arm)[source]

Compute the current index of arm ‘arm’.

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

choice()[source]

In an index policy, choose an arm with maximal index (uniformly at random):

\[A(t) \sim U(\arg\max_{1 \leq k \leq K} I_k(t)).\]

Warning

In almost all cases, there is a unique arm with maximal index, so we loose a lot of time with this generic code, but I couldn’t find a way to be more efficient without loosing generality.

choiceWithRank(rank=1)[source]

In an index policy, choose an arm with index is the (1+rank)-th best (uniformly at random).

  • For instance, if rank is 1, the best arm is chosen (the 1-st best).
  • If rank is 4, the 4-th best arm is chosen.

Note

This method is required for the PoliciesMultiPlayers.rhoRand policy.

choiceFromSubSet(availableArms='all')[source]

In an index policy, choose the best arm from sub-set availableArms (uniformly at random).

choiceMultiple(nb=1)[source]

In an index policy, choose nb arms with maximal indexes (uniformly at random).

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

In an index policy, the IMP strategy is hybrid: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing means.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.IndexPolicy'
Policies.LM_DSEE module

The LM-DSEE policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

This implementation is still experimental!

class Policies.LM_DSEE.State

Bases: enum.Enum

Different states during the LM-DSEE algorithm

Exploitation = 2
Exploration = 1
__module__ = 'Policies.LM_DSEE'
Policies.LM_DSEE.VERBOSE = False

Whether to be verbose when doing the search for valid parameter \(\ell\).

Policies.LM_DSEE.parameter_ell(a, N, b, gamma, verbose=False, max_value_on_l=1000000)[source]

Look for the smallest value of the parameter \(\ell\) that satisfies the following equations:

class Policies.LM_DSEE.LM_DSEE(nbArms, nu=0.5, DeltaMin=0.5, a=1, b=0.25, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

The LM-DSEE policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

__init__(nbArms, nu=0.5, DeltaMin=0.5, a=1, b=0.25, *args, **kwargs)[source]

New policy.

a = None

Parameter \(a\) for the LM-DSEE algorithm.

b = None

Parameter \(b\) for the LM-DSEE algorithm.

l = None

Parameter \(\ell\) for the LM-DSEE algorithm, as computed by the function parameter_ell().

gamma = None

Parameter \(\gamma\) for the LM-DSEE algorithm.

rho = None

Parameter \(\rho = \frac{1-\nu}{1+\nu}\) for the LM-DSEE algorithm.

phase = None

Current phase, exploration or exploitation.

current_exploration_arm = None

Currently explored arm.

current_exploitation_arm = None

Currently exploited arm.

batch_number = None

Number of batch

length_of_current_phase = None

Length of the current phase, either computed from length_exploration_phase() or func:length_exploitation_phase.

step_of_current_phase = None

Timer inside the current phase.

all_rewards = None

Memory of all the rewards. A list per arm. Growing list until restart of that arm?

__str__()[source]

-> str

startGame()[source]

Start the game (fill pulls and rewards with 0).

length_exploration_phase(verbose=False)[source]

Compute the value of the current exploration phase:

\[L_1(k) = L(k) = \lceil \gamma \log(k^{\rho} l b)\rceil.\]

Warning

I think there is a typo in the paper, as their formula are weird (like \(al\) is defined from \(a\)). See parameter_ell().

length_exploitation_phase(verbose=False)[source]

Compute the value of the current exploitation phase:

\[L_2(k) = \lceil a k^{\rho} l \rceil - K L_1(k).\]

Warning

I think there is a typo in the paper, as their formula are weird (like \(al\) is defined from \(a\)). See parameter_ell().

getReward(arm, reward)[source]

Get a reward from an arm.

__module__ = 'Policies.LM_DSEE'
choice()[source]

Choose an arm following the different phase of growing lenghts according to the LM-DSEE algorithm.

Policies.LearnExp module

The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.

The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).

  • At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).
  • Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
  • The reward is reweighted by the trust of the listened algorithm, and given back to it with a certain probability.
  • The other slaves, whose decision was not even asked, receive nothing.
  • The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.
  • The detail about how to increase or decrease the probabilities are specified in the reference article.

Note

Reference: [[Learning to Use Learners’ Advice, A.Singla, H.Hassani & A.Krause, 2017](https://arxiv.org/abs/1702.04825)].

Policies.LearnExp.renormalize_reward(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]

Renormalize the reward to [0, 1]:

  • divide by (trust/mintrust) if unbiased is True.
  • simply project to [0, 1] if unbiased is False,

Warning

If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.

Policies.LearnExp.unnormalize_reward(reward, lower=0.0, amplitude=1.0)[source]

Project back reward to [lower, lower + amplitude].

Policies.LearnExp.UNBIASED = True

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).

Policies.LearnExp.ETA = 0.5

Default value for the constant Eta in (0, 1]

class Policies.LearnExp.LearnExp(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.

__init__(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms.

lower = None

Lower values for rewards.

amplitude = None

Larger values for rewards.

unbiased = None

Flag, see above.

eta = None

Constant parameter \(\eta\).

rate = None

Constant \(\eta / N\), faster computations if it is stored once.

children = None

List of slave algorithms.

last_choice = None

Remember the index of the last child trusted for a decision.

trusts = None

Initial trusts in the slaves \(p_j^t\). Default to uniform, but a prior can also be given.

weights = None

Weights \(w_j^t\).

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

startGame()[source]

Start the game for each child.

getReward(arm, reward)[source]

Give reward for each child, and then update the trust probabilities.

choice()[source]

Trust one of the slave and listen to his choice.

choiceWithRank(rank=1)[source]

Trust one of the slave and listen to his choiceWithRank.

choiceFromSubSet(availableArms='all')[source]

Trust one of the slave and listen to his choiceFromSubSet.

choiceMultiple(nb=1)[source]

Trust one of the slave and listen to his choiceMultiple.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Trust one of the slave and listen to his choiceIMP.

__module__ = 'Policies.LearnExp'
estimatedOrder()[source]

Trust one of the slave and listen to his estimatedOrder.

  • Return the estimate order of the arms, as a permutation on \([0,...,K-1]\) that would order the arms by increasing means.
estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

Policies.LearnExp.random() → x in the interval [0, 1).
Policies.MEGA module

MEGA: implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421).

The Multi-user epsilon-Greedy collision Avoiding (MEGA) algorithm is based on the epsilon-greedy algorithm introduced in [2], augmented by a collision avoidance mechanism that is inspired by the classical ALOHA protocol.

  • [2]: Finite-time analysis of the multi-armed bandit problem, P.Auer & N.Cesa-Bianchi & P.Fischer, 2002
class Policies.MEGA.MEGA(nbArms, p0=0.5, alpha=0.5, beta=0.5, c=0.1, d=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

MEGA: implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421).

__init__(nbArms, p0=0.5, alpha=0.5, beta=0.5, c=0.1, d=0.01, lower=0.0, amplitude=1.0)[source]
  • nbArms: number of arms.
  • p0: initial probability p(0); p(t) is the probability of persistance on the chosenArm at time t
  • alpha: scaling in the update for p(t+1) <- alpha p(t) + (1 - alpha(t))
  • beta: exponent used in the interval [t, t + t^beta], from where to sample a random time t_next(k), until when the chosenArm is unavailable
  • c, d: used to compute the exploration probability epsilon_t, cf the function _epsilon_t().

Example:

>>> nbArms, p0, alpha, beta, c, d = 17, 0.5, 0.5, 0.5, 0.1, 0.01
>>> player1 = MEGA(nbArms, p0, alpha, beta, c, d)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, MEGA, nbArms, p0, alpha, beta, c, d).children
c = None

Parameter c

d = None

Parameter d

p0 = None

Parameter p0, should not be modified

p = None

Parameter p, can be modified

alpha = None

Parameter alpha

beta = None

Parameter beta

chosenArm = None

Last chosen arm

tnext = None

Only store the delta time

meanRewards = None

Mean rewards

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory.

choice()[source]

Choose an arm, as described by the MEGA algorithm.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the MEGA algorithm.

  • If not collision, receive a reward after pulling the arm.
handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

  • Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.

Note

We do not care on which arm the collision occured.

_epsilon_t()[source]

Compute the value of decreasing epsilon(t), cf. Algorithm 1 in [Avner & Mannor, 2014](https://arxiv.org/abs/1404.5421).

__module__ = 'Policies.MEGA'
Policies.MEGA.random() → x in the interval [0, 1).
Policies.MOSS module

The MOSS policy for bounded bandits. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).

class Policies.MOSS.MOSS(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The MOSS policy for bounded bandits. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.MOSS'
Policies.MOSSAnytime module

The MOSS-Anytime policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).

Policies.MOSSAnytime.ALPHA = 1.0

Default value for the parameter \(\alpha\) for the MOSS-Anytime algorithm.

class Policies.MOSSAnytime.MOSSAnytime(nbArms, alpha=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.MOSS.MOSS

The MOSS-Anytime policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).

__init__(nbArms, alpha=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
alpha = None

Parameter \(\alpha \geq 0\) for the computations of the index. Optimal value seems to be \(1.35\).

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\left(\frac{1+\alpha}{2}\right) \max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.MOSSAnytime'
Policies.MOSSExperimental module

The MOSS-Experimental policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).

Warning

Nothing was proved for this heuristic!

class Policies.MOSSExperimental.MOSSExperimental(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.MOSS.MOSS

The MOSS-Experimental policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{ \max\left(0, \frac{\log\left(\frac{t}{\hat{H}(t)}\right)}{N_k(t)}\right)},\\ \text{where}\;\; \hat{H}(t) &:= \begin{cases} \sum\limits_{j=1, N_j(t) < \sqrt{t}}^{K} N_j(t) & \;\text{if it is}\; > 0,\\ K N_k(t) & \;\text{otherwise}\; \end{cases}\end{split}\]

Note

In the article, the authors do not explain this subtlety, and I don’t see an argument to justify that at anytime, \(\hat{H}(t) > 0\) ie to justify that there is always some arms \(j\) such that \(0 < N_j(t) < \sqrt{t}\).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.MOSSExperimental'
Policies.MOSSH module

The MOSS-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).

class Policies.MOSSH.MOSSH(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.MOSS.MOSS

The MOSS-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).

__init__(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{T}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.MOSSH'
Policies.Monitored_UCB module

The Monitored-UCB generic policy for non-stationary bandits.

  • Reference: [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)

  • It runs on top of a simple policy, e.g., UCB, and Monitored_IndexPolicy is a wrapper:

    >>> policy = Monitored_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(K w)\) memory for a window of size \(w\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.Monitored_UCB.DELTA = 0.1

Default value for the parameter \(\delta\), the lower-bound for \(\delta_k^{(i)}\) the amplitude of change of arm k at break-point. Default is 0.05.

Policies.Monitored_UCB.PER_ARM_RESTART = False

Should we reset one arm empirical average or all? For M-UCB it is False by default.

Policies.Monitored_UCB.FULL_RESTART_WHEN_REFRESH = True

Should we fully restart the algorithm or simply reset one arm empirical average? For M-UCB it is True by default.

Policies.Monitored_UCB.WINDOW_SIZE = None

Default value of the window-size. Give None to use the default value computed from a knowledge of the horizon and number of break-points.

Policies.Monitored_UCB.GAMMA_SCALE_FACTOR = 1

For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!

class Policies.Monitored_UCB.Monitored_IndexPolicy(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The Monitored-UCB generic policy for non-stationary bandits, from [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)

  • For a window size w, it uses only \(\mathcal{O}(K w)\) memory.
__init__(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]

New policy.

window_size = None

Parameter \(w\) for the M-UCB algorithm.

threshold_b = None

Parameter \(b\) for the M-UCB algorithm.

gamma = None

What they call \(\gamma\) in their paper: the share of uniform exploration.

last_update_time_tau = None

Keep in memory the last time a change was detected, ie, the variable \(\tau\) in the algorithm.

last_w_rewards = None

Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

last_restart_times = None

Keep in memory the times of last restarts (for each arm).

__str__()[source]

-> str

choice()[source]

Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to choice of the underlying policy (eg. UCB).

Warning

Actually, it’s more complicated:

  • If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]
choiceWithRank(rank=1)[source]

Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to choiceWithRank of the underlying policy (eg. UCB).

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the change detection algorithm says so.
__module__ = 'Policies.Monitored_UCB'
detect_change(arm)[source]

A change is detected for the current arm if the following test is true:

\[|\sum_{i=w/2+1}^{w} Y_i - \sum_{i=1}^{w/2} Y_i | > b ?\]
  • where \(Y_i\) is the i-th data in the latest w data from this arm (ie, \(X_k(t)\) for \(t = n_k - w + 1\) to \(t = n_k\) current number of samples from arm k).
  • where threshold_b is the threshold b of the test, and window_size is the window-size w.

Warning

FIXED only the last \(w\) data are stored, using lists that got their first element ``pop()``ed out (deleted). See https://github.com/SMPyBandits/SMPyBandits/issues/174

Policies.MusicalChair module

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

  • Each player has 3 states, 1st is random exploration, 2nd is musical chair, 3rd is staying sit
  • 1st step
    • Every player tries uniformly an arm for \(T_0\) steps, counting the empirical means of each arm, and the number of observed collisions \(C_{T_0}\)
    • Finally, \(N^* = M\) = nbPlayers is estimated based on nb of collisions \(C_{T_0}\), and the \(N^*\) best arms are computed from their empirical means
  • 2nd step:
    • Every player Choose an arm uniformly, among the \(N^*\) best arms, until she does not encounter collision right after choosing it
    • When an arm was chosen by only one player, she decides to sit on this chair (= arm)
  • 3rd step:
    • Every player stays sitted on her chair for the rest of the game
    • \(\implies\) constant regret if \(N^*\) is well estimated and if the estimated N* best arms were correct
    • \(\implies\) linear regret otherwise
Policies.MusicalChair.optimalT0(nbArms=10, epsilon=0.1, delta=0.05)[source]

Compute the lower-bound suggesting “large-enough” values for \(T_0\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\).

Examples:

  • For \(K=2\) arms, and in order to have a constant regret with probability at least \(90\%\), if the gap \(\Delta\) is known to be \(\geq 0.05\), then their theoretical analysis suggests to use \(T_0 \geq 18459\). That’s very huge, for just two arms!
>>> optimalT0(2, 0.1, 0.05)     # Just 2 arms !
18459                           # ==> That's a LOT of steps for just 2 arms!
  • For a harder problem with \(K=6\) arms, for a risk smaller than \(1\%\) and a gap \(\Delta \geq 0.05\), they suggest at least \(T_0 \geq 7646924\), i.e., about 7 millions of trials. That is simply too much for any realistic system, and starts to be too large for simulated systems.
>>> optimalT0(6, 0.01, 0.05)    # Constant regret with >99% proba
7646924                         # ==> That's a LOT of steps!
>>> optimalT0(6, 0.001, 0.05)   # Reasonable value of epsilon
764692376                       # ==> That's a LOT of steps!!!
  • For an even harder problem with \(K=17\) arms, the values given by their Theorem 1 start to be really unrealistic:
>>> optimalT0(17, 0.01, 0.05)   # Constant regret with >99% proba
27331794                        # ==> That's a LOT of steps!
>>> optimalT0(17, 0.001, 0.05)  # Reasonable value of epsilon
2733179304                      # ==> That's a LOT of steps!!!
Policies.MusicalChair.boundOnFinalRegret(T0, nbPlayers)[source]

Use the upper-bound on regret when \(T_0\) and \(M\) are known.

  • The “constant” regret of course grows linearly with \(T_0\), as:

    \[\forall T \geq T_0, \;\; R_T \leq T_0 K + 2 \mathrm{exp}(2) K.\]

Warning

this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_0\) is chosen as given by optimalT0()).

>>> boundOnFinalRegret(18459, 2)        # Crazy constant regret!  # doctest: +ELLIPSIS
36947.5..
>>> boundOnFinalRegret(7646924, 6)      # Crazy constant regret!!  # doctest: +ELLIPSIS
45881632.6...
>>> boundOnFinalRegret(764692376, 6)    # Crazy constant regret!!  # doctest: +ELLIPSIS
4588154344.6...
>>> boundOnFinalRegret(27331794, 17)    # Crazy constant regret!!  # doctest: +ELLIPSIS
464640749.2...
>>> boundOnFinalRegret(2733179304, 17)  # Crazy constant regret!!  # doctest: +ELLIPSIS
46464048419.2...
class Policies.MusicalChair.State

Bases: enum.Enum

Different states during the Musical Chair algorithm

InitialPhase = 2
MusicalChair = 3
NotStarted = 1
Sitted = 4
__module__ = 'Policies.MusicalChair'
class Policies.MusicalChair.MusicalChair(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

__init__(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]
  • nbArms: number of arms,
  • Time0: required, number of step, or portion of the horizon Time1 (optional), for the first step (pure random exploration by each players),
  • N: optional, exact or upper bound on the number of players,
  • Time1: optional, only used to compute Time0 if Time0 is fractional (eg. 0.2).

Example:

>>> nbArms, Time0, Time1, N = 17, 0.1, 10000, 6
>>> player1 = MusicalChair(nbArms, Time0, Time1, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.25, Time1=HORIZON, N=NB_PLAYERS).children
state = None

Current state

Time0 = None

Parameter T0

nbPlayers = None

Number of players

chair = None

Current chair. Not sited yet.

cumulatedRewards = None

That’s the s_i(t) of the paper

nbObservations = None

That’s the o_i of the paper

A = None

A random permutation of arms, it will then be of size nbPlayers!

nbCollision = None

Number of collisions, that’s the C_Time0 of the paper

t = None

Internal times

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

choice()[source]

Choose an arm, as described by the Musical Chair algorithm.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.

  • If not collision, receive a reward after pulling the arm.
_endInitialPhase()[source]

Small computation needed at the end of the initial random exploration phase.

handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

  • Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
__module__ = 'Policies.MusicalChair'
Policies.MusicalChairNoSensing module

MusicalChairNoSensing: implementation of the decentralized multi-player policy from [[“Multiplayer bandits without observing collision information”, by Gabor Lugosi and Abbas Mehrabian]](https://arxiv.org/abs/1808.08416).

Note

The algorithm implemented here is Algorithm 1 (page 8) in the article, but the authors did not named it. I will refer to it as the Musical Chair algorithm with no sensing, or MusicalChairNoSensing in the code.

Policies.MusicalChairNoSensing.ConstantC = 1

A crazy large constant to get all the theoretical results working. The paper suggests \(C = 128\).

Warning

One can choose a much smaller value in order to (try to) have reasonable empirical performances! I have tried \(C = 1\). BUT the algorithm DOES NOT work better with a much smaller constant: every single simulations I tried end up with a linear regret for MusicalChairNoSensing.

Policies.MusicalChairNoSensing.parameter_g(K=9, m=3, T=1000, constant_c=1)[source]

Length \(g\) of the phase 1, from parameters K, m and T.

\[g = 128 K \log(3 K m^2 T^2).\]

Examples:

>>> parameter_g(m=2, K=2, T=100)  # DOCTEST: +ELLIPSIS
3171.428...
>>> parameter_g(m=2, K=2, T=1000)  # DOCTEST: +ELLIPSIS
4350.352...
>>> parameter_g(m=2, K=3, T=100)  # DOCTEST: +ELLIPSIS
4912.841...
>>> parameter_g(m=3, K=3, T=100)  # DOCTEST: +ELLIPSIS
5224.239...
Policies.MusicalChairNoSensing.estimate_length_phases_12(K=3, m=9, Delta=0.1, T=1000)[source]

Estimate the length of phase 1 and 2 from the parameters of the problem.

Examples:

>>> estimate_length_phases_12(m=2, K=2, Delta=0.1, T=100)
198214307
>>> estimate_length_phases_12(m=2, K=2, Delta=0.01, T=100)
19821430723
>>> estimate_length_phases_12(m=2, K=2, Delta=0.1, T=1000)
271897030
>>> estimate_length_phases_12(m=2, K=3, Delta=0.1, T=100)
307052623
>>> estimate_length_phases_12(m=2, K=5, Delta=0.1, T=100)
532187397
Policies.MusicalChairNoSensing.smallest_T_from_where_length_phases_12_is_larger(K=3, m=9, Delta=0.1, Tmax=1000000000.0)[source]

Compute the smallest horizon T from where the (estimated) length of phases 1 and 2 is larger than T.

Examples:

>>> smallest_T_from_where_length_phases_12_is_larger(K=2, m=1)
687194767
>>> smallest_T_from_where_length_phases_12_is_larger(K=3, m=2)
1009317314
>>> smallest_T_from_where_length_phases_12_is_larger(K=3, m=3)
1009317314

Examples with even longer phase 1:

>>> smallest_T_from_where_length_phases_12_is_larger(K=10, m=5)
1009317314
>>> smallest_T_from_where_length_phases_12_is_larger(K=10, m=10)
1009317314

With \(K=100\) arms, it starts to be crazy:

>>> smallest_T_from_where_length_phases_12_is_larger(K=100, m=10)
1009317314
class Policies.MusicalChairNoSensing.State

Bases: enum.Enum

Different states during the Musical Chair with no sensing algorithm

InitialPhase = 2
MusicalChair = 4
NotStarted = 1
Sitted = 5
UniformWaitPhase2 = 3
__module__ = 'Policies.MusicalChairNoSensing'
class Policies.MusicalChairNoSensing.MusicalChairNoSensing(nbPlayers=1, nbArms=1, horizon=1000, constant_c=1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

MusicalChairNoSensing: implementation of the decentralized multi-player policy from [[“Multiplayer bandits without observing collision information”, by Gabor Lugosi and Abbas Mehrabian]](https://arxiv.org/abs/1808.08416).

__init__(nbPlayers=1, nbArms=1, horizon=1000, constant_c=1, lower=0.0, amplitude=1.0)[source]
  • nbArms: number of arms (K in the paper),
  • nbPlayers: number of players (m in the paper),
  • horizon: horizon (length) of the game (T in the paper),

Example:

>>> nbPlayers, nbArms, horizon = 3, 9, 10000
>>> player1 = MusicalChairNoSensing(nbPlayers, nbArms, horizon)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChairNoSensing, nbArms, nbPlayers=nbPlayers, horizon=horizon).children

or

>>> configuration["players"] = [ MusicalChairNoSensing(nbPlayers=nbPlayers, nbArms=nbArms, horizon=horizon) for _ in range(NB_PLAYERS) ]
state = None

Current state

nbPlayers = None

Number of players

nbArms = None

Number of arms

horizon = None

Parameter T (horizon)

chair = None

Current chair. Not sited yet.

cumulatedRewards = None

That’s the s_i(t) of the paper

nbObservations = None

That’s the o_i of the paper

A = None

A random permutation of arms, it will then be of size nbPlayers!

tau_phase_2 = None

Time when phase 2 starts

t = None

Internal times

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

choice()[source]

Choose an arm, as described by the Musical Chair with no Sensing algorithm.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the Musical Chair with no Sensing algorithm.

  • If not collision, receive a reward after pulling the arm.
__module__ = 'Policies.MusicalChairNoSensing'
_endPhase2()[source]

Small computation needed at the end of the initial random exploration phase.

handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

  • Here, as its name suggests it, the MusicalChairNoSensing algorithm does not use any collision information, hence this method is empty.
  • Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
Policies.OCUCB module

The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits, with sub-Gaussian noise.

Policies.OCUCB.ETA = 2

Default value for parameter \(\eta > 1\) for OCUCB.

Policies.OCUCB.RHO = 1

Default value for parameter \(\rho \in (1/2, 1]\) for OCUCB.

class Policies.OCUCB.OCUCB(nbArms, eta=2, rho=1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits, with sub-Gaussian noise.

__init__(nbArms, eta=2, rho=1, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
eta = None

Parameter \(\eta > 1\).

rho = None

Parameter \(\rho \in (1/2, 1]\).

__str__()[source]

-> str

_Bterm(k)[source]

Compute the extra term \(B_k(t)\) as follows:

\[\begin{split}B_k(t) &= \max\Big\{ \exp(1), \log(t), t \log(t) / C_k(t) \Big\},\\ \text{where}\; C_k(t) &= \sum_{j=1}^{K} \min\left\{ T_k(t), T_j(t)^{\rho} T_k(t)^{1 - \rho} \right\}\end{split}\]
_Bterms()[source]

Compute all the extra terms, \(B_k(t)\) for each arm k, in a naive manner, not optimized to be vectorial, but it works.

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \eta \log(B_k(t))}{N_k(t)}}.\]
  • Where \(\eta\) is a parameter of the algorithm,
  • And \(B_k(t)\) is the additional term defined above.
__module__ = 'Policies.OCUCB'
Policies.OCUCBH module

The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).

Policies.OCUCBH.PSI = 2

Default value for parameter \(\psi \geq 2\) for OCUCBH.

Policies.OCUCBH.ALPHA = 4

Default value for parameter \(\alpha \geq 2\) for OCUCBH.

class Policies.OCUCBH.OCUCBH(nbArms, horizon=None, psi=2, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.OCUCB.OCUCB

The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).

__init__(nbArms, horizon=None, psi=2, alpha=4, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
psi = None

Parameter \(\psi \geq 2\).

alpha = None

Parameter \(\alpha \geq 2\).

horizon = None

Horizon T.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha}{N_k(t)} \log(\frac{\psi T}{t})}.\]
  • Where \(\alpha\) and \(\psi\) are two parameters of the algorithm.
__module__ = 'Policies.OCUCBH'
class Policies.OCUCBH.AOCUCBH(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.OCUCBH.OCUCBH

The Almost Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).

__init__(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2}{N_k(t)} \log(\frac{T}{N_k(t)})}.\]
__module__ = 'Policies.OCUCBH'
Policies.OSSB module

Optimal Sampling for Structured Bandits (OSSB) algorithm.

Warning

This is the simplified OSSB algorithm for classical bandits. It can be applied to more general bandit problems, see the original paper.

  • The OSSB is for Bernoulli stochastic bandits, and GaussianOSSB is for Gaussian stochastic bandits, with a direct application of the result from their paper.
  • The SparseOSSB is for sparse Gaussian (or sub-Gaussian) stochastic bandits, of known variance.
  • I also added support for non-constant :math:`

arepsilon` and \(\gamma\) rates, as suggested in a talk given by Combes, 24th of May 2018, Rotterdam (Workshop, “Learning while Earning”). See OSSB_DecreasingRate and OSSB_AutoDecreasingRate.

Policies.OSSB.klGauss_vect(xs, y, sig2x=0.25)[source]
class Policies.OSSB.Phase

Bases: enum.Enum

Different phases during the OSSB algorithm

__module__ = 'Policies.OSSB'
estimation = 3
exploitation = 2
exploration = 4
initialisation = 1
Policies.OSSB.EPSILON = 0.0

Default value for the \(\varepsilon\) parameter, 0.0 is a safe default.

Policies.OSSB.GAMMA = 0.0

Default value for the \(\gamma\) parameter, 0.0 is a safe default.

Policies.OSSB.solve_optimization_problem__classic(thetas)[source]

Solve the optimization problem (2)-(3) as defined in the paper, for classical stochastic bandits.

  • No need to solve anything, as they give the solution for classical bandits.
Policies.OSSB.solve_optimization_problem__gaussian(thetas, sig2x=0.25)[source]

Solve the optimization problem (2)-(3) as defined in the paper, for Gaussian classical stochastic bandits.

  • No need to solve anything, as they give the solution for Gaussian classical bandits.
Policies.OSSB.solve_optimization_problem__sparse_bandits(thetas, sparsity=None, only_strong_or_weak=False)[source]

Solve the optimization problem (2)-(3) as defined in the paper, for sparse stochastic bandits.

  • I recomputed suboptimal solution to the optimization problem, and found the same as in [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].
  • If only_strong_or_weak is True, the solution \(c_i\) are not returned, but instead strong_or_weak, k is returned (to know if the problem is strongly sparse or not, and if not, the k that satisfy the required constraint).
class Policies.OSSB.OSSB(nbArms, epsilon=0.0, gamma=0.0, solve_optimization_problem='classic', lower=0.0, amplitude=1.0, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

Optimal Sampling for Structured Bandits (OSSB) algorithm.

  • solve_optimization_problem can be "classic" or "bernoulli" for classic stochastic bandit with no structure, "gaussian" for classic bandit for Gaussian arms, or "sparse" for sparse stochastic bandit (give the sparsity s in a kwargs).
  • Reference: [[Minimal Exploration in Structured Stochastic Bandits, Combes et al, arXiv:1711.00400 [stat.ML]]](https://arxiv.org/abs/1711.00400)
__init__(nbArms, epsilon=0.0, gamma=0.0, solve_optimization_problem='classic', lower=0.0, amplitude=1.0, **kwargs)[source]

New policy.

epsilon = None

Parameter \(\varepsilon\) for the OSSB algorithm. Can be = 0.

gamma = None

Parameter \(\gamma\) for the OSSB algorithm. Can be = 0.

counter_s_no_exploitation_phase = None

counter of number of exploitation phase

phase = None

categorical variable for the phase

__str__()[source]

-> str

startGame()[source]

Start the game (fill pulls and rewards with 0).

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

choice()[source]

Applies the OSSB procedure, it’s quite complicated so see the original paper.

handleCollision(arm, reward=None)[source]

Nothing special to do.

__module__ = 'Policies.OSSB'
class Policies.OSSB.GaussianOSSB(nbArms, epsilon=0.0, gamma=0.0, variance=0.25, lower=0.0, amplitude=1.0, **kwargs)[source]

Bases: Policies.OSSB.OSSB

Optimal Sampling for Structured Bandits (OSSB) algorithm, for Gaussian Stochastic Bandits.

__init__(nbArms, epsilon=0.0, gamma=0.0, variance=0.25, lower=0.0, amplitude=1.0, **kwargs)[source]

New policy.

__module__ = 'Policies.OSSB'
class Policies.OSSB.SparseOSSB(nbArms, epsilon=0.0, gamma=0.0, sparsity=None, lower=0.0, amplitude=1.0, **kwargs)[source]

Bases: Policies.OSSB.OSSB

Optimal Sampling for Structured Bandits (OSSB) algorithm, for Sparse Stochastic Bandits.

__init__(nbArms, epsilon=0.0, gamma=0.0, sparsity=None, lower=0.0, amplitude=1.0, **kwargs)[source]

New policy.

__module__ = 'Policies.OSSB'
Policies.OSSB.DECREASINGRATE = 1e-06

Default value for the constant for the decreasing rate

class Policies.OSSB.OSSB_DecreasingRate(nbArms, epsilon=0.0, gamma=0.0, decreasingRate=1e-06, lower=0.0, amplitude=1.0, **kwargs)[source]

Bases: Policies.OSSB.OSSB

Optimal Sampling for Structured Bandits (OSSB) algorithm, with decreasing rates for both \(\varepsilon\) and \(\gamma\).

Warning

This is purely experimental, the paper does not talk about how to chose decreasing rates. It is inspired by the rates for Exp3 algorithm, cf [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).

__init__(nbArms, epsilon=0.0, gamma=0.0, decreasingRate=1e-06, lower=0.0, amplitude=1.0, **kwargs)[source]

New policy.

__str__()[source]

-> str

epsilon

Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 \exp(- t \tau))\).

__module__ = 'Policies.OSSB'
gamma

Decreasing \(\gamma(t) = \min(1, \gamma_0 \exp(- t \tau))\).

class Policies.OSSB.OSSB_AutoDecreasingRate(nbArms, lower=0.0, amplitude=1.0, **kwargs)[source]

Bases: Policies.OSSB.OSSB

Optimal Sampling for Structured Bandits (OSSB) algorithm, with automatically-tuned decreasing rates for both \(\varepsilon\) and \(\gamma\).

Warning

This is purely experimental, the paper does not talk about how to chose decreasing rates. It is inspired by the rates for Exp3++ algorithm, [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

__module__ = 'Policies.OSSB'
__init__(nbArms, lower=0.0, amplitude=1.0, **kwargs)[source]

New policy.

__str__()[source]

-> str

epsilon

Decreasing \(\varepsilon(t) = \frac{1}{2} \sqrt{\frac{\log(K)}{t K}}\).

gamma

Decreasing \(\gamma(t) = \frac{1}{2} \sqrt{\frac{\log(K)}{t K}}\).

Policies.OracleSequentiallyRestartPolicy module

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

  • It runs on top of a simple policy, e.g., UCB, and OracleSequentiallyRestartPolicy is a wrapper:

    >>> policy = OracleSequentiallyRestartPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses the knowledge of the breakpoints to restart the underlying algorithm at each breakpoint.

  • It is very simple but impractical: in any real problem it is impossible to know the locations of the breakpoints, but it acts as an efficient baseline.

Warning

It is an efficient baseline, but it has no reason to be the best algorithm on a given problem (empirically)! I found that Policy.DiscountedThompson.DiscountedThompson is usually the most efficient.

Policies.OracleSequentiallyRestartPolicy.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? Default is False for this algorithm.

Policies.OracleSequentiallyRestartPolicy.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_ALL_CHANGE = False

True if the algorithm reset one/all arm memories when a change occur on any arm. False` if the algorithms only resets one arm memories when a change occur on this arm (needs to know listOfMeans) (default, it should be more efficient).

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_SUBOPTIMAL_CHANGE = True

True if the algorithms resets memories of this arm no matter if it stays optimal/suboptimal (default, it should be more efficient). False if the algorithm reset memories only when a change make the previously best arm become suboptimal.

class Policies.OracleSequentiallyRestartPolicy.OracleSequentiallyRestartPolicy(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

__init__(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]

New policy.

reset_for_all_change = None

See RESET_FOR_ALL_CHANGE

reset_for_suboptimal_change = None

See RESET_FOR_SUBOPTIMAL_CHANGE

changePoints = None

Locations of the break points (or change points) of the switching bandit problem, for each arm. If None, an empty list is used.

all_rewards = None

Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

compute_optimized_changePoints(changePoints=None, listOfMeans=None)[source]

Compute the list of change points for each arm.

__str__()[source]

-> str

__module__ = 'Policies.OracleSequentiallyRestartPolicy'
getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the current time step is in the list of change points.
detect_change(arm)[source]

Try to detect a change in the current arm.

Policies.PHE module

The PHE, Perturbed-History Exploration, policy for bounded bandits.

  • Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
Policies.PHE.DEFAULT_PERTURBATION_SCALE = 1.0

By default, \(a\) the perturbation scale in PHE is 1, that is, at current time step t, if there is \(s = T_{i,t-1}\) samples of arm i, PHE generates \(s\) pseudo-rewards (of mean \(1/2\))

class Policies.PHE.PHE(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The PHE, Perturbed-History Exploration, policy for bounded bandits.

  • Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
  • They prove that PHE achieves a regret of \(\mathcal{O}(K \Delta^{-1} \log(T))\) regret for horizon \(T\), and if \(\Delta\) is the minimum gap between the expected rewards of the optimal and suboptimal arms, for \(a > 1\).
  • Note that the limit case of \(a=0\) gives the Follow-the-Leader algorithm (FTL), known to fail.
__init__(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
perturbation_scale = None

Perturbation scale, denoted \(a\) in their paper. Should be a float or int number. With \(s\) current samples, \(\lceil a s \rceil\) additional pseudo-rewards are generated.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute a randomized index by adding \(a\) pseudo-rewards (of mean \(1/2\)) to the current observations of this arm.

__module__ = 'Policies.PHE'
Policies.ProbabilityPursuit module

The basic Probability Pursuit algorithm.

  • We use the simple version of the pursuit algorithm, as described in the seminal book by Sutton and Barto (1998), https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html.

  • Initially, a uniform probability is set on each arm, \(p_k(0) = 1/k\).

  • At each time step \(t\), the probabilities are all recomputed, following this equation:

    \[\begin{split}p_k(t+1) = \begin{cases} (1 - \beta) p_k(t) + \beta \times 1 & \text{if}\; \hat{\mu}_k(t) = \max_j \hat{\mu}_j(t) \\ (1 - \beta) p_k(t) + \beta \times 0 & \text{otherwise}. \end{cases}\end{split}\]
  • \(\beta \in (0, 1)\) is a learning rate, default is BETA = 0.5.

  • And then arm \(A_k(t+1)\) is randomly selected from the distribution \((p_k(t+1))_{1 \leq k \leq K}\).

  • References: [Kuleshov & Precup - JMLR, 2000](http://www.cs.mcgill.ca/~vkules/bandits.pdf#page=6), [Sutton & Barto, 1998]

Policies.ProbabilityPursuit.BETA = 0.5

Default value for the beta parameter

class Policies.ProbabilityPursuit.ProbabilityPursuit(nbArms, beta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The basic Probability pursuit algorithm.

__init__(nbArms, beta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

New policy.

probabilities = None

Probabilities of each arm

startGame()[source]

Reinitialize probabilities.

beta

Constant parameter \(\beta(t) = \beta(0)\).

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the probabilities \(p_k(t)\) of each arm.

choice()[source]

One random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice(), and select the last one (less probable).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().

__module__ = 'Policies.ProbabilityPursuit'
choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().

Policies.RAWUCB module

author: Julien Seznec

Rotting Adaptive Window Upper Confidence Bounds for rotting bandits.

Reference : [Seznec et al., 2019b] A single algorithm for both rested and restless rotting bandits (WIP) Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko

class Policies.RAWUCB.EFF_RAWUCB(nbArms, alpha=0.06, subgaussian=1, m=None, delta=None, delay=False)[source]

Bases: Policies.FEWA.EFF_FEWA

Efficient Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2020] Efficient trick described in [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043] (m=2) and [Seznec et al., 2020] (m<=2) We use the confidence level :math:`delta_t =

rac{1}{t^lpha}`.

choice()[source]

Not defined.

_compute_ucb()[source]
_append_thresholds(w)[source]
__str__()[source]

-> str

__module__ = 'Policies.RAWUCB'
class Policies.RAWUCB.EFF_RAWklUCB(nbArms, subgaussian=1, alpha=1, klucb=<function klucbBern>, tol=0.0001, m=2)[source]

Bases: Policies.RAWUCB.EFF_RAWUCB

Use KL-confidence bound instead of close formula approximation. Experimental work : Much slower (!!) because we compute many UCB at each round per arm)

__init__(nbArms, subgaussian=1, alpha=1, klucb=<function klucbBern>, tol=0.0001, m=2)[source]

New policy.

choice()[source]

Not defined.

__str__()[source]

-> str

__module__ = 'Policies.RAWUCB'
class Policies.RAWUCB.RAWUCB(nbArms, subgaussian=1, alpha=1)[source]

Bases: Policies.RAWUCB.EFF_RAWUCB

Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2020] We use the confidence level :math:`delta_t =

rac{1}{t^lpha}`.

__init__(nbArms, subgaussian=1, alpha=1)[source]

New policy.

__str__()[source]

-> str

__module__ = 'Policies.RAWUCB'
class Policies.RAWUCB.EFF_RAWUCB_pp(nbArms, subgaussian=1, alpha=1, beta=0, m=2)[source]

Bases: Policies.RAWUCB.EFF_RAWUCB

Efficient Rotting Adaptive Window Upper Confidence Bound ++ (RAW-UCB++) [Seznec et al., 2020, Thesis] We use the confidence level :math:`delta_{t,h} =

rac{Kh}{t(1+log(t/Kh)^Beta)}`.

__init__(nbArms, subgaussian=1, alpha=1, beta=0, m=2)[source]

New policy.

__str__()[source]

-> str

_compute_ucb()[source]
_inlog(w)[source]
__module__ = 'Policies.RAWUCB'
class Policies.RAWUCB.RAWUCB_pp(nbArms, subgaussian=1, beta=2)[source]

Bases: Policies.RAWUCB.EFF_RAWUCB_pp

Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2019b, WIP] We use the confidence level :math:`delta_t =

rac{Kh}{t^lpha}`.

__init__(nbArms, subgaussian=1, beta=2)[source]

New policy.

__str__()[source]

-> str

__module__ = 'Policies.RAWUCB'
Policies.RCB module

The RCB, Randomized Confidence Bound, policy for bounded bandits.

  • Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
class Policies.RCB.RCB(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.RandomizedIndexPolicy.RandomizedIndexPolicy, Policies.UCBalpha.UCBalpha

The RCB, Randomized Confidence Bound, policy for bounded bandits.

  • Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
__module__ = 'Policies.RCB'
Policies.RandomizedIndexPolicy module

Generic randomized index policy.

  • Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.VERBOSE = False

True to debug information about the perturbations

Policies.RandomizedIndexPolicy.uniform_perturbation(size=1, low=-1.0, high=1.0)[source]

Uniform random perturbation, not from \([0, 1]\) but from \([-1, 1]\), that is \(\mathcal{U}niform([-1, 1])\).

  • Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.normal_perturbation(size=1, loc=0.0, scale=0.25)[source]

Normal (Gaussian) random perturbation, with mean loc=0 and scale (sigma2) scale=0.25 (by default), that is \(\mathcal{N}ormal(loc, scale)\).

  • Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.gaussian_perturbation(size=1, loc=0.0, scale=0.25)

Normal (Gaussian) random perturbation, with mean loc=0 and scale (sigma2) scale=0.25 (by default), that is \(\mathcal{N}ormal(loc, scale)\).

  • Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.exponential_perturbation(size=1, scale=0.25)[source]

Exponential random perturbation, with parameter (\(\lambda\)) scale=0.25 (by default), that is \(\mathcal{E}xponential(\lambda)\).

  • Reference: see Corollary 7 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.gumbel_perturbation(size=1, loc=0.0, scale=0.25)[source]

Gumbel random perturbation, with mean loc=0 and scale scale=0.25 (by default), that is \(\mathcal{G}umbel(loc, scale)\).

  • Reference: see Corollary 7 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
Policies.RandomizedIndexPolicy.map_perturbation_str_to_function = {'exponential': <function exponential_perturbation>, 'gaussian': <function normal_perturbation>, 'gumbel': <function gumbel_perturbation>, 'normal': <function normal_perturbation>, 'uniform': <function uniform_perturbation>}

Map perturbation names (like "uniform") to perturbation functions (like uniform_perturbation()).

class Policies.RandomizedIndexPolicy.RandomizedIndexPolicy(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.IndexPolicy.IndexPolicy

Class that implements a generic randomized index policy.

__init__(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • perturbation: [“uniform”, “normal”, “exponential”, “gaussian”] or a function like numpy.random.uniform(),
  • lower, amplitude: lower value and known amplitude of the rewards.
perturbation_name = None

Name of the function to generate the random perturbation.

perturbation = None

Function to generate the random perturbation.

__str__()[source]

-> str

computeIndex(arm)[source]

In a randomized index policy, with distribution \(\mathrm{Distribution}\) generating perturbations \(Z_k(t)\), with index \(I_k(t)\) and mean \(\hat{\mu}_k(t)\) for each arm \(k\), it chooses an arm with maximal perturbated index (uniformly at random):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ Z_k(t) &\sim \mathrm{Distribution}, \\ \mathrm{UCB}_k(t) &= I_k(t) - \hat{\mu}_k(t),\\ A(t) &\sim U(\arg\max_{1 \leq k \leq K} \hat{\mu}_k(t) + \mathrm{UCB}_k(t) \cdot Z_k(t)).\end{split}\]
__module__ = 'Policies.RandomizedIndexPolicy'
computeAllIndex()[source]

In a randomized index policy, with distribution \(\mathrm{Distribution}\) generating perturbations \(Z_k(t)\), with index \(I_k(t)\) and mean \(\hat{\mu}_k(t)\) for each arm \(k\), it chooses an arm with maximal perturbated index (uniformly at random):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ Z_k(t) &\sim \mathrm{Distribution}, \\ \mathrm{UCB}_k(t) &= I_k(t) - \hat{\mu}_k(t),\\ A(t) &\sim U(\arg\max_{1 \leq k \leq K} \hat{\mu}_k(t) + \mathrm{UCB}_k(t) \cdot Z_k(t)).\end{split}\]
Policies.SIC_MMAB module

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

  • The algorithm is quite complicated, please see the paper (Algorithm 1, page 6).
  • The UCB-H indexes are used, for more details see Policies.UCBH.
Policies.SIC_MMAB.c = 1.0

default value, as it was in pymaBandits v1.0

Policies.SIC_MMAB.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.SIC_MMAB.State

Bases: enum.Enum

Different states during the Musical Chair algorithm

Communication = 4
Estimation = 2
Exploitation = 5
Exploration = 3
Fixation = 1
__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]

Bases: Policies.BasePolicy.BasePolicy

SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]
  • nbArms: number of arms,
  • horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
  • alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
phase = None

Current state

horizon = None

Horizon T of the experiment.

alpha = None

Parameter \(\alpha\) for the UCB/LCB computations.

Time0 = None

Parameter \(T_0 = \lceil K \log(T) \rceil\).

ext_rank = None

External rank, -1 until known

int_rank = None

Internal rank, starts to be 0 then increase when needed

nbPlayers = None

Estimated number of players, starts to be 1

last_action = None

Keep memory of the last played action (starts randomly)

t_phase = None

Number of the phase XXX ?

round_number = None

Number of the round XXX ?

active_arms = None

Set of active arms (kept as a numpy array)

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

  • By default, the SIC-MMAB algorithm uses the UCB-H confidence bounds:
\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\end{split}\]
choice()[source]

Choose an arm, as described by the SIC-MMAB algorithm.

getReward(arm, reward, collision=False)[source]

Receive a reward on arm of index ‘arm’, as described by the SIC-MMAB algorithm.

  • If not collision, receive a reward after pulling the arm.
handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB_UCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_UCB: SIC-MMAB with the simple UCB-1 confidence bounds.

__str__()[source]

-> str

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\end{split}\]
  • Reference: [Auer et al. 02].
  • Other possibilities include UCB-H (the default, see SIC_MMAB) and klUCB (see SIC_MMAB_klUCB).
__module__ = 'Policies.SIC_MMAB'
class Policies.SIC_MMAB.SIC_MMAB_klUCB(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]

Bases: Policies.SIC_MMAB.SIC_MMAB

SIC_MMAB_klUCB: SIC-MMAB with the kl-UCB confidence bounds.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]
  • nbArms: number of arms,
  • horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
  • alpha: for the UCB/LCB computations.

Example:

>>> nbArms, horizon, N = 17, 10000, 6
>>> player1 = SIC_MMAB(nbArms, horizon, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
c = None

Parameter c

klucb = None

kl function to use

tolerance = None

Numerical tolerance

__str__()[source]

-> str

compute_ucb_lcb()[source]

Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \mathrm{UCB}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathrm{Biais}_k(t) &= \mathrm{UCB}_k(t) - \hat{\mu}_k(t),\\ \mathrm{LCB}_k(t) &= \hat{\mu}_k(t) - \mathrm{Biais}_k(t).\end{split}\]
  • If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback),

and c is the parameter (default to 1).

__module__ = 'Policies.SIC_MMAB'
Policies.SWA module

author : Julien Seznec Sliding Window Average policy for rotting bandits.

Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf]. Advances in Neural Information Processing Systems 30 (NIPS 2017) Nir Levine, Koby Crammer, Shie Mannor

class Policies.SWA.SWA(nbArms, horizon=1, subgaussian=1, maxDecrement=1, alpha=0.2, doublingTrick=False)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The Sliding Window Average policy for rotting bandits. Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf].

__init__(nbArms, horizon=1, subgaussian=1, maxDecrement=1, alpha=0.2, doublingTrick=False)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
setWindow()[source]
getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the mean of the h last value

startGame(resetHorizon=True)[source]

Initialize the policy for a new game.

__module__ = 'Policies.SWA'
class Policies.SWA.wSWA(nbArms, firstHorizon=1, subgaussian=1, maxDecrement=1, alpha=0.2)[source]

Bases: Policies.SWA.SWA

SWA with doubling trick Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf].

__init__(nbArms, firstHorizon=1, subgaussian=1, maxDecrement=1, alpha=0.2)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

doublingTrick()[source]
getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

__module__ = 'Policies.SWA'
Policies.SWHash_UCB module

The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

  • Instead of being restricted to UCB, it runs on top of a simple policy, e.g., UCB, and SWHash_IndexPolicy() is a generic policy using any simple policy with this “sliding window” trick:

    >>> policy = SWHash_IndexPolicy(nbArms, UCB, tau=100, threshold=0.1)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional non-fixed \(\mathcal{O}(\tau(t,\alpha))\) memory and an extra time complexity.

Warning

This implementation is still experimental!

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.SWHash_UCB.alpha_for_abruptly_changing_env(nu=0.5)[source]

For abruptly-changing environement, if the number of break-points is \(\Upsilon_T = \mathcal{O}(T^{\nu})\), then the SW-UCB# algorithm chooses \(\alpha = \frac{1-\nu}{2}\).

Policies.SWHash_UCB.alpha_for_slowly_varying_env(kappa=1)[source]

For slowly-varying environement, if the change in mean reward between two time steps is bounded by \(\varepsilon_T = \mathcal{O}(T^{-\kappa})\), then the SW-UCB# algorithm chooses \(\alpha = \min{1, \frac{3\kappa}{4}}\).

Policies.SWHash_UCB.ALPHA = 0.5

Default parameter for \(\alpha\).

Policies.SWHash_UCB.LAMBDA = 1

Default parameter for \(\lambda\).

Policies.SWHash_UCB.tau_t_alpha(t, alpha=0.5, lmbda=1)[source]

Compute \(\tau(t,\alpha) = \min(\lceil \lambda t^{\alpha} \rceil, t)\).

class Policies.SWHash_UCB.SWHash_IndexPolicy(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)

__init__(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New policy.

alpha = None

The parameter \(\alpha\) for the SW-UCB# algorithm (see article for reference).

lmbda = None

The parameter \(\lambda\) for the SW-UCB# algorithm (see article for reference).

all_rewards = None

Keep in memory all the rewards obtained in the all the past steps (the size of the window is evolving!).

all_pulls = None

Keep in memory all the pulls obtained in the all the past steps (the size of the window is evolving!). Start with -1 (never seen).

__str__()[source]

-> str

tau

The current \(\tau(t,\alpha)\).

startGame(createNewPolicy=True)[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update total history and partial history of all arms (normalized in [0, 1]).

Warning

So far this is badly implemented and the algorithm is VERY slow: it has to store all the past, as the window-length is increasing when t increases.

__module__ = 'Policies.SWHash_UCB'
Policies.SlidingWindowRestart module

An experimental policy, using a sliding window (of for instance \(\tau=100\) draws of each arm), and reset the algorithm as soon as the small empirical average is too far away from the long history empirical average (or just restart for one arm, if possible).

  • Reference: none yet, idea from Rémi Bonnefoi and Lilian Besson.

  • It runs on top of a simple policy, e.g., UCB, and SlidingWindowRestart() is a generic policy using any simple policy with this “sliding window” trick:

    >>> policy = SlidingWindowRestart(nbArms, UCB, tau=100, threshold=0.1)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau)\) memory but do not cost anything else in terms of time complexity (the average is done with a sliding window, and costs \(\mathcal{O}(1)\) at every time step).

Warning

This is very experimental!

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)! Note that it works on Policies.Thompson.Thompson.

Policies.SlidingWindowRestart.TAU = 100

Size of the sliding window.

Policies.SlidingWindowRestart.THRESHOLD = 0.005

Threshold to know when to restart the base algorithm.

Policies.SlidingWindowRestart.FULL_RESTART_WHEN_REFRESH = True

Should we fully restart the algorithm or simply reset one arm empirical average ?

class Policies.SlidingWindowRestart.SlidingWindowRestart(nbArms, policy=<class 'Policies.UCB.UCB'>, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).

__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]

New policy.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the small average is too far away from it.
__module__ = 'Policies.SlidingWindowRestart'
class Policies.SlidingWindowRestart.SWR_UCB(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]

Bases: Policies.UCB.UCB

An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).

Warning

FIXME I should remove this code, it’s useless now that the generic wrapper SlidingWindowRestart works fine.

__init__(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
tau = None

Size of the sliding window.

threshold = None

Threshold to know when to restart the base algorithm.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

full_restart_when_refresh = None

Should we fully restart the algorithm or simply reset one arm empirical average ?

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the small average is too far away from it.
__module__ = 'Policies.SlidingWindowRestart'
class Policies.SlidingWindowRestart.SWR_UCBalpha(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, alpha=4, *args, **kwargs)[source]

Bases: Policies.UCBalpha.UCBalpha

An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).

Warning

FIXME I should remove this code, it’s useless now that the generic wrapper SlidingWindowRestart works fine.

__init__(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, alpha=4, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
tau = None

Size of the sliding window.

threshold = None

Threshold to know when to restart the base algorithm.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

full_restart_when_refresh = None

Should we fully restart the algorithm or simply reset one arm empirical average ?

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the small average is too far away from it.
__module__ = 'Policies.SlidingWindowRestart'
class Policies.SlidingWindowRestart.SWR_klUCB(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, *args, **kwargs)[source]

Bases: Policies.klUCB.klUCB

An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).

Warning

FIXME I should remove this code, it’s useless now that the generic wrapper SlidingWindowRestart works fine.

__init__(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
tau = None

Size of the sliding window.

threshold = None

Threshold to know when to restart the base algorithm.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

full_restart_when_refresh = None

Should we fully restart the algorithm or simply reset one arm empirical average ?

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the small average is too far away from it.
__module__ = 'Policies.SlidingWindowRestart'
Policies.SlidingWindowUCB module

An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.

  • Reference: [On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • It uses an additional \(\mathcal{O}(\tau)\) memory but do not cost anything else in terms of time complexity (the average is done with a sliding window, and costs \(\mathcal{O}(1)\) at every time step).

Warning

This is very experimental!

Note

This is similar to SlidingWindowRestart.SWR_UCB but slightly different: SlidingWindowRestart.SWR_UCB uses a window of size \(T_0=100\) to keep in memory the last 100 draws of each arm, and restart the index if the small history mean is too far away from the whole mean, while this SWUCB uses a fixed-size window of size \(\tau=1000\) to keep in memory the last 1000 steps.

Policies.SlidingWindowUCB.TAU = 1000

Size of the sliding window.

Policies.SlidingWindowUCB.ALPHA = 1.0

Default value for the constant \(\alpha\).

class Policies.SlidingWindowUCB.SWUCB(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]

Bases: Policies.IndexPolicy.IndexPolicy

An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.

__init__(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
tau = None

Size \(\tau\) of the sliding window.

alpha = None

Constant \(\alpha\) in the square-root in the computation for the index.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_choices = None

Keep in memory the times where each arm was last seen.

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\tau}(t)\) pulls of arm \(k\):

\[\begin{split}I_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} + c_{k,\tau}(t),\\ \text{where}\;\; c_{k,\tau}(t) &:= \sqrt{\alpha \frac{\log(\min(t,\tau))}{N_{k,\tau}(t)}},\\ \text{and}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]
__module__ = 'Policies.SlidingWindowUCB'
class Policies.SlidingWindowUCB.SWUCBPlus(nbArms, horizon=None, *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWUCB

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history.

  • Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
__init__(nbArms, horizon=None, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

__module__ = 'Policies.SlidingWindowUCB'
Policies.SlidingWindowUCB.constant_c = 1.0

default value, as it was in pymaBandits v1.0

Policies.SlidingWindowUCB.tolerance = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.SlidingWindowUCB.SWklUCB(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWUCB

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see Policy.klUCB) indexes instead of UCB.

__init__(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
klucb = None

kl function to use

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(\min(t,\tau))}{N_{k,\tau}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

__module__ = 'Policies.SlidingWindowUCB'
class Policies.SlidingWindowUCB.SWklUCBPlus(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWklUCB, Policies.SlidingWindowUCB.SWUCBPlus

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see Policy.klUCB) indexes instead of UCB.

  • Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
__str__()[source]

-> str

__module__ = 'Policies.SlidingWindowUCB'
Policies.Softmax module

The Boltzmann Exploration (Softmax) index policy.

Policies.Softmax.UNBIASED = False

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

class Policies.Softmax.Softmax(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Boltzmann Exploration (Softmax) index policy, with a constant temperature \(\eta_t\).

__init__(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Flag

startGame()[source]

Nothing special to do.

__str__()[source]

-> str

temperature

Constant temperature, \(\eta_t\).

trusts

Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature \(\eta_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= \exp\left( \frac{X_k(t)}{\eta_t N_k(t)} \right) \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k.

choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (least probable one).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).
choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with fixed temperature \(\eta_t = \eta_0\) chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

temperature

Fixed temperature, small, knowing the horizon: \(\eta_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxDecreasing(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with decreasing temperature \(\eta_t\).

__str__()[source]

-> str

temperature

Decreasing temperature with the time: \(\eta_t = \sqrt(\frac{\log(K)}{t K})\) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftMix(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Another Softmax with decreasing temperature \(\eta_t\).

__str__()[source]

-> str

__module__ = 'Policies.Softmax'
temperature

Decreasing temperature with the time: \(\eta_t = c \frac{\log(t)}{t}\) (heuristic).

Policies.SparseUCB module

The SparseUCB policy, designed to tackle sparse stochastic bandit problems:

  • This means that only a small subset of size s of the K arms has non-zero means.
  • The SparseUCB algorithm requires to known exactly the value of s.
  • Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].

Warning

This algorithm only works for sparse Gaussian (or sub-Gaussian) stochastic bandits.

class Policies.SparseUCB.Phase

Bases: enum.Enum

Different states during the SparseUCB algorithm.

  • RoundRobin means all are sampled once.
  • ForceLog uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).
  • UCB is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
ForceLog = 2
RoundRobin = 1
UCB = 3
__module__ = 'Policies.SparseUCB'
Policies.SparseUCB.ALPHA = 4

Default parameter for \(\alpha\) for the UCB indexes.

class Policies.SparseUCB.SparseUCB(nbArms, sparsity=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBalpha.UCBalpha

The SparseUCB policy, designed to tackle sparse stochastic bandit problems.

  • By default, assume sparsity = nbArms.
__init__(nbArms, sparsity=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
sparsity = None

Known value of the sparsity of the current problem.

phase = None

Current phase of the algorithm.

force_to_see = None

Binary array for the set \(\mathcal{J}(t)\).

goods = None

Binary array for the set \(\mathcal{K}(t)\).

offset = None

Next arm to sample, for the Round-Robin phase

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

update_j()[source]

Recompute the set \(\mathcal{J}(t)\):

\[\mathcal{J}(t) = \left\{ k \in [1,...,K]\;, \frac{X_k(t)}{N_k(t)} \geq \sqrt{\frac{\alpha \log(N_k(t))}{N_k(t)}} \right\}.\]
update_k()[source]

Recompute the set \(\mathcal{K}(t)\):

\[\mathcal{K}(t) = \left\{ k \in [1,...,K]\;, \frac{X_k(t)}{N_k(t)} \geq \sqrt{\frac{\alpha \log(t)}{N_k(t)}} \right\}.\]
choice()[source]

Choose the next arm to play:

  • If still in a Round-Robin phase, play the next arm,
  • Otherwise, recompute the set \(\mathcal{J}(t)\),
  • If it is too small, if \(\mathcal{J}(t) < s\):
    • Start a new Round-Robin phase from arm 0.
  • Otherwise, recompute the second set \(\mathcal{K}(t)\),
  • If it is too small, if \(\mathcal{K}(t) < s\):
    • Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).
  • Otherwise,
    • Play a UCB step by choosing an arm with highest UCB index from the set \(\mathcal{K}(t)\).
__module__ = 'Policies.SparseUCB'
Policies.SparseWrapper module

The SparseWrapper policy, designed to tackle sparse stochastic bandit problems:

  • This means that only a small subset of size s of the K arms has non-zero means.
  • The SparseWrapper algorithm requires to known exactly the value of s.
  • This SparseWrapper is a very generic version, and can use any index policy for both the decision in the UCB phase and the construction of the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).
  • The usual UCB indexes can be used for the set \(\mathcal{K}(t)\) by setting the flag use_ucb_for_set_K to true (but usually the indexes from the underlying policy can be used efficiently for set \(\mathcal{K}(t)\)), and for the set \(\mathcal{J}(t)\) by setting the flag use_ucb_for_set_J to true (as its formula is less easily generalized).
  • If used with Policy.UCBalpha or Policy.klUCB, it should be better to use directly Policy.SparseUCB or Policy.SparseklUCB.
  • Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)] who introduced SparseUCB.

Warning

This is very EXPERIMENTAL! I have no proof yet! But it works fine!!

Policies.SparseWrapper.default_index_policy

alias of Policies.UCBalpha.UCBalpha

class Policies.SparseWrapper.Phase

Bases: enum.Enum

Different states during the SparseWrapper algorithm.

  • RoundRobin means all are sampled once.
  • ForceLog uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).
  • UCB is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
ForceLog = 2
RoundRobin = 1
UCB = 3
__module__ = 'Policies.SparseWrapper'
Policies.SparseWrapper.USE_UCB_FOR_SET_K = False

Default value for the flag controlling whether the usual UCB indexes are used for the set \(\mathcal{K}(t)\). Default it to use the indexes of the underlying policy, which could be more efficient.

Policies.SparseWrapper.USE_UCB_FOR_SET_J = False

Default value for the flag controlling whether the usual UCB indexes are used for the set \(\mathcal{J}(t)\). Default it to use the UCB indexes as there is no clean and generic formula to obtain the indexes for \(\mathcal{J}(t)\) from the indexes of the underlying policy. Note that I found a formula, it’s just durty. See below.

Policies.SparseWrapper.ALPHA = 1

Default parameter for \(\alpha\) for the UCB indexes.

class Policies.SparseWrapper.SparseWrapper(nbArms, sparsity=None, use_ucb_for_set_K=False, use_ucb_for_set_J=False, alpha=1, policy=<class 'Policies.UCBalpha.UCBalpha'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The SparseWrapper policy, designed to tackle sparse stochastic bandit problems.

  • By default, assume sparsity = nbArms.
__init__(nbArms, sparsity=None, use_ucb_for_set_K=False, use_ucb_for_set_J=False, alpha=1, policy=<class 'Policies.UCBalpha.UCBalpha'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New policy.

sparsity = None

Known value of the sparsity of the current problem.

use_ucb_for_set_K = None

Whether the usual UCB indexes are used for the set \(\mathcal{K}(t)\).

use_ucb_for_set_J = None

Whether the usual UCB indexes are used for the set \(\mathcal{J}(t)\).

alpha = None

Parameter \(\alpha\) for the UCB indexes for the two sets, if not using the indexes of the underlying policy.

phase = None

Current phase of the algorithm.

force_to_see = None

Binary array for the set \(\mathcal{J}(t)\).

goods = None

Binary array for the set \(\mathcal{K}(t)\).

offset = None

Next arm to sample, for the Round-Robin phase

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

update_j()[source]

Recompute the set \(\mathcal{J}(t)\):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= I_k^{P}(t) - \hat{\mu}_k(t),\\ U^{\mathcal{J}}_k(t) &= U^{\mathcal{K}}_k(t) \times \sqrt{\frac{\log(N_k(t))}{\log(t)}},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{J}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]
  • Yes, this is a nothing but a hack, as there is no generic formula to retrieve the indexes used in the set \(\mathcal{J}(t)\) from the indexes \(I_k^{P}(t)\) of the underlying index policy \(P\).
  • If use_ucb_for_set_J is True, the same formula from Policies.SparseUCB is used.

Warning

FIXME rewrite the above with LCB and UCB instead of this weird U - mean.

__module__ = 'Policies.SparseWrapper'
update_k()[source]

Recompute the set \(\mathcal{K}(t)\):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= I_k^{P}(t) - \hat{\mu}_k(t),\\ \mathcal{K}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{K}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]
choice()[source]

Choose the next arm to play:

  • If still in a Round-Robin phase, play the next arm,
  • Otherwise, recompute the set \(\mathcal{J}(t)\),
  • If it is too small, if \(\mathcal{J}(t) < s\):
    • Start a new Round-Robin phase from arm 0.
  • Otherwise, recompute the second set \(\mathcal{K}(t)\),
  • If it is too small, if \(\mathcal{K}(t) < s\):
    • Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus K(t)\).
  • Otherwise,
    • Play a UCB step by choosing an arm with highest index (from the underlying policy) from the set \(\mathcal{K}(t)\).
Policies.SparseklUCB module

The SparseklUCB policy, designed to tackle sparse stochastic bandit problems:

  • This means that only a small subset of size s of the K arms has non-zero means.
  • The SparseklUCB algorithm requires to known exactly the value of s.
  • This SparseklUCB is my version. It uses the KL-UCB index for both the decision in the UCB phase and the construction of the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).
  • The usual UCB indexes can be used for the sets by setting the flag use_ucb_for_sets to true.
  • Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)] who introduced SparseUCB.

Warning

This algorithm only works for sparse Gaussian (or sub-Gaussian) stochastic bandits, of known variance.

class Policies.SparseklUCB.Phase

Bases: enum.Enum

Different states during the SparseklUCB algorithm.

  • RoundRobin means all are sampled once.
  • ForceLog uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).
  • UCB is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
ForceLog = 2
RoundRobin = 1
UCB = 3
__module__ = 'Policies.SparseklUCB'
Policies.SparseklUCB.c = 1.0

default value, as it was in pymaBandits v1.0

Policies.SparseklUCB.USE_UCB_FOR_SETS = False

Default value for the flag controlling whether the usual UCB indexes are used for the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\). Default it to use the KL-UCB indexes, which should be more efficient.

class Policies.SparseklUCB.SparseklUCB(nbArms, sparsity=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, use_ucb_for_sets=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The SparseklUCB policy, designed to tackle sparse stochastic bandit problems.

  • By default, assume sparsity = nbArms.
__init__(nbArms, sparsity=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, use_ucb_for_sets=False, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
sparsity = None

Known value of the sparsity of the current problem.

use_ucb_for_sets = None

Whether the usual UCB indexes are used for the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).

phase = None

Current phase of the algorithm.

force_to_see = None

Binary array for the set \(\mathcal{J}(t)\).

goods = None

Binary array for the set \(\mathcal{K}(t)\).

offset = None

Next arm to sample, for the Round-Robin phase

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

update_j()[source]

Recompute the set \(\mathcal{J}(t)\):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{J}}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(N_k(t))}{N_k(t)} \right\},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{J}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]
update_k()[source]

Recompute the set \(\mathcal{K}(t)\):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{K}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]
__module__ = 'Policies.SparseklUCB'
choice()[source]

Choose the next arm to play:

  • If still in a Round-Robin phase, play the next arm,
  • Otherwise, recompute the set \(\mathcal{J}(t)\),
  • If it is too small, if \(\mathcal{J}(t) < s\):
    • Start a new Round-Robin phase from arm 0.
  • Otherwise, recompute the second set \(\mathcal{K}(t)\),
  • If it is too small, if \(\mathcal{K}(t) < s\):
    • Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus K(t)\).
  • Otherwise,
    • Play a UCB step by choosing an arm with highest KL-UCB index from the set \(\mathcal{K}(t)\).
Policies.SuccessiveElimination module

Generic policy based on successive elimination, mostly useless except to maintain a clear hierarchy of inheritance.

class Policies.SuccessiveElimination.SuccessiveElimination(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

Generic policy based on successive elimination, mostly useless except to maintain a clear hierarchy of inheritance.

choice()[source]

In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (self.activeArms) with method choiceFromSubSet.

__module__ = 'Policies.SuccessiveElimination'
Policies.TakeFixedArm module

TakeFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).

class Policies.TakeFixedArm.TakeFixedArm(nbArms, armIndex=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

TakeFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).

__init__(nbArms, armIndex=None, lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms

armIndex = None

Fixed arm

__str__()[source]

-> str

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Always the same choice.

choiceWithRank(rank=1)[source]

Ignore the rank.

__module__ = 'Policies.TakeFixedArm'
Policies.TakeRandomFixedArm module

TakeRandomFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).

class Policies.TakeRandomFixedArm.TakeRandomFixedArm(nbArms, lower=0.0, amplitude=1.0, nbArmIndexes=None)[source]

Bases: Policies.TakeFixedArm.TakeFixedArm

TakeRandomFixedArm: first selects a random sub-set of arms, then always select from it.

__init__(nbArms, lower=0.0, amplitude=1.0, nbArmIndexes=None)[source]

New policy.

nbArms = None

Number of arms

armIndexes = None

Fix the set of arms

__str__()[source]

-> str

choice()[source]

Uniform choice from armIndexes.

__module__ = 'Policies.TakeRandomFixedArm'
Policies.Thompson module

The Thompson (Bayesian) index policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
  • Reference: [Thompson - Biometrika, 1933].
class Policies.Thompson.Thompson(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

The Thompson (Bayesian) index policy.

  • By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.

  • Prior is initially flat, i.e., \(a=\alpha_0=1\) and \(b=\beta_0=1\).

  • A non-flat prior for each arm can be given with parameters a and b, for instance:

    nbArms = 2
    prior_failures  = a = 100
    prior_successes = b = 50
    policy = Thompson(nbArms, a=a, b=b)
    np.mean([policy.choice() for _ in range(1000)])  # 0.515 ~= 0.5: each arm has same prior!
    
  • A different prior for each arm can be given with parameters params_for_each_posterior, for instance:

    nbArms = 2
    params0 = { 'a': 10, 'b': 5}  # mean 1/3
    params1 = { 'a': 5, 'b': 10}  # mean 2/3
    params = [params0, params1]
    policy = Thompson(nbArms, params_for_each_posterior=params)
    np.mean([policy.choice() for _ in range(1000)])  # 0.9719 ~= 1: arm 1 is better than arm 0 !
    
  • Reference: [Thompson - Biometrika, 1933].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior:

\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \tilde{S_k}(t), 1 + \tilde{N_k}(t) - \tilde{S_k}(t)).\end{split}\]
__module__ = 'Policies.Thompson'
Policies.TrekkingTSN module

TrekkingTSN: implementation of the decentralized multi-player policy from [R.Kumar, A.Yadav, S.J.Darak, M.K.Hanawal, Trekking based Distributed Algorithm for Opportunistic Spectrum Access in Infrastructure-less Network, 2018](XXX).

  • Each player has 3 states, 1st is channel characterization, 2nd is Trekking phase
  • 1st step
    • FIXME
  • 2nd step:
    • FIXME
Policies.TrekkingTSN.special_times(nbArms=10, theta=0.01, epsilon=0.1, delta=0.05)[source]

Compute the lower-bound suggesting “large-enough” values for the different parameters \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\) and the smallest mean is larger than \(\theta\).

\[\begin{split}T_{RH} &= \frac{\log(\frac{\delta}{3 K})}{\log(1 - \theta (1 - \frac{1}{K})^{K-1}))} \\ T_{SH} &= (2 K / \varepsilon^2) \log(\frac{2 K^2}{\delta / 3}) \\ T_{TR} &= \lceil\frac{\log((\delta / 3) K XXX)}{\log(1 - \theta)} \rceil \frac{(K - 1) K}{2}.\end{split}\]
  • Cf. Theorem 1 of [Kumar et al., 2018](XXX).
  • Examples:
>>> nbArms = 8
>>> theta = Delta = 0.07
>>> epsilon = theta
>>> delta = 0.1
>>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta)  # doctest: +ELLIPSIS
(197, 26949, -280)
>>> delta = 0.01
>>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta)  # doctest: +ELLIPSIS
(279, 34468, 616)
>>> delta = 0.001
>>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta)  # doctest: +ELLIPSIS
(362, 41987, 1512)
Policies.TrekkingTSN.boundOnFinalRegret(T_RH, T_SH, T_TR, nbPlayers, nbArms)[source]

Use the upper-bound on regret when \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) and \(M\) are known.

  • The “constant” regret of course grows linearly with \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\), as:

    \[\forall T \geq T_{RH} + T_{SH} + T_{TR}, \;\; R_T \leq M (T_{RH} + (1 - \frac{M}{K}) T_{SH} + T_{TR}).\]

Warning

this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) is chosen as given by special_times()).

  • Cf. Theorem 1 of [Kumar et al., 2018](XXX).
  • Examples:
>>> boundOnFinalRegret(197, 26949, -280, 2, 8)  # doctest: +ELLIPSIS
40257.5
>>> boundOnFinalRegret(279, 34468, 616, 2, 8)   # doctest: +ELLIPSIS
53492.0
>>> boundOnFinalRegret(362, 41987, 1512, 2, 8)  # doctest: +ELLIPSIS
66728.5
  • For \(M=5\):
>>> boundOnFinalRegret(197, 26949, -280, 5, 8)  # doctest: +ELLIPSIS
50114.375
>>> boundOnFinalRegret(279, 34468, 616, 5, 8)   # doctest: +ELLIPSIS
69102.5
>>> boundOnFinalRegret(362, 41987, 1512, 5, 8)  # doctest: +ELLIPSIS
88095.625
  • For \(M=K=8\):
>>> boundOnFinalRegret(197, 26949, -280, 8, 8)  # doctest: +ELLIPSIS
-664.0  # there is something wrong with T_TR !
>>> boundOnFinalRegret(279, 34468, 616, 8, 8)   # doctest: +ELLIPSIS
7160.0
>>> boundOnFinalRegret(362, 41987, 1512, 8, 8)  # doctest: +ELLIPSIS
14992.0
class Policies.TrekkingTSN.State

Bases: enum.Enum

Different states during the Musical Chair algorithm

ChannelCharacterization = 2
NotStarted = 1
TrekkingTSN = 3
__module__ = 'Policies.TrekkingTSN'
class Policies.TrekkingTSN.TrekkingTSN(nbArms, theta=0.01, epsilon=0.1, delta=0.05, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

TrekkingTSN: implementation of the single-player policy from [R.Kumar, A.Yadav, S.J.Darak, M.K.Hanawal, Trekking based Distributed Algorithm for Opportunistic Spectrum Access in Infrastructure-less Network, 2018](XXX).

__init__(nbArms, theta=0.01, epsilon=0.1, delta=0.05, lower=0.0, amplitude=1.0)[source]
  • nbArms: number of arms,

Example:

>>> nbArms = 8
>>> theta, epsilon, delta = 0.01, 0.1, 0.05
>>> player1 = TrekkingTSN(nbArms, theta=theta, epsilon=epsilon, delta=delta)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, TrekkingTSN, nbArms, theta=theta, epsilon=epsilon, delta=delta).children
state = None

Current state

theta = None

Parameter \(\theta\).

epsilon = None

Parameter \(\epsilon\).

delta = None

Parameter \(\delta\).

T_RH = None

Parameter \(T_{RH}\) computed from special_times()

T_SH = None

Parameter \(T_{SH}\) computed from special_times()

T_CC = None

Parameter \(T_{CC} = T_{RH} + T_{SH}\)

T_TR = None

Parameter \(T_{TR}\) computed from special_times()

last_was_successful = None

That’s the l of the paper

last_choice = None

Keep memory of the last choice for CC phase

cumulatedRewards = None

That’s the V_n of the paper

nbObservations = None

That’s the S_n of the paper

lock_channel = None

That’s the L of the paper

t = None

Internal times

__str__()[source]

-> str

startGame()[source]

Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

choice()[source]

Choose an arm, as described by the Musical Chair algorithm.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.

  • If not collision, receive a reward after pulling the arm.
_endCCPhase()[source]

Small computation needed at the end of the initial CC phase.

handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

  • Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
__module__ = 'Policies.TrekkingTSN'
Policies.TsallisInf module

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

  • Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)
Policies.TsallisInf.ALPHA = 0.5

Default value for \(\alpha\) the parameter of the Tsallis entropy. We focus on the 1/2-Tsallis algorithm, ie, with \(\alpha=\frac{1}{2}\).

class Policies.TsallisInf.TsallisInf(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Exp3.Exp3

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

  • Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)
__init__(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

Store the constant \(\alpha\) used by the Online-Mirror-Descent step using \(\alpha\) Tsallis entropy.

inverse_exponent = None

Store \(\frac{1}{\alpha-1}\) to only compute it once.

cumulative_losses = None

Keep in memory the vector \(\hat{L}_t\) of cumulative (unbiased estimates) of losses.

__str__()[source]

-> str

eta

Decreasing learning rate, \(\eta_t = \frac{1}{\sqrt{t}}\).

trusts

Trusts probabilities \(\mathrm{trusts}(t+1)\) are just the normalized weights \(w_k(t)\).

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then recompute the trusts.

Compute the trusts probabilities \(w_k(t)\) with one step of Online-Mirror-Descent for bandit, using the \(\alpha\) Tsallis entropy for the \(\Psi_t\) functions.

\[\begin{split}\mathrm{trusts}'_k(t+1) &= \nabla (\Psi_t + \mathcal{I}_{\Delta^K})^* (- \hat{L}_{t-1}), \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]
  • If \(\Delta^K\) is the probability simplex of dimension \(K\),
  • and \(\hat{L}_{t-1}\) is the cumulative loss vector, ie, the sum of the (unbiased estimate) \(\hat{\ell}_t\) for the previous time steps,
  • where \(\hat{\ell}_{t,i} = 1(I_t = i) \frac{\ell_{t,i}}{\mathrm{trusts}_i(t)}\) is the unbiased estimate of the loss,
  • With \(\Psi_t = \Psi_{t,\alpha}(w) := - \sum_{k=1}^{K} \frac{w_k^{\alpha}}{\alpha \eta_t}\),
  • With learning rate \(\eta_t = \frac{1}{\sqrt{t}}\) the (decreasing) learning rate.
__module__ = 'Policies.TsallisInf'
Policies.UCB module

The UCB policy for bounded bandits.

  • Reference: [Lai & Robbins, 1985].
class Policies.UCB.UCB(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB policy for bounded bandits.

  • Reference: [Lai & Robbins, 1985].
computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \log(t)}{N_k(t)}}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCB'
Policies.UCBH module

The UCB-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert et al. 09].

class Policies.UCBH.UCBH(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBalpha.UCBalpha

The UCB-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert et al. 09].

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

alpha = None

Parameter alpha

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBH'
Policies.UCBV module

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

class Policies.UCBV.UCBV(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

__str__()[source]

-> str

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
rewardsSquared = None

Keep track of squared of rewards, to compute an empirical variance

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and of rewards squared for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{2 \log(t) V_k(t)}{N_k(t)}} + 3 (b - a) \frac{\log(t)}{N_k(t)}.\end{split}\]

Where rewards are in \([a, b]\), and \(V_k(t)\) is an estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBV'
Policies.UCBVtuned module

The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].

class Policies.UCBVtuned.UCBVtuned(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBV.UCBV

The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ V'_k(t) &= V_k(t) + \sqrt{\frac{2 \log(t)}{N_k(t)}}, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{\log(t) V'_k(t)}{N_k(t)}}.\end{split}\]

Where \(V'_k(t)\) is an other estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBVtuned'
Policies.UCBalpha module

The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Reference: [Auer et al. 02].

Policies.UCBalpha.ALPHA = 4

Default parameter for alpha

class Policies.UCBalpha.UCBalpha(nbArms, alpha=4, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Reference: [Auer et al. 02].

__init__(nbArms, alpha=4, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
alpha = None

Parameter alpha

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\]
__module__ = 'Policies.UCBalpha'
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

Policies.UCBdagger module

The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

  • Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://tor-lattimore.com/)
Policies.UCBdagger.ALPHA = 1

Default value for the parameter \(\alpha > 0\) for UCBdagger.

Policies.UCBdagger.log_bar(x)[source]

The function defined as \(\mathrm{l\overline{og}}\) by Lattimore:

\[\mathrm{l\overline{og}}(x) := \log\left((x+e)\sqrt{\log(x+e)}\right)\]

Some values:

>>> for x in np.logspace(0, 7, 8):
...     print("x = {:<5.3g} gives log_bar(x) = {:<5.3g}".format(x, log_bar(x)))
x = 1     gives log_bar(x) = 1.45
x = 10    gives log_bar(x) = 3.01
x = 100   gives log_bar(x) = 5.4
x = 1e+03 gives log_bar(x) = 7.88
x = 1e+04 gives log_bar(x) = 10.3
x = 1e+05 gives log_bar(x) = 12.7
x = 1e+06 gives log_bar(x) = 15.1
x = 1e+07 gives log_bar(x) = 17.5

Illustration:

>>> import matplotlib.pyplot as plt
>>> X = np.linspace(0, 1000, 2000)
>>> Y = log_bar(X)
>>> plt.plot(X, Y)
>>> plt.title(r"The $\mathrm{l\overline{og}}$ function")
>>> plt.show()
Policies.UCBdagger.Ki_function(pulls, i)[source]

Compute the \(K_i(t)\) index as defined in the article, for one arm i.

Policies.UCBdagger.Ki_vectorized(pulls)[source]

Compute the \(K_i(t)\) index as defined in the article, for all arms (in a vectorized manner).

Warning

I didn’t find a fast vectorized formula, so don’t use this one.

class Policies.UCBdagger.UCBdagger(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

__init__(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
alpha = None

Parameter \(\alpha > 0\).

horizon = None

Parameter \(T > 0\).

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \mathrm{l}\overline{\mathrm{og}}\left( \frac{T}{H_k(t)} \right)}, \\ \text{where}\;\; & H_k(t) := N_k(t) K_k(t) \\ \text{and}\;\; & K_k(t) := \sum_{j=1}^{K} \min(1, \sqrt{\frac{T_j(t)}{T_i(t)}}).\end{split}\]
__module__ = 'Policies.UCBdagger'
Policies.UCBimproved module

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

Policies.UCBimproved.ALPHA = 0.5

Default value for parameter \(\alpha\).

Policies.UCBimproved.n_m(horizon, delta_m)[source]

Function \(\lceil \frac{2 \log(T \Delta_m^2)}{\Delta_m^2} \rceil\).

class Policies.UCBimproved.UCBimproved(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]

Bases: Policies.SuccessiveElimination.SuccessiveElimination

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

__init__(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

alpha = None

Parameter alpha

activeArms = None

Set of active arms

estimate_delta = None

Current estimate of the gap \(\Delta_0\)

max_nb_of_exploration = None

Keep in memory the \(n_m\) quantity, using n_m()

current_m = None

Current round m

max_m = None

Bound \(m = \lfloor \frac{1}{2} \log_2(\frac{T}{e}) \rfloor\)

when_did_it_leave = None

Also keep in memory when the arm was kicked out of the activeArms sets, so fake index can be given, if we ask to order the arms for instance.

__str__()[source]

-> str

update_activeArms()[source]

Update the set activeArms of active arms.

choice(recursive=False)[source]

In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (self.activeArms) with method choiceFromSubSet.

computeIndex(arm)[source]

Nothing to do, just copy from when_did_it_leave.

__module__ = 'Policies.UCBimproved'
Policies.UCBmin module

The UCB-min policy for bounded bandits, with a \(\min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right)\) term. Reference: [Anandkumar et al., 2010].

class Policies.UCBmin.UCBmin(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB-min policy for bounded bandits, with a \(\min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right)\) term. Reference: [Anandkumar et al., 2010].

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right).\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBmin'
Policies.UCBoost module

The UCBoost policy for bounded bandits (on [0, 1]).

Warning

The whole goal of their paper is to provide a numerically efficient alternative to kl-UCB, so for my comparison to be fair, I should either use the Python versions of klUCB utility functions (using kullback) or write C or Cython versions of this UCBoost module. My conclusion is that kl-UCB is always faster than UCBoost.

Policies.UCBoost.c = 0.0

Default value for better practical performance.

Policies.UCBoost.tolerance_with_upperbound = 1.0001

Tolerance when checking (with assert) that the solution(s) of any convex problem are correct.

Policies.UCBoost.CHECK_SOLUTION = False

Whether to check that the solution(s) of any convex problem are correct.

Warning

This is currently disabled, to try to optimize this module! WARNING bring it back when debugging!

Policies.UCBoost.squadratic_distance(p, q)[source]

The quadratic distance, \(d_{sq}(p, q) := 2 (p - q)^2\).

Policies.UCBoost.solution_pb_sq(p, upperbound, check_solution=False)[source]

Closed-form solution of the following optimisation problem, for \(d = d_{sq}\) the biquadratic_distance() function:

\[\begin{split}P_1(d_{sq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{sq}(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[q^* = p + \sqrt{\frac{\delta}{2}}.\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCB_sq(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB(d_sq) policy for bounded bandits (on [0, 1]).

__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{sq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.biquadratic_distance(p, q)[source]

The biquadratic distance, \(d_{bq}(p, q) := 2 (p - q)^2 + (4/9) * (p - q)^4\).

Policies.UCBoost.solution_pb_bq(p, upperbound, check_solution=False)[source]

Closed-form solution of the following optimisation problem, for \(d = d_{bq}\) the biquadratic_distance() function:

\[\begin{split}P_1(d_{bq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{bq}(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[q^* = \min(1, p + \sqrt{-\frac{9}{4} + \sqrt{\frac{81}{16} + \frac{9}{4} \delta}}).\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCB_bq(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB(d_bq) policy for bounded bandits (on [0, 1]).

__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{bq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.hellinger_distance(p, q)[source]

The Hellinger distance, \(d_{h}(p, q) := (\sqrt{p} - \sqrt{q})^2 + (\sqrt{1 - p} - \sqrt{1 - q})^2\).

Policies.UCBoost.solution_pb_hellinger(p, upperbound, check_solution=False)[source]

Closed-form solution of the following optimisation problem, for \(d = d_{h}\) the hellinger_distance() function:

\[\begin{split}P_1(d_h)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_h(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[q^* = \left( (1 - \frac{\delta}{2}) \sqrt{p} + \sqrt{(1 - p) (\delta - \frac{\delta^2}{4})} \right)^{2 \times \boldsymbol{1}(\delta < 2 - 2 \sqrt{p})}.\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCB_h(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB(d_h) policy for bounded bandits (on [0, 1]).

__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_h)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.eps = 1e-15

Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]

Policies.UCBoost.kullback_leibler_distance_on_mean(p, q)[source]

Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{kl}(p, q) = \mathrm{KL}(\mathcal{B}(p), \mathcal{B}(q)) = p \log\left(\frac{p}{q}\right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
Policies.UCBoost.kullback_leibler_distance_lowerbound(p, q)[source]

Lower-bound on the Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[d_{lb}(p, q) = p \log\left( p \right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
Policies.UCBoost.solution_pb_kllb(p, upperbound, check_solution=False)[source]

Closed-form solution of the following optimisation problem, for \(d = d_{lb}\) the proposed lower-bound on the Kullback-Leibler binary distance (kullback_leibler_distance_lowerbound()) function:

\[\begin{split}P_1(d_{lb})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{lb}(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[q^* = 1 - (1 - p) \exp\left(\frac{p \log(p) - \delta}{1 - p}\right).\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCB_lb(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB(d_lb) policy for bounded bandits (on [0, 1]).

__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{lb})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.distance_t(p, q)[source]

A shifted tangent line function of kullback_leibler_distance_on_mean().

\[d_t(p, q) = \frac{2 q}{p + 1} + p \log\left(\frac{p}{p + 1}\right) + \log\left(\frac{2}{\mathrm{e}(p + 1)}\right).\]

Warning

I think there might be a typo in the formula in the article, as this \(d_t\) does not seem to “depend enough on q” (just intuition).

Policies.UCBoost.solution_pb_t(p, upperbound, check_solution=False)[source]

Closed-form solution of the following optimisation problem, for \(d = d_t\) a shifted tangent line function of kullback_leibler_distance_on_mean() (distance_t()) function:

\[\begin{split}P_1(d_t)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_t(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[q^* = \min\left(1, \frac{p + 1}{2} \left( \delta - p \log\left(\frac{p}{p + 1}\right) - \log\left(\frac{2}{\mathrm{e} (p + 1)}\right) \right)\right).\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCB_t(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCB(d_t) policy for bounded bandits (on [0, 1]).

Warning

It has bad performance, as expected (see the paper for their remark).

__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_t)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.is_a_true_number(n)[source]

Check if n is a number or not (int, float, complex etc, any instance of numbers.Number class.

class Policies.UCBoost.UCBoost(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCBoost policy for bounded bandits (on [0, 1]).

  • It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
  • set_D should be either a set of strings (and NOT functions), or a number (3, 4 or 5). 3 indicate using d_bq, d_h, d_lb, 4 adds d_t, and 5 adds d_sq (see the article, Corollary 3, p5, for more details).
  • Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
__init__(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
set_D = None

Set of strings that indicate which d functions are in the set of functions D. Warning: do not use real functions here, or the object won’t be hashable!

c = None

Parameter c

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
class Policies.UCBoost.UCBoost_bq_h_lb(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBoost.UCBoost

The UCBoost policy for bounded bandits (on [0, 1]).

  • It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
  • set_D is d_bq, d_h, d_lb (see the article, Corollary 3, p5, for more details).
  • Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
class Policies.UCBoost.UCBoost_bq_h_lb_t(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBoost.UCBoost

The UCBoost policy for bounded bandits (on [0, 1]).

  • It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
  • set_D is d_bq, d_h, d_lb, d_t (see the article, Corollary 3, p5, for more details).
  • Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
class Policies.UCBoost.UCBoost_bq_h_lb_t_sq(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCBoost.UCBoost

The UCBoost policy for bounded bandits (on [0, 1]).

  • It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
  • set_D is d_bq, d_h, d_lb, d_t, d_sq (see the article, Corollary 3, p5, for more details).
  • Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
__init__(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
__module__ = 'Policies.UCBoost'
Policies.UCBoost.min_solutions_pb_from_epsilon(p, upperbound, epsilon=0.001, check_solution=False)[source]

List of closed-form solutions of the following optimisation problems, for \(d = d_s^k\) approximation of \(d_{kl}\) and any \(\tau_1(p) \leq k \leq \tau_2(p)\):

\[\begin{split}P_1(d_s^k)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_s^k(p, q) \leq \delta.\end{split}\]
  • The solution is:
\[\begin{split}q^* &= q_k^{\boldsymbol{1}(\delta < d_{kl}(p, q_k))},\\ d_s^k &: (p, q) \mapsto d_{kl}(p, q_k) \boldsymbol{1}(q > q_k),\\ q_k &:= 1 - \left( 1 - \frac{\varepsilon}{1 + \varepsilon} \right)^k.\end{split}\]
  • \(\delta\) is the upperbound parameter on the semi-distance between input \(p\) and solution \(q^*\).
class Policies.UCBoost.UCBoostEpsilon(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The UCBoostEpsilon policy for bounded bandits (on [0, 1]).

  • It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
  • This variant uses solutions_pb_from_epsilon() to also compute the \(\varepsilon\) approximation of the kullback_leibler_distance_on_mean() function (see the article for details, Th.3 p6).
  • Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
__init__(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

__module__ = 'Policies.UCBoost'
epsilon = None

Parameter epsilon

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D_{\varepsilon}} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
Policies.UCBplus module

The UCB+ policy for bounded bandits, with a small trick on the index.

class Policies.UCBplus.UCBplus(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB+ policy for bounded bandits, with a small trick on the index.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log(t / N_k(t))}{2 N_k(t)}\right)}.\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBplus'
Policies.UCBrandomInit module

The UCB index policy, modified to take a random permutation order for the initial exploration of each arm (could reduce collisions in the multi-players setting). Reference: [Lai & Robbins, 1985].

class Policies.UCBrandomInit.UCBrandomInit(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB index policy, modified to take a random permutation order for the initial exploration of each arm (could reduce collisions in the multi-players setting). Reference: [Lai & Robbins, 1985].

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
choice()[source]

In an index policy, choose an arm with maximal index (uniformly at random):

\[A(t) \sim U(\arg\max_{1 \leq k \leq K} I_k(t)).\]

Warning

In almost all cases, there is a unique arm with maximal index, so we loose a lot of time with this generic code, but I couldn’t find a way to be more efficient without loosing generality.

__module__ = 'Policies.UCBrandomInit'
Policies.Uniform module

Uniform: the fully uniform policy who selects randomly (uniformly) an arm at each step (stupid).

class Policies.Uniform.Uniform(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

Uniform: the fully uniform policy who selects randomly (uniformly) an arm at each step (stupid).

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

Nothing to do.

nbArms = None

Number of arms

__str__()[source]

-> str

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Uniform random choice between 0 and nbArms - 1 (included).

choiceWithRank(rank=1)[source]

Ignore the rank!

__module__ = 'Policies.Uniform'
Policies.UniformOnSome module

UniformOnSome: a fully uniform policy who selects randomly (uniformly) an arm among a fix set, at each step (stupid).

class Policies.UniformOnSome.UniformOnSome(nbArms, armIndexes=None, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Uniform.Uniform

UniformOnSome: a fully uniform policy who selects randomly (uniformly) an arm among a fix set, at each step (stupid).

__init__(nbArms, armIndexes=None, lower=0.0, amplitude=1.0)[source]

Nothing to do.

nbArms = None

Number of arms

armIndexes = None

Arms from where to uniformly sample

__str__()[source]

-> str

choice()[source]

Uniform choice from armIndexes.

__module__ = 'Policies.UniformOnSome'
Policies.WrapRange module

A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in \([0,1]\), and are easy to use if the rewards known to be in some interval \([a, b]\) (I did this from the very beginning here, with [lower, lower+amplitude]). But if the interval \([a,b]\) is unknown, what can we do? The “Doubling Trick”, in this setting, refers to this algorithm:

  1. Start with \([a_0, b_0] = [0, 1]\),
  2. If a reward \(r_t\) is seen below \(a_i\), use \(a_{i+1} = r_t\),
  3. If a reward \(r_t\) is seen above \(b_i\), use \(b_{i+1} = r_t - a_i\).

Instead of just doubling the length of the interval (“doubling trick”), we use \([r_t, b_i]\) or \([a_i, r_t]\) as it is the smallest interval compatible with the past and the new observation \(r_t\)

  • Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea.

See also

See for instance Obandit.WrapRange by @freuk.

class Policies.WrapRange.WrapRange(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New policy.

policy = None

Underlying policy

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Maybe change the current range and rescale all the past history, and then pass the reward, and update t.

Let call \(r_s\) the reward at time \(s\), \(l_{t-1}\) and \(a_{t-1}\) the lower-bound and amplitude of rewards at previous time \(t-1\), and \(l_t\) and \(a_t\) the new lower-bound and amplitude for current time \(t\). The previous history is \(R_t := \sum_{s=1}^{t-1} r_s\).

The generic formula for rescaling the previous history is the following:

\[R_t := \frac{(a_{t-1} \times R_t + l_{t-1}) - l_t}{a_t}.\]

So we have the following efficient algorithm:

  1. If \(r < l_{t-1}\), let \(l_t = r\) and \(R_t := R_t + \frac{l_{t-1} - l_t}{a_t}\),
  2. Else if \(r > l_{t-1} + a_{t-1}\), let \(a_t = r - l_{t-1}\) and \(R_t := R_t \times \frac{a_{t-1}}{a-t}\),
  3. Otherwise, nothing to do, the current reward is still correctly in \([l_{t-1}, l_{t-1} + a_{t-1}]\), so simply keep \(l_t = l_{t-1}\) and \(a_t = a_{t-1}\).
index

Get attribute index from the underlying policy.

choice()[source]

Pass the call to choice of the underlying policy.

choiceWithRank(rank=1)[source]

Pass the call to choiceWithRank of the underlying policy.

choiceFromSubSet(availableArms='all')[source]

Pass the call to choiceFromSubSet of the underlying policy.

choiceMultiple(nb=1)[source]

Pass the call to choiceMultiple of the underlying policy.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Pass the call to choiceIMP of the underlying policy.

estimatedOrder()[source]

Pass the call to estimatedOrder of the underlying policy.

estimatedBestArms(M=1)[source]

Pass the call to estimatedBestArms of the underlying policy.

computeIndex(arm)[source]

Pass the call to computeIndex of the underlying policy.

computeAllIndex()[source]

Pass the call to computeAllIndex of the underlying policy.

__module__ = 'Policies.WrapRange'
Policies.klUCB module

The generic KL-UCB policy for one-parameter exponential distributions.

Policies.klUCB.c = 1.0

default value, as it was in pymaBandits v1.0

Policies.klUCB.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.klUCB.klUCB(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.IndexPolicy.IndexPolicy

The generic KL-UCB policy for one-parameter exponential distributions.

__init__(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
c = None

Parameter c

klucb = None

kl function to use

klucb_vect = None

kl function to use, in a vectorized way using numpy.vectorize().

tolerance = None

Numerical tolerance

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCB'
Policies.klUCBH module

The kl-UCB-H policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)

class Policies.klUCBH.klUCBH(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The kl-UCB-H policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)

__init__(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBH'
Policies.klUCBHPlus module

The improved kl-UCB-H+ policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)

class Policies.klUCBHPlus.klUCBHPlus(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The improved kl-UCB-H+ policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)

__init__(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T / N_k(t))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBHPlus'
Policies.klUCBPlus module

The improved kl-UCB policy, for one-parameter exponential distributions. Reference: [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf)

class Policies.klUCBPlus.klUCBPlus(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The improved kl-UCB policy, for one-parameter exponential distributions. Reference: [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf)

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t / N_k(t))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBPlus'
Policies.klUCBPlusPlus module

The improved kl-UCB++ policy, for one-parameter exponential distributions. Reference: [Menard & Garivier, ALT 2017](https://hal.inria.fr/hal-01475078)

Policies.klUCBPlusPlus.logplus(x)[source]

..math:: log^+(x) := max(0, log(x)).

Policies.klUCBPlusPlus.g(t, T, K)[source]

The exploration function g(t) (for t current time, T horizon, K nb arms), as defined in page 3 of the reference paper.

\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
Policies.klUCBPlusPlus.g_vect(t, T, K)[source]

The exploration function g(t) (for t current time, T horizon, K nb arms), as defined in page 3 of the reference paper, for numpy vectorized inputs.

\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
class Policies.klUCBPlusPlus.klUCBPlusPlus(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The improved kl-UCB++ policy, for one-parameter exponential distributions. Reference: [Menard & Garivier, ALT 2017](https://hal.inria.fr/hal-01475078)

__init__(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c g(N_k(t), T, K)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1), and where \(g(t, T, K)\) is this function:

\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBPlusPlus'
Policies.klUCB_forGLR module

The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).

Policies.klUCB_forGLR.c = 3

Default value when using \(f(t) = \log(t) + c \log(\log(t))\), as klUCB_forGLR is inherited from klUCBloglog.

Policies.klUCB_forGLR.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.klUCB_forGLR.klUCB_forGLR(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=3, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCBloglog.klUCBloglog

The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).

__init__(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=3, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
t_for_each_arm = None

Keep in memory not only the global time step \(t\), but also let the possibility for GLR_UCB to use a different time steps \(t_k\) for each arm, in the exploration function \(f(t) = \log(t_k) + 3 \log(\log(t_k))\).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t_k) + c \log(\log(t_k))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

Warning

The only difference with klUCB is that a custom \(t_k\) is used for each arm k, instead of a common \(t\). This policy is designed to be used with GLR_UCB.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCB_forGLR'
Policies.klUCBloglog module

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t) Reference: [Garivier & Cappé - COLT, 2011].

Policies.klUCBloglog.c = 3

default value, as it was in pymaBandits v1.0

class Policies.klUCBloglog.klUCBloglog(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t) Reference: [Garivier & Cappé - COLT, 2011].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t) + c \log(\max(1, \log(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBloglog'
Policies.klUCBloglog_forGLR module

The generic kl-UCB policy for one-parameter exponential distributions with restarted round count t_k. By default, it assumes Bernoulli arms. Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t) - It is designed to be used with the wrapper GLR_UCB. - By default, it assumes Bernoulli arms. - Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).

Policies.klUCBloglog_forGLR.c = 3

Default value when using \(f(t) = \log(t) + c \log(\log(t))\), as klUCB_forGLR is inherited from klUCBloglog.

Policies.klUCBloglog_forGLR.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.klUCBloglog_forGLR.klUCBloglog_forGLR(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=2, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB_forGLR.klUCB_forGLR

The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).

__init__(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=2, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t) + c \log(\max(1, \log(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.klUCBloglog_forGLR'
Policies.klUCBswitch module

The kl-UCB-switch policy, for bounded distributions.

Policies.klUCBswitch.TOLERANCE = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

Policies.klUCBswitch.threshold_switch_bestchoice(T, K, gamma=0.2)[source]

The threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).

\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 1/5.\]
Policies.klUCBswitch.threshold_switch_delayed(T, K, gamma=0.8888888888888888)[source]

Another threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).

\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 8/9.\]
Policies.klUCBswitch.threshold_switch_default(T, K, gamma=0.2)

The threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).

\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 1/5.\]
Policies.klUCBswitch.klucbplus_index(reward, pull, horizon, nbArms, klucb=<function klucbBern>, c=1.0, tolerance=0.0001)[source]

One kl-UCB+ index, from [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I^{KL+}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T / (K * N_k(t)))}{N_k(t)} \right\}.\end{split}\]
Policies.klUCBswitch.mossplus_index(reward, pull, horizon, nbArms)[source]

One MOSS+ index, from [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf):

\[I^{MOSS+}_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{T}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
class Policies.klUCBswitch.klUCBswitch(nbArms, horizon=None, threshold='best', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCB.klUCB

The kl-UCB-switch policy, for bounded distributions.

__init__(nbArms, horizon=None, threshold='best', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
horizon = None

Parameter \(T\) = known horizon of the experiment.

constant_threshold_switch = None

For klUCBswitch (not the anytime variant), we can precompute the threshold as it is constant, \(= f(T, K)\).

use_MOSS_index = None

Initialize internal memory: at first, every arm uses the kl-UCB index, then some will switch to MOSS. (Array of K bool).

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}U_k(t) = \begin{cases} U^{KL+}_k(t) & \text{if } N_k(t) \leq f(T, K), \\ U^{MOSS+}_k(t) & \text{if } N_k(t) > f(T, K). \end{cases}.\end{split}\]
__module__ = 'Policies.klUCBswitch'
Policies.klUCBswitch.logplus(x)[source]

The \(\log_+\) function.

\[\log_+(x) := \max(0, \log(x)).\]
Policies.klUCBswitch.phi(x)[source]

The \(\phi(x)\) function defined in equation (6) in their paper.

\[\phi(x) := \log_+(x (1 + (\log_+(x))^2)).\]
Policies.klUCBswitch.klucb_index(reward, pull, t, nbArms, klucb=<function klucbBern>, c=1.0, tolerance=0.0001)[source]

One kl-UCB index, from [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf):

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I^{KL}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t / N_k(t))}{N_k(t)} \right\}.\end{split}\]
Policies.klUCBswitch.moss_index(reward, pull, t, nbArms)[source]

One MOSS index, from [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf):

\[I^{MOSS}_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
class Policies.klUCBswitch.klUCBswitchAnytime(nbArms, threshold='delayed', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

Bases: Policies.klUCBswitch.klUCBswitch

The anytime variant of the kl-UCB-switch policy, for bounded distributions.

__init__(nbArms, threshold='delayed', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__module__ = 'Policies.klUCBswitch'
threshold_switch = None

A function, like threshold_switch(), of T and K, to decide when to switch from kl-UCB indexes to MOSS indexes (for each arm).

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}U_k(t) = \begin{cases} U^{KL}_k(t) & \text{if } N_k(t) \leq f(t, K), \\ U^{MOSS}_k(t) & \text{if } N_k(t) > f(t, K). \end{cases}.\end{split}\]
Policies.kullback module

Kullback-Leibler divergence functions and klUCB utilities.

Warning

All functions are not vectorized, and assume only one value for each argument. If you want vectorized function, use the wrapper numpy.vectorize:

>>> import numpy as np
>>> klBern_vect = np.vectorize(klBern)
>>> klBern_vect([0.1, 0.5, 0.9], 0.2)  # doctest: +ELLIPSIS
array([0.036..., 0.223..., 1.145...])
>>> klBern_vect(0.4, [0.2, 0.3, 0.4])  # doctest: +ELLIPSIS
array([0.104..., 0.022..., 0...])
>>> klBern_vect([0.1, 0.5, 0.9], [0.2, 0.3, 0.4])  # doctest: +ELLIPSIS
array([0.036..., 0.087..., 0.550...])

For some functions, you would be better off writing a vectorized version manually, for instance if you want to fix a value of some optional parameters:

>>> # WARNING using np.vectorize gave weird result on klGauss
>>> # klGauss_vect = np.vectorize(klGauss, excluded="y")
>>> def klGauss_vect(xs, y, sig2x=0.25):  # vectorized for first input only
...    return np.array([klGauss(x, y, sig2x) for x in xs])
>>> klGauss_vect([-1, 0, 1], 0.1)  # doctest: +ELLIPSIS
array([2.42, 0.02, 1.62])
Policies.kullback.eps = 1e-15

Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]

Policies.kullback.klBern(x, y)[source]

Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]
>>> klBern(0.5, 0.5)
0.0
>>> klBern(0.1, 0.9)  # doctest: +ELLIPSIS
1.757779...
>>> klBern(0.9, 0.1)  # And this KL is symmetric  # doctest: +ELLIPSIS
1.757779...
>>> klBern(0.4, 0.5)  # doctest: +ELLIPSIS
0.020135...
>>> klBern(0.01, 0.99)  # doctest: +ELLIPSIS
4.503217...
  • Special values:
>>> klBern(0, 1)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
34.539575...
Policies.kullback.klBin(x, y, n)[source]

Kullback-Leibler divergence for Binomial distributions. https://math.stackexchange.com/questions/320399/kullback-leibner-divergence-of-binomial-distributions

  • It is simply the n times klBern() on x and y.
\[\mathrm{KL}(\mathrm{Bin}(x, n), \mathrm{Bin}(y, n)) = n \times \left(x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}) \right).\]

Warning

The two distributions must have the same parameter n, and x, y are p, q in (0, 1).

>>> klBin(0.5, 0.5, 10)
0.0
>>> klBin(0.1, 0.9, 10)  # doctest: +ELLIPSIS
17.57779...
>>> klBin(0.9, 0.1, 10)  # And this KL is symmetric  # doctest: +ELLIPSIS
17.57779...
>>> klBin(0.4, 0.5, 10)  # doctest: +ELLIPSIS
0.20135...
>>> klBin(0.01, 0.99, 10)  # doctest: +ELLIPSIS
45.03217...
  • Special values:
>>> klBin(0, 1, 10)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
345.39575...
Policies.kullback.klPoisson(x, y)[source]

Kullback-Leibler divergence for Poison distributions. https://en.wikipedia.org/wiki/Poisson_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathrm{Poisson}(x), \mathrm{Poisson}(y)) = y - x + x \times \log(\frac{x}{y}).\]
>>> klPoisson(3, 3)
0.0
>>> klPoisson(2, 1)  # doctest: +ELLIPSIS
0.386294...
>>> klPoisson(1, 2)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klPoisson(3, 6)  # doctest: +ELLIPSIS
0.920558...
>>> klPoisson(6, 8)  # doctest: +ELLIPSIS
0.273907...
  • Special values:
>>> klPoisson(1, 0)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
33.538776...
>>> klPoisson(0, 0)
0.0
Policies.kullback.klExp(x, y)[source]

Kullback-Leibler divergence for exponential distributions. https://en.wikipedia.org/wiki/Exponential_distribution#Kullback.E2.80.93Leibler_divergence

\[\begin{split}\mathrm{KL}(\mathrm{Exp}(x), \mathrm{Exp}(y)) = \begin{cases} \frac{x}{y} - 1 - \log(\frac{x}{y}) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]
>>> klExp(3, 3)
0.0
>>> klExp(3, 6)  # doctest: +ELLIPSIS
0.193147...
>>> klExp(1, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.193147...
>>> klExp(2, 1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klExp(4, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.306852...
>>> klExp(6, 8)  # doctest: +ELLIPSIS
0.037682...
  • x, y have to be positive:
>>> klExp(-3, 2)
inf
>>> klExp(3, -2)
inf
>>> klExp(-3, -2)
inf
Policies.kullback.klGamma(x, y, a=1)[source]

Kullback-Leibler divergence for gamma distributions. https://en.wikipedia.org/wiki/Gamma_distribution#Kullback.E2.80.93Leibler_divergence

  • It is simply the a times klExp() on x and y.
\[\begin{split}\mathrm{KL}(\Gamma(x, a), \Gamma(y, a)) = \begin{cases} a \times \left( \frac{x}{y} - 1 - \log(\frac{x}{y}) \right) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]

Warning

The two distributions must have the same parameter a.

>>> klGamma(3, 3)
0.0
>>> klGamma(3, 6)  # doctest: +ELLIPSIS
0.193147...
>>> klGamma(1, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.193147...
>>> klGamma(2, 1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
0.306852...
>>> klGamma(4, 2)  # Only the proportion between x and y is used  # doctest: +ELLIPSIS
0.306852...
>>> klGamma(6, 8)  # doctest: +ELLIPSIS
0.037682...
  • x, y have to be positive:
>>> klGamma(-3, 2)
inf
>>> klGamma(3, -2)
inf
>>> klGamma(-3, -2)
inf
Policies.kullback.klNegBin(x, y, r=1)[source]

Kullback-Leibler divergence for negative binomial distributions. https://en.wikipedia.org/wiki/Negative_binomial_distribution

\[\mathrm{KL}(\mathrm{NegBin}(x, r), \mathrm{NegBin}(y, r)) = r \times \log((r + x) / (r + y)) - x \times \log(y \times (r + x) / (x \times (r + y))).\]

Warning

The two distributions must have the same parameter r.

>>> klNegBin(0.5, 0.5)
0.0
>>> klNegBin(0.1, 0.9)  # doctest: +ELLIPSIS
-0.711611...
>>> klNegBin(0.9, 0.1)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
2.0321564...
>>> klNegBin(0.4, 0.5)  # doctest: +ELLIPSIS
-0.130653...
>>> klNegBin(0.01, 0.99)  # doctest: +ELLIPSIS
-0.717353...
  • Special values:
>>> klBern(0, 1)  # Should be +inf, but 0 --> eps, 1 --> 1 - eps  # doctest: +ELLIPSIS
34.539575...
  • With other values for r:
>>> klNegBin(0.5, 0.5, r=2)
0.0
>>> klNegBin(0.1, 0.9, r=2)  # doctest: +ELLIPSIS
-0.832991...
>>> klNegBin(0.1, 0.9, r=4)  # doctest: +ELLIPSIS
-0.914890...
>>> klNegBin(0.9, 0.1, r=2)  # And this KL is non-symmetric  # doctest: +ELLIPSIS
2.3325528...
>>> klNegBin(0.4, 0.5, r=2)  # doctest: +ELLIPSIS
-0.154572...
>>> klNegBin(0.01, 0.99, r=2)  # doctest: +ELLIPSIS
-0.836257...
Policies.kullback.klGauss(x, y, sig2x=0.25, sig2y=None)[source]

Kullback-Leibler divergence for Gaussian distributions of means x and y and variances sig2x and sig2y, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):

\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]

See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties

  • By default, sig2y is assumed to be sig2x (same variance).

Warning

The C version does not support different variances.

>>> klGauss(3, 3)
0.0
>>> klGauss(3, 6)
18.0
>>> klGauss(1, 2)
2.0
>>> klGauss(2, 1)  # And this KL is symmetric
2.0
>>> klGauss(4, 2)
8.0
>>> klGauss(6, 8)
8.0
  • x, y can be negative:
>>> klGauss(-3, 2)
50.0
>>> klGauss(3, -2)
50.0
>>> klGauss(-3, -2)
2.0
>>> klGauss(3, 2)
2.0
  • With other values for sig2x:
>>> klGauss(3, 3, sig2x=10)
0.0
>>> klGauss(3, 6, sig2x=10)
0.45
>>> klGauss(1, 2, sig2x=10)
0.05
>>> klGauss(2, 1, sig2x=10)  # And this KL is symmetric
0.05
>>> klGauss(4, 2, sig2x=10)
0.2
>>> klGauss(6, 8, sig2x=10)
0.2
  • With different values for sig2x and sig2y:
>>> klGauss(0, 0, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
-0.0284...
>>> klGauss(0, 0, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.2243...
>>> klGauss(0, 0, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
1.1534...
>>> klGauss(0, 1, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
0.9715...
>>> klGauss(0, 1, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.7243...
>>> klGauss(0, 1, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
3.1534...
>>> klGauss(1, 0, sig2x=0.25, sig2y=0.5)  # doctest: +ELLIPSIS
0.9715...
>>> klGauss(1, 0, sig2x=0.25, sig2y=1.0)  # doctest: +ELLIPSIS
0.7243...
>>> klGauss(1, 0, sig2x=0.5, sig2y=0.25)  # not symmetric here!  # doctest: +ELLIPSIS
3.1534...

Warning

Using Policies.klUCB (and variants) with klGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Policies.kullback.klucb(x, d, kl, upperbound, precision=1e-06, lowerbound=-inf, max_iterations=50)[source]

The generic KL-UCB index computation.

  • x: value of the cum reward,
  • d: upper bound on the divergence,
  • kl: the KL divergence to be used (klBern(), klGauss(), etc),
  • upperbound, lowerbound=float('-inf'): the known bound of the values x,
  • precision=1e-6: the threshold from where to stop the research,
  • max_iterations=50: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{klucb}(x, d) \simeq \sup_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) < d \}.\]

Note

It uses a bisection search, and one call to kl for each step of the bisection search.

For example, for klucbBern(), the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:

>>> x, d = 0.9, 0.2   # mean x, exploration term d
>>> upperbound = min(1., klucbGauss(x, d, sig2x=0.25))  # variance 1/4 for [0,1] bounded distributions
>>> upperbound  # doctest: +ELLIPSIS
1.0
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=10)  # doctest: +ELLIPSIS
0.9941...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=10)  # doctest: +ELLIPSIS
0.9944...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=50)  # doctest: +ELLIPSIS
0.9941...
>>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=100)  # more and more precise!  # doctest: +ELLIPSIS
0.994489...

Note

See below for more examples for different KL divergence functions.

Policies.kullback.klucbBern(x, d, precision=1e-06)[source]

KL-UCB index computation for Bernoulli distributions, using klucb().

  • Influence of x:
>>> klucbBern(0.1, 0.2)  # doctest: +ELLIPSIS
0.378391...
>>> klucbBern(0.5, 0.2)  # doctest: +ELLIPSIS
0.787088...
>>> klucbBern(0.9, 0.2)  # doctest: +ELLIPSIS
0.994489...
  • Influence of d:
>>> klucbBern(0.1, 0.4)  # doctest: +ELLIPSIS
0.519475...
>>> klucbBern(0.1, 0.9)  # doctest: +ELLIPSIS
0.734714...
>>> klucbBern(0.5, 0.4)  # doctest: +ELLIPSIS
0.871035...
>>> klucbBern(0.5, 0.9)  # doctest: +ELLIPSIS
0.956809...
>>> klucbBern(0.9, 0.4)  # doctest: +ELLIPSIS
0.999285...
>>> klucbBern(0.9, 0.9)  # doctest: +ELLIPSIS
0.999995...
Policies.kullback.klucbGauss(x, d, sig2x=0.25, precision=0.0)[source]

KL-UCB index computation for Gaussian distributions.

  • Note that it does not require any search.

Warning

it works only if the good variance constant is given.

  • Influence of x:
>>> klucbGauss(0.1, 0.2)  # doctest: +ELLIPSIS
0.416227...
>>> klucbGauss(0.5, 0.2)  # doctest: +ELLIPSIS
0.816227...
>>> klucbGauss(0.9, 0.2)  # doctest: +ELLIPSIS
1.216227...
  • Influence of d:
>>> klucbGauss(0.1, 0.4)  # doctest: +ELLIPSIS
0.547213...
>>> klucbGauss(0.1, 0.9)  # doctest: +ELLIPSIS
0.770820...
>>> klucbGauss(0.5, 0.4)  # doctest: +ELLIPSIS
0.947213...
>>> klucbGauss(0.5, 0.9)  # doctest: +ELLIPSIS
1.170820...
>>> klucbGauss(0.9, 0.4)  # doctest: +ELLIPSIS
1.347213...
>>> klucbGauss(0.9, 0.9)  # doctest: +ELLIPSIS
1.570820...

Warning

Using Policies.klUCB (and variants) with klucbGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Policies.kullback.klucbPoisson(x, d, precision=1e-06)[source]

KL-UCB index computation for Poisson distributions, using klucb().

  • Influence of x:
>>> klucbPoisson(0.1, 0.2)  # doctest: +ELLIPSIS
0.450523...
>>> klucbPoisson(0.5, 0.2)  # doctest: +ELLIPSIS
1.089376...
>>> klucbPoisson(0.9, 0.2)  # doctest: +ELLIPSIS
1.640112...
  • Influence of d:
>>> klucbPoisson(0.1, 0.4)  # doctest: +ELLIPSIS
0.693684...
>>> klucbPoisson(0.1, 0.9)  # doctest: +ELLIPSIS
1.252796...
>>> klucbPoisson(0.5, 0.4)  # doctest: +ELLIPSIS
1.422933...
>>> klucbPoisson(0.5, 0.9)  # doctest: +ELLIPSIS
2.122985...
>>> klucbPoisson(0.9, 0.4)  # doctest: +ELLIPSIS
2.033691...
>>> klucbPoisson(0.9, 0.9)  # doctest: +ELLIPSIS
2.831573...
Policies.kullback.klucbExp(x, d, precision=1e-06)[source]

KL-UCB index computation for exponential distributions, using klucb().

  • Influence of x:
>>> klucbExp(0.1, 0.2)  # doctest: +ELLIPSIS
0.202741...
>>> klucbExp(0.5, 0.2)  # doctest: +ELLIPSIS
1.013706...
>>> klucbExp(0.9, 0.2)  # doctest: +ELLIPSIS
1.824671...
  • Influence of d:
>>> klucbExp(0.1, 0.4)  # doctest: +ELLIPSIS
0.285792...
>>> klucbExp(0.1, 0.9)  # doctest: +ELLIPSIS
0.559088...
>>> klucbExp(0.5, 0.4)  # doctest: +ELLIPSIS
1.428962...
>>> klucbExp(0.5, 0.9)  # doctest: +ELLIPSIS
2.795442...
>>> klucbExp(0.9, 0.4)  # doctest: +ELLIPSIS
2.572132...
>>> klucbExp(0.9, 0.9)  # doctest: +ELLIPSIS
5.031795...
Policies.kullback.klucbGamma(x, d, precision=1e-06)[source]

KL-UCB index computation for Gamma distributions, using klucb().

  • Influence of x:
>>> klucbGamma(0.1, 0.2)  # doctest: +ELLIPSIS
0.202...
>>> klucbGamma(0.5, 0.2)  # doctest: +ELLIPSIS
1.013...
>>> klucbGamma(0.9, 0.2)  # doctest: +ELLIPSIS
1.824...
  • Influence of d:
>>> klucbGamma(0.1, 0.4)  # doctest: +ELLIPSIS
0.285...
>>> klucbGamma(0.1, 0.9)  # doctest: +ELLIPSIS
0.559...
>>> klucbGamma(0.5, 0.4)  # doctest: +ELLIPSIS
1.428...
>>> klucbGamma(0.5, 0.9)  # doctest: +ELLIPSIS
2.795...
>>> klucbGamma(0.9, 0.4)  # doctest: +ELLIPSIS
2.572...
>>> klucbGamma(0.9, 0.9)  # doctest: +ELLIPSIS
5.031...
Policies.kullback.kllcb(x, d, kl, lowerbound, precision=1e-06, upperbound=inf, max_iterations=50)[source]

The generic KL-LCB index computation.

  • x: value of the cum reward,
  • d: lower bound on the divergence,
  • kl: the KL divergence to be used (klBern(), klGauss(), etc),
  • lowerbound, upperbound=float('-inf'): the known bound of the values x,
  • precision=1e-6: the threshold from where to stop the research,
  • max_iterations=50: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{kllcb}(x, d) \simeq \inf_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) > d \}.\]

Note

It uses a bisection search, and one call to kl for each step of the bisection search.

For example, for kllcbBern(), the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:

>>> x, d = 0.9, 0.2   # mean x, exploration term d
>>> lowerbound = max(0., kllcbGauss(x, d, sig2x=0.25))  # variance 1/4 for [0,1] bounded distributions
>>> lowerbound  # doctest: +ELLIPSIS
0.5837...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=10)  # doctest: +ELLIPSIS
0.29...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=10)  # doctest: +ELLIPSIS
0.29188...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=50)  # doctest: +ELLIPSIS
0.291886...
>>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=100)  # more and more precise!  # doctest: +ELLIPSIS
0.29188611...

Note

See below for more examples for different KL divergence functions.

Policies.kullback.kllcbBern(x, d, precision=1e-06)[source]

KL-LCB index computation for Bernoulli distributions, using kllcb().

  • Influence of x:
>>> kllcbBern(0.1, 0.2)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.5, 0.2)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbBern(0.9, 0.2)  # doctest: +ELLIPSIS
0.89999...
  • Influence of d:
>>> kllcbBern(0.1, 0.4)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.1, 0.9)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbBern(0.5, 0.4)  # doctest: +ELLIPSIS
0.4999...
>>> kllcbBern(0.5, 0.9)  # doctest: +ELLIPSIS
0.4999...
>>> kllcbBern(0.9, 0.4)  # doctest: +ELLIPSIS
0.8999...
>>> kllcbBern(0.9, 0.9)  # doctest: +ELLIPSIS
0.8999...
Policies.kullback.kllcbGauss(x, d, sig2x=0.25, precision=0.0)[source]

KL-LCB index computation for Gaussian distributions.

  • Note that it does not require any search.

Warning

it works only if the good variance constant is given.

  • Influence of x:
>>> kllcbGauss(0.1, 0.2)  # doctest: +ELLIPSIS
-0.21622...
>>> kllcbGauss(0.5, 0.2)  # doctest: +ELLIPSIS
0.18377...
>>> kllcbGauss(0.9, 0.2)  # doctest: +ELLIPSIS
0.58377...
  • Influence of d:
>>> kllcbGauss(0.1, 0.4)  # doctest: +ELLIPSIS
-0.3472...
>>> kllcbGauss(0.1, 0.9)  # doctest: +ELLIPSIS
-0.5708...
>>> kllcbGauss(0.5, 0.4)  # doctest: +ELLIPSIS
0.0527...
>>> kllcbGauss(0.5, 0.9)  # doctest: +ELLIPSIS
-0.1708...
>>> kllcbGauss(0.9, 0.4)  # doctest: +ELLIPSIS
0.4527...
>>> kllcbGauss(0.9, 0.9)  # doctest: +ELLIPSIS
0.2291...

Warning

Using Policies.kllCB (and variants) with kllcbGauss() is equivalent to use Policies.UCB, so prefer the simpler version.

Policies.kullback.kllcbPoisson(x, d, precision=1e-06)[source]

KL-LCB index computation for Poisson distributions, using kllcb().

  • Influence of x:
>>> kllcbPoisson(0.1, 0.2)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.5, 0.2)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.9, 0.2)  # doctest: +ELLIPSIS
0.89999...
  • Influence of d:
>>> kllcbPoisson(0.1, 0.4)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.1, 0.9)  # doctest: +ELLIPSIS
0.09999...
>>> kllcbPoisson(0.5, 0.4)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.5, 0.9)  # doctest: +ELLIPSIS
0.49999...
>>> kllcbPoisson(0.9, 0.4)  # doctest: +ELLIPSIS
0.89999...
>>> kllcbPoisson(0.9, 0.9)  # doctest: +ELLIPSIS
0.89999...
Policies.kullback.kllcbExp(x, d, precision=1e-06)[source]

KL-LCB index computation for exponential distributions, using kllcb().

  • Influence of x:
>>> kllcbExp(0.1, 0.2)  # doctest: +ELLIPSIS
0.15267...
>>> kllcbExp(0.5, 0.2)  # doctest: +ELLIPSIS
0.7633...
>>> kllcbExp(0.9, 0.2)  # doctest: +ELLIPSIS
1.3740...
  • Influence of d:
>>> kllcbExp(0.1, 0.4)  # doctest: +ELLIPSIS
0.2000...
>>> kllcbExp(0.1, 0.9)  # doctest: +ELLIPSIS
0.3842...
>>> kllcbExp(0.5, 0.4)  # doctest: +ELLIPSIS
1.0000...
>>> kllcbExp(0.5, 0.9)  # doctest: +ELLIPSIS
1.9214...
>>> kllcbExp(0.9, 0.4)  # doctest: +ELLIPSIS
1.8000...
>>> kllcbExp(0.9, 0.9)  # doctest: +ELLIPSIS
3.4586...
Policies.kullback.maxEV(p, V, klMax)[source]

Maximize expectation of \(V\) with respect to \(q\) st. \(\mathrm{KL}(p, q) < \text{klMax}\).

Policies.kullback.reseqp(p, V, klMax, max_iterations=50)[source]

Solve f(reseqp(p, V, klMax)) = klMax, using Newton method.

Note

This is a subroutine of maxEV().

Warning

np.dot is very slow!

Policies.kullback.reseqp2(p, V, klMax)[source]

Solve f(reseqp(p, V, klMax)) = klMax, using a blackbox minimizer, from scipy.optimize.

  • FIXME it does not work well yet!

Note

This is a subroutine of maxEV().

  • Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011].

Warning

np.dot is very slow!

Policies.kullback_cython module
Policies.setup module
Policies.usenumba module

Import numba.jit or a dummy decorator.

Policies.usenumba.USE_NUMBA = False

Configure the use of numba

Policies.usenumba.jit(f)[source]

Fake numba.jit decorator.

Policies.with_proba module

Simply defines a function with_proba() that is used everywhere.

Policies.with_proba.with_proba(epsilon)[source]

Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.

Example:

>>> from random import seed; seed(0)  # reproductible
>>> with_proba(0.5)
False
>>> with_proba(0.9)
True
>>> with_proba(0.1)
False
>>> if with_proba(0.2):
...     print("This happens 20% of the time.")
Policies.with_proba.random() → x in the interval [0, 1).

PoliciesMultiPlayers package

PoliciesMultiPlayers : contains various collision-avoidance protocol for the multi-players setting.

  • Selfish: a multi-player policy where every player is selfish, they do not try to handle the collisions.
  • CentralizedNotFair: a multi-player policy which uses a centralize intelligence to affect users to a FIXED arm.
  • CentralizedFair: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.
  • CentralizedMultiplePlay and CentralizedIMP: multi-player policies that use centralized but non-omniscient learning to select K = nbPlayers arms at each time step.
  • OracleNotFair: a multi-player policy with full knowledge and centralized intelligence to affect users to a FIXED arm, among the best arms.
  • OracleFair: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbBestArms, among the best arms.
  • rhoRand, ALOHA: implementation of generic collision avoidance algorithms, relying on a single-player bandit policy (eg. UCB, Thompson etc). And variants, rhoRandRand, rhoRandSticky, rhoRandRotating, rhoRandEst, rhoLearn, rhoLearnEst, rhoLearnExp3, rhoRandALOHA,
  • rhoCentralized is a semi-centralized version where orthogonal ranks 1..M are given to the players, instead of just giving them the value of M, but a decentralized learning policy is still used to learn the best arms.
  • RandTopM is another approach, similar to rhoRandSticky and MusicalChair, but we hope it will be better, and we succeed in analyzing more easily.

All policies have the same interface, as described in BaseMPPolicy for decentralized policies, and BaseCentralizedPolicy for centralized policies, in order to use them in any experiment with the following approach:

my_policy_MP = Policy_MP(nbPlayers, nbArms)
children = my_policy_MP.children             # get a list of usable single-player policies
for one_policy in children:
    one_policy.startGame()                       # start the game
for t in range(T):
    for i in range(nbPlayers):
        k_t[i] = children[i].choice()            # chose one arm, for each player
    for k in range(nbArms):
        players_who_played_k = [ k_t[i] for i in range(nbPlayers) if k_t[i] == k ]
        reward = reward_t[k] = sampled from the arm k     # sample a reward
        if len(players_who_played_k) > 1:
            reward = 0
        for i in players_who_played_k:
            children[i].getReward(k, reward)
Submodules
PoliciesMultiPlayers.ALOHA module

ALOHA: generalized implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421), for a generic single-player policy.

This policy uses the collision avoidance mechanism that is inspired by the classical ALOHA protocol, and any single-player policy.

PoliciesMultiPlayers.ALOHA.tnext_beta(t, beta=0.5)[source]

Simple function, as used in MEGA: upper_tnext(t) = \(t^{\beta}\). Default to \(t^{0.5}\).

>>> tnext_beta(100, beta=0.1)  # doctest: +ELLIPSIS
1.584...
>>> tnext_beta(100, beta=0.5)
10.0
>>> tnext_beta(100, beta=0.9)  # doctest: +ELLIPSIS
63.095...
>>> tnext_beta(1000)  # doctest: +ELLIPSIS
31.622...
PoliciesMultiPlayers.ALOHA.make_tnext_beta(beta=0.5)[source]

Returns the function \(t \mapsto t^{\beta}\).

>>> tnext = make_tnext_beta(0.5)
>>> tnext(100)
10.0
>>> tnext(1000)  # doctest: +ELLIPSIS
31.622...
PoliciesMultiPlayers.ALOHA.tnext_log(t, scaling=1.0)[source]

Other function, not the one used in MEGA, but our proposal: upper_tnext(t) = \(\text{scaling} * \log(1 + t)\).

>>> tnext_log(100, scaling=1)  # doctest: +ELLIPSIS
4.615...
>>> tnext_log(100, scaling=10)  # doctest: +ELLIPSIS
46.151...
>>> tnext_log(100, scaling=100)  # doctest: +ELLIPSIS
461.512...
>>> tnext_log(1000)  # doctest: +ELLIPSIS
6.908...
PoliciesMultiPlayers.ALOHA.make_tnext_log_scaling(scaling=1.0)[source]

Returns the function \(t \mapsto \text{scaling} * \log(1 + t)\).

>>> tnext = make_tnext_log_scaling(1)
>>> tnext(100)  # doctest: +ELLIPSIS
4.615...
>>> tnext(1000)  # doctest: +ELLIPSIS
6.908...
class PoliciesMultiPlayers.ALOHA.oneALOHA(nbPlayers, mother, playerId, nbArms, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: the ALOHA collision avoidance protocol is implemented here.
__init__(nbPlayers, mother, playerId, nbArms, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None)[source]

Initialize self. See help(type(self)) for accurate signature.

nbPlayers = None

Number of players

p0 = None

Initial probability, should not be modified

p = None

Current probability, can be modified

alpha_p0 = None

Parameter alpha for the recurrence equation for probability p(t)

beta = None

Parameter beta

tnext = None

Only store the delta time

t = None

Internal time

chosenArm = None

Last chosen arm

__str__()[source]

Return str(self).

startGame()[source]

Start game.

ftnext(t)[source]

Time until the arm is removed from list of unavailable arms.

getReward(arm, reward)[source]

Receive a reward on arm of index ‘arm’, as described by the ALOHA protocol.

  • If not collision, receive a reward after pulling the arm.
handleCollision(arm, reward=None)[source]

Handle a collision, on arm of index ‘arm’.

Warning

This method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.

Note

We do not care on which arm the collision occured.

choice()[source]

Identify the available arms, and use the underlying single-player policy (UCB, Thompson etc) to choose an arm from this sub-set of arms.

__module__ = 'PoliciesMultiPlayers.ALOHA'
class PoliciesMultiPlayers.ALOHA.ALOHA(nbPlayers, nbArms, playerAlgo, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

ALOHA: implementation of the multi-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421), for a generic single-player policy.

__init__(nbPlayers, nbArms, playerAlgo, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • p0: initial probability p(0); p(t) is the probability of persistance on the chosenArm at time t
  • alpha_p0: scaling in the update for p[t+1] <- alpha_p0 p[t] + (1 - alpha_p0)
  • ftnext: general function, default to t -> t^beta, to know from where to sample a random time t_next(k), until when the chosenArm is unavailable. t -> log(1 + t) is also possible.
  • (optional) beta: if present, overwrites ftnext, which will be t –> t^beta.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> p0, alpha_p0 = 0.6, 0.5
>>> s = ALOHA(nbPlayers, nbArms, Thompson, p0=p0, alpha_p0=alpha_p0, ftnext=tnext_log)
>>> [ child.choice() for child in s.children ]
[6, 11, 8, 4, 8, 8]
>>> s = ALOHA(nbPlayers, nbArms, UCBalpha, p0=p0, alpha_p0=alpha_p0, beta=0.5, alpha=1)
>>> [ child.choice() for child in s.children ]
[1, 0, 5, 2, 15, 3]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
__module__ = 'PoliciesMultiPlayers.ALOHA'
nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

PoliciesMultiPlayers.ALOHA.random() → x in the interval [0, 1).
PoliciesMultiPlayers.BaseCentralizedPolicy module

Base class for any centralized policy, for the multi-players setting.

class PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy(nbArms)[source]

Bases: object

Base class for any centralized policy, for the multi-players setting.

__init__(nbArms)[source]

New policy

__str__()[source]

Return str(self).

startGame()[source]

Start the simulation.

getReward(arm, reward)[source]

Get a reward from that arm.

choice()[source]

Choose an arm.

__dict__ = mappingproxy({'__module__': 'PoliciesMultiPlayers.BaseCentralizedPolicy', '__doc__': ' Base class for any centralized policy, for the multi-players setting.', '__init__': <function BaseCentralizedPolicy.__init__>, '__str__': <function BaseCentralizedPolicy.__str__>, 'startGame': <function BaseCentralizedPolicy.startGame>, 'getReward': <function BaseCentralizedPolicy.getReward>, 'choice': <function BaseCentralizedPolicy.choice>, '__dict__': <attribute '__dict__' of 'BaseCentralizedPolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BaseCentralizedPolicy' objects>})
__module__ = 'PoliciesMultiPlayers.BaseCentralizedPolicy'
__weakref__

list of weak references to the object (if defined)

PoliciesMultiPlayers.BaseMPPolicy module

Base class for any multi-players policy.

  • If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
class PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy[source]

Bases: object

Base class for any multi-players policy.

__init__()[source]

New policy

__str__()[source]

Return str(self).

_startGame_one(playerId)[source]

Forward the call to self._players[playerId].

_getReward_one(playerId, arm, reward)[source]

Forward the call to self._players[playerId].

_choice_one(playerId)[source]

Forward the call to self._players[playerId].

_choiceWithRank_one(playerId, rank=1)[source]

Forward the call to self._players[playerId].

_choiceFromSubSet_one(playerId, availableArms='all')[source]

Forward the call to self._players[playerId].

_choiceMultiple_one(playerId, nb=1)[source]

Forward the call to self._players[playerId].

_choiceIMP_one(playerId, nb=1)[source]

Forward the call to self._players[playerId].

_estimatedOrder_one(playerId)[source]

Forward the call to self._players[playerId].

_estimatedBestArms_one(playerId, M=1)[source]

Forward the call to self._players[playerId].

__dict__ = mappingproxy({'__module__': 'PoliciesMultiPlayers.BaseMPPolicy', '__doc__': ' Base class for any multi-players policy.', '__init__': <function BaseMPPolicy.__init__>, '__str__': <function BaseMPPolicy.__str__>, '_startGame_one': <function BaseMPPolicy._startGame_one>, '_getReward_one': <function BaseMPPolicy._getReward_one>, '_choice_one': <function BaseMPPolicy._choice_one>, '_choiceWithRank_one': <function BaseMPPolicy._choiceWithRank_one>, '_choiceFromSubSet_one': <function BaseMPPolicy._choiceFromSubSet_one>, '_choiceMultiple_one': <function BaseMPPolicy._choiceMultiple_one>, '_choiceIMP_one': <function BaseMPPolicy._choiceIMP_one>, '_estimatedOrder_one': <function BaseMPPolicy._estimatedOrder_one>, '_estimatedBestArms_one': <function BaseMPPolicy._estimatedBestArms_one>, '__dict__': <attribute '__dict__' of 'BaseMPPolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BaseMPPolicy' objects>})
__module__ = 'PoliciesMultiPlayers.BaseMPPolicy'
__weakref__

list of weak references to the object (if defined)

PoliciesMultiPlayers.CentralizedCycling module

CentralizedCycling: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.

  • It allows to have absolutely no collision, if there is more channels than users (always assumed).
  • And it is perfectly fair on every run: each chosen arm is played successively by each player.
  • Note that it is NOT affecting players on the best arms: it has no knowledge of the means of the arms, only of the number of arms nbArms.
class PoliciesMultiPlayers.CentralizedCycling.Cycling(nbArms, offset)[source]

Bases: PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy

Cycling: select an arm as (offset + t) % nbArms, with offset being decided by the CentralizedCycling multi-player policy.

__init__(nbArms, offset)[source]

Cycling with an offset.

nbArms = None

Number of arms

offset = None

Offset

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Chose cycling arm.

__module__ = 'PoliciesMultiPlayers.CentralizedCycling'
class PoliciesMultiPlayers.CentralizedCycling.CentralizedCycling(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

CentralizedCycling: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.

__init__(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]
  • nbPlayers: number of players to create (in self._players).
  • nbArms: number of arms.

Examples:

>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> s = CentralizedCycling(2, 3)
>>> [ child.choice() for child in s.children ]
[2, 1]
>>> [ child.choice() for child in s.children ]
[0, 2]
>>> [ child.choice() for child in s.children ]
[1, 0]
>>> [ child.choice() for child in s.children ]
[2, 1]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use
nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.CentralizedCycling'
_printNbCollisions()[source]

Print number of collisions.

PoliciesMultiPlayers.CentralizedFixed module

CentralizedFixed: a multi-player policy which uses a centralized intelligence to affect users to a FIXED arm.

  • It allows to have absolutely no collision, if there is more channels than users (always assumed).
  • But it is NOT fair on ONE run: the best arm is played only by one player.
  • Note that in average, it is fair (who plays the best arm is randomly decided).
  • Note that it is NOT affecting players on the best arms: it has no knowledge of the means of the arms, only of the number of arms nbArms.
class PoliciesMultiPlayers.CentralizedFixed.Fixed(nbArms, armIndex, lower=0.0, amplitude=1.0)[source]

Bases: PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy

Fixed: always select a fixed arm, as decided by the CentralizedFixed multi-player policy.

__init__(nbArms, armIndex, lower=0.0, amplitude=1.0)[source]

Fixed on this arm.

nbArms = None

Number of arms

armIndex = None

Index of the fixed arm

__str__()[source]

Return str(self).

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Chose fixed arm.

__module__ = 'PoliciesMultiPlayers.CentralizedFixed'
class PoliciesMultiPlayers.CentralizedFixed.CentralizedFixed(nbPlayers, nbArms)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

CentralizedFixed: a multi-player policy which uses a centralized intelligence to affect users to a FIXED arm.

__init__(nbPlayers, nbArms)[source]
  • nbPlayers: number of players to create (in self._players).
  • nbArms: number of arms.

Examples:

>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> s = CentralizedFixed(2, 3)
>>> [ child.choice() for child in s.children ]
[2, 1]
>>> [ child.choice() for child in s.children ]
[2, 1]
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> s = CentralizedFixed(4, 8)
>>> [ child.choice() for child in s.children ]
[7, 6, 1, 2]
>>> [ child.choice() for child in s.children ]
[7, 6, 1, 2]
>>> s = CentralizedFixed(10, 14)
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use
nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

_printNbCollisions()[source]

Print number of collisions.

_startGame_one(playerId)[source]

Pass the call to the player algorithm.

_getReward_one(playerId, arm, reward)[source]

Pass the call to the player algorithm.

__module__ = 'PoliciesMultiPlayers.CentralizedFixed'
_choice_one(playerId)[source]

Pass the call to the player algorithm.

PoliciesMultiPlayers.CentralizedIMP module

CentralizedIMP: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step, using an hybrid strategy: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].

class PoliciesMultiPlayers.CentralizedIMP.CentralizedIMP(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.CentralizedMultiplePlay.CentralizedMultiplePlay

CentralizedIMP: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step, using an hybrid strategy: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].

_choice_one(playerId)[source]

Use choiceIMP for each player.

__module__ = 'PoliciesMultiPlayers.CentralizedIMP'
PoliciesMultiPlayers.CentralizedMultiplePlay module

CentralizedMultiplePlay: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step.

class PoliciesMultiPlayers.CentralizedMultiplePlay.CentralizedChildPointer(mother, playerId)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Centralized version of the ChildPointer class.

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__module__ = 'PoliciesMultiPlayers.CentralizedMultiplePlay'
class PoliciesMultiPlayers.CentralizedMultiplePlay.CentralizedMultiplePlay(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

CentralizedMultiplePlay: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step.

__init__(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • uniformAllocation: Should the affectations of users always be uniform, or fixed when UCB indexes have converged? First choice is more fair, but linear nb of switches, second choice is not fair, but cst nb of switches.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Examples:

>>> from Policies import *
>>> s = CentralizedMultiplePlay(2, 3, UCB)
>>> [ child.choice() for child in s.children ]
[2, 0]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
nbPlayers = None

Number of players

player = None

Only one policy

children = None

But nbPlayers children, fake algorithms

nbArms = None

Number of arms

uniformAllocation = None

Option: in case of multiplay plays, should the affectations of users always be uniform, or fixed when UCB indexes have converged? First choice is more fair, but linear nb of switches, second choice is not fair, but cst nb of switches

choices = None

Choices, given by first call to internal algorithm

affectation_order = None

Affectation of choices to players

__str__()[source]

Return str(self).

_startGame_one(playerId)[source]

Pass the call to the player algorithm.

_getReward_one(playerId, arm, reward)[source]

Pass the call to the player algorithm.

_choice_one(playerId)[source]

Use the player algorithm for the 1st decision, for each players, then use it.

_handleCollision_one(playerId, arm, reward=None)[source]

Cannot be called!

_estimatedOrder_one(playerId)[source]

Use the centralized algorithm to estimate ranking of the arms.

__module__ = 'PoliciesMultiPlayers.CentralizedMultiplePlay'
PoliciesMultiPlayers.ChildPointer module

ChildPointer: Class that acts as a child policy, but in fact it passes all its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).

class PoliciesMultiPlayers.ChildPointer.ChildPointer(mother, playerId)[source]

Bases: object

Class that acts as a child policy, but in fact it passes all its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).

__init__(mother, playerId)[source]

Initialize self. See help(type(self)) for accurate signature.

mother = None

Pointer to the mother class.

playerId = None

ID of player in the mother class list of players

nbArms = None

Number of arms (pretty print)

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

startGame()[source]

Pass the call to self.mother._startGame_one(playerId) with the player’s ID number.

getReward(arm, reward)[source]

Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

handleCollision(arm, reward=None)[source]

Pass the call to self.mother._handleCollision_one(playerId, arm, reward) with the player’s ID number.

choice()[source]

Pass the call to self.mother._choice_one(playerId) with the player’s ID number.

choiceWithRank(rank=1)[source]

Pass the call to self.mother._choiceWithRank_one(playerId) with the player’s ID number.

choiceFromSubSet(availableArms='all')[source]

Pass the call to self.mother._choiceFromSubSet_one(playerId) with the player’s ID number.

choiceMultiple(nb=1)[source]

Pass the call to self.mother._choiceMultiple_one(playerId) with the player’s ID number.

choiceIMP(nb=1)[source]

Pass the call to self.mother._choiceIMP_one(playerId) with the player’s ID number.

estimatedOrder()[source]

Pass the call to self.mother._estimatedOrder_one(playerId) with the player’s ID number.

estimatedBestArms(M=1)[source]

Pass the call to self.mother._estimatedBestArms_one(playerId) with the player’s ID number.

__dict__ = mappingproxy({'__module__': 'PoliciesMultiPlayers.ChildPointer', '__doc__': ' Class that acts as a child policy, but in fact it passes *all* its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).\n ', '__init__': <function ChildPointer.__init__>, '__str__': <function ChildPointer.__str__>, '__repr__': <function ChildPointer.__repr__>, 'startGame': <function ChildPointer.startGame>, 'getReward': <function ChildPointer.getReward>, 'handleCollision': <function ChildPointer.handleCollision>, 'choice': <function ChildPointer.choice>, 'choiceWithRank': <function ChildPointer.choiceWithRank>, 'choiceFromSubSet': <function ChildPointer.choiceFromSubSet>, 'choiceMultiple': <function ChildPointer.choiceMultiple>, 'choiceIMP': <function ChildPointer.choiceIMP>, 'estimatedOrder': <function ChildPointer.estimatedOrder>, 'estimatedBestArms': <function ChildPointer.estimatedBestArms>, '__dict__': <attribute '__dict__' of 'ChildPointer' objects>, '__weakref__': <attribute '__weakref__' of 'ChildPointer' objects>})
__module__ = 'PoliciesMultiPlayers.ChildPointer'
__weakref__

list of weak references to the object (if defined)

PoliciesMultiPlayers.DepRound module

DepRound(): implementation of the dependent rounding procedure, from [[Dependent rounding and its applications to approximation algorithms, by R Gandhi, S Khuller, S Parthasarathy, Journal of the ACM, 2006](http://dl.acm.org/citation.cfm?id=1147956)].

It solves the problem of efficiently selecting a set of \(k\) distinct actions from \(\{1,\dots,K\}\), while satisfying the condition that each action \(i\) is selected with probability \(p_i\) exactly.

The distribution \((p_1, \dots, p_K)\) on \(\{1,\dots,K\}\) is assumed to be given.

Dependent rounding developed by [Gandhi et al.] is a kind of technique that randomly selects a set of edges from a bipartite graph under some cardinality constraints.

  • It runs in \(\mathcal{O}(K)\) space complexity, and at most \(\mathcal{O}(K^2)\) time complexity (note that the article [Uchiya et al., 2010] wrongly claim it is in \(\mathcal{O}(K)\)).
  • References: see also https://www.cs.umd.edu/~samir/grant/jacm06.pdf
PoliciesMultiPlayers.DepRound.DepRound(weights_p, k=1)[source]

[[Algorithms for adversarial bandit problems with multiple plays, by T.Uchiya, A.Nakamura and M.Kudo, 2010](http://hdl.handle.net/2115/47057)] Figure 5 (page 15) is a very clean presentation of the algorithm.

  • Inputs: \(k < K\) and weights_p \(= (p_1, \dots, p_K)\) such that \(\sum_{i=1}^{K} p_i = k\) (or \(= 1\)).
  • Output: A subset of \(\{1,\dots,K\}\) with exactly \(k\) elements. Each action \(i\) is selected with probability exactly \(p_i\).

Example:

>>> import numpy as np; import random
>>> np.random.seed(0); random.seed(0)  # for reproductibility!
>>> K = 5
>>> k = 2
>>> weights_p = [ 2, 2, 2, 2, 2 ]  # all equal weights
>>> DepRound(weights_p, k)
[3, 4]
>>> DepRound(weights_p, k)
[3, 4]
>>> DepRound(weights_p, k)
[0, 1]
>>> weights_p = [ 10, 8, 6, 4, 2 ]  # decreasing weights
>>> DepRound(weights_p, k)
[0, 4]
>>> DepRound(weights_p, k)
[1, 2]
>>> DepRound(weights_p, k)
[3, 4]
>>> weights_p = [ 3, 3, 0, 0, 3 ]  # decreasing weights
>>> DepRound(weights_p, k)
[0, 4]
>>> DepRound(weights_p, k)
[0, 4]
>>> DepRound(weights_p, k)
[0, 4]
>>> DepRound(weights_p, k)
[0, 1]
PoliciesMultiPlayers.DepRound.random() → x in the interval [0, 1).
PoliciesMultiPlayers.EstimateM module

EstimateM: generic wrapper on a multi-player decentralized learning policy, to learn on the run the number of players, adapted from rhoEst from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
  • My choice for the threshold function, see threshold_on_t(), does not need the horizon either, and uses \(t\) instead.

Note

This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).

Warning

This is still very experimental!

Note

For a less generic approach, see the policies defined in rhoEst.rhoEst (generalizing rhoRand.rhoRand) and RandTopMEst.RandTopMEst (generalizing RandTopM.RandTopM).

PoliciesMultiPlayers.EstimateM.threshold_on_t_with_horizon(t, nbPlayersEstimate, horizon=None)[source]

Function \(\xi(T, k)\) used as a threshold in rhoEstPlus.

Warning

It requires the horizon \(T\), and does not use the current time \(t\).

Example:

>>> threshold_on_t_with_horizon(1000, 3)  # doctest: +ELLIPSIS
14.287...
>>> threshold_on_t_with_horizon(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
16.357...
PoliciesMultiPlayers.EstimateM.threshold_on_t_doubling_trick(t, nbPlayersEstimate, horizon=None, base=2, min_fake_horizon=1000, T0=1)[source]

A trick to have a threshold depending on a growing horizon (doubling-trick).

  • Instead of using \(t\) or \(T\), a fake horizon \(T_t\) is used, corresponding to the horizon a doubling-trick algorithm would be using at time \(t\).
  • \(T_t = T_0 b^{\lceil \log_b(t) \rceil}\) is the default choice, for \(b=2\) \(T_0 = 10\).
  • If \(T_t\) is too small, min_fake_horizon is used instead.

Warning

This is ongoing research!

Example:

>>> threshold_on_t_doubling_trick(1000, 3)  # doctest: +ELLIPSIS
14.356...
>>> threshold_on_t_doubling_trick(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
14.356...
PoliciesMultiPlayers.EstimateM.threshold_on_t(t, nbPlayersEstimate, horizon=None)[source]

Function \(\xi(t, k)\) used as a threshold in rhoEst.

  • 0 if nbPlayersEstimate is 0,
  • 1 if nbPlayersEstimate is 1,
  • My heuristic to be any-time (ie, without needing to know the horizon) is to use a function of \(t\) (current time) and not \(T\) (horizon).
  • The choice which seemed to perform the best in practice was \(\xi(t, k) = c t\) for a small constant \(c\) (like 5 or 10).

Example:

>>> threshold_on_t(1000, 3)  # doctest: +ELLIPSIS
47.730...
>>> threshold_on_t(1000, 3, horizon=2000)  # doctest: +ELLIPSIS
47.730...
class PoliciesMultiPlayers.EstimateM.oneEstimateM(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
__init__(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

threshold = None

Threshold function

nbPlayersEstimate = None

Number of players. Optimistic: start by assuming it is alone!

collisionCount = None

Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None

Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None

Internal time

__str__()[source]

Return str(self).

updateNbPlayers(nbPlayers=None)[source]

Change the value of nbPlayersEstimate, and propagate the change to the underlying policy, for parameters called maxRank or nbPlayers.

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]

One transmission without collision.

choice()[source]

Pass the call to self._policy.choice() with the player’s ID number.

choiceWithRank(rank=1)[source]

Pass the call to self._policy.choiceWithRank() with the player’s ID number.

choiceFromSubSet(availableArms='all')[source]

Pass the call to self._policy.choiceFromSubSet() with the player’s ID number.

choiceMultiple(nb=1)[source]

Pass the call to self._policy.choiceMultiple() with the player’s ID number.

choiceIMP(nb=1)[source]

Pass the call to self._policy.choiceIMP() with the player’s ID number.

estimatedOrder()[source]

Pass the call to self._policy.estimatedOrder() with the player’s ID number.

estimatedBestArms(M=1)[source]

Pass the call to self._policy.estimatedBestArms() with the player’s ID number.

__module__ = 'PoliciesMultiPlayers.EstimateM'
class PoliciesMultiPlayers.EstimateM.EstimateM(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

EstimateM: a generic wrapper for an efficient multi-players learning policy, with no prior knowledge of the number of player, and using any other MP policy.

__init__(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • nbArms: number of arms.
  • decentralizedPolicy: base MP decentralized policy.
  • threshold: the threshold function to use, see threshold_on_t_with_horizon(), threshold_on_t_doubling_trick() or threshold_on_t() above.
  • policyArgs: named arguments (dictionnary), given to decentralizedPolicy.
  • *args, **kwargs: arguments, named arguments, given to decentralizedPolicy (will probably be given to the single-player decentralized policy under the hood, don’t care).

Example:

>>> from Policies import *; from PoliciesMultiPlayers import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 4
>>> nbPlayers = 2
>>> s = EstimateM(nbPlayers, nbArms, rhoRand, UCBalpha, alpha=0.5)
>>> [ child.choice() for child in s.children ]
[0, 3]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.EstimateM'
PoliciesMultiPlayers.OracleFair module

OracleFair: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbBestArms, among the best arms.

  • It allows to have absolutely no collision, if there is more channels than users (always assumed).
  • And it is perfectly fair on every run: each chosen arm is played successively by each player.
  • Note that it IS affecting players on the best arms: it requires full knowledge of the means of the arms, not simply the number of arms.
  • Note that they need a perfect knowledge on the arms, even this is not physically plausible.
class PoliciesMultiPlayers.OracleFair.CyclingBest(nbArms, offset, bestArms=None)[source]

Bases: PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy

CyclingBest: select an arm in the best ones (bestArms) as (offset + t) % (len(bestArms)), with offset being decided by the OracleFair multi-player policy.

__init__(nbArms, offset, bestArms=None)[source]

Cycling with an offset.

nbArms = None

Number of arms

offset = None

Offset

bestArms = None

List of index of the best arms to play

nb_bestArms = None

Number of best arms

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Chose cycling arm.

__module__ = 'PoliciesMultiPlayers.OracleFair'
class PoliciesMultiPlayers.OracleFair.OracleFair(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

OracleFair: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.

__init__(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]
  • nbPlayers: number of players to create (in self._players).
  • armsMAB: MAB object that represents the arms.

Examples:

>>> import sys; sys.path.insert(0, '..'); from Environment import MAB; from Arms import Bernoulli
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> problem = MAB({'arm_type': Bernoulli, 'params': [0.1, 0.5, 0.9]})  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
...
>>> s = OracleFair(2, problem)
>>> [ child.choice() for child in s.children ]
[1, 2]
>>> [ child.choice() for child in s.children ]
[2, 1]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use
nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.OracleFair'
_printNbCollisions()[source]

Print number of collisions.

PoliciesMultiPlayers.OracleNotFair module

OracleNotFair: a multi-player policy with full knowledge and centralized intelligence to affect users to a FIXED arm, among the best arms.

  • It allows to have absolutely no collision, if there is more channels than users (always assumed).
  • But it is NOT fair on ONE run: the best arm is played only by one player.
  • Note that in average, it is fair (who plays the best arm is randomly decided).
  • Note that it IS affecting players on the best arms: it requires full knowledge of the means of the arms, not simply the number of arms.
  • Note that they need a perfect knowledge on the arms, even this is not physically plausible.
class PoliciesMultiPlayers.OracleNotFair.Fixed(nbArms, armIndex)[source]

Bases: PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy

Fixed: always select a fixed arm, as decided by the OracleNotFair multi-player policy.

__init__(nbArms, armIndex)[source]

Fixed on this arm.

nbArms = None

Number of arms

armIndex = None

Index of fixed arm

__str__()[source]

Return str(self).

startGame()[source]

Nothing to do.

getReward(arm, reward)[source]

Nothing to do.

choice()[source]

Chose fixed arm.

__module__ = 'PoliciesMultiPlayers.OracleNotFair'
class PoliciesMultiPlayers.OracleNotFair.OracleNotFair(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

OracleNotFair: a multi-player policy which uses a centralized intelligence to affect users to affect users to a FIXED arm, among the best arms.

__init__(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]
  • nbPlayers: number of players to create (in self._players).
  • armsMAB: MAB object that represents the arms.

Examples:

>>> import sys; sys.path.insert(0, '..'); from Environment import MAB; from Arms import Bernoulli
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> problem = MAB({'arm_type': Bernoulli, 'params': [0.1, 0.5, 0.9]})  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
...
>>> s = OracleNotFair(2, problem)
>>> [ child.choice() for child in s.children ]
[2, 1]
>>> [ child.choice() for child in s.children ]
[2, 1]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use
nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.OracleNotFair'
_printNbCollisions()[source]

Print number of collisions.

PoliciesMultiPlayers.RandTopM module

RandTopM: four proposals for an efficient multi-players learning policy. RandTopM and MCTopM are the two main algorithms, with variants (see below).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
  • When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

PoliciesMultiPlayers.RandTopM.WITH_CHAIR = False

Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best. RandTopM (and variants) uses False and MCTopM (and variants) uses True.

PoliciesMultiPlayers.RandTopM.OPTIM_PICK_WORST_FIRST = False

XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by RandTopMCautious and RandTopMExtraCautious, and by MCTopMCautious and MCTopMExtraCautious.

PoliciesMultiPlayers.RandTopM.OPTIM_EXIT_IF_WORST_WAS_PICKED = False

XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by RandTopMExtraCautious and MCTopMExtraCautious.

PoliciesMultiPlayers.RandTopM.OPTIM_PICK_PREV_WORST_FIRST = True

XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for RandTopMOld and MCTopMOld.

class PoliciesMultiPlayers.RandTopM.oneRandTopM(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random arm is sampled after observing a collision,
  • And the player does not aim at the best arm, but at one of the best arm, based on her index policy.
  • (See variants for more details.)
__init__(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different.

chosen_arm = None

Current chosen arm.

sitted = None

Not yet sitted. After 1 step without collision, don’t react to collision (but still react when the chosen arm lies outside M-best).

prevWorst = None

Keep track of the last arms worst than the chosen one (at previous time step).

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Start game.

Mbest()[source]

Current estimate of the M-best arms. M is the maxRank given to the algorithm.

worst_Mbest()[source]

Index of the worst of the current estimate of the M-best arms. M is the maxRank given to the algorithm.

worst_previous__and__current_Mbest(current_arm)[source]

Return the set from which to select a random arm for MCTopM (the optimization is now the default):

\[\hat{M}^j(t) \cap \{ m : g^j_m(t-1) \leq g^j_k(t-1) \}.\]
handleCollision(arm, reward=None)[source]

Get a new random arm from the current estimate of Mbest, and give reward to the algorithm if not None.

getReward(arm, reward)[source]

Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

choice()[source]

Reconsider the choice of arm, and then use the chosen arm.

  • For all variants, if the chosen arm is no longer in the current estimate of the Mbest set, a new one is selected,
  • The basic RandTopM selects uniformly an arm in estimate Mbest,
  • MCTopM starts by being “non sitted” on its new chosen arm,
  • MCTopMCautious is forced to first try the arm with lowest UCB indexes (or whatever index policy is used).
_index()[source]

Update and return the indexes of the underlying index policy.

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.RandTopM(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopM: a proposal for an efficient multi-players learning policy.

__init__(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • withChair: see WITH_CHAIR,
  • pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
  • exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
  • pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
  • maxRank: maximum rank allowed by the RandTopM child (default to nbPlayers, but for instance if there is 2 × RandTopM[UCB] + 2 × RandTopM[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopM(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

withChair = None

Using a chair ?

pickWorstFirst = None

Using first optimization ?

exitIfWorstWasPicked = None

Using second optimization ?

pickPrevWorstFirst = None

Using third optimization ? Default to yes now.

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.RandTopMCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the RandTopMCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMCautious[UCB] + 2 × RandTopMCautious[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.RandTopMExtraCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the RandTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMExtraCautious[UCB] + 2 × RandTopMExtraCautious[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMExtraCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.RandTopMOld(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

RandTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the RandTopMOld child (default to nbPlayers, but for instance if there is 2 × RandTopMOld[UCB] + 2 × RandTopMOld[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMOld(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.MCTopM(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopM: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the MCTopM child (default to nbPlayers, but for instance if there is 2 × MCTopM[UCB] + 2 × MCTopM[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopM(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.MCTopMCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the MCTopMCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMCautious[UCB] + 2 × MCTopMCautious[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopM'
class PoliciesMultiPlayers.RandTopM.MCTopMExtraCautious(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the MCTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMExtraCautious[UCB] + 2 × MCTopMExtraCautious[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMExtraCautious(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__module__ = 'PoliciesMultiPlayers.RandTopM'
__str__()[source]

Return str(self).

class PoliciesMultiPlayers.RandTopM.MCTopMOld(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.RandTopM

MCTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.

Warning

Still very experimental! But it seems to be one of the most efficient decentralized MP algorithm we have so far… The two other variants of MCTopM seem even better!

__module__ = 'PoliciesMultiPlayers.RandTopM'
__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the MCTopMOld child (default to nbPlayers, but for instance if there is 2 × MCTopMOld[UCB] + 2 × MCTopMOld[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = MCTopMOld(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
__str__()[source]

Return str(self).

PoliciesMultiPlayers.RandTopMEst module

RandTopMEstEst: four proposals for an efficient multi-players learning policy. RandTopMEstEst and MCTopMEstEst are the two main algorithms, with variants (see below).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
  • When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.
  • The (fixed) number of players is learned on the run.

Note

This is fully decentralized: player do not need to know the (fixed) number of players!

Warning

This is still very experimental!

Note

For a more generic approach, see the wrapper defined in EstimateM.EstimateM.

class PoliciesMultiPlayers.RandTopMEst.oneRandTopMEst(threshold, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.oneRandTopM

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
__init__(threshold, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

threshold = None

Threshold function

nbPlayersEstimate = None

Number of players. Optimistic: start by assuming it is alone!

collisionCount = None

Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None

Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]

One transmission without collision.

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
PoliciesMultiPlayers.RandTopMEst.WITH_CHAIR = False

Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best. RandTopMEst (and variants) uses False and MCTopMEst (and variants) uses True.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_WORST_FIRST = False

XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by RandTopMEstCautious and RandTopMEstExtraCautious, and by MCTopMEstCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_EXIT_IF_WORST_WAS_PICKED = False

XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by RandTopMEstExtraCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_PREV_WORST_FIRST = True

XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for RandTopMEstOld and MCTopMEstOld.

class PoliciesMultiPlayers.RandTopMEst.RandTopMEst(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEst: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • withChair: see WITH_CHAIR,
  • pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
  • exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
  • pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
  • threshold: the threshold function to use, see EstimateM.threshold_on_t_with_horizon(), EstimateM.threshold_on_t_doubling_trick() or EstimateM.threshold_on_t() above.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMEst(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

withChair = None

Using a chair ?

pickWorstFirst = None

Using first optimization ?

exitIfWorstWasPicked = None

Using second optimization ?

pickPrevWorstFirst = None

Using third optimization ? Default to yes now.

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
class PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEstPlus: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • horizon: need to know the horizon \(T\).
  • withChair: see WITH_CHAIR,
  • pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
  • exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
  • pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
  • threshold: the threshold function to use, see threshold_on_t_with_horizon() or threshold_on_t() above.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> horizon = 1000
>>> s = RandTopMEstPlus(nbPlayers, nbArms, UCB, horizon)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

withChair = None

Using a chair ?

pickWorstFirst = None

Using first optimization ?

exitIfWorstWasPicked = None

Using second optimization ?

pickPrevWorstFirst = None

Using third optimization ? Default to yes now.

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
class PoliciesMultiPlayers.RandTopMEst.MCTopMEst(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEst

MCTopMEst: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.
__module__ = 'PoliciesMultiPlayers.RandTopMEst'
__str__()[source]

Return str(self).

class PoliciesMultiPlayers.RandTopMEst.MCTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus

MCTopMEstPlus: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
__init__(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.
__str__()[source]

Return str(self).

PoliciesMultiPlayers.Scenario1 module

Scenario1: make a set of M experts with the following behavior, for K = 2 arms: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.

  • Reference: Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. E. (2011, April). Contextual Bandit Algorithms with Supervised Learning Guarantees. In AISTATS (pp. 19-26).
class PoliciesMultiPlayers.Scenario1.OneScenario1(mother, playerId)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

OneScenario1: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.

__init__(mother, playerId)[source]

Initialize self. See help(type(self)) for accurate signature.

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__module__ = 'PoliciesMultiPlayers.Scenario1'
class PoliciesMultiPlayers.Scenario1.Scenario1(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

Scenario1: make a set of M experts with the following behavior, for K = 2 arms: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.

  • Reference: Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. E. (2011, April). Contextual Bandit Algorithms with Supervised Learning Guarantees. In AISTATS (pp. 19-26).
__init__(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]
  • nbPlayers: number of players to create (in self._players).

Examples:

>>> s = Scenario1(10)
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use
__str__()[source]

Return str(self).

_startGame_one(playerId)[source]

Forward the call to self._players[playerId].

_getReward_one(playerId, arm, reward)[source]

Forward the call to self._players[playerId].

_choice_one(playerId)[source]

Forward the call to self._players[playerId].

__module__ = 'PoliciesMultiPlayers.Scenario1'
PoliciesMultiPlayers.Selfish module

Selfish: a multi-player policy where every player is selfish, playing on their side.

  • without knowing how many players there is,
  • and not even knowing that they should try to avoid collisions. When a collision happens, the algorithm simply receive a 0 reward for the chosen arm.
class PoliciesMultiPlayers.Selfish.SelfishChildPointer(mother, playerId)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Selfish version of the ChildPointer class (just pretty printed).

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.Selfish'
PoliciesMultiPlayers.Selfish.PENALTY = None

Customize here the value given to a user after a collision XXX If it is None, then player.lower (default to 0) is used instead

class PoliciesMultiPlayers.Selfish.Selfish(nbPlayers, nbArms, playerAlgo, penalty=None, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

Selfish: a multi-player policy where every player is selfish, playing on their side.

  • without nowing how many players there is, and
  • not even knowing that they should try to avoid collisions. When a collision happens, the algorithm simply receives a 0 reward for the chosen arm (can be changed with penalty= argument).
__init__(nbPlayers, nbArms, playerAlgo, penalty=None, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Examples:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = Selfish(nbPlayers, nbArms, Uniform)
>>> [ child.choice() for child in s.children ]
[12, 13, 1, 8, 16, 15]
>>> [ child.choice() for child in s.children ]
[12, 9, 15, 11, 6, 16]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!

Warning

I want my code to stay compatible with Python 2, so I cannot use the new syntax of keyword-only argument. It would make more sense to have *args, penalty=PENALTY, lower=0., amplitude=1., **kwargs instead of penalty=PENALTY, *args, **kwargs but I can’t.

nbPlayers = None

Number of players

penalty = None

Penalty = reward given in case of collision

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

_handleCollision_one(playerId, arm, reward=None)[source]

Give a reward of 0, or player.lower, or self.penalty, in case of collision.

__module__ = 'PoliciesMultiPlayers.Selfish'
PoliciesMultiPlayers.rhoCentralized module

rhoCentralized: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • Every player has rank_i = i + 1, as given by the base station.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players, and an initial orthogonal configuration.

Warning

This policy is NOT efficient at ALL! Don’t use it! It seems a smart idea, but it’s not.

class PoliciesMultiPlayers.rhoCentralized.oneRhoCentralized(maxRank, mother, playerId, rank=None, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • The player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, mother, playerId, rank=None, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

keep_the_same_rank = None

If True, the rank is kept constant during the game, as if it was given by the Base Station

rank = None

Current rank, starting to 1 by default, or ‘rank’ if given as an argument

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new fully random rank, and give reward to the algorithm if not None.

choice()[source]

Chose with the actual rank.

__module__ = 'PoliciesMultiPlayers.rhoCentralized'
class PoliciesMultiPlayers.rhoCentralized.rhoCentralized(nbPlayers, nbArms, playerAlgo, maxRank=None, orthogonalRanks=True, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

rhoCentralized: implementation of a variant of the multi-player rhoRand policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, orthogonalRanks=True, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the rhoCentralized child (default to nbPlayers, but for instance if there is 2 × rhoCentralized[UCB] + 2 × rhoCentralized[klUCB], maxRank should be 4 not 2).
  • orthogonalRanks: if True, orthogonal ranks 1..M are directly affected to the players 1..M.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoCentralized(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

orthogonalRanks = None

Using orthogonal ranks from starting

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoCentralized'
PoliciesMultiPlayers.rhoEst module

rhoEst: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on \([1, \dots, \hat{M}_i(t)]\) where \(\hat{M}_i(t)\) is the current estimated number of player by player i,
  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
  • My choice for the threshold function, see threshold_on_t(), does not need the horizon either, and uses \(t\) instead.

Note

This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).

Note

For a more generic approach, see the wrapper defined in EstimateM.EstimateM.

class PoliciesMultiPlayers.rhoEst.oneRhoEst(threshold, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy,
  • The rhoEst policy is used to keep an estimate on the total number of players, \(\hat{M}_i(t)\).
  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
__init__(threshold, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

threshold = None

Threshold function

nbPlayersEstimate = None

Number of players. Optimistic: start by assuming it is alone!

rank = None

Current rank, starting to 1

collisionCount = None

Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None

Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]

One transmission without collision.

__module__ = 'PoliciesMultiPlayers.rhoEst'
class PoliciesMultiPlayers.rhoEst.rhoEst(nbPlayers, nbArms, playerAlgo, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoEst: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • threshold: the threshold function to use, see EstimateM.threshold_on_t_with_horizon(), EstimateM.threshold_on_t_doubling_trick() or EstimateM.threshold_on_t() above.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoEst(nbPlayers, nbArms, UCB, threshold=threshold_on_t)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoEst'
class PoliciesMultiPlayers.rhoEst.rhoEstPlus(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoEstPlus: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • horizon: need to know the horizon \(T\).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> horizon = 1000
>>> s = rhoEstPlus(nbPlayers, nbArms, UCB, horizon=horizon)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__module__ = 'PoliciesMultiPlayers.rhoEst'
__str__()[source]

Return str(self).

PoliciesMultiPlayers.rhoLearn module

rhoLearn: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
  • If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy, it might be better! Warning: no theoretical guarantees exist!
  • Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

PoliciesMultiPlayers.rhoLearn.CHANGE_RANK_EACH_STEP = False

Should oneRhoLearn players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.

class PoliciesMultiPlayers.rhoLearn.oneRhoLearn(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
  • When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

rank = None

Current rank, starting to 1

change_rank_each_step = None

Change rank at each step?

__str__()[source]

Return str(self).

startGame()[source]

Initialize both rank and arm selection algorithms.

getReward(arm, reward)[source]

Give a 1 reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.

handleCollision(arm, reward=None)[source]

Give a 0 reward to the rank selection algorithm, and select a (possibly new) rank.

__module__ = 'PoliciesMultiPlayers.rhoLearn'
class PoliciesMultiPlayers.rhoLearn.rhoLearn(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, maxRank=None, change_rank_each_step=False, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoLearn: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.

__init__(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, maxRank=None, change_rank_each_step=False, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • rankSelectionAlgo: algorithm to use for selecting the ranks.
  • maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> stickyTime = 5
>>> s = rhoLearn(nbPlayers, nbArms, UCB, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

rankSelectionAlgo = None

Policy to use to chose the ranks

nbArms = None

Number of arms

change_rank_each_step = None

Change rank at every steps?

__module__ = 'PoliciesMultiPlayers.rhoLearn'
__str__()[source]

Return str(self).

PoliciesMultiPlayers.rhoLearnEst module

rhoLearnEst: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank, and without knowing the number of users.

  • It generalizes PoliciesMultiPlayers.rhoLearn.rhoLearn simply by letting the ranks be \(\{1,\dots,K\}\) and not in \(\{1,\dots,M\}\), by hoping the learning algorithm will be “smart enough” and learn by itself that ranks should be \(\leq M\).
  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
  • If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy, it might be better! Warning: no theoretical guarantees exist!
  • Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)

Note

This is fully decentralized: each child player does not need to know the (fixed) number of players, it will learn to select ranks only in \(\{1,\dots,M\}\) instead of \(\{1,\dots,K\}\).

Warning

This policy does not work very well!

class PoliciesMultiPlayers.rhoLearnEst.oneRhoLearnEst(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoLearn.oneRhoLearn

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoLearnEst'
class PoliciesMultiPlayers.rhoLearnEst.rhoLearnEst(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, change_rank_each_step=False, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoLearn.rhoLearn

rhoLearnEst: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank, and without knowing the number of users.

__init__(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, change_rank_each_step=False, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • rankSelectionAlgo: algorithm to use for selecting the ranks.
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Difference with PoliciesMultiPlayers.rhoLearn.rhoLearn:

  • maxRank: maximum rank allowed by the rhoRand child, is not an argument, but it is always nbArms (= K).

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoLearnEst(nbPlayers, nbArms, UCB, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
nbPlayers = None

Number of players

children = None

List of children, fake algorithms

rankSelectionAlgo = None

Policy to use to chose the ranks

nbArms = None

Number of arms

change_rank_each_step = None

Change rank at every steps?

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoLearnEst'
PoliciesMultiPlayers.rhoLearnExp3 module

rhoLearnExp3: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using the Exp3 learning algorithm instead of a random exploration for choosing the rank.

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
  • If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy (like Exp3 here), it might be better! Warning: no theoretical guarantees exist!
  • Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

For the Exp3 algorithm:

PoliciesMultiPlayers.rhoLearnExp3.binary_feedback(sensing, collision)[source]

Count 1 iff the sensing authorized to communicate and no collision was observed.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= r_{j,t} = F_{m,t} \times (1 - c_{m,t}), \\ \text{where}\;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
PoliciesMultiPlayers.rhoLearnExp3.ternary_feedback(sensing, collision)[source]

Count 1 iff the sensing authorized to communicate and no collision was observed, 0 if no communication, and -1 iff communication but a collision was observed.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
PoliciesMultiPlayers.rhoLearnExp3.generic_ternary_feedback(sensing, collision, bonus=1, malus=-1)[source]

Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, and 0 if no communication.

PoliciesMultiPlayers.rhoLearnExp3.make_generic_ternary_feedback(bonus=1, malus=-1)[source]
PoliciesMultiPlayers.rhoLearnExp3.generic_continuous_feedback(sensing, collision, bonus=1, malus=-1)[source]

Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, but possibly does not count 0 if no communication.

\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= \mathrm{malus} + (\mathrm{bonus} - \mathrm{malus}) \times \frac{r'_{j,t} + 1}{2}, \\ \text{where}\;\; r'_{j,t} &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
PoliciesMultiPlayers.rhoLearnExp3.make_generic_continuous_feedback(bonus=1, malus=-1)[source]
PoliciesMultiPlayers.rhoLearnExp3.reward_from_decoupled_feedback(sensing, collision)

Decide the default function to use. FIXME try all of them!

PoliciesMultiPlayers.rhoLearnExp3.CHANGE_RANK_EACH_STEP = False

Should oneRhoLearnExp3 players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.

class PoliciesMultiPlayers.rhoLearnExp3.oneRhoLearnExp3(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
  • When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

rank = None

Current rank, starting to 1

change_rank_each_step = None

Change rank at each step?

feedback_function = None

Feedback function: (sensing, collision) -> reward

__str__()[source]

Return str(self).

startGame()[source]

Initialize both rank and arm selection algorithms.

getReward(arm, reward)[source]

Give a “good” reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.

handleCollision(arm, reward)[source]

Give a “bad” reward to the rank selection algorithm, and select a (possibly new) rank.

__module__ = 'PoliciesMultiPlayers.rhoLearnExp3'
class PoliciesMultiPlayers.rhoLearnExp3.rhoLearnExp3(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoLearnExp3: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.

__init__(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • rankSelectionAlgo: algorithm to use for selecting the ranks.
  • maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoLearnExp3(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[0, 1, 9, 0, 10, 3]
>>> [ child.choice() for child in s.children ]
[11, 2, 0, 0, 4, 5]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

rankSelectionAlgo = None

Policy to use to chose the ranks

nbArms = None

Number of arms

change_rank_each_step = None

Change rank at every steps?

__module__ = 'PoliciesMultiPlayers.rhoLearnExp3'
__str__()[source]

Return str(self).

PoliciesMultiPlayers.rhoRand module

rhoRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

class PoliciesMultiPlayers.rhoRand.oneRhoRand(maxRank, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

rank = None

Current rank, starting to 1 by default

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new fully random rank, and give reward to the algorithm if not None.

choice()[source]

Chose with the actual rank.

__module__ = 'PoliciesMultiPlayers.rhoRand'
class PoliciesMultiPlayers.rhoRand.rhoRand(nbPlayers, nbArms, playerAlgo, maxRank=None, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

rhoRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoRand(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoRand'
PoliciesMultiPlayers.rhoRandALOHA module

rhoRandALOHA: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
  • The only difference with rhoRand is that when colliding, users have a small chance of keeping the same rank, following a Bernoulli experiment: with probability = \(p(t)\), it keeps the same rank, with proba \(1 - p(t)\) it changes its rank (uniformly in \(\{1,\dots,M\}\), so there is a chance it finds the same again? FIXME).
  • There is also a variant, like in MEGA (ALOHA-like protocol), the proba change after time: p(t+1) = alpha p(t) + (1-alpha)

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

PoliciesMultiPlayers.rhoRandALOHA.new_rank(rank, maxRank, forceChange=False)[source]

Return a new rank, from \(1, \dots, \mathrm{maxRank}\), different than rank, uniformly.

  • Internally, it uses a simple rejection sampling : keep taking a new rank \(\sim U(\{1, \dots, \mathrm{maxRank}\})\), until it is different than rank (that’s not the most efficient way to do it but is simpler).

Example:

>>> from random import seed; seed(0)  # reproducibility
>>> [ new_rank(1, 8, False) for _ in range(10) ]
[7, 7, 1, 5, 9, 8, 7, 5, 8, 6]
>>> [ new_rank(8, 8, False) for _ in range(10) ]
[4, 9, 3, 5, 3, 2, 5, 9, 3, 5]

Example with forceChange = True, when a new rank is picked different than the current one.

>>> [ new_rank(1, 8, True) for _ in range(10) ]
[2, 2, 6, 8, 9, 2, 6, 7, 6, 4]
>>> [ new_rank(5, 8, True) for _ in range(10) ]
[9, 8, 8, 9, 1, 9, 1, 2, 7, 1]
class PoliciesMultiPlayers.rhoRandALOHA.oneRhoRandALOHA(maxRank, p0, alpha_p0, forceChange, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, p0, alpha_p0, forceChange, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

p0 = None

Initial probability, should not be modified.

p = None

Current probability of staying with the current rank after a collision. If 0, then it is like the initial rhoRand policy.

alpha_p0 = None

Parameter alpha for the recurrence equation for probability p(t)

rank = None

Current rank, starting to 1 by default

forceChange = None

Should a different rank be used when moving? Or not.

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new fully random rank, and give reward to the algorithm if not None.

getReward(arm, reward)[source]

Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

  • Additionally, if the current rank was good enough to not bring any collision during the last p0 time steps, the player “sits” on that rank.
__module__ = 'PoliciesMultiPlayers.rhoRandALOHA'
PoliciesMultiPlayers.rhoRandALOHA.P0 = 0.0

Default value for P0, ideally, it should be 1/(K*M) the number of player

PoliciesMultiPlayers.rhoRandALOHA.ALPHA_P0 = 0.9999

Default value for ALPHA_P0, FIXME I have no idea what the best possible choise ca be!

PoliciesMultiPlayers.rhoRandALOHA.FORCE_CHANGE = False

Default value for forceChange. Logically, it should be True.

class PoliciesMultiPlayers.rhoRandALOHA.rhoRandALOHA(nbPlayers, nbArms, playerAlgo, p0=None, alpha_p0=0.9999, forceChange=False, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoRandALOHA: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, p0=None, alpha_p0=0.9999, forceChange=False, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • p0: given to the oneRhoRandALOHA objects (see above).
  • alpha_p0: given to the oneRhoRandALOHA objects (see above).
  • forceChange: given to the oneRhoRandALOHA objects (see above).
  • maxRank: maximum rank allowed by the rhoRandALOHA child (default to nbPlayers, but for instance if there is 2 × rhoRandALOHA[UCB] + 2 × rhoRandALOHA[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> p0, alpha_p0, forceChange = 0.6, 0.5, True
>>> s = rhoRandALOHA(nbPlayers, nbArms, UCB, p0, alpha_p0, forceChange)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

p0 = None

Initial value for p, current probability of staying with the current rank after a collision

alpha_p0 = None

Parameter alpha for the recurrence equation for probability p(t)

forceChange = None

Should a different rank be used when moving? Or not.

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoRandALOHA'
PoliciesMultiPlayers.rhoRandALOHA.random() → x in the interval [0, 1).
PoliciesMultiPlayers.rhoRandRand module

rhoRandRand: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the k-th best arm, for k again uniformly drawn from [1, …, rank_i],
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, …, M] where M is the number of player.

Note

This algorithm is intended to be stupid! It does not work at all!!

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

class PoliciesMultiPlayers.rhoRandRand.oneRhoRandRand(maxRank, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.ChildPointer.ChildPointer

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

rank = None

Current rank, starting to 1

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new rank.

choice()[source]

Chose with a RANDOM rank.

__module__ = 'PoliciesMultiPlayers.rhoRandRand'
class PoliciesMultiPlayers.rhoRandRand.rhoRandRand(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, maxRank=None, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

rhoRandRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, maxRank=None, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoRandRand(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

nbArms = None

Number of arms

children = None

List of children, fake algorithms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoRandRand'
PoliciesMultiPlayers.rhoRandRotating module

rhoRandRotating: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
  • The only difference with rhoRand is that at every time step, the rank is updated by 1, and cycles in [1, .., M] iteratively.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

class PoliciesMultiPlayers.rhoRandRotating.oneRhoRandRotating(maxRank, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

rank = None

Current rank, starting to 1 by default

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new fully random rank, and give reward to the algorithm if not None.

choice()[source]

Chose with the new rank, then update the rank:

\[\mathrm{rank}_j(t+1) := \mathrm{rank}_j(t) + 1 \;\mathrm{mod}\; M.\]
__module__ = 'PoliciesMultiPlayers.rhoRandRotating'
class PoliciesMultiPlayers.rhoRandRotating.rhoRandRotating(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoRandRotating: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • maxRank: maximum rank allowed by the rhoRandRotating child (default to nbPlayers, but for instance if there is 2 × rhoRandRotating[UCB] + 2 × rhoRandRotating[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = rhoRandRotating(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.
  • Warning: s._players is for internal use ONLY!
maxRank = None

Max rank, usually nbPlayers but can be different

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoRandRotating'
PoliciesMultiPlayers.rhoRandSticky module

rhoRandSticky: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
  • But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
  • At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
  • The only difference with rhoRand is that once a player selected a rank and did not encounter a collision for STICKY_TIME time steps, he will never change his rank. rhoRand has STICKY_TIME = +oo, MusicalChair is something like STICKY_TIME = 1, this variant rhoRandSticky has this as a parameter.

Note

This is not fully decentralized: as each child player needs to know the (fixed) number of players.

PoliciesMultiPlayers.rhoRandSticky.STICKY_TIME = 10

Default value for STICKY_TIME

class PoliciesMultiPlayers.rhoRandSticky.oneRhoRandSticky(maxRank, stickyTime, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.oneRhoRand

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • Except for the handleCollision method: a new random rank is sampled after observing a collision,
  • And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
__init__(maxRank, stickyTime, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

maxRank = None

Max rank, usually nbPlayers but can be different

stickyTime = None

Number of time steps needed without collisions before sitting (never changing rank again)

rank = None

Current rank, starting to 1 by default

sitted = None

Not yet sitted. After stickyTime steps without collisions, sit and never change rank again.

stepsWithoutCollisions = None

Number of steps since we chose that rank and did not see any collision. As soon as this gets greater than stickyTime, the player sit.

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Get a new fully random rank, and give reward to the algorithm if not None.

getReward(arm, reward)[source]

Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

  • Additionally, if the current rank was good enough to not bring any collision during the last stickyTime time steps, the player “sits” on that rank.
__module__ = 'PoliciesMultiPlayers.rhoRandSticky'
class PoliciesMultiPlayers.rhoRandSticky.rhoRandSticky(nbPlayers, nbArms, playerAlgo, stickyTime=10, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.rhoRand.rhoRand

rhoRandSticky: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).

__init__(nbPlayers, nbArms, playerAlgo, stickyTime=10, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).
  • playerAlgo: class to use for every players.
  • nbArms: number of arms, given as first argument to playerAlgo.
  • stickyTime: given to the oneRhoRandSticky objects (see above).
  • maxRank: maximum rank allowed by the rhoRandSticky child (default to nbPlayers, but for instance if there is 2 × rhoRandSticky[UCB] + 2 × rhoRandSticky[klUCB], maxRank should be 4 not 2).
  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> stickyTime = 5
>>> s = rhoRandSticky(nbPlayers, nbArms, UCB, stickyTime=stickyTime)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
>>> [ child.choice() for child in s.children ]
[9, 4, 6, 12, 1, 6]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

maxRank = None

Max rank, usually nbPlayers but can be different

stickyTime = None

Number of time steps needed without collisions before sitting (never changing rank again)

nbPlayers = None

Number of players

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.rhoRandSticky'
PoliciesMultiPlayers.with_proba module

Simply defines a function with_proba() that is used everywhere.

PoliciesMultiPlayers.with_proba.with_proba(epsilon)[source]

Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.

Example:

>>> from random import seed; seed(0)  # reproductible
>>> with_proba(0.5)
False
>>> with_proba(0.9)
True
>>> with_proba(0.1)
False
>>> if with_proba(0.2):
...     print("This happens 20% of the time.")
PoliciesMultiPlayers.with_proba.random() → x in the interval [0, 1).

complete_tree_exploration_for_MP_bandits module

Experimental code to perform complete tree exploration for Multi-Player bandits.

Algorithms:

  • Support Selfish 0-greedy, UCB, and klUCB in 3 different variants.
  • Support also RhoRand, RandTopM and MCTopM, even though they are not memory-less, by using another state representation (inlining the memory of each player, eg the ranks for RhoRand).

Features:

  • For the means of each arm, \(\mu_1, \dots, \mu_K\), this script can use exact formal computations with sympy, or fractions with Fraction, or float number.
  • The graph can contain all nodes from root to leafs, or only leafs (with summed probabilities), and possibly only the absorbing nodes are showed.
  • Support export of the tree to a GraphViz dot graph, and can save it to SVG/PNG and LaTeX (with Tikz) and PDF etc.
  • By default, the root is highlighted in green and the absorbing nodes are in red.

Warning

I still have to fix these issues:

  • TODO : right now, it is not so efficient, could it be improved? I don’t think I can do anything in a smarter way, in pure Python.

Requirements:

  • ‘sympy’ module to use formal means \(\mu_1, \dots, \mu_K\) instead of numbers,
  • ‘numpy’ module for computations on indexes (e.g., np.where),
  • ‘graphviz’ module to generate the graph and save it,
  • ‘dot2tex’ module to generate nice LaTeX (with Tikz) graph and save it to PDF.

Note

To use the ‘dot2tex’ module, only Python2 is supported. However, I maintain an unpublished port of ‘dot2tex’ for Python3, see [here](https://github.com/Naereen/dot2tex), that you can download, and install manually (sudo python3 setup.py install) to have ‘dot2tex’ for Python3 also.

About:

complete_tree_exploration_for_MP_bandits.oo = inf

Shortcut for float(‘+inf’).

complete_tree_exploration_for_MP_bandits.PLOT_DIR = 'plots/trees'

Directory for the plots

complete_tree_exploration_for_MP_bandits.tupleit1(anarray)[source]

Convert a non-hashable 1D numpy array to a hashable tuple.

complete_tree_exploration_for_MP_bandits.tupleit2(anarray)[source]

Convert a non-hashable 2D numpy array to a hashable tuple-of-tuples.

complete_tree_exploration_for_MP_bandits.prod(iterator)[source]

Product of the values in this iterator.

complete_tree_exploration_for_MP_bandits.WIDTH = 200

Default value for the width parameter for wraptext() and wraplatex().

complete_tree_exploration_for_MP_bandits.wraptext(text, width=200)[source]

Wrap the text, using textwrap module, and width.

complete_tree_exploration_for_MP_bandits.mybool(s)[source]
complete_tree_exploration_for_MP_bandits.ONLYLEAFS = True

By default, aim at the most concise graph representation by only showing the leafs.

complete_tree_exploration_for_MP_bandits.ONLYABSORBING = False

By default, don’t aim at the most concise graph representation by only showing the absorbing leafs.

complete_tree_exploration_for_MP_bandits.CONCISE = True

By default, only show \(\tilde{S}\) and \(N\) in the graph representations, not all the 4 vectors.

complete_tree_exploration_for_MP_bandits.FULLHASH = False

Use only Stilde, N for hashing the states.

complete_tree_exploration_for_MP_bandits.FORMAT = 'svg'

Format used to save the graphs.

complete_tree_exploration_for_MP_bandits.FixedArm(j, state)[source]

Fake player j that always targets at arm j.

complete_tree_exploration_for_MP_bandits.UniformExploration(j, state)[source]

Fake player j that always targets all arms.

complete_tree_exploration_for_MP_bandits.ConstantRank(j, state, decision, collision)[source]

Constant rank no matter what.

complete_tree_exploration_for_MP_bandits.choices_from_indexes(indexes)[source]

For deterministic index policies, if more than one index is maximum, return the list of positions attaining this maximum (ties), or only one position.

complete_tree_exploration_for_MP_bandits.Selfish_0Greedy_U(j, state)[source]

Selfish policy + 0-Greedy index + U feedback.

complete_tree_exploration_for_MP_bandits.Selfish_0Greedy_Utilde(j, state)[source]

Selfish policy + 0-Greedy index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.Selfish_0Greedy_Ubar(j, state)[source]

Selfish policy + 0-Greedy index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.Selfish_UCB_U(j, state)[source]

Selfish policy + UCB_0.5 index + U feedback.

complete_tree_exploration_for_MP_bandits.Selfish_UCB(j, state)[source]

Selfish policy + UCB_0.5 index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.Selfish_UCB_Utilde(j, state)

Selfish policy + UCB_0.5 index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.Selfish_UCB_Ubar(j, state)[source]

Selfish policy + UCB_0.5 index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.Selfish_KLUCB_U(j, state)[source]

Selfish policy + Bernoulli KL-UCB index + U feedback.

complete_tree_exploration_for_MP_bandits.Selfish_KLUCB(j, state)[source]

Selfish policy + Bernoulli KL-UCB index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.Selfish_KLUCB_Utilde(j, state)

Selfish policy + Bernoulli KL-UCB index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.Selfish_KLUCB_Ubar(j, state)[source]

Selfish policy + Bernoulli KL-UCB index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.choices_from_indexes_with_rank(indexes, rank=1)[source]

For deterministic index policies, if more than one index is maximum, return the list of positions attaining the rank-th largest index (with more than one if ties, or only one position).

complete_tree_exploration_for_MP_bandits.RhoRand_UCB_U(j, state)[source]

RhoRand policy + UCB_0.5 index + U feedback.

complete_tree_exploration_for_MP_bandits.RhoRand_UCB_Utilde(j, state)[source]

RhoRand policy + UCB_0.5 index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.RhoRand_UCB_Ubar(j, state)[source]

RhoRand policy + UCB_0.5 index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.RhoRand_KLUCB_U(j, state)[source]

RhoRand policy + Bernoulli KL-UCB index + U feedback.

complete_tree_exploration_for_MP_bandits.RhoRand_KLUCB_Utilde(j, state)[source]

RhoRand policy + Bernoulli KL-UCB index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.RhoRand_KLUCB_Ubar(j, state)[source]

RhoRand policy + Bernoulli KL-UCB index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.RandomNewRank(j, state, decision, collision)[source]

RhoRand chooses a new uniform rank in {1,..,M} in case of collision, or keep the same.

complete_tree_exploration_for_MP_bandits.default_policy(j, state)

RhoRand policy + UCB_0.5 index + U feedback.

complete_tree_exploration_for_MP_bandits.default_update_memory(j, state, decision, collision)

RhoRand chooses a new uniform rank in {1,..,M} in case of collision, or keep the same.

complete_tree_exploration_for_MP_bandits.RandTopM_UCB_U(j, state, collision=False)[source]

RandTopM policy + UCB_0.5 index + U feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_UCB_Utilde(j, state, collision=False)[source]

RandTopM policy + UCB_0.5 index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_UCB_Ubar(j, state, collision=False)[source]

RandTopM policy + UCB_0.5 index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_KLUCB_U(j, state, collision=False)[source]

RandTopM policy + Bernoulli KL-UCB index + U feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_KLUCB_Utilde(j, state, collision=False)[source]

RandTopM policy + Bernoulli KL-UCB index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_KLUCB_Ubar(j, state, collision=False)[source]

RandTopM policy + Bernoulli KL-UCB index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.RandTopM_RandomNewChosenArm(j, state, decision, collision)[source]

RandTopM chooses a new arm after a collision or if the chosen arm lies outside of its estimatedBestArms set, uniformly from the set of estimated M best arms, or keep the same.

complete_tree_exploration_for_MP_bandits.write_to_tuple(this_tuple, index, value)[source]

Tuple cannot be written, this hack fixes that.

complete_tree_exploration_for_MP_bandits.MCTopM_UCB_U(j, state, collision=False)[source]

MCTopM policy + UCB_0.5 index + U feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_UCB_Utilde(j, state, collision=False)[source]

MCTopM policy + UCB_0.5 index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_UCB_Ubar(j, state, collision=False)[source]

MCTopM policy + UCB_0.5 index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_KLUCB_U(j, state, collision=False)[source]

MCTopM policy + Bernoulli KL-UCB index + U feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_KLUCB_Utilde(j, state, collision=False)[source]

MCTopM policy + Bernoulli KL-UCB index + Utilde feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_KLUCB_Ubar(j, state, collision=False)[source]

MCTopM policy + Bernoulli KL-UCB index + Ubar feedback.

complete_tree_exploration_for_MP_bandits.MCTopM_RandomNewChosenArm(j, state, decision, collision)[source]

RandTopMC chooses a new arm after if the chosen arm lies outside of its estimatedBestArms set, uniformly from the set of estimated M best arms, or keep the same.

complete_tree_exploration_for_MP_bandits.symbol_means(K)[source]

Better to work directly with symbols and instantiate the results after.

complete_tree_exploration_for_MP_bandits.random_uniform_means(K)[source]

If needed, generate an array of K (numerical) uniform means in [0, 1].

complete_tree_exploration_for_MP_bandits.uniform_means(nbArms=3, delta=0.1, lower=0.0, amplitude=1.0)[source]

Return a list of means of arms, well spaced:

  • in [lower, lower + amplitude],
  • sorted in increasing order,
  • starting from lower + amplitude * delta, up to lower + amplitude * (1 - delta),
  • and there is nbArms arms.
>>> np.array(uniform_means(2, 0.1))
array([ 0.1,  0.9])
>>> np.array(uniform_means(3, 0.1))
array([ 0.1,  0.5,  0.9])
>>> np.array(uniform_means(9, 1 / (1. + 9)))
array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
complete_tree_exploration_for_MP_bandits.proba2float(proba, values=None, K=None, names=None)[source]

Replace mu_k by a numerical value and evaluation the formula.

complete_tree_exploration_for_MP_bandits.simplify(proba)[source]

Try to simplify the expression of the probability.

complete_tree_exploration_for_MP_bandits.proba2str(proba, latex=False, html_in_var_names=False)[source]

Pretty print a proba, either a number, a Fraction, or a sympy expression.

complete_tree_exploration_for_MP_bandits.tex2pdf(filename)[source]

Naive call to command line pdflatex, twice.

class complete_tree_exploration_for_MP_bandits.State(S, Stilde, N, Ntilde, mus, players, depth=0)[source]

Bases: object

Not space-efficient representation of a state in the system we model.

  • S, Stilde, N, Ntilde: are arrays of size (M, K),
  • depth, t, M, K: integers, to avoid recomputing them,
  • mus: the problem parameters (only for Bernoulli arms),
  • players: is a list of algorithms,
  • probas: list of transition probabilities,
  • children: list of all possible next states (transitions).
__init__(S, Stilde, N, Ntilde, mus, players, depth=0)[source]

Create a new state. Arrays S, Stilde, N, Ntilde are copied to avoid modify previous values!

S = None

sensing feedback

Stilde = None

number of sensing trials

N = None

number of succesful transmissions

Ntilde = None

number of trials without collisions

depth = None

current depth of the exploration tree

t = None

current time step. Simply = sum(N[0]) = sum(N[i]) for all player i, but easier to compute it once and store it

M = None

number of players

K = None

number of arms (channels)

children = None

list of next state, representing all the possible transitions

probas = None

probabilities of transitions

__str__(concise=True)[source]

Return str(self).

to_node(concise=True)[source]

Print the state as a small string to be attached to a GraphViz node.

to_dot(title='', name='', comment='', latex=False, html_in_var_names=False, ext='svg', onlyleafs=True, onlyabsorbing=False, concise=True)[source]

Convert the state to a .dot graph, using GraphViz. See http://graphviz.readthedocs.io/ for more details.

  • onlyleafs: only print the root and the leafs, to see a concise representation of the tree.
  • onlyabsorbing: only print the absorbing leafs, to see a really concise representation of the tree.
  • concise: weather to use the short representation of states (using \(\tilde{S}\) and \(N\)) or the long one (using the 4 variables).
  • html_in_var_names: experimental use of <SUB>..</SUB> and <SUP>..</SUP> in the label for the tree.
  • latex: experimental use of _{..} and ^{..} in the label for the tree, to use with dot2tex.
saveto(filename, view=True, title='', name='', comment='', latex=False, html_in_var_names=False, ext='svg', onlyleafs=True, onlyabsorbing=False, concise=True)[source]
copy()[source]

Get a new copy of that state with same S, Stilde, N, Ntilde but no probas and no children (and depth=0).

__hash__(full=False)[source]

Hash the matrix Stilde and N of the state.

is_absorbing()[source]

Try to detect if this state is absorbing, ie only one transition is possible, and again infinitely for the only child.

Warning

Still very experimental!

has_absorbing_child_whole_subtree()[source]

Try to detect if this state has an absorbing child in the whole subtree.

explore_from_node_to_depth(depth=1)[source]

Compute recursively the one_depth children of the root and its children.

compute_one_depth()[source]

Use all_deltas to store all the possible transitions and their probabilities. Increase depth by 1 at the end.

all_absorbing_states(depth=1)[source]

Generator that yields all the absorbing nodes of the tree, one by one.

  • It might not find any,
  • It does so without merging common nodes, in order to find the first absorbing node as quick as possible.
absorbing_states_one_depth()[source]

Use all_deltas to yield all the absorbing one-depth child and their probabilities.

find_N_absorbing_states(N=1, maxdepth=8)[source]

Find at least N absorbing states, by considering a large depth.

all_deltas()[source]

Generator that yields functions transforming state to another state.

  • It is memory efficient as it is a generator.
  • Do not convert that to a list or it might use all your system memory: each returned value is a function with code and variables inside!
pretty_print_result_recursively()[source]

Print all the transitions, depth by depth (recursively).

get_all_leafs()[source]

Recurse and get all the leafs. Many different state can be present in the list of leafs, with possibly different probabilities (each correspond to a trajectory).

get_unique_leafs()[source]

Compute all the leafs (deepest children) and merge the common one to compute their full probabilities.

proba_reaching_absorbing_state()[source]

Compute the probability of reaching a leaf that is an absorbing state.

__dict__ = mappingproxy({'__module__': 'complete_tree_exploration_for_MP_bandits', '__doc__': 'Not space-efficient representation of a state in the system we model.\n\n - S, Stilde, N, Ntilde: are arrays of size (M, K),\n - depth, t, M, K: integers, to avoid recomputing them,\n - mus: the problem parameters (only for Bernoulli arms),\n - players: is a list of algorithms,\n - probas: list of transition probabilities,\n - children: list of all possible next states (transitions).\n ', '__init__': <function State.__init__>, '__str__': <function State.__str__>, 'to_node': <function State.to_node>, 'to_dot': <function State.to_dot>, 'saveto': <function State.saveto>, 'copy': <function State.copy>, '__hash__': <function State.__hash__>, 'is_absorbing': <function State.is_absorbing>, 'has_absorbing_child_whole_subtree': <function State.has_absorbing_child_whole_subtree>, 'explore_from_node_to_depth': <function State.explore_from_node_to_depth>, 'compute_one_depth': <function State.compute_one_depth>, 'all_absorbing_states': <function State.all_absorbing_states>, 'absorbing_states_one_depth': <function State.absorbing_states_one_depth>, 'find_N_absorbing_states': <function State.find_N_absorbing_states>, 'all_deltas': <function State.all_deltas>, 'pretty_print_result_recursively': <function State.pretty_print_result_recursively>, 'get_all_leafs': <function State.get_all_leafs>, 'get_unique_leafs': <function State.get_unique_leafs>, 'proba_reaching_absorbing_state': <function State.proba_reaching_absorbing_state>, '__dict__': <attribute '__dict__' of 'State' objects>, '__weakref__': <attribute '__weakref__' of 'State' objects>})
__module__ = 'complete_tree_exploration_for_MP_bandits'
__weakref__

list of weak references to the object (if defined)

class complete_tree_exploration_for_MP_bandits.StateWithMemory(S, Stilde, N, Ntilde, mus, players, update_memories, memories=None, depth=0)[source]

Bases: complete_tree_exploration_for_MP_bandits.State

State with a memory for each player, to represent and play with RhoRand etc.

__init__(S, Stilde, N, Ntilde, mus, players, update_memories, memories=None, depth=0)[source]

Create a new state. Arrays S, Stilde, N, Ntilde are copied to avoid modify previous values!

memories = None

Personal memory for all players, can be a rank in {1,..,M} for rhoRand, or anything else.

__str__(concise=False)[source]

Return str(self).

to_node(concise=True)[source]

Print the state as a small string to be attached to a GraphViz node.

copy()[source]

Get a new copy of that state with same S, Stilde, N, Ntilde but no probas and no children (and depth=0).

__hash__(full=False)[source]

Hash the matrix Stilde and N of the state and memories of the players (ie. ranks for RhoRand).

is_absorbing()[source]

Try to detect if this state is absorbing, ie only one transition is possible, and again infinitely for the only child.

Warning

Still very experimental!

all_deltas()[source]

Generator that yields functions transforming state to another state.

  • It is memory efficient as it is a generator.
  • Do not convert that to a list or it might use all your system memory: each returned value is a function with code and variables inside!
__module__ = 'complete_tree_exploration_for_MP_bandits'
complete_tree_exploration_for_MP_bandits.main(depth=1, players=None, update_memories=None, mus=None, M=2, K=2, S=None, Stilde=None, N=None, Ntilde=None, find_only_N=None)[source]

Compute all the transitions, and print them.

complete_tree_exploration_for_MP_bandits.test(depth=1, M=2, K=2, S=None, Stilde=None, N=None, Ntilde=None, mus=None, debug=True, all_players=None, all_update_memories=None, find_only_N=None)[source]

Test the main exploration function for various all_players.

configuration module

Configuration for the simulations, for the single-player case.

configuration.CPU_COUNT = 2

Number of CPU on the local machine

configuration.HORIZON = 10000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration.REPETITIONS = 4

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration.RANDOM_SHUFFLE = False

The arms won’t be shuffled (shuffle(arms)).

configuration.RANDOM_INVERT = False

The arms won’t be inverted (arms = arms[::-1]).

configuration.NB_BREAK_POINTS = 0

Number of true breakpoints. They are uniformly spaced in time steps (and the first one at t=0 does not count).

configuration.EPSILON = 0.1

Parameters for the epsilon-greedy and epsilon-… policies.

configuration.TEMPERATURE = 0.05

Temperature for the Softmax policies.

configuration.LEARNING_RATE = 0.01

Learning rate for my aggregated bandit (it can be autotuned)

configuration.TEST_WrapRange = False

To know if my WrapRange policy is tested.

configuration.CACHE_REWARDS = True

Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.

configuration.UPDATE_ALL_CHILDREN = False

Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?

configuration.UNBIASED = False

Should the rewards for Aggregator policy use as biased estimator, ie just r_t, or unbiased estimators, r_t / p_t

configuration.UPDATE_LIKE_EXP4 = False

Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal

configuration.UNBOUNDED_VARIANCE = 1

Variance of unbounded Gaussian arms

configuration.NB_ARMS = 9

Number of arms for non-hard-coded problems (Bayesian problems)

configuration.LOWER = 0.0

Default value for the lower value of means

configuration.AMPLITUDE = 1.0

Default value for the amplitude value of means

configuration.VARIANCE = 0.05

Variance of Gaussian arms

configuration.ARM_TYPE

alias of Arms.Bernoulli.Bernoulli

configuration.ENVIRONMENT_BAYESIAN = False

True to use bayesian problem

configuration.MEANS = [0.05, 0.16249999999999998, 0.27499999999999997, 0.38749999999999996, 0.49999999999999994, 0.6125, 0.725, 0.8374999999999999, 0.95]

Means of arms for non-hard-coded problems (non Bayesian)

configuration.USE_FULL_RESTART = True

True to use full-restart Doubling Trick

configuration.configuration = {'append_labels': {}, 'cache_rewards': True, 'change_labels': {0: 'Pure exploration', 1: 'Pure exploitation', 2: '$\\varepsilon$-greedy', 3: 'Explore-then-Exploit', 5: 'Bernoulli kl-UCB', 6: 'Thompson sampling'}, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}], 'environment_bayesian': False, 'horizon': 10000, 'n_jobs': -1, 'nb_break_points': 0, 'plot_lowerbound': True, 'policies': [{'archtype': <class 'Policies.Uniform.Uniform'>, 'params': {}, 'change_label': 'Pure exploration'}, {'archtype': <class 'Policies.EmpiricalMeans.EmpiricalMeans'>, 'params': {}, 'change_label': 'Pure exploitation'}, {'archtype': <class 'Policies.EpsilonGreedy.EpsilonDecreasing'>, 'params': {'epsilon': 479.99999999999983}, 'change_label': '$\\varepsilon$-greedy'}, {'archtype': <class 'Policies.ExploreThenCommit.ETC_KnownGap'>, 'params': {'horizon': 10000, 'gap': 0.11250000000000004}, 'change_label': 'Explore-then-Exploit'}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>}, 'change_label': 'Bernoulli kl-UCB'}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>}, 'change_label': 'Thompson sampling'}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 4, 'verbosity': 6}

This dictionary configures the experiments

configuration.nbArms = 9

Number of arms in the first environment

configuration.klucb(x, d, precision=1e-06)[source]

Warning: if using Exponential or Gaussian arms, gives klExp or klGauss to KL-UCB-like policies!

configuration_comparing_aggregation_algorithms module

Configuration for the simulations, for the single-player case, for comparing Aggregation algorithms.

configuration_comparing_aggregation_algorithms.HORIZON = 10000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_comparing_aggregation_algorithms.REPETITIONS = 4

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_comparing_aggregation_algorithms.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_comparing_aggregation_algorithms.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_comparing_aggregation_algorithms.NB_ARMS = 9

Number of arms for non-hard-coded problems (Bayesian problems)

configuration_comparing_aggregation_algorithms.RANDOM_SHUFFLE = False

The arms are shuffled (shuffle(arms)).

configuration_comparing_aggregation_algorithms.RANDOM_INVERT = False

The arms are inverted (arms = arms[::-1]).

configuration_comparing_aggregation_algorithms.NB_RANDOM_EVENTS = 5

Number of random events. They are uniformly spaced in time steps.

configuration_comparing_aggregation_algorithms.CACHE_REWARDS = False

Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.

configuration_comparing_aggregation_algorithms.UPDATE_ALL_CHILDREN = False

Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?

configuration_comparing_aggregation_algorithms.UNBIASED = True

Should the rewards for Aggregator policy use as biased estimator, ie just r_t, or unbiased estimators, r_t / p_t

configuration_comparing_aggregation_algorithms.UPDATE_LIKE_EXP4 = False

Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal

configuration_comparing_aggregation_algorithms.TRUNC = 1

Trunc parameter, ie amplitude, for Exponential arms

configuration_comparing_aggregation_algorithms.VARIANCE = 0.05

Variance of Gaussian arms

configuration_comparing_aggregation_algorithms.MINI = 0

lower bound on rewards from Gaussian arms

configuration_comparing_aggregation_algorithms.MAXI = 1

upper bound on rewards from Gaussian arms, ie amplitude = 1

configuration_comparing_aggregation_algorithms.SCALE = 1

Scale of Gamma arms

configuration_comparing_aggregation_algorithms.ARM_TYPE

alias of Arms.Bernoulli.Bernoulli

configuration_comparing_aggregation_algorithms.configuration = {'cache_rewards': False, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}], 'horizon': 10000, 'n_jobs': -1, 'nb_random_events': 5, 'policies': [{'archtype': <class 'Policies.Aggregator.Aggregator'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'update_all_children': False, 'decreaseRate': 'auto', 'update_like_exp4': False}}, {'archtype': <class 'Policies.Aggregator.Aggregator'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'update_all_children': False, 'decreaseRate': 'auto', 'update_like_exp4': True}}, {'archtype': <class 'Policies.LearnExp.LearnExp'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'eta': 0.9}}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 4, 'verbosity': 6}

This dictionary configures the experiments

configuration_comparing_aggregation_algorithms.LOWER = 0

And get LOWER, AMPLITUDE values

configuration_comparing_aggregation_algorithms.AMPLITUDE = 1

And get LOWER, AMPLITUDE values

configuration_comparing_aggregation_algorithms.klucbGauss(x, d, precision=0.0)[source]

klucbGauss(x, d, sig2x) with the good variance (= 0.05).

configuration_comparing_aggregation_algorithms.klucbGamma(x, d, precision=0.0)[source]

klucbGamma(x, d, sig2x) with the good scale (= 1).

configuration_comparing_doubling_algorithms module

Configuration for the simulations, for the single-player case, for comparing doubling-trick doubling schemes.

configuration_comparing_doubling_algorithms.CPU_COUNT = 2

Number of CPU on the local machine

configuration_comparing_doubling_algorithms.HORIZON = 45678

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_comparing_doubling_algorithms.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_comparing_doubling_algorithms.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_comparing_doubling_algorithms.REPETITIONS = 1000

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_comparing_doubling_algorithms.UNBOUNDED_VARIANCE = 1

Variance of unbounded Gaussian arms

configuration_comparing_doubling_algorithms.VARIANCE = 0.05

Variance of Gaussian arms

configuration_comparing_doubling_algorithms.NB_ARMS = 9

Number of arms for non-hard-coded problems (Bayesian problems)

configuration_comparing_doubling_algorithms.lower = 0.0

Default value for the lower value of means

configuration_comparing_doubling_algorithms.amplitude = 1.0

Default value for the amplitude value of means

configuration_comparing_doubling_algorithms.ARM_TYPE

alias of Arms.Bernoulli.Bernoulli

configuration_comparing_doubling_algorithms.configuration = {'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}, {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': {'newMeans': <function randomMeans>, 'args': {'nbArms': 9, 'mingap': None, 'lower': 0.0, 'amplitude': 1.0, 'isSorted': True}}}], 'horizon': 45678, 'n_jobs': -1, 'policies': [{'archtype': <class 'Policies.UCB.UCB'>, 'params': {}}, {'archtype': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>, 'params': {'horizon': 45678}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__arithmetic>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__geometric>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_fast>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_slow>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__arithmetic>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__geometric>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_fast>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_slow>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}], 'repetitions': 1000, 'verbosity': 6}

This dictionary configures the experiments

configuration_comparing_doubling_algorithms.klucb(x, d, precision=1e-06)[source]

Warning: if using Exponential or Gaussian arms, gives klExp or klGauss to KL-UCB-like policies!

configuration_markovian module

Configuration for the simulations, for the single-player case for Markovian problems.

configuration_markovian.CPU_COUNT = 2

Number of CPU on the local machine

configuration_markovian.HORIZON = 1000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_markovian.REPETITIONS = 100

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_markovian.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_markovian.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_markovian.VARIANCE = 10

Variance of Gaussian arms

configuration_markovian.TEST_Aggregator = True

To know if my Aggregator policy is tried.

configuration_markovian.configuration = {'environment': [{'arm_type': 'Markovian', 'params': {'rested': False, 'transitions': [{(0, 0): 0.7, (0, 1): 0.3, (1, 0): 0.5, (1, 1): 0.5}, [[0.2, 0.8], [0.6, 0.4]]], 'steadyArm': <class 'Arms.Bernoulli.Bernoulli'>}}], 'horizon': 1000, 'n_jobs': -1, 'policies': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {}}], 'repetitions': 100, 'verbosity': 6}

This dictionary configures the experiments

configuration_markovian.nbArms = 3

Number of arms in the first environment

configuration_markovian.klucb(x, d, precision=1e-06)[source]

Warning: if using Exponential or Gaussian arms, gives klExp or klGauss to KL-UCB-like policies!

configuration_multiplayers module

Configuration for the simulations, for the multi-players case.

configuration_multiplayers.HORIZON = 10000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_multiplayers.REPETITIONS = 200

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_multiplayers.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_multiplayers.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_multiplayers.NB_PLAYERS = 3

NB_PLAYERS : number of players for the game. Should be >= 2 and <= number of arms.

configuration_multiplayers.collisionModel(t, arms, players, choices, rewards, pulls, collisions)

The best collision model: none of the colliding users get any reward

configuration_multiplayers.VARIANCE = 0.05

Variance of Gaussian arms

configuration_multiplayers.CACHE_REWARDS = False

Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.

configuration_multiplayers.NB_ARMS = 6

Number of arms for non-hard-coded problems (Bayesian problems)

configuration_multiplayers.LOWER = 0.0

Default value for the lower value of means

configuration_multiplayers.AMPLITUDE = 1.0

Default value for the amplitude value of means

configuration_multiplayers.ARM_TYPE

alias of Arms.Bernoulli.Bernoulli

configuration_multiplayers.ENVIRONMENT_BAYESIAN = False

True to use bayesian problem

configuration_multiplayers.MEANS = [0.1, 0.26, 0.42000000000000004, 0.58, 0.74, 0.9]

Means of arms for non-hard-coded problems (non Bayesian)

configuration_multiplayers.configuration = {'averageOn': 0.001, 'collisionModel': <function onlyUniqUserGetsReward>, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.26, 0.42000000000000004, 0.58, 0.74, 0.9]}], 'finalRanksOnAverage': True, 'horizon': 10000, 'n_jobs': -1, 'players': [<Policies.SIC_MMAB.SIC_MMAB object>, <Policies.SIC_MMAB.SIC_MMAB object>, <Policies.SIC_MMAB.SIC_MMAB object>], 'plot_lowerbounds': False, 'repetitions': 200, 'successive_players': [[CentralizedMultiplePlay(kl-UCB), CentralizedMultiplePlay(kl-UCB), CentralizedMultiplePlay(kl-UCB)], [Selfish(kl-UCB), Selfish(kl-UCB), Selfish(kl-UCB)], [rhoRand(kl-UCB), rhoRand(kl-UCB), rhoRand(kl-UCB)], [MCTopM(kl-UCB), MCTopM(kl-UCB), MCTopM(kl-UCB)]], 'verbosity': 6}

This dictionary configures the experiments

configuration_multiplayers.nbArms = 6

Number of arms in the first environment

configuration_sparse module

Configuration for the simulations, for single-player sparse bandit.

configuration_sparse.HORIZON = 10000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_sparse.REPETITIONS = 100

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_sparse.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_sparse.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_sparse.RANDOM_SHUFFLE = False

The arms are shuffled (shuffle(arms)).

configuration_sparse.RANDOM_INVERT = False

The arms are inverted (arms = arms[::-1]).

configuration_sparse.NB_RANDOM_EVENTS = 5

Number of random events. They are uniformly spaced in time steps.

configuration_sparse.UPDATE_ALL_CHILDREN = False

Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?

configuration_sparse.LEARNING_RATE = 0.01

Learning rate for my aggregated bandit (it can be autotuned)

configuration_sparse.UNBIASED = False

Should the rewards for Aggregator policy use as biased estimator, ie just r_t, or unbiased estimators, r_t / p_t

configuration_sparse.UPDATE_LIKE_EXP4 = False

Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal

configuration_sparse.TEST_Aggregator = False

To know if my Aggregator policy is tried.

configuration_sparse.CACHE_REWARDS = False

Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.

configuration_sparse.TRUNC = 1

Trunc parameter, ie amplitude, for Exponential arms

configuration_sparse.MINI = 0

lower bound on rewards from Gaussian arms

configuration_sparse.MAXI = 1

upper bound on rewards from Gaussian arms, ie amplitude = 1

configuration_sparse.SCALE = 1

Scale of Gamma arms

configuration_sparse.NB_ARMS = 15

Number of arms for non-hard-coded problems (Bayesian problems)

configuration_sparse.SPARSITY = 7

Sparsity for non-hard-coded problems (Bayesian problems)

configuration_sparse.LOWERNONZERO = 0.25

Default value for the lower value of non-zero means

configuration_sparse.VARIANCE = 0.05

Variance of Gaussian arms

configuration_sparse.ARM_TYPE

alias of Arms.Gaussian.Gaussian

configuration_sparse.ENVIRONMENT_BAYESIAN = False

True to use bayesian problem

configuration_sparse.MEANS = [0.00125, 0.03660714285714286, 0.07196428571428572, 0.10732142857142857, 0.14267857142857143, 0.1780357142857143, 0.21339285714285713, 0.24875, 0.25375, 0.3775, 0.50125, 0.625, 0.74875, 0.8725, 0.99625]

Means of arms for non-hard-coded problems (non Bayesian)

configuration_sparse.ISSORTED = True

Whether to sort the means of the problems or not.

configuration_sparse.configuration = {'environment': [{'arm_type': <class 'Arms.Gaussian.Gaussian'>, 'params': [(0.05, 0.05, 0.0, 1.0), (0.07142857142857144, 0.05, 0.0, 1.0), (0.09285714285714286, 0.05, 0.0, 1.0), (0.1142857142857143, 0.05, 0.0, 1.0), (0.13571428571428573, 0.05, 0.0, 1.0), (0.15714285714285717, 0.05, 0.0, 1.0), (0.1785714285714286, 0.05, 0.0, 1.0), (0.2, 0.05, 0.0, 1.0), (0.4, 0.05, 0.0, 1.0), (0.47500000000000003, 0.05, 0.0, 1.0), (0.55, 0.05, 0.0, 1.0), (0.625, 0.05, 0.0, 1.0), (0.7000000000000001, 0.05, 0.0, 1.0), (0.7750000000000001, 0.05, 0.0, 1.0), (0.8500000000000001, 0.05, 0.0, 1.0)], 'sparsity': 7}], 'horizon': 10000, 'n_jobs': -1, 'nb_random_events': 5, 'policies': [{'archtype': <class 'Policies.EmpiricalMeans.EmpiricalMeans'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseUCB.SparseUCB'>, 'params': {'alpha': 1, 'sparsity': 7, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseklUCB.SparseklUCB'>, 'params': {'sparsity': 7, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.Thompson.Thompson'>, 'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.Thompson.Thompson'>, 'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.BayesUCB.BayesUCB'>, 'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'policy': <class 'Policies.BayesUCB.BayesUCB'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.OSSB.OSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0}}, {'archtype': <class 'Policies.OSSB.GaussianOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0, 'variance': 0.05}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.001, 'gamma': 0.0, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.01, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.001, 'gamma': 0.01, 'sparsity': 7}}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 100, 'verbosity': 6}

This dictionary configures the experiments

configuration_sparse.LOWER = 0

And get LOWER, AMPLITUDE values

configuration_sparse.AMPLITUDE = 1

And get LOWER, AMPLITUDE values

configuration_sparse.klucbGauss(x, d, precision=0.0)[source]

klucbGauss(x, d, sig2x) with the good variance (= 0.25).

configuration_sparse.klucbGamma(x, d, precision=0.0)[source]

klucbGamma(x, d, sig2x) with the good scale (= 1).

configuration_sparse_multiplayers module

Configuration for the simulations, for the multi-players case with sparse activated players.

configuration_sparse_multiplayers.HORIZON = 10000

HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.

configuration_sparse_multiplayers.REPETITIONS = 4

REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.

configuration_sparse_multiplayers.DO_PARALLEL = True

To profile the code, turn down parallel computing

configuration_sparse_multiplayers.N_JOBS = -1

Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.

configuration_sparse_multiplayers.NB_PLAYERS = 2

NB_PLAYERS : number of players for the game. Should be >= 2 and <= number of arms.

configuration_sparse_multiplayers.ACTIVATION = 1.0

ACTIVATION : common probability of activation.

configuration_sparse_multiplayers.ACTIVATIONS = (1.0, 1.0)

ACTIVATIONS : probability of activation of each player.

configuration_sparse_multiplayers.VARIANCE = 0.05

Variance of Gaussian arms

configuration_sparse_multiplayers.NB_ARMS = 2

Number of arms for non-hard-coded problems (Bayesian problems)

configuration_sparse_multiplayers.ARM_TYPE

alias of Arms.Bernoulli.Bernoulli

configuration_sparse_multiplayers.MEANS = [0.3333333333333333, 0.6666666666666667]

Means of the arms

configuration_sparse_multiplayers.configuration = {'activations': (1.0, 1.0), 'averageOn': 0.001, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.3333333333333333, 0.6666666666666667]}], 'finalRanksOnAverage': True, 'horizon': 10000, 'n_jobs': -1, 'players': [Selfish(UCB), Selfish(UCB)], 'repetitions': 4, 'successive_players': [[Selfish(U(1..2)), Selfish(U(1..2))], [Selfish(UCB), Selfish(UCB)], [Selfish(Thompson Sampling), Selfish(Thompson Sampling)], [Selfish(kl-UCB), Selfish(kl-UCB)], [Selfish(Exp3++), Selfish(Exp3++)]], 'verbosity': 6}

This dictionary configures the experiments

configuration_sparse_multiplayers.nbArms = 2

Number of arms in the first environment

env_client module

Client to play multi-armed bandits problem against. Many distribution of arms are supported, default to Bernoulli.

Usage:
env_client.py [–markovian | –dynamic] [–port=<PORT>] [–host=<HOST>] [–speed=<SPEED>] <json_configuration> env_client.py (-h|–help) env_client.py –version
Options:
-h –help Show this screen. –version Show version. –markovian Whether to use a Markovian MAB problem (default is simple MAB problems). –dynamic Whether to use a Dynamic MAB problem (default is simple MAB problems). –port=<PORT> Port to use for the TCP connection [default: 10000]. –host=<HOST> Address to use for the TCP connection [default: 0.0.0.0]. –speed=<SPEED> Speed of emission, in milliseconds [default: 1000].
env_client.default_configuration = {'arm_type': 'Bernoulli', 'params': {(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)}}

Example of configuration to pass from the command line. '{"arm_type": "Bernoulli", "params": (0.1, 0.5, 0.9)}'

env_client.read_configuration_env(a_string)[source]

Return a valid configuration dictionary to initialize a MAB environment, from the input string.

env_client.send_message(sock, message)[source]

Send this message to the socket.

env_client.client(env, host, port, speed)[source]

Launch a client that:

  • uses sockets to listen to input and reply
  • create a MAB environment from a JSON configuration (exactly like main.py when it reads configuration.py)
  • then receives choice arm from the network, pass it to the MAB environment, listens to his reward = draw(arm) feedback, and sends this back to the network.
env_client.transform_str(params)[source]

Like a safe exec() on a dictionary that can contain special values:

  • strings are interpreted as variables names (e.g., policy names) from the current globals() scope,
  • list are transformed to tuples to be constant and hashable,
  • dictionary are recursively transformed.
env_client.main(arguments)[source]

Take arguments, construct the learning policy and starts the server.

main module

main_multiplayers module

main_multiplayers_more module

main_sparse_multiplayers module

policy_server module

Server to play multi-armed bandits problem against.

Usage:
policy_server.py [–port=<PORT>] [–host=<HOST>] [–means=<MEANS>] <json_configuration> policy_server.py (-h|–help) policy_server.py –version
Options:
-h –help Show this screen. –version Show version. –port=<PORT> Port to use for the TCP connection [default: 10000]. –host=<HOST> Address to use for the TCP connection [default: 0.0.0.0]. –means=<MEANS> Means of arms used by the environment, to print regret [default: None].
policy_server.default_configuration = {'archtype': 'UCBalpha', 'nbArms': 10, 'params': {'alpha': 1}}

Example of configuration to pass from the command line. '{"nbArms": 3, "archtype": "UCBalpha", "params": { "alpha": 0.5 }}'

policy_server.read_configuration_policy(a_string)[source]

Return a valid configuration dictionary to initialize a policy, from the input string.

policy_server.server(policy, host, port, means=None)[source]

Launch a server that:

  • uses sockets to listen to input and reply
  • create a learning algorithm from a JSON configuration (exactly like main.py when it reads configuration.py)
  • then receives feedback (arm, reward) from the network, pass it to the algorithm, listens to his arm = choice() suggestion, and sends this back to the network.
policy_server.transform_str(params)[source]

Like a safe exec() on a dictionary that can contain special values:

  • strings are interpreted as variables names (e.g., policy names) from the current globals() scope,
  • list are transformed to tuples to be constant and hashable,
  • dictionary are recursively transformed.

Warning

It is still as unsafe as exec() : only use it with trusted inputs!

policy_server.main(args)[source]

Take args, construct the learning policy and starts the server.

How to run the code ?

This short page explains quickly how to install the requirements for this project, and then how to use the code to run simulations.

Required modules

Virtualenv

First, install the requirements, globally (or with a virtualenv, see below):

pip install -r requirements.txt
Some requirements are only needed for one policy (mostly the experimental ones), and for the documentation.
Nix

A pinned Nix environment is available:

nix-shell

Running some simulations

Then, it should be very straight forward to run some experiment. This will run the simulation, average them (by repetitions) and plot the results.

Single player
Single player
python main.py
# or
make main
Single player, aggregating algorithms
python main.py configuration_comparing_aggregation_algorithms
# or
make comparing_aggregation_algorithms
See these explainations: Aggregation.md
Single player, doubling-trick algorithms
python main.py configuration_comparing_doubling_algorithms
# or
make comparing_doubling_algorithms
See these explainations: DoublingTrick.md
Single player, with Sparse Stochastic Bandit
python main.py configuration_sparse
# or
make sparse
See these explainations: SparseBandits.md
Single player, with Markovian problem
python main.py configuration_markovian
# or
make markovian
Single player, with non-stationary problem
python main.py configuration_nonstationary
# or
make nonstationary
See these explainations: NonStationaryBandits.md
Multi-Player
Multi-Player, one algorithm
python main_multiplayers.py
# or
make multi
Multi-Player, comparing different algorithms
python main_multiplayers_more.py
# or
make moremulti
See these explainations: MultiPlayers.md

Using env variables ?

For all simulations, I recently added the support for environment variable, to ease the customization of the main parameters of every simulations.

For instance, if the configuration_multiplayers_more.py file is correct, then you can customize to use N=4 repetitions, for horizon T=1000 and M=3 players, parallelized with N_JOBS=4 jobs (use the number of cores of your CPU for optimal performance):

N=4 T=1000 M=3 DEBUG=True SAVEALL=False N_JOBS=4 make moremulti

In a virtualenv ?

If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv.

$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/SMPyBandits/bin/python3
Also creating executable in /your/path/to/SMPyBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate  # in bash, use activate.csh or activate.fish if needed
$ type pip  # just to check
pip is /your/path/to/SMPyBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2

And then be sure to use the virtualenv binary for Python, bin/python, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate was executed).


Or with a Makefile ?

You can also use the provided Makefile file to do this simply:

make install       # install the requirements
make multiplayers  # run and log the main.py script

It can be used to check the quality of the code with pylint:

make lint lint3  # check the code with pylint

It is also used to clean the code, build the doc, send the doc, etc. (This should not be used by others)


Or within a Jupyter notebook ?

I am writing some Jupyter notebooks, in this folder (notebooks/), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.

List of research publications using Lilian Besson’s SMPyBandits project

I (Lilian Besson) have started my PhD in October 2016, and this project is a part of my on going research since December 2016.


1st article, about policy aggregation algorithm (aka model selection)

I designed and added the Aggregator policy, in order to test its validity and performance.

It is a “simple” voting algorithm to combine multiple bandit algorithms into one. Basically, it behaves like a simple MAB bandit just based on empirical means (even simpler than UCB), where arms are the child algorithms A_1 .. A_N, each running in “parallel”.

For more details, refer to this file: Aggregation.md and this research article.

2nd article, about Multi-players Multi-Armed Bandits

There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting. The basic difference is about collisions : at each time t, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.

For more details, refer to this file: MultiPlayers.md and this research article.

3rd article, using Doubling Trick for Multi-Armed Bandits

I studied what Doubling Trick can and can’t do to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.

For more details, refer to this file: DoublingTrick.md and this research article.

4th article, about Piece-Wise Stationary Multi-Armed Bandits

With Emilie Kaufmann, we studied the Generalized Likelihood Ratio Test (GLRT) for sub-Bernoulli distributions, and proposed the B-GLRT algorithm for change-point detection for piece-wise stationary one-armed bandit problems. We combined the B-GLRT with the kl-UCB multi-armed bandit algorithm and proposed the GLR-klUCB algorithm for piece-wise stationary multi-armed bandit problems. We prove finite-time guarantees for the B-GLRT and the GLR-klUCB algorithm, and we illustrate its performance with extensive numerical experiments.

For more details, refer to this file: NonStationaryBandits.md and this research article.

Other interesting things

Single-player Policies
Arms and problems
  • My framework mainly targets stochastic bandits, with arms following Bernoulli, bounded (truncated) or unbounded Gaussian, Exponential, Gamma or Poisson distributions.
  • The default configuration is to use a fixed problem for N repetitions (e.g. 1000 repetitions, use MAB.MAB), but there is also a perfect support for “Bayesian” problems where the mean vector µ1,…,µK change at every repetition (see MAB.DynamicMAB).
  • There is also a good support for Markovian problems, see MAB.MarkovianMAB, even though I didn’t implement any policies tailored for Markovian problems.
  • I’m actively working on adding a very clean support for non-stationary MAB problems, and MAB.PieceWiseStationaryMAB is already working well. Use it with policies designed for piece-wise stationary problems, like Discounted-Thompson, CD-UCB, M-UCB, SW-UCB#.

Policy aggregation algorithms

Idea

The basic idea of a policy aggregation algorithm is to run in parallel some online learning algorithms, denoted $A_1,\ldots,A_N$ ($A_i$), and make them all vote at each step, and use some probabilistic scheme to select a decision from their votes.

Hopefully, if all the algorithms $A_i$ are not too bad and at least one of them is efficient for the problem at hand, the aggregation algorithm will learn to mainly trust the efficient one(s) and discard the votes from the others. An efficient aggregation algorithm should have performances similar to the best child algorithm $A_i$, in any problem.

The Exp4 algorithm by [Auer et al, 2002] is the first aggregation algorithm for online bandit algorithms, and recently other algorithms include LearnExp ([Singla et al, 2017]) and CORRAL ([Agarwal et al, 2017]).


Mathematical explanations

Initially, every child algorithms $A_i$ has the same “trust” probability $p_i$, and at every step, the aggregated bandit first listen to the decision from all its children $A_i$ ($a_{i,t}$ in $\{1,\ldots,K\}$), and then decide which arm to select by a probabilistic vote: the probability of selecting arm $k$ is the sum of the trust probability of the children who voted for arm $k$. It could also be done the other way: the aggregated bandit could first decide which children to listen to, then trust him.

But we want to update the trust probability of all the children algorithms, not only one, when it was wised to trust them. Mathematically, when the aggregated arm choose to pull the arm $k$ at step $t$, if it yielded a positive reward $r_{k,t}$, then the probability of all children algorithms $A_i$ who decided (independently) to chose $k$ (i.e., $a_{i,t} = k$) are increased multiplicatively: $p_i \leftarrow p_i * \exp(+ \beta * r_{k,t})$ where $\beta$ is a positive learning rate, e.g., $\beta = 0.1$.

It is also possible to decrease multiplicatively the trust of all the children algorithms who did not decided to chose the arm $k$ at every step $t$: if $a_{i,t} \neq k$ then $p_i \leftarrow p_i * \exp(- \beta * r_{k,t})$. I did not observe any difference of behavior between these two options (implemented with the Boolean parameter updateAllChildren).

Ensemble voting for MAB algorithms

This algorithm can be seen as the Multi-Armed Bandits (i.e., sequential reinforcement learning) counterpart of an ensemble voting technique, as used for classifiers or regression algorithm in usual supervised machine learning (see, e.g., sklearn.ensemble.VotingClassifier in scikit-learn).

Another approach could be to do some sort of grid search.

My algorithm: Aggregator

It is based on a modification of Exp4, and the details are given in its documentation, see Aggregator.

All the mathematical details can be found in my paper, [Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017], presented at the IEEE WCNC 2018 conference.


Configuration:

A simple python file, configuration_comparing_aggregation_algorithms.py, is used to import the arm classes, the policy classes and define the problems and the experiments.

For example, this will compare the classical MAB algorithms UCB, Thompson, BayesUCB, klUCB algorithms.

configuration = {
    "horizon": 10000,    # Finite horizon of the simulation
    "repetitions": 100,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [
        {
            "arm_type": Bernoulli,  # Only Bernoulli is available as far as now
            "params": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.1]
        }
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        {"archtype": UCB, "params": {} },
        {"archtype": Thompson, "params": {} },
        {"archtype": klUCB, "params": {} },
        {"archtype": BayesUCB, "params": {} },
    ]
}

To add an aggregated bandit algorithm (Aggregator class), you can use this piece of code, to aggregate all the algorithms defined before and dynamically add it to configuration:

current_policies = configuration["policies"]
configuration["policies"] = current_policies +
    [{  # Add one Aggregator policy, from all the policies defined above
        "archtype": Aggregator,
        "params": {
            "learningRate": 0.05,  # Tweak this if needed
            "updateAllChildren": True,
            "children": current_policies,
        },
    }]

The learning rate can be tuned automatically, by using the heuristic proposed by [Bubeck and Cesa-Bianchi, Theorem 4.2], without knowledge of the horizon, a decreasing learning rate $\eta_t = \sqrt(\frac{\log(N)}{t * K})$.


How to run the experiments ?

You should use the provided Makefile file to do this simply:

# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install  # install the requirements ONLY ONCE
make comparing_aggregation_algorithms   # run and log the main.py script

Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli arms only):

On a “simple” Bernoulli problem (semi-log-y scale)

_images/main_semilogy____env1-4_932221613383548446.pngOn a "simple" Bernoulli problem (semi-log-y scale).

Aggregator is the most efficient, and very similar to Exp4 here.

On a “harder” Bernoulli problem

_images/main____env2-4_932221613383548446.pngOn a "harder" Bernoulli problem, they all have similar performances, except LearnExp.

They all have similar performances, except LearnExp, which performs badly. We can check that the problem is indeed harder as the lower-bound (in black) is much larger.

On an “easy” Gaussian problem

_images/main____env3-4_932221613383548446.pngOn an "easy" Gaussian problem, only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling.

Only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling. CORRAL and LearnExp clearly appears sub-efficient.

On a harder problem, mixing Bernoulli, Gaussian, Exponential arms

_images/main_semilogy____env4-4_932221613383548446.pngOn a harder problem, mixing Bernoulli, Gaussian, Exponential arms, with 3 arms of each types with the same mean.

This problem is much harder as it has 3 arms of each types with the same mean.

_images/main_semilogx____env4-4_932221613383548446.pngThe semi-log-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard "mixed" problem.

The semi-log-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard “mixed” problem.


Multi-players simulation environment

There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting.

The basic difference is about collisions : at each time $t$, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.

Collision models

For example, I implemented these different collision models, in CollisionModels.py:

  • noCollision is a limited model where all players can sample an arm with collision. It corresponds to the single-player simulation: each player is a policy, compared without collision. This is for testing only, not so interesting.
  • onlyUniqUserGetsReward is a simple collision model where only the players alone on one arm sample it and receive the reward. This is the default collision model in the literature, for instance cf. [Shamir et al., 2015] collision model 1 or cf [Liu & Zhao, 2009]. Our article also focusses on this model.
  • rewardIsSharedUniformly is similar: the players alone on one arm sample it and receive the reward, and in case of more than one player on one arm, only one player (uniform choice, chosen by the base station) can sample it and receive the reward.
  • closerUserGetsReward is similar but uses another approach to chose who can emit. Instead of randomly choosing the lucky player, it uses a given (or random) vector indicating the distance of each player to the base station (it can also indicate the quality of the communication), and when two (or more) players are colliding, only the one who is closer to the base station can transmit. It is the more physically plausible.

More details on the code

Have a look to:

Policies designed to be used in the multi-players setting


Configuration:

A simple python file, configuration_multiplayers.py, is used to import the arm classes, the policy classes and define the problems and the experiments. See the explanations given for the simple-player case.

configuration["successive_players"] = [
    CentralizedMultiplePlay(NB_PLAYERS, klUCB, nbArms).children,
    RandTopM(NB_PLAYERS, klUCB, nbArms).children,
    MCTopM(NB_PLAYERS, klUCB, nbArms).children,
    Selfish(NB_PLAYERS, klUCB, nbArms).children,
    rhoRand(NB_PLAYERS, klUCB, nbArms).children,
]
  • The multi-players policies are added by giving a list of their children (eg Selfish(*args).children), who are instances of the proxy class ChildPointer. Each child methods is just passed back to the mother class (the multi-players policy, e.g., Selfish), who can then handle the calls as it wants (can be centralized or not).

How to run the experiments ?

You should use the provided Makefile file to do this simply:

# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install            # install the requirements ONLY ONCE
make multiplayers       # run and log the main_multiplayers.py script
make moremultiplayers   # run and log the main_more_multiplayers.py script

Some illustrations of multi-players simulations

_images/MP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.pngplots/MP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.png

Figure 1 : Regret, $M=6$ players, $K=9$ arms, horizon $T=5000$, against $500$ problems $\mu$ uniformly sampled in $[0,1]^K$. rhoRand (top blue curve) is outperformed by the other algorithms (and the gain increases with $M$). MCTopM (bottom yellow) outperforms all the other algorithms is most cases.

_images/MP__K9_M6_T10000_N1000__4_algos__all_RegretCentralized_loglog____env1-1_8200873569864822246.pngplots/MP__K9_M6_T10000_N1000__4_algos__all_RegretCentralized_loglog____env1-1_8200873569864822246.png _images/MP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.pngplots/MP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.png

Figure 2 : Regret (in loglog scale), for $M=6$ players for $K=9$ arms, horizon $T=5000$, for $1000$ repetitions on problem $\mu=[0.1,\ldots,0.9]$. RandTopM (yellow curve) outperforms Selfish (green), both clearly outperform rhoRand. The regret of MCTopM is logarithmic, empirically with the same slope as the lower bound. The $x$ axis on the regret histograms have different scale for each algorithm.

plots/MP__K9_M3_T123456_N100__8_algos__all_RegretCentralized_semilogy____env1-1_7803645526012310577.png

Figure 3 : Regret (in logy scale) for $M=3$ players for $K=9$ arms, horizon $T=123456$, for $100$ repetitions on problem $\mu=[0.1,\ldots,0.9]$. With the parameters from their respective article, MEGA and MusicalChair fail completely, even with knowing the horizon for MusicalChair.

Fairness vs. unfairness

For a multi-player policy, being fair means that on every simulation with $M$ players, each player access any of the $M$ best arms (about) the same amount of time. It is important to highlight that it has to be verified on each run of the MP policy, having this property in average is NOT enough.

  • For instance, the oracle policy OracleNotFair affects each of the $M$ players to one of the $M$ best arms, orthogonally, but once they are affected they always pull this arm. It’s unfair because one player will be lucky and affected to the best arm, the others are unlucky. The centralized regret is optimal (null, in average), but it is not fair.
  • And the other oracle policy OracleFair affects an offset to each of the $M$ players corresponding to one of the $M$ best arms, orthogonally, and once they are affected they will cycle among the best $M$ arms. It’s fair because every player will pull the $M$ best arms an equal number of time. And the centralized regret is also optimal (null, in average).
  • Usually, the Selfish policy is not fair: as each player is selfish and tries to maximize her personal regret, there is no reason for them to share the time on the $M$ best arms.
  • Conversely, the MusicalChair policy is not fair either, and cannot be: when each player has attained the last step, ie. they are all choosing the same arm, orthogonally, and they are not sharing the $M$ best arms.
  • The MEGA policy is designed to be fair: when players collide, they all have the same chance of leaving or staying on the arm, and they all sample from the $M$ best arms equally.
  • The rhoRand policy is not designed to be fair for every run, but it is fair in average.
  • Similarly for our algorithms RandTopM and MCTopM, defined in RandTopM.

Doubling Trick for Multi-Armed Bandits

I studied what Doubling Trick can and can’t do for multi-armed bandits, to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.

The Doubling Trick algorithm, denoted $DT(A, (T_i))$ for a diverging increasing sequence $T_i$, is the following algorithm:

_images/DoublingTrick_algo1.pngPolicies/DoublingTrick.py

Long story short, we proved the two following theorems.

For geometric sequences

It works for minimax regret bounds (in $R_T = \mathcal{O}(\sqrt{T}))$, with a constant multiplicative loss $\leq 4$, but not for logarithmic regret bounds (in $R_T = \mathcal{O}(\log T))$.

_images/DoublingTrick_theorem1.pnghttps://hal.inria.fr/hal-01736357

For exponential sequences

It works for logarithmic regret bounds (in $R_T = \mathcal{O}(\log T))$, but not for minimax regret bounds (in $R_T = \mathcal{O}(\sqrt{T}))$.

_images/DoublingTrick_theorem2.pnghttps://hal.inria.fr/hal-01736357


Article

I wrote a research article on that topic, it is a better introduction as a self-contained document to explain this idea and the algorithms. Reference: [What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018].


Configuration

A simple python file, configuration_comparing_doubling_algorithms.py, is used to import the arm classes, the policy classes and define the problems and the experiments.

For example, we can compare the standard anytime klUCB algorithm against the non-anytime klUCBPlusPlus algorithm, as well as 3 versions of DoublingTrickWrapper applied to klUCBPlusPlus.

configuration = {
    "horizon": 10000,    # Finite horizon of the simulation
    "repetitions": 100,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [
        {
            "arm_type": Bernoulli,
            "params": 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
        }
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        {"archtype": UCB, "params": {} },
        {"archtype": klUCB, "params": {} },
        {"archtype": klUCBPlusPlus, "params": { "horizon": 10000 } },
    ]
}

Then add a Doubling-Trick bandit algorithm (DoublingTrickWrapper class), you can use this piece of code:

configuration["policies"] += [
    {
        "archtype": DoublingTrickWrapper,
        "params": {
            "next_horizon": next_horizon,
            "full_restart": full_restart,
            "policy": BayesUCB,
        }
    }
    for full_restart in [ True, False ]
    for next_horizon in [
        next_horizon__arithmetic,
        next_horizon__geometric,
        next_horizon__exponential_fast,
        next_horizon__exponential_slow,
        next_horizon__exponential_generic
    ]
]

How to run the experiments ?

You should use the provided Makefile file to do this simply:

# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install  # install the requirements ONLY ONCE
make comparing_doubling_algorithms   # run and log the main.py script

Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli and UnboundedGaussian arms only):

Doubling-Trick with restart, on a “simple” Bernoulli problem

_images/main____env1-1_1217677871459230631.pngDoubling-Trick with restart, on a "simple" Bernoulli problem

Regret for Doubling-Trick, for $K=9$ Bernoulli arms, horizon $T=45678$, $n=1000$ repetitions and $\mu_1,\ldots,\mu_K$ taken uniformly in $[0,1]^K$. Geometric doubling ($b=2$) and slow exponential doubling ($b=1.1$) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At $t=40000$ we see clearly the effect of a new sequence for the best doubling trick ($T_i = 200 \times 2^i$). As expected, kl-UCB++ outperforms kl-UCB, and if the doubling sequence is growing fast enough then Doubling-Trick(kl-UCB++) can perform as well as kl-UCB++ (see for $t < 40000$).

Doubling-Trick with restart, on randomly taken Bernoulli problems

_images/main____env1-1_3633169128724378553.pngDoubling-Trick with restart, on randomly taken Bernoulli problems

Similarly but for $\mu_1,\ldots,\mu_K$ evenly spaced in $[0,1]^K$ (${0.1,\ldots,0.9}$). Both kl-UCB and kl-UCB++ are very efficient on “easy” problems like this one, and we can check visually that they match the lower bound from Lai & Robbins (1985). As before we check that slow doubling are too slow to give reasonable performance.

Doubling-Trick with restart, on randomly taken Gaussian problems with variance $V=1$

_images/main____env1-1_2223860464453456415.pngDoubling-Trick with restart, on randomly taken Gaussian problems with variance V=1

Regret for $K=9$ Gaussian arms $\mathcal{N}(\mu, 1)$, horizon $T=45678$, $n=1000$ repetitions and $\mu_1,\ldots,\mu_K$ taken uniformly in $[-5,5]^K$ and variance $V=1$. On “hard” problems like this one, both UCB and AFHG perform similarly and poorly w.r.t. to the lower bound from Lai & Robbins (1985). As before we check that geometric doubling ($b=2$) and slow exponential doubling ($b=1.1$) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime AFHG obtained by Doubling-Trick.

Doubling-Trick with restart, on an easy Gaussian problems with variance $V=1$

_images/main____env1-1_6979515539977716717.pngDoubling-Trick with restart, on an easy Gaussian problems with variance V=1

Regret for Doubling-Trick, for $K=9$ Gaussian arms $\mathcal{N}(\mu, 1)$, horizon $T=45678$, $n=1000$ repetitions and $\mu_1,\ldots,\mu_K$ uniformly spaced in $[-5,5]^K$. On “easy” problems like this one, both UCB and AFHG perform similarly and attain near constant regret (identifying the best Gaussian arm is very easy here as they are sufficiently distinct). Each doubling trick also appear to attain near constant regret, but geometric doubling ($b=2$) and slow exponential doubling ($b=1.1$) are slower to converge and thus less efficient.

Doubling-Trick with no restart, on randomly taken Bernoulli problems

_images/main____env1-1_5964629015089571121.pngDoubling-Trick with no restart, on randomly taken Bernoulli problems

Regret for $K=9$ Bernoulli arms, horizon $T=45678$, $n=1000$ repetitions and $\mu_1,\ldots,\mu_K$ taken uniformly in $[0,1]^K$, for Doubling-Trick no-restart. Geometric doubling (\eg, $b=2$) and slow exponential doubling (\eg, $b=1.1$) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At $t=40000$ we see clearly the effect of a new sequence for the best doubling trick ($T_i = 200 \times 2^i$). As expected, kl-UCB++ outperforms kl-UCB, and if the doubling sequence is growing fast enough then Doubling-Trick no-restart for kl-UCB++ can perform as well as kl-UCB++.

Doubling-Trick with no restart, on an “simple” Bernoulli problems

_images/main____env1-1_5972568793654673752.pngDoubling-Trick with no restart, on an "simple" Bernoulli problems

$K=9$ Bernoulli arms with $\mu_1,\ldots,\mu_K$ evenly spaced in $[0,1]^K$. On easy problems like this one, both kl-UCB and kl-UCB++ are very efficient, and here the geometric allows the Doubling-Trick no-restart anytime version of kl-UCB++ to outperform both kl-UCB and kl-UCB++.


Structure and Sparsity of Stochastic Multi-Armed Bandits

This page explains shortly what I studied on sparse stochastic multi-armed bandits. Assume a MAB problem with $K$ arms, each parametrized by its mean $\mu_k\in\mathbb{R}$. If you know in advance that only a small subset (of size $s$) of the arms have a positive arm, it sounds reasonable to hope to be more efficient in playing the bandit game compared to an approach which is non aware of the sparsity.

The SparseUCB is an extension of the well-known UCB, and requires to known exactly the value of $s$. It works by identifying as fast as possible (actually, in a sub-logarithmic number of samples) the arms with non-positive means. Then it only plays in the “good” arms with positive means, with a regular UCB policy.

I studied extensions of this idea, first of all the SparseklUCB policy as it was suggested in the original research paper, but mainly a generic “wrapper” black-box approach. For more details, see SparseWrapper.


Article

TODO finish! I am writing a small research article on that topic, it is a better introduction as a self-contained document to explain this idea and the algorithms. Reference: [Structure and Sparsity of Stochastic Multi-Arm Bandits, Lilian Besson and Emilie Kaufmann, 2018].

Example of simulation configuration

A simple python file, configuration_sparse.py, is used to import the arm classes, the policy classes and define the problems and the experiments.

For example, we can compare the standard UCB and BayesUCB algorithms, non aware of the sparsity, against the sparsity-aware SparseUCB algorithm, as well as 4 versions of SparseWrapper applied to BayesUCB.

configuration = {
    "horizon": 10000,    # Finite horizon of the simulation
    "repetitions": 100,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [
        {   # sparsity = nb of >= 0 mean, = 3 here
            "arm_type": Bernoulli,
            "params": 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.2, 0.3
        }
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        {"archtype": UCB, "params": {} },
        {"archtype": SparseUCB, "params": { "sparsity": 3 } },
        {"archtype": BayesUCB, "params": { } },
    ]
}

Then add a Sparse-Wrapper bandit algorithm (SparseWrapper class), you can use this piece of code:

configuration["policies"] += [
    {
        "archtype": SparseWrapper,
        "params": {
            "policy": BayesUCB,
            "use_ucb_for_set_J": use_ucb_for_set_J,
            "use_ucb_for_set_K": use_ucb_for_set_K,
        }
    }
    for use_ucb_for_set_J in [ True, False ]
    for use_ucb_for_set_K in [ True, False ]
]

How to run the experiments ?

You should use the provided Makefile file to do this simply:

make install  # install the requirements ONLY ONCE
make sparse   # run and log the main.py script

Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various sparse problems (with Bernoulli or UnboundedGaussian arms only):

3 variants of Sparse-Wrapper for UCB, on a “simple” sparse Bernoulli problem

plots/main____env1-1_XXX.png3 variants of Sparse-Wrapper for UCB, on a "simple" sparse Bernoulli problem

FIXME run some simulations and explain them!


Non-Stationary Stochastic Multi-Armed Bandits

A well-known and well-studied variant of the stochastic Multi-Armed Bandits is the so-called Non-Stationary Stochastic Multi-Armed Bandits. I give here a short introduction, with references below. If you are in a hurry, please read the first two pages of this recent article instead (arXiv:1802.08380).

  • The first studied variant considers piece-wise stationary problems, also referred to as abruptly changing, where the distributions of the $K$ arms are stationary on some intervals $[T_i,\ldots,T_{i+1}]$ with some abrupt change points $(T_i)$.
    • It is always assumed that the location of the change points are unknown to the user, otherwise the problem is not harder: just play your favorite algorithm, and restart it at each change point.
    • The change points can be fixed or randomly generated, but it is assumed that they are generated with a random source being oblivious of the user’s actions, so we can always consider that they were already generated before the game starts.
    • For instance, Arms.geometricChangePoints() generates some change point if we assume that at every time step $t=1,\ldots,T]$, there is a (small) probability p to have a change point.
    • The number of change points is usually denoted $L$ or $\Upsilon_T$, and should not be a constant w.r.t. $T$ (otherwise when $T\to\infty$ only the last section counts and give a stationary problem so it is not harder). Some algorithms require to know the value of $\Upsilon_T$, or at least an upper-bound, and some algorithms try to be efficient without knowing it (this is what we want!).
    • The goal is to have an efficient algorithm, but of course if $\Upsilon_T = \mathcal{O}(T)$ the problem is too hard to hope to be efficient and any algorithm will suffer a linear regret (i.e., be as efficient as a naive random strategy).
  • Another variant is the slowly varying problem, where the rewards $r(t) = r_{A(t),t}$ is sampled at each time from a parametric distribution, and the parameter(s) change at each time (usually parametrized by its mean). If we focus on 1D exponential families, or any family of distributions parametrized by their mean $\mu$, we denote this by having $r(t) \sim D(\mu_{A(t)}(t))$ where $\mu_k(t)$ can be varying with the time. The slowly varying hypothesis is that every time step can be a break point, and that the speed of change $|\mu_k(t+1) - \mu_k(t)|$ is bounded.
  • Other variants include harder settings.
    • For instance, we can consider that an adversarial is deciding the change points, by being adaptative to the user’s actions. I consider it harder, as always with adversarial problems, and not very useful to model real-world problems.
    • Another harder setting is a “pseudo-Markovian rested” point-of-view: the mean (or parameters) of an arm’s distribution can change only when it is sampled, either from time to time or at each time step. It makes sense for some applications, for instance Julien’s work (in SequeL Inria team), but for others it doesn’t really make sense (e.g., cognitive radio applications).

TODO fix notations more precisely, include definitions! TODO what are the lower-bounds given in the more recent articles?

Applications

TL;DR: the world is non stationary, so it makes sense to study this!

TODO write more justifications about applications, mainly for IoT networks (like when I studied multi-player bandits).

References

Here is a partial list of references on this topic. For more, a good starting point is to read the references given in the mentioned article, as always.

Main references
  1. It is not on non-stationary but on non-stochastic (i.e., adversary) bandits, but it can be a good reading for the curious reader. [“The Non-Stochastic Multi-Armed Bandit Problem”. P. Auer, N. Cesa-Bianchi, Y. Freund and R. Schapire. SIAM journal on computing, 32(1), 48-77, 2002].
  2. The Sliding-Window and Discounted UCB algorithms were given in [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”. Aurélien Garivier and Éric Moulines, ALT 2011].
    • They are implemented in Policies.SlidingWindowUCB.SWUCB and Policies.DiscountedUCB.
    • Note that I also implemented the non-anytime heuristic given by the author, Policies.SlidingWindowUCB.SWUCBPlus which uses the knowledge of the horizon $T$ to try to guess a correct value for $\tau$ the sliding window size.
    • I implemented this sliding window idea in a generic way, and Policies.SlidingWindowRestart is a generic wrapper that can work with (almost) any algorithm: it is an experimental policy, using a sliding window (of for instance $\tau=100$ draws of each arm), and reset the underlying algorithm as soon as the small empirical average is too far away from the long history empirical average (or just restart for one arm, if possible).
  3. [“Thompson sampling for dynamic multi-armed bandits”. N Gupta,. OC Granmo, A. Agrawala, 10th International Conference on Machine Learning and Applications Workshops. IEEE, 2011]
  4. [“Stochastic multi-armed-bandit problem with non-stationary rewards”, O. Besbes, Y. Gur, A. Zeevi. Advances in Neural Information Processing Systems (pp. 199-207), 2014]
  5. [“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017] introduced the CUSUM-UCB and PHT-UCB algorithms.
  6. [“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018] introduced the M-UCB algorithm.
Recent references

More recent articles include the following:

  1. [“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”. L. Wei and V. Srivastav. arXiv preprint arXiv:1802.08380, 2018], introduced the first algorithms that can (try to) tackle the two problems simultaneously, LM-DSEE and SW-UCB#.
    • They require to know the rate of change but not the number of changes. They either assume that the number of break points $\Upsilon_T$ is $\mathcal{O}(T^\nu)$ for some $\nu\in(0,1)$ (for abruptly-changing), or that the rate of change is $\max_t |\mu_{t+1} - \mu_{t}| \leq \varepsilon_T = \mathcal{O}(T^{-\kappa})$. In both cases, their model assumes to know $\nu$ or $\kappa$, or an upper-bound on it.
    • One advantage of their algorithms is their simplicity and ability to tackle both cases!
  2. [“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner. EWRL 2018, Lille], introduced the AdSwitch algorithm, which does not require to know the number $\Upsilon_T$ of change points.
    • Be sure how to adapt it to $K\geq2$ arms and not just $K=2$ (it shouldn’t be hard).
    • TODO adapt it to unknown horizon (using doubling tricks?!
  3. [“Memory Bandits: a Bayesian approach for the Switching Bandit Problem”. Réda Alami, Odalric Maillard, Raphaël Féraud. 31st Conference on Neural Information Processing Systems (NIPS 2017), hal-01811697], introduced the MemoryBandit algorithm, which does not require to know the number $\Upsilon_T$ of change points.
    • They use a generic idea of expert aggregation with an efficient tracking of a growing number of expert. The basic idea is the following: a new expert is started at every time, and at a breakpoint, the expert started just after the breakpoint will essentially be the most efficient one (and we need efficient tracking to know it).
    • Their MemoryBandit algorithm is very efficient empirically, but not easy to implement and it requires a large memory (although some discussion is given in their article’s appendix, as they evoke an heuristic that reduces the storage requirement).
  4. 🇫🇷 [“Algorithme de bandit et obsolescence : un modèle pour la recommandation”. Jonhathan Louëdec, Laurent Rossi, Max Chevalier, Aurélien Garivier and Josiane Mothe. 18ème Conférence francophone sur l’Apprentissage Automatique, 2016 (Marseille, France)] (🇫🇷 in French), introduces and justifies the possible applications of slowly-varying to recommender systems. They studies and present a model with an exponential decrease of the means, and the FadingUCB that is efficient if a bound on the speed of the exponential decrease is known.

Example of simulation configuration

A simple python file, configuration_nonstationary.py, is used to import the arm classes, the policy classes and define the problems and the experiments. The main.py file is used to import the configuration and launch the simulations.

For example, we can compare the standard UCB and Thompson algorithms, non aware of the non-stationarity, against the non-stationarity aware DiscountedUCB SWUCB, and the efficient DiscountedThompson algorithm.

We also included our algorithms Bernoulli-GLR-UCB using kl-UCB, and compare it with CUSUM-UCB and M-UCB, the two other state-of-the-art actively adaptive algorithms.

horizon = 5000
change_points = [0, 1000, 2000, 3000, 4000]
nb_random_events = len(change_points) - 1 # t=0 is not a change-point
list_of_means = [
    [0.4, 0.5, 0.9], # from 0 to 1000
    [0.5, 0.4, 0.7], # from 1000 to 2000
    [0.6, 0.3, 0.5], # from 2000 to 3000
    [0.7, 0.2, 0.3], # from 3000 to 4000
    [0.8, 0.1, 0.1], # from 4000 to 5000
]

configuration = {
    "horizon": horizon,    # Finite horizon of the simulation
    "repetitions": 1000,  # number of repetitions
    "n_jobs": -1,        # Maximum number of cores for parallelization: use ALL your CPU
    "verbosity": 5,      # Verbosity for the joblib calls
    # Environment configuration, you can set up more than one.
    "environment": [     # Bernoulli arms with non-stationarity
        {   # A non stationary problem: every step of the same repetition use a different mean vector!
            "arm_type": Bernoulli,
            "params": {
                "listOfMeans": list_of_means,
                "changePoints": change_points,
            }
        },
    ]
    ],
    # Policies that should be simulated, and their parameters.
    "policies": [
        { "archtype": klUCB, "params": {} },
        { "archtype": Thompson, "params": {} },
        { "archtype": OracleSequentiallyRestartPolicy, "params": {
            "policy": klUCB,
            "changePoints": change_points,
            "list_of_means": list_of_means,
            "reset_for_all_change": True,
            "reset_for_suboptimal_change": False,
        }}
        { "archtype": SWklUCB, "params": { "tau":  # formula from [GarivierMoulines2011]
            2 * np.sqrt(horizon * np.log(horizon) / (1 + nb_random_events))
        } },
        { "archtype": DiscountedklUCB, "params": { "gamma": 0.95 } },
        { "archtype": DiscountedThompson, "params": { "gamma": 0.95 } },
        { "archtype": Monitored_IndexPolicy, "params": {
            "horizon": horizon, "policy": klUCB, "w": 150,
        } },
        { "archtype": CUSUM_IndexPolicy, "params": {
            "horizon": horizon, "policy": klUCB, "w": 150, "max_nb_random_events": nb_random_events, "lazy_detect_change_only_x_steps": 10, # Delta n to speed up
        } } ] + [
        { "archtype": BernoulliGLR_IndexPolicy_WithDeterministicExploration,
        "params": {
            "horizon": horizon, "policy": klUCB_forGLR, "max_nb_random_events": nb_random_events,
            "lazy_detect_change_only_x_steps": 10, # Delta n to speed up
            "lazy_try_value_s_only_x_steps": 10, # Delta s
            "per_arm_restart": per_arm_restart,
        } }
        for per_arm_restart in [True, False]
    ]
}

How to run the experiments ?

You should use the provided Makefile file to do this simply:

# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install         # install the requirements ONLY ONCE

Then modify the configuration_nonstationary.py file, to specify the algorithms you want to compare (use the snippet above for inspiration). And run with:

make nonstationary   # run and log the main.py script

There is a couple of different piece-wise stationary problems, that we implemented for our article, and you can use environment variables to modify the experiment to run. For instance, to run problems 1 and 2, with horizon T=5000, N=1000 repetitions, using 4 cores, run:

PROBLEMS=1,2 T=5000 N=1000 N_JOBS=4 DEBUG=False SAVEALL=True make nonstationary

Some illustrations

Here are some plots illustrating the performances of the different policies implemented in this project, against various non-stationary problems (with Bernoulli only).

History of means for this simple problem

We consider a simple piece-wise stationary problem, with $K=3$ arms, a time horizon $T=5000$ and $N=1000$ repetitions. Arm changes concern only one arm at a time, and there is $\Upsilon=4$ changes at times $1000,2000,3000,4000$ ($C_T=\Upsilon_T=4)$.

_images/NonStationary_example_HistoryOfMeans.pngplots/NonStationary_example_HistoryOfMeans.png

Figure 1 : history of means $\mu_i(t)$ for the $K=3$ arms. There is only one change of the optimal arm.

The next figures were obtained with the following command (at the date of writing, 31st of January 2019):

PROBLEMS=1 T=5000 N=1000 N_JOBS=4 DEBUG=False SAVEALL=True make nonstationary
Comparison of different algorithms

By using the configuration snippet shown above, we compare 9 algorithms. The plots below show how to perform. Our proposal is the GLR-klUCB, with two options for Local or Global restarts (Generalized Likelihood Ratio test + klUCB), and it outperforms all the previous state-of-the-art approaches.

_images/NonStationary_example_Regret.pngplots/NonStationary_example_Regret.png

Figure 2 : plot of the mean regret $R_t$ as a function of the current time step $t$, for the different algorithms.

_images/NonStationary_example_BoxPlotRegret.pngplots/NonStationary_example_BoxPlotRegret.png

Figure 3 : box plot of the regret at $T=5000$, for the different algorithms.

_images/NonStationary_example_HistogramsRegret.pngplots/NonStationary_example_HistogramsRegret.png

Figure 4 : plot of the histograms of the regret at $T=5000$, for the different algorithms.
Comparison of time and memory consumptions

_images/NonStationary_example_RunningTimes.pngplots/NonStationary_example_RunningTimes.png

Figure 5 : comparison of the running times. Our approach, like other actively adaptive approach, is slower, but drastically more efficient!

_images/NonStationary_example_MemoryConsumption.pngplots/NonStationary_example_MemoryConsumption.png

Figure 6 : comparison of the memory consumption. Our approach, like other actively adaptive approach, is more costly, but drastically more efficient!

Article?

Not yet! We are working on this! TODO

Short documentation of the API

This short document aim at documenting the API used in my SMPyBandits environment, and closing this issue #3.

Code organization

Layout of the code:

UML diagrams
For more details, see these UML diagrams.

Question: How to change the simulations?

To customize the plots
  1. Change the default settings defined in Environment/plotsettings.py.
To change the configuration of the simulations
  1. Change the config file, i.e., configuration.py for single-player simulations, or configuration_multiplayers.py for multi-players simulations.
  2. A good example of a very simple configuration file is given in very_simple_configuration.py`
To change how to results are exploited
  1. Change the main script, i.e., main.py for single-player simulations, main_multiplayers.py for multi-players simulations. Some plots can be disabled or enabled by commenting a few lines, and some options are given as flags (constants in the beginning of the file).
  2. If needed, change, improve or add some methods to the simulation environment class, i.e., Environment.Evaluator for single-player simulations, and Environment.EvaluatorMultiPlayers for multi-players simulations. They use a class to store their simulation result, Environment.Result and Environment.ResultMultiPlayers.

Question: How to add something to this project?

In other words, what’s the API of this project?
For a new arm
  1. Make a new file, e.g., MyArm.py
  2. Save it in Arms/
  3. The file should contain a class of the same name, inheriting from Arms/Arm, e.g., like this class MyArm(Arm): ... (no need for any super call)
  4. This class MyArm has to have at least an __init__(...) method to create the arm object (with or without arguments - named or not); a __str__ method to print it as a string; a draw(t) method to draw a reward from this arm (t is the time, which can be used or not); and should have a mean() method that gives/computes the mean of the arm
  5. Finally, add it to the Arms/__init__.py file: from .MyArm import MyArm
  • For example, use this template:
from .Arm import Arm

class MyArm(Arm):
    def __init__(self, *args, **kwargs):
        # TODO Finish this method that initialize the arm MyArm

    def __str__(self):
        return "MyArm(...)".format('...')  # TODO

    def draw(self, t=None):
        # TODO Simulates a pull of this arm. t might be used, but not necessarily

    def mean(self):
        # TODO Returns the mean of this arm

For a new (single-user) policy
  1. Make a new file, e.g., MyPolicy.py
  2. Save it in Policies/
  3. The file should contain a class of the same name, it can inherit from Policies/IndexPolicy if it is a simple index policy, e.g., like this, class MyPolicy(IndexPolicy): ... (no need for any super call), or simply like class MyPolicy(object): ...
  4. This class MyPolicy has to have at least an __init__(nbArms, ...) method to create the policy object (with or without arguments - named or not), with at least the parameter nbArms (number of arms); a __str__ method to print it as a string; a choice() method to choose an arm (index among 0, ..., nbArms - 1, e.g., at random, or based on a maximum index if it is an index policy); and a getReward(arm, reward) method called when the arm arm gave the reward reward, and finally a startGame() method (possibly empty) which is called when a new simulation is ran.
  5. Optionally, a policy class can have a handleCollision(arm) method to handle a collision after choosing the arm arm (eg. update an internal index, change a fixed offset etc).
  6. Finally, add it to the Policies/__init__.py file: from .MyPolicy import MyPolicy
  • For example, use this template:
class MyPolicy(object):
    def __init__(self, nbArms, *args, **kwargs):
        self.nbArms = nbArms
        # TODO Finish this method that initialize the arm MyArm

    def __str__(self):
        return "MyArm(...)".format('...')  # TODO

    def startGame(self):
        pass  # Can be non-trivial, TODO if needed

    def getReward(self, arm, reward):
        # TODO After the arm 'arm' has been pulled, it gave the reward 'reward'
        pass  # Can be non-trivial, TODO if needed

    def choice(self):
        # TODO Do a smart choice of arm
        return random.randint(self.nbArms)

    def handleCollision(self, arm):
        pass  # Can be non-trivial, TODO if needed
Other choice...() methods can be added, if this policy MyPolicy has to be used for multiple play, ranked play, etc.

For a new multi-users policy
  1. Make a new file, e.g., MyPoliciesMultiPlayers.py
  2. Save it in PoliciesMultiPlayers/
  3. The file should contain a class, of the same name, e.g., like this, class MyPoliciesMultiPlayers(object):
  4. This class MyPoliciesMultiPlayers has to have at least an __init__ method to create the arm; a __str__ method to print it as a string; and a children attribute that gives a list of players (single-player policies).
  5. Finally, add it to the PoliciesMultiPlayers/__init__.py file: from .MyPoliciesMultiPlayers import MyPoliciesMultiPlayers
For examples, see PoliciesMultiPlayers.OracleNotFair and PoliciesMultiPlayers.OracleFair for full-knowledge centralized policies (fair or not), PoliciesMultiPlayers.CentralizedFixed and PoliciesMultiPlayers.CentralizedCycling for non-full-knowledge centralized policies (fair or not). There is also the PoliciesMultiPlayers.Selfish decentralized policy, where all players runs in without any knowledge on the number of players, and no communication (decentralized).
PoliciesMultiPlayers.Selfish is the simplest possible example I could give as a template.

About parallel computations

This short page explains quickly we used multi-core computations to speed-up the simulations in SMPyBandits.

Nowadays, parallelism is everywhere in the computational world, and any serious framework for numerical simulations must explore at least one of the three main approaches to (try to) gain performance from parallelism.

For all the different numerical simulations for which SMPyBandits is designed, the setting is the same: we consider a small set of p different problems, of time horizon T that we want to simulate for N independent runs (e.g., p=6, T=10000 and N=100). On the first hand, because of the fundamentally sequential nature of bandit games, each repetition of the simulation must be sequential regarding the time steps t=1,…,T, and so no parallelism can be done to speed up this axis. On the other hand, parallelism can help greatly for the two other axes: if we have a way to run in parallel 4 processes, and we have p=4 problems to simulate, then running a process for each problem directly brings a speed-up factor of 4. Similarly, if we want to run 100 repetitions of the same (random) problem, and we can run 4 processes in parallel, then running 100/4=25 repetitions on each process also bring a speed-up factor of 4.

In this page, we quickly review the chosen approach for SMPyBandits (multi-core on one machine), and we explain why the two other approaches were less appropriate for our study of multi-armed bandit problems.

What we did implement: Joblib for multi-core simulations.

The first approach is to use multiple cores of the same machines, and because it is both the simplest and the less financially as well as ecologically costly, this is the approach implemented in SMPyBandits. The machines I had access to during my thesis, either my own laptop or a workstation hosted the SCEE team in CentraleSupélec campus, were equipped with i5 or i7 Intel CPU with 4 or 12 cores.

As explained in the page How_to_run_the_code.html, we implemented in SMPyBandits an easy way to run any simulations on n cores of a machine, using the Joblib library. It is implemented in a completely transparent way, and if someone uses the command-line variable to configure experiments, using one core or all the cores of the machine one changes N_JOBS=1 to N_JOBS=-1, like in this example.

 BAYES=False ARM_TYPE=Bernoulli N=100 T=10000 K=9 N_JOBS=1 \
  python3 main.py configuration.py

As long as the number of jobs (N_JOBS here) is less then or equal to the number of physical cores in the CPU of the computer, the final speed-up in terms of total computation runtime is almost optimal.

But jobs are implemented as threads, so the speed-up cannot be more than the number of cores, and using for instance 20 jobs on 4-cores for the 20 repetitions is sub-optimal, as the CPU will essentially spend all its time (and memory) managing the different jobs, and not actually doing the simulations. Using the above example, we illustrate the effect of using multi-jobs and multi-cores on the time efficiency of simulations using SMPyBandits. We consider three values of N_JOBS, 1 to use only one core and one job, 4 to use all the 4 cores of my i5 Intel CPU, and 20 jobs.

We give in the Table below an example of running time of an experiment with T=1000, and different number of repetitions and number of jobs. It clearly illustrates that using more jobs than the number of CPU is sub-optimal, and that as soon as the number of repetitions is large enough, using one job by available CPU core (\ie, here 4 jobs) gives a significant speed-up time. Due to the cost of orchestrating the different jobs, and memory exchanges at the end of each repetition, the parallel version is \emph{not} 4 times faster, but empirically we always found it to be 2 to 3.5 times faster.

For a simulation with 9 different algorithms, for K=9 arms, a time horizon of T=10000, we illustrate the effect on the running time of using N_JOBS jobs in parallel. For different number of repetitions and different number of jobs N_JOBS, for 1, 4 (= nb cores), 20 (> nb cores) jobs:

  • 1 repetition: 15 seconds, 26 seconds, 43 seconds
  • 10 repetitions: 87 seconds, 51 seconds, 76 seconds
  • 100 repetitions: 749 seconds, 272 seconds, 308 seconds
  • 500 repetitions: 2944 seconds, 1530 seconds, 1846 seconds

_images/About_parallel_computations.pngThe table above shows the effect on the running time of using N_JOBS jobs in parallel, for a simulation with 9 different algorithms, for K=9 arms, a time horizon of T=10000.


Approaches we did not try

The two other approaches we could have consider is parallel computations running on not multiple cores but multiple machines, in a computer cluster, or parallel computations running in a Graphical Processing Unit (GPU).

GPU

I did not try to add in SMPyBandits the possibility to run simulations using a GPU, or any general purpose computation libraries offering a GPU-backend. Initially designed for graphical simulations and mainly for video-games applications, the use of GPU for scientific computations have been gaining attention for numerical simulation in the research world since the last 15 years, and NVidia CUDA for GPGPU (General Purpose GPU) started to become popular in 2011. Since 2016, we saw a large press coverage as well as an extensive use in research of deep learning libraries that make general-purpose machine learning algorithms train on the GPU of a user’s laptop or a cluster of GPU. This success is mainly possible because of the heavy parallelism of such training algorithms, and the parallel nature of GPU. To the best of the author knowledge, nobody has tried to implement high performance MAB simulations by using the “parallelism power” of a GPU (at least, no code for such experiments were made public in 2019).

I worked on a GPU, implementing fluid dynamic simulations in an internship in 2012, and I have since then kept a curiosity on how to use GPU-powered libraries and code. I have contributed to and used famous deep-learning libraries, like Theano or Keras, and my limited knowledge on such libraries made me believe that it was not easy to use a GPU for bandit simulations, and most surely it would not have been worth the time.

I would be very curious to understand how a GPU could be used to implement highly efficient simulations for sequential learning problems, because it seemed hard whenever I thought about it.

Large scale cluster

I also did not try to use any large scale computer cluster, even if I was aware of the possibility offered by the Grid 5000 project, for instance. It is partly due to time constraint, as I would have been curious to try, but mainly because we found that it would not have helped us much to use a large scale cluster. The main reason is that in the multi-armed bandit and sequential learning literature, most research papers do not even include an experimental section, and for the papers who did take the time to implement and test their proposed algorithms, it is almost done on just a few problems and for short- or medium- duration experiments.

For instance, the papers we consider to be the best ones regarding their empirical sections are Liu & Lee & Shroff, 2017, arXiv:1711.03539 and Cao & Zhen & Kveton & Xie, 2018, arXiv:1802.03692, for piece-wise stationary bandits, and they mainly consider reasonable problems of horizon T=10000 and no more than 1000 independent repetitions. Each paper considers one harder problem, of horizon T=1000000 and less repetitions.

In each article written during my thesis, we included extensive numerical simulations, and even the longest ones (for Besson & Kaufmann, 2019, HAL-02006471) were short enough to run in less than 12 hours on a 12-core workstation, so we could run a few large-scale simulations over night. For such reasons, we prefer to not try to run simulations on a cluster.

Other ideas?

And you, dear reader, do you have any idea of a technology I should have tried? If so, please fill an issue on GitHub! Thanks!


:boom: TODO

For others things to do, and issues to solve, see the issue tracker on GitHub.

Publicly release it and document it - OK

Other aspects

  • [x] publish on GitHub!

Presentation paper

  • [x] A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience
  • [x] A clear statement of need that illustrates the purpose of the software
  • [x] A list of key references including a link to the software archive
  • [x] Mentions (if applicable) of any ongoing research projects using the software or recent scholarly publications enabled by it

Clean up things - OK

Initial things to do! - OK

Improve and speed-up the code? - OK


More single-player MAB algorithms? - OK

Contextual bandits?

  • [ ] I should try to add support for (basic) contextual bandit.

Better storing of the simulation results

  • [ ] use hdf5 (with h5py) to store the data, on the run (to never lose data, even if the simulation gets killed).
  • [ ] even more “secure”: be able to interrupt the simulation, save its state and then load it back if needed (for instance if you want to leave the office for the weekend).

Multi-players simulations - OK

Other Multi-Player algorithms
Dynamic settings
  • [ ] add the possibility to have a varying number of dynamic users for multi-users simulations…
  • [ ] implement the experiments from [Musical Chair], [rhoRand] articles, and Navik Modi’s experiments?

C++ library / bridge to C++

  • [ ] Finish to write a perfectly clean CLI client to my Python server
  • [ ] Write a small library that can be included in any other C++ program to do : 1. start the socket connexion to the server, 2. then play one step at a time,
  • [ ] Check that the library can be used within a GNU Radio block !

Some illustrations for this project

Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli arms only):

Histogram of regrets at the end of some simulations

On a simple Bernoulli problem, we can compare 16 different algorithms (on a short horizon and a small number of repetitions, just as an example). If we plot the distribution of the regret at the end of each experiment, R_T, we can see this kind of plot:

_images/Histogramme_regret_monoplayer_2.pngHistogramme_regret_monoplayer_2.png

It helps a lot to see both the mean value (in solid black) of the regret, and its distribution of a few runs (100 here). It can be used to detect algorithms that perform well in average, but sometimes with really bad runs. Here, the Exp3++ seems to had one bad run.


Demonstration of different Aggregation policies

On a fixed Gaussian problem, aggregating some algorithms tuned for this exponential family (ie, they know the variance but not the means). Our algorithm, Aggregator, outperforms its ancestor Exp4 as well as the other state-of-the-art experts aggregation algorithms, CORRAL and LearnExp.

_images/main____env3-4_932221613383548446.pngmain____env3-4_932221613383548446.png


Demonstration of multi-player algorithms

Regret plot on a random Bernoulli problem, with M=6 players accessing independently and in a decentralized way K=9 arms. Our algorithms (RandTopM and MCTopM, as well as Selfish) outperform the state-of-the-art rhoRand:

_images/MP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.pngMP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.png

Histogram on the same random Bernoulli problems. We see that some all algorithms have a non-negligible variance on their regrets.

_images/MP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.pngMP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.png

Comparison with two other “state-of-the-art” algorithms (MusicalChair and MEGA, in semilogy scale to really see the different scale of regret between efficient and sub-optimal algorithms):

_images/MP__K9_M3_T123456_N100__8_algos__all_RegretCentralized_semilogy____env1-1_7803645526012310577.pngMP__K9_M3_T123456_N100__8_algos__all_RegretCentralized_semilogy____env1-1_7803645526012310577.png


Other illustrations

Piece-wise stationary problems

Comparing Sliding-Window UCB and Discounted UCB and UCB, on a simple Bernoulli problem which regular random shuffling of the arm.

_images/Demo_of_DiscountedUCB2.pngDemo_of_DiscountedUCB2.png

Sparse problem and Sparsity-aware algorithms

Comparing regular UCB, klUCB and Thompson sampling against “sparse-aware” versions, on a simple Gaussian problem with K=10 arms but only s=4 with non-zero mean.

_images/Demo_of_SparseWrapper_regret.pngDemo_of_SparseWrapper_regret.png


Demonstration of the Doubling Trick policy

  • On a fixed problem with full restart: _images/main____env1-1_3633169128724378553.pngmain____env1-1_3633169128724378553.png
  • On a fixed problem with no restart: _images/main____env1-1_5972568793654673752.pngmain____env1-1_5972568793654673752.png
  • On random problems with full restart: _images/main____env1-1_1217677871459230631.pngmain____env1-1_1217677871459230631.png
  • On random problems with no restart: _images/main____env1-1_5964629015089571121.pngmain____env1-1_5964629015089571121.png

Plots for the JMLR MLOSS paper

In the JMLR MLOSS paper I wrote to present SMPyBandits, an example of a simulation is presented, where we compare the standard anytime klUCB algorithm against the non-anytime variant klUCBPlusPlus algorithm, and also UCB (with (\alpha=1)) and Thompson (with Beta posterior).

configuration["policies"] = [
  { "archtype": klUCB, "params": { "klucb": klucbBern } },
  { "archtype": klUCBPlusPlus, "params": { "horizon": HORIZON, "klucb": klucbBern } },
  { "archtype": UCBalpha, "params": { "alpha": 1 } },
  { "archtype": Thompson, "params": { "posterior": Beta } }
]

Running this simulation as shown below will save figures in a sub-folder, as well as save data (pulls, rewards and regret) in HDF5 files.

# 3. run a single-player simulation
$ BAYES=False ARM_TYPE=Bernoulli N=1000 T=10000 K=9 N_JOBS=4 \
  MEANS=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] python3 main.py configuration.py

The two plots below shows the average regret for these 4 algorithms. The regret is the difference between the cumulated rewards of the best fixed-armed strategy (which is the oracle strategy for stationary bandits), and the cumulated rewards of the considered algorithms.

  • Average regret: _images/3.pngpaper/3.png
  • Histogram of regrets: _images/3_hist.pngpaper/3_hist.png
Example of a single-player simulation showing the average regret and histogram of regrets of 4 algorithms. They all perform very well: each algorithm is known to be order-optimal (i.e., its regret is proved to match the lower-bound up-to a constant), and each but UCB is known to be optimal (i.e. with the constant matching the lower-bound). For instance, Thomson sampling is very efficient in average (in yellow), and UCB shows a larger variance (in red).
Saving simulation data to HDF5 file

This simulation produces this example HDF5 file, which contains attributes (e.g., horizon=10000, repetitions=1000, nbPolicies=4), and a collection of different datasets for each environment. Only one environment was tested, and for env_0 the HDF5 stores some attributes (e.g., nbArms=9 and means=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]) and datasets (e.g., bestArmPulls of shape (4, 10000), cumulatedRegret of shape (4, 10000), lastRegrets of shape (4, 1000), averageRewards of shape (4, 10000)). See the example: GitHub.com/SMPyBandits/SMPyBandits/blob/master/plots/paper/example.hdf5.

Note: HDFCompass is recommended to explore the file from a nice and easy to use GUI. Or use it from a Python script with h5py or a Julia script with HDF5.jl. _images/example_HDF5_exploration_with_HDFCompass.pngExample of exploring this 'example.hdf5' file using HDFCompass

Graph of time and memory consumptions

Time consumption

Note that I had added a very clean support for time consumption measures, every simulation script will output (as the end) some lines looking like this:

Giving the mean and std running times ...
For policy #0 called 'UCB($\alpha=1$)' ...
    84.3 ms ± 7.54 ms per loop (mean ± std. dev. of 10 runs)
For policy #1 called 'Thompson' ...
    89.6 ms ± 17.7 ms per loop (mean ± std. dev. of 10 runs)
For policy #3 called 'kl-UCB$^{++}$($T=1000$)' ...
    2.52 s ± 29.3 ms per loop (mean ± std. dev. of 10 runs)
For policy #2 called 'kl-UCB' ...
    2.59 s ± 284 ms per loop (mean ± std. dev. of 10 runs)

_images/Demo_of_automatic_time_consumption_measure_between_algorithms1.pngDemo_of_automatic_time_consumption_measure_between_algorithms

Memory consumption

Note that I had added an experimental support for time consumption measures, every simulation script will output (as the end) some lines looking like this:

Giving the mean and std memory consumption ...
For players called '3 x RhoRand-kl-UCB, rank:1' ...
    23.6 KiB ± 52 B (mean ± std. dev. of 10 runs)
For players called '3 x RandTopM-kl-UCB' ...
    1.1 KiB ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x Selfish-kl-UCB' ...
    12 B ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x MCTopM-kl-UCB' ...
    4.9 KiB ± 86 B (mean ± std. dev. of 10 runs)
For players called '3 x MCNoSensing($M=3$, $T=1000$)' ...
    12 B ± 0 B (mean ± std. dev. of 10 runs)

_images/Demo_of_automatic_memory_consumption_measure_between_algorithms1.pngDemo_of_automatic_memory_consumption_measure_between_algorithms

It is still experimental!

Jupyter Notebooks :notebook: by Naereen @ GitHub

This folder hosts some Jupyter Notebooks, to present in a nice format some numerical experiments for my SMPyBandits project.

The wonderful Jupyter tools is awesome to write interactive and nicely presented :snake: Python simulations!

https://img.shields.io/badge/Made%20with-Jupyter-1f425f.svgmade-with-jupyter https://img.shields.io/badge/Made%20with-Python-1f425f.svgmade-with-python


1. List of experiments presented with notebooks

MAB problems
Single-Player simulations
Multi-Player simulations

2. Question: How to read these documents?

2.a. View the notebooks statically :memo:

3. Question: Requirements to run the notebooks locally?

All the requirements can be installed with pip.

Note: if you use Python 3 instead of Python 2, you might have to replace pip and python by pip3 and python3 in the next commands (if both pip and pip3 are installed).
3.a. Jupyter Notebook and IPython
sudo pip install jupyter ipython

It will also install all the dependencies, afterward you should have a jupyter-notebook command (or a jupyter command, to be ran as jupyter notebook) available in your PATH:

$ whereis jupyter-notebook
jupyter-notebook: /usr/local/bin/jupyter-notebook
$ jupyter-notebook --version  # version >= 4 is recommended
4.4.1
3.b. My numerical environment, SMPyBandits
  • First, install its dependencies (pip install -r requirements).
  • Then, either install it (not yet), or be sure to work in the main folder.
Note: it’s probably better to use virtualenv, if you like it. I never really understood how and why virtualenv are useful, but if you know why, you should know how to use it.

:information_desk_person: More information?

List of notebooks for SMPyBandits

Note

I wrote many other Jupyter notebooks covering various topics, see on my GitHub notebooks/ project.

A note on execution times, speed and profiling

A better approach?

In January, I tried to use the PyCharm Python IDE, and it has an awesome profiler included! But it was too cumbersome to use…

An even better approach?

Well now… I know my codebase, and I know how costly or efficient every new piece of code should be, if I find empirically something odd, I explore with one of the above-mentionned module…


logs files

This folder keeps some examples of log files to show the output of the simulation scripts.

Single player simulations

Example of output of the main.py program

Multi players simulations

Example of output of the main_multiplayers.py program

Linters

Pylint

Profilers


Graph of time and memory consumptions

Time consumption

Note that I had added a very clean support for time consumption measures, every simulation script will output (as the end) some lines looking like this:

Giving the mean and std running times ...
For policy #0 called 'UCB($\alpha=1$)' ...
    84.3 ms ± 7.54 ms per loop (mean ± std. dev. of 10 runs)
For policy #1 called 'Thompson' ...
    89.6 ms ± 17.7 ms per loop (mean ± std. dev. of 10 runs)
For policy #3 called 'kl-UCB$^{++}$($T=1000$)' ...
    2.52 s ± 29.3 ms per loop (mean ± std. dev. of 10 runs)
For policy #2 called 'kl-UCB' ...
    2.59 s ± 284 ms per loop (mean ± std. dev. of 10 runs)

_images/Demo_of_automatic_time_consumption_measure_between_algorithms.pngDemo_of_automatic_time_consumption_measure_between_algorithms

Memory consumption

Note that I had added an experimental support for time consumption measures, every simulation script will output (as the end) some lines looking like this:

Giving the mean and std memory consumption ...
For players called '3 x RhoRand-kl-UCB, rank:1' ...
    23.6 KiB ± 52 B (mean ± std. dev. of 10 runs)
For players called '3 x RandTopM-kl-UCB' ...
    1.1 KiB ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x Selfish-kl-UCB' ...
    12 B ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x MCTopM-kl-UCB' ...
    4.9 KiB ± 86 B (mean ± std. dev. of 10 runs)
For players called '3 x MCNoSensing($M=3$, $T=1000$)' ...
    12 B ± 0 B (mean ± std. dev. of 10 runs)

_images/Demo_of_automatic_memory_consumption_measure_between_algorithms.pngDemo_of_automatic_memory_consumption_measure_between_algorithms

It is still experimental!

Note

Both this documentation and the code are publicly available, under the open-source MIT License. The code is hosted on GitHub at github.com/SMPyBandits/SMPyBandits.

Indices and tables

Stars of https://github.com/SMPyBandits/SMPyBandits/ Contributors of https://github.com/SMPyBandits/SMPyBandits/ Watchers of https://github.com/SMPyBandits/SMPyBandits/ Forks of https://github.com/SMPyBandits/SMPyBandits/

Releases of https://github.com/SMPyBandits/SMPyBandits/ Commits of https://github.com/SMPyBandits/SMPyBandits/ / Date of last commit of https://github.com/SMPyBandits/SMPyBandits/

Issues of https://github.com/SMPyBandits/SMPyBandits/ : Open issues of https://github.com/SMPyBandits/SMPyBandits/ / Closed issues of https://github.com/SMPyBandits/SMPyBandits/

Pull requests of https://github.com/SMPyBandits/SMPyBandits/ : Open pull requests of https://github.com/SMPyBandits/SMPyBandits/ / Closed pull requests of https://github.com/SMPyBandits/SMPyBandits/

ForTheBadge uses-badges ForTheBadge uses-git forthebadge made-with-python ForTheBadge built-with-science