Welcome to SMPyBandits documentation!¶
Open-Source Python package for Single- and Multi-Players multi-armed Bandits algorithms.
A research framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms: UCB, KL-UCB, Thompson and many more for single-players, and MCTopM & RandTopM, MusicalChair, ALOHA, MEGA, rhoRand for multi-players simulations. It runs on Python 2 and 3, and is publically released as an open-source software under the MIT License.
Note
See more on the GitHub page for this project: https://github.com/SMPyBandits/SMPyBandits/.
The project is also hosted on Inria GForge, and the documentation can be seen online at https://smpybandits.github.io/ or http://http://banditslilian.gforge.inria.fr/ or https://smpybandits.readthedocs.io/.
This repository contains the code of my numerical environment, written in Python, in order to perform numerical simulations on single-player and multi-players Multi-Armed Bandits (MAB) algorithms.
I (Lilian Besson) have started my PhD in October 2016, and this is a part of my on going research since December 2016.
How to cite this work?¶
If you use this package for your own work, please consider citing it with this piece of BibTeX:
@misc{SMPyBandits,
title = {{SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python}},
author = {Lilian Besson},
year = {2018},
url = {https://github.com/SMPyBandits/SMPyBandits/},
howpublished = {Online at: \url{GitHub.com/SMPyBandits/SMPyBandits}},
note = {Code at https://github.com/SMPyBandits/SMPyBandits/, documentation at https://smpybandits.github.io/}
}
I also wrote a small paper to present SMPyBandits, and I will send it to JMLR MLOSS. The paper can be consulted here on my website.
SMPyBandits¶
Open-Source Python package for Single- and Multi-Players multi-armed Bandits algorithms.

This repository contains the code of Lilian Besson’s numerical environment, written in Python (2 or 3), for numerical simulations on :slot_machine: single-player and multi-players Multi-Armed Bandits (MAB) algorithms.
- A complete Sphinx-generated documentation is on SMPyBandits.GitHub.io.
- You can also browse online the results of extensive benchmarks, powered by Airspeed Velocity, on this page (code on SMPyBandits-benchmarks).
Quick presentation¶
It contains the most complete collection of single-player (classical) bandit algorithms on the Internet (over 65!), as well as implementation of all the state-of-the-art multi-player algorithms.
I follow very actively the latest publications related to Multi-Armed Bandits (MAB) research, and usually implement quite quickly the new algorithms (see for instance, Exp3++, CORRAL and SparseUCB were each introduced by articles (for Exp3++, for CORRAL, for SparseUCB) presented at COLT in July 2017, LearnExp comes from a NIPS 2017 paper, and kl-UCB++ from an ALT 2017 paper.). More recent examples are klUCBswitch from a paper from May 2018, and also MusicalChairNoSensing from a paper from August 2018.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
- Classical MAB have a lot of applications, from clinical trials, A/B testing, game tree exploration, and online content recommendation (my framework does not implement contextual bandit - yet).
- Multi-player MAB have applications in Cognitive Radio, and my framework implements all the collision models found in the literature, as well as all the algorithms from the last 10 years or so (
rhoRand
from 2009,MEGA
from 2015,MusicalChair
, and our state-of-the-art algorithmsRandTopM
andMCTopM
, along with very recent algorithmsSIC-MMAB
from arXiv:1809.08151 andMusicalChairNoSensing
from arXiv:1808.08416). - I’m working on adding a clean support for non-stationary MAB problem, and I will soon implement all state-of-the-art algorithms for these problems.
With this numerical framework, simulations can run on a single CPU or a multi-core machine, and summary plots are automatically saved as high-quality PNG, PDF and EPS (ready for being used in research article).
Making new simulations is very easy, one only needs to write a configuration script and basically no code! See these examples (files named configuration_*.py
).
A complete Sphinx documentation for each algorithms and every piece of code (included constants in the configurations!) is available here: SMPyBandits.GitHub.io. (I will use ReadTheDocs for this project, but I won’t use any continuous integration, don’t even think of it!)
I (Lilian Besson) have started my PhD in October 2016, and this is a part of my on going research since December 2016.
I launched the documentation on March 2017, I wrote my first research articles using this framework in 2017 and decided to (finally) open-source my project in February 2018.
Commits of https://github.com/SMPyBandits/SMPyBandits/ /
Date of last commit of https://github.com/SMPyBandits/SMPyBandits/
Issues of https://github.com/SMPyBandits/SMPyBandits/ :
Open issues of https://github.com/SMPyBandits/SMPyBandits/ /
Closed issues of https://github.com/SMPyBandits/SMPyBandits/
How to cite this work?¶
If you use this package for your own work, please consider citing it with this piece of BibTeX:
@misc{SMPyBandits,
title = {{SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python}},
author = {Lilian Besson},
year = {2018},
url = {https://github.com/SMPyBandits/SMPyBandits/},
howpublished = {Online at: \url{github.com/SMPyBandits/SMPyBandits}},
note = {Code at https://github.com/SMPyBandits/SMPyBandits/, documentation at https://smpybandits.github.io/}
}
I also wrote a small paper to present SMPyBandits, and I will send it to JMLR MLOSS. The paper can be consulted here on my website.
List of research publications using SMPyBandits¶
1st article, about policy aggregation algorithm (aka model selection)¶
I designed and added the Aggregator
policy, in order to test its validity and performance.
It is a “simple” voting algorithm to combine multiple bandit algorithms into one.
Basically, it behaves like a simple MAB bandit just based on empirical means (even simpler than UCB), where arms are the child algorithms A_1 .. A_N
, each running in “parallel”.
For more details, refer to this file: Aggregation.md and this research article.
PDF : BKM_IEEEWCNC_2018.pdf | HAL notice : BKM_IEEEWCNC_2018 | BibTeX : BKM_IEEEWCNC_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
2nd article, about Multi-players Multi-Armed Bandits¶
There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting.
The basic difference is about collisions : at each time t
, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.
For more details, refer to this file: MultiPlayers.md and this research article.
PDF : BK__ALT_2018.pdf | HAL notice : BK__ALT_2018 | BibTeX : BK__ALT_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
3rd article, using Doubling Trick for Multi-Armed Bandits¶
I studied what Doubling Trick can and can’t do to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.
For more details, refer to this file: DoublingTrick.md and this research article.
PDF : BK__DoublingTricks_2018.pdf | HAL notice : BK__DoublingTricks_2018 | BibTeX : BK__DoublingTricks_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
4th article, about Piece-Wise Stationary Multi-Armed Bandits¶
With Emilie Kaufmann, we studied the Generalized Likelihood Ratio Test (GLRT) for sub-Bernoulli distributions, and proposed the B-GLRT algorithm for change-point detection for piece-wise stationary one-armed bandit problems. We combined the B-GLRT with the kl-UCB multi-armed bandit algorithm and proposed the GLR-klUCB algorithm for piece-wise stationary multi-armed bandit problems. We prove finite-time guarantees for the B-GLRT and the GLR-klUCB algorithm, and we illustrate its performance with extensive numerical experiments.
For more details, refer to this file: NonStationaryBandits.md and this research article.
PDF : BK__COLT_2019.pdf | HAL notice : BK__COLT_2019 | BibTeX : BK__COLT_2019.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
Other interesting things¶
Single-player Policies¶
- More than 65 algorithms, including all known variants of the
UCB
, kl-UCB,MOSS
and Thompson Sampling algorithms, as well as other less known algorithms (OCUCB
,BESA
,OSSB
etc). - For instance,
SparseWrapper
is a generalization of the SparseUCB from this article. - Implementation of very recent Multi-Armed Bandits algorithms, e.g.,
kl-UCB++
(from this article),UCB-dagger
(from this article), orMOSS-anytime
(from this article). - Experimental policies:
BlackBoxOpt
orUnsupervisedLearning
(using Gaussian processes to learn the arms distributions).
Arms and problems¶
- My framework mainly targets stochastic bandits, with arms following
Bernoulli
, bounded (truncated) or unboundedGaussian
,Exponential
,Gamma
orPoisson
distributions, and more. - The default configuration is to use a fixed problem for N repetitions (e.g. 1000 repetitions, use
MAB.MAB
), but there is also a perfect support for “Bayesian” problems where the mean vector µ1,…,µK change at every repetition (seeMAB.DynamicMAB
). - There is also a good support for Markovian problems, see
MAB.MarkovianMAB
, even though I didn’t implement any policies tailored for Markovian problems. - I’m actively working on adding a very clean support for non-stationary MAB problems, and
MAB.PieceWiseStationaryMAB
is already working well. Use it with policies designed for piece-wise stationary problems, like Discounted-Thompson, one of the CD-UCB algorithms, M-UCB, SlidingWindowUCB or Discounted-UCB, or SW-UCB#.
Other remarks¶
- Everything here is done in an imperative, object oriented style. The API of the Arms, Policy and MultiPlayersPolicy classes is documented in this file (API.md).
- The code is clean, valid for both Python 2 and Python 3.
- Some piece of code come from the pymaBandits project, but most of them were refactored. Thanks to the initial project!
- G.Varoquaux’s joblib is used for the
Evaluator
andEvaluatorMultiPlayers
classes, so the simulations are easily parallelized on multi-core machines. (Putn_jobs = -1
orPARALLEL = True
in the config file to use all your CPU cores, as it is by default).
How to run the experiments ?¶
See this document: How_to_run_the_code.md for more details (or this documentation page).
TL;DR: this short bash snippet shows how to clone the code, install the requirements for Python 3 (in a virtualenv, and starts some simulation for N=100 repetitions of the default non-Bayesian Bernoulli-distributed problem, for K=9 arms, an horizon of T=10000 and on 4 CPUs (it should take about 20 minutes for each simulations):
cd /tmp/ # or wherever you want
git clone -c core.symlinks=true https://GitHub.com/SMPyBandits/SMPyBandits.git
cd SMPyBandits
# just be sure you have the latest virtualenv from Python 3
sudo pip3 install --upgrade --force-reinstall virtualenv
# create and active the virtualenv
virtualenv venv
. venv/bin/activate
type pip # check it is /tmp/SMPyBandits/venv/bin/pip
type python # check it is /tmp/SMPyBandits/venv/bin/python
# install the requirements in the virtualenv
pip install -r requirements_full.txt
# run a single-player simulation!
N=100 T=10000 K=9 N_JOBS=4 make single
# run a multi-player simulation!
N=100 T=10000 M=3 K=9 N_JOBS=4 make moremulti
You can also install it directly with pip
and from GitHub:
cd /tmp/ ; mkdir SMPyBandits ; cd SMPyBandits/
virtualenv venv
. venv/bin/activate
type pip # check it is /tmp/SMPyBandits/venv/bin/pip
type python # check it is /tmp/SMPyBandits/venv/bin/python
pip install git+https://github.com/SMPyBandits/SMPyBandits.git#egg=SMPyBandits[full]
- If speed matters to you and you want to use algorithms based on kl-UCB, you should take the time to build and install the fast C implementation of the utilities KL functions. Default is to use kullback.py, but using the C version from Policies/C/ really speeds up the computations. Just follow the instructions, it should work well (you need
gcc
to be installed).- And if speed matters, be sure that you have a working version of Numba, it is used by many small functions to (try to automatically) speed up the computations.
Nix¶
A pinned Nix environment is available for this experimental setup in the nix/pkgs/
directory.
From the root of the project:
$ nix-shell
nix-shell$ jupyter_notebook
nix-shell$ N=100 T=10000 K=9 N_JOBS=4 make single
The following one-liner lets you explore one of the example notebooks from any Nix-enabled machine, without cloning the repository:
$ nix-shell https://github.com/SMPYBandits/SMPyBandits/archive/master.tar.gz --run 'jupyter-notebook $EXAMPLE_NOTEBOOKS/Example_of_a_small_Multi-Player_Simulation__with_Centralized_Algorithms.ipynb'
:boom: Warning¶
- This work is still experimental even if it is well tested and stable! It’s active research. It should be completely bug free and every single module/file should work perfectly (as this pylint log and this other one says), but bugs are sometimes hard to spot so if you encounter any issue, please fill a bug ticket.
- Whenever I add a new feature, I run experiments to check that nothing is broken (and Travis CI helps too). But there is no unittest (I don’t have time). You would have to trust me :sunglasses:!
- This project is NOT meant to be a library that you can use elsewhere, but a research tool.
Contributing?¶
I don’t except issues or pull requests on this project, but you are welcome to.
Contributions (issues, questions, pull requests) are of course welcome, but this project is and will stay a personal environment designed for quick research experiments, and will never try to be an industry-ready module for applications of Multi-Armed Bandits algorithms. If you want to contribute, please have a look to the CONTRIBUTING.md file, and if you want to be more seriously involved, read the CODE_OF_CONDUCT.md file.
- You are welcome to submit an issue, if it was not previously answered,
- If you have interesting example of use of SMPyBandits, please share it! (Jupyter Notebooks are preferred). And fill a pull request to add it to the notebooks examples.
:boom: TODO¶
See this file TODO.md, and the issues on GitHub.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson, with help from contributors.
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Contributors of https://github.com/SMPyBandits/SMPyBandits/
Watchers of https://github.com/SMPyBandits/SMPyBandits/
Forks of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Commits of https://github.com/SMPyBandits/SMPyBandits/ /
Date of last commit of https://github.com/SMPyBandits/SMPyBandits/
Issues of https://github.com/SMPyBandits/SMPyBandits/ :
Open issues of https://github.com/SMPyBandits/SMPyBandits/ /
Closed issues of https://github.com/SMPyBandits/SMPyBandits/
Pull requests of https://github.com/SMPyBandits/SMPyBandits/ :
Open pull requests of https://github.com/SMPyBandits/SMPyBandits/ /
Closed pull requests of https://github.com/SMPyBandits/SMPyBandits/
ForTheBadge uses-badges
ForTheBadge uses-git
forthebadge made-with-python
ForTheBadge built-with-science
SMPyBandits modules¶
Arms package¶
Arms : contains different types of bandit arms:
Constant
, UniformArm
, Bernoulli
, Binomial
, Poisson
, Gaussian
, Exponential
, Gamma
, DiscreteArm
.
Each arm class follows the same interface:
> my_arm = Arm(params)
> my_arm.mean
0.5
> my_arm.draw() # one random draw
0.0
> my_arm.draw_nparray(20) # or ((3, 10)), many draw
array([ 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0.,
1., 0., 0., 0., 1., 1., 1.])
Also contains:
uniformMeans()
, to generate uniformly spaced means of arms.uniformMeansWithSparsity()
, to generate uniformly spaced means of arms, with sparsity constraints.randomMeans()
, to generate randomly spaced means of arms.randomMeansWithGapBetweenMbestMworst()
, to generate randomly spaced means of arms, with a constraint on the gap between the M-best arms and the (K-M)-worst arms.randomMeansWithSparsity()
, to generate randomly spaced means of arms with sparsity constraint.shuffled()
, to return a shuffled version of a list.- Utility functions
array_from_str()
list_from_str()
andtuple_from_str()
to obtain a numpy.ndarray, a list or a tuple from a string (used for the CLI env variables interface). optimal_selection_probabilities()
.geometricChangePoints()
, to obtain randomly spaced change points.continuouslyVaryingMeans()
andrandomContinuouslyVaryingMeans()
, to get new random means for continuously varying non-stationary MAB problems.
-
Arms.
shuffled
(mylist)[source]¶ Returns a shuffled version of the input 1D list. sorted() exists instead of list.sort(), but shuffled() does not exist instead of random.shuffle()…
>>> from random import seed; seed(1234) # reproducible results >>> mylist = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] >>> shuffled(mylist) [0.9, 0.4, 0.3, 0.6, 0.5, 0.7, 0.1, 0.2, 0.8] >>> shuffled(mylist) [0.4, 0.3, 0.7, 0.5, 0.8, 0.1, 0.9, 0.6, 0.2] >>> shuffled(mylist) [0.4, 0.6, 0.9, 0.5, 0.7, 0.2, 0.1, 0.3, 0.8] >>> shuffled(mylist) [0.8, 0.7, 0.3, 0.1, 0.9, 0.5, 0.6, 0.2, 0.4]
-
Arms.
uniformMeans
(nbArms=3, delta=0.05, lower=0.0, amplitude=1.0, isSorted=True)[source]¶ Return a list of means of arms, well spaced:
- in [lower, lower + amplitude],
- sorted in increasing order,
- starting from lower + amplitude * delta, up to lower + amplitude * (1 - delta),
- and there is nbArms arms.
>>> np.array(uniformMeans(2, 0.1)) array([0.1, 0.9]) >>> np.array(uniformMeans(3, 0.1)) array([0.1, 0.5, 0.9]) >>> np.array(uniformMeans(9, 1 / (1. + 9))) array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
-
Arms.
uniformMeansWithSparsity
(nbArms=10, sparsity=3, delta=0.05, lower=0.0, lowerNonZero=0.5, amplitude=1.0, isSorted=True)[source]¶ Return a list of means of arms, well spaced, in [lower, lower + amplitude].
- Exactly
nbArms-sparsity
arms will have a mean =lower
and the others are randomly sampled uniformly in [lowerNonZero, lower + amplitude]. - All means will be different, except if
mingap=None
, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234) # reproducible results >>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2)) # doctest: +ELLIPSIS array([ 0. , 0. , 0. , 0. , 0.55, 0.95]) >>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2, lowerNonZero=0.8, delta=0.03)) # doctest: +ELLIPSIS array([ 0. , 0. , 0. , 0. , 0.806, 0.994]) >>> np.array(uniformMeansWithSparsity(nbArms=10, sparsity=2)) # doctest: +ELLIPSIS array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.55, 0.95]) >>> np.array(uniformMeansWithSparsity(nbArms=6, sparsity=2, delta=0.05)) # doctest: +ELLIPSIS array([ 0. , 0. , 0. , 0. , 0.525, 0.975]) >>> np.array(uniformMeansWithSparsity(nbArms=10, sparsity=4, delta=0.05)) # doctest: +ELLIPSIS array([ 0. , 0. , 0. , 0. , 0. , 0. , 0.525, 0.675, 0.825, 0.975])
- Exactly
-
Arms.
randomMeans
(nbArms=3, mingap=None, lower=0.0, amplitude=1.0, isSorted=True)[source]¶ Return a list of means of arms, randomly sampled uniformly in [lower, lower + amplitude], with a min gap >= mingap.
- All means will be different, except if
mingap=None
, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234) # reproducible results >>> randomMeans(nbArms=3, mingap=0.05) # doctest: +ELLIPSIS [0.191..., 0.437..., 0.622...] >>> randomMeans(nbArms=3, mingap=0.01) # doctest: +ELLIPSIS [0.276..., 0.801..., 0.958...]
- Means are sorted, except if
isSorted=False
.
>>> import random; random.seed(1234) # reproducible results >>> randomMeans(nbArms=5, mingap=0.01, isSorted=True) # doctest: +ELLIPSIS [0.006..., 0.229..., 0.416..., 0.535..., 0.899...] >>> randomMeans(nbArms=5, mingap=0.01, isSorted=False) # doctest: +ELLIPSIS [0.419..., 0.932..., 0.072..., 0.755..., 0.650...]
- All means will be different, except if
-
Arms.
randomMeansWithGapBetweenMbestMworst
(nbArms=3, mingap=None, nbPlayers=2, lower=0.0, amplitude=1.0, isSorted=True)[source]¶ Return a list of means of arms, randomly sampled uniformly in [lower, lower + amplitude], with a min gap >= mingap between the set Mbest and Mworst.
-
Arms.
randomMeansWithSparsity
(nbArms=10, sparsity=3, mingap=0.01, delta=0.05, lower=0.0, lowerNonZero=0.5, amplitude=1.0, isSorted=True)[source]¶ Return a list of means of arms, in [lower, lower + amplitude], with a min gap >= mingap.
- Exactly
nbArms-sparsity
arms will have a mean =lower
and the others are randomly sampled uniformly in[lowerNonZero, lower + amplitude]
. - All means will be different, except if
mingap=None
, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234) # reproducible results >>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.05) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.595..., 0.811...] >>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.718..., 0.892...]
- Means are sorted, except if
isSorted=False
.
>>> import random; random.seed(1234) # reproducible results >>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01, isSorted=True) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.636..., 0.889...] >>> randomMeansWithSparsity(nbArms=6, sparsity=2, mingap=0.01, isSorted=False) # doctest: +ELLIPSIS [0.0, 0.0, 0.900..., 0.638..., 0.0, 0.0]
- Exactly
-
Arms.
randomMeansWithSparsity2
(nbArms=10, sparsity=3, mingap=0.01, lower=-1.0, lowerNonZero=0.0, amplitude=2.0, isSorted=True)[source]¶ Return a list of means of arms, in [lower, lower + amplitude], with a min gap >= mingap.
- Exactly
nbArms-sparsity
arms will have a mean sampled uniformly in[lower, lowerNonZero]
and the others are randomly sampled uniformly in[lowerNonZero, lower + amplitude]
. - All means will be different, except if
mingap=None
, with a min gap > 0.
>>> import numpy as np; np.random.seed(1234) # reproducible results >>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.05) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.595..., 0.811...] >>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.718..., 0.892...]
- Means are sorted, except if
isSorted=False
.
>>> import random; random.seed(1234) # reproducible results >>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01, isSorted=True) # doctest: +ELLIPSIS [0.0, 0.0, 0.0, 0.0, 0.636..., 0.889...] >>> randomMeansWithSparsity2(nbArms=6, sparsity=2, mingap=0.01, isSorted=False) # doctest: +ELLIPSIS [0.0, 0.0, 0.900..., 0.638..., 0.0, 0.0]
- Exactly
-
Arms.
array_from_str
(my_str)[source]¶ Convert a string like “[0.1, 0.2, 0.3]” to a numpy array [0.1, 0.2, 0.3], using safe json.loads instead of exec.
>>> array_from_str("[0.1, 0.2, 0.3]") array([0.1, 0.2, 0.3]) >>> array_from_str("0.1, 0.2, 0.3") array([0.1, 0.2, 0.3]) >>> array_from_str("0.9") array([0.9])
-
Arms.
list_from_str
(my_str)[source]¶ Convert a string like “[0.1, 0.2, 0.3]” to a list (0.1, 0.2, 0.3), using safe json.loads instead of exec.
>>> list_from_str("[0.1, 0.2, 0.3]") [0.1, 0.2, 0.3] >>> list_from_str("0.1, 0.2, 0.3") [0.1, 0.2, 0.3] >>> list_from_str("0.9") [0.9]
-
Arms.
tuple_from_str
(my_str)[source]¶ Convert a string like “[0.1, 0.2, 0.3]” to a tuple (0.1, 0.2, 0.3), using safe json.loads instead of exec.
>>> tuple_from_str("[0.1, 0.2, 0.3]") (0.1, 0.2, 0.3) >>> tuple_from_str("0.1, 0.2, 0.3") (0.1, 0.2, 0.3) >>> tuple_from_str("0.9") (0.9,)
-
Arms.
optimal_selection_probabilities
(M, mu)[source]¶ Compute the optimal selection probabilities of K arms of means \(\mu_i\) by \(1 \leq M \leq K\) players, if they all observe each other pulls and rewards, as derived in (15) p3 of [[The Effect of Communication on Noncooperative Multiplayer Multi-Armed Bandit Problems, by Noyan Evirgen, Alper Kose, IEEE ICMLA 2017]](https://arxiv.org/abs/1711.01628v1).
Warning
They consider a different collision model than I usually do, when two (or more) players ask for the same resource at same time t, I usually consider than all the colliding players receive a zero reward (see
Environment.CollisionModels.onlyUniqUserGetsReward()
), but they consider than exactly one of the colliding players gets the reward, and all the others get a zero reward (seeEnvironment.CollisionModels.rewardIsSharedUniformly()
).Example:
>>> optimal_selection_probabilities(3, [0.1,0.1,0.1]) array([0.33333333, 0.33333333, 0.33333333])
>>> optimal_selection_probabilities(3, [0.1,0.2,0.3]) # weird ? not really... array([0. , 0.43055556, 0.56944444])
>>> optimal_selection_probabilities(3, [0.1,0.3,0.9]) # weird ? not really... array([0. , 0.45061728, 0.54938272])
>>> optimal_selection_probabilities(3, [0.7,0.8,0.9]) array([0.15631866, 0.35405647, 0.48962487])
Note
These results may sound counter-intuitive, but again they use a different collision models: in my usual collision model, it makes no sense to completely drop an arm when K=M=3, no matter the probabilities \(\mu_i\), but in their collision model, a player wins more (in average) if she has a \(50\%\) chance of being alone on an arm with mean \(0.3\) than if she is sure to be alone on an arm with mean \(0.1\) (see examples 3 and 4).
-
Arms.
geometricChangePoints
(horizon=10000, proba=0.001)[source]¶ Change points following a geometric distribution: at each time, the probability of having a change point at the next step is
proba
.>>> np.random.seed(0) >>> geometricChangePoints(100, 0.1) array([ 8, 20, 29, 37, 43, 53, 59, 81]) >>> geometricChangePoints(100, 0.2) array([ 6, 8, 14, 29, 31, 35, 40, 44, 46, 60, 63, 72, 78, 80, 88, 91])
-
Arms.
continuouslyVaryingMeans
(means, sign=1, maxSlowChange=0.1, horizon=None, lower=0.0, amplitude=1.0, isSorted=True)[source]¶ New means, slightly modified from the previous ones.
- The change and the sign of change are constants.
-
Arms.
randomContinuouslyVaryingMeans
(means, maxSlowChange=0.1, horizon=None, lower=0.0, amplitude=1.0, isSorted=True)[source]¶ New means, slightly modified from the previous ones.
- The amplitude
c
of the change is constant, but it is randomly sampled in \(\mathcal{U}([-c,c])\).
- The amplitude
Submodules¶
Arms.Arm module¶
Base class for an arm class.
-
class
Arms.Arm.
Arm
(lower=0.0, amplitude=1.0)[source]¶ Bases:
object
Base class for an arm class.
-
lower
= None¶ Lower value of rewards
-
amplitude
= None¶ Amplitude of value of rewards
-
min
= None¶ Lower value of rewards
-
max
= None¶ Higher value of rewards
-
lower_amplitude
¶ (lower, amplitude)
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - mu) / KL(mu, mumax).
-
__dict__
= mappingproxy({'__module__': 'Arms.Arm', '__doc__': ' Base class for an arm class.', '__init__': <function Arm.__init__>, 'lower_amplitude': <property object>, '__str__': <function Arm.__str__>, '__repr__': <function Arm.__repr__>, 'draw': <function Arm.draw>, 'oracle_draw': <function Arm.oracle_draw>, 'set_mean_param': <function Arm.set_mean_param>, 'draw_nparray': <function Arm.draw_nparray>, 'kl': <staticmethod object>, 'oneLR': <staticmethod object>, 'oneHOI': <staticmethod object>, '__dict__': <attribute '__dict__' of 'Arm' objects>, '__weakref__': <attribute '__weakref__' of 'Arm' objects>})¶
-
__module__
= 'Arms.Arm'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Arms.Bernoulli module¶
Bernoulli distributed arm.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> B03 = Bernoulli(0.3)
>>> B03
B(0.3)
>>> B03.mean
0.3
Examples of sampling from an arm:
>>> B03.draw()
0
>>> B03.draw_nparray(20)
array([1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1.,
1., 1., 1.])
-
class
Arms.Bernoulli.
Bernoulli
(probability)[source]¶ Bases:
Arms.Arm.Arm
Bernoulli distributed arm.
-
probability
= None¶ Parameter p for this Bernoulli arm
-
mean
= None¶ Mean for this Bernoulli arm
-
lower_amplitude
¶ (lower, amplitude)
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Bernoulli arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Bernoulli'¶
-
Arms.Binomial module¶
Binomial distributed arm.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> B03_10 = Binomial(0.3, 10)
>>> B03_10
Bin(0.3, 10)
>>> B03_10.mean
3.0
Examples of sampling from an arm:
>>> B03_10.draw()
3
>>> B03_10.draw_nparray(20)
array([4., 3., 3., 3., 3., 3., 5., 6., 3., 4., 3., 3., 5., 1., 1., 0., 4.,
4., 5., 6.])
-
class
Arms.Binomial.
Binomial
(probability, draws=1)[source]¶ Bases:
Arms.Arm.Arm
Binomial distributed arm.
-
probability
= None¶ Parameter p for this Binomial arm
-
draws
= None¶ Parameter n for this Binomial arm
-
mean
= None¶ Mean for this Binomial arm
-
lower_amplitude
¶ (lower, amplitude)
-
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Binomial arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Binomial'¶
-
Arms.Constant module¶
Arm with a constant reward. Useful for debugging.
Example of creating an arm:
>>> C013 = Constant(0.13)
>>> C013
Constant(0.13)
>>> C013.mean
0.13
Examples of sampling from an arm:
>>> C013.draw()
0.13
>>> C013.draw_nparray(3)
array([0.13, 0.13, 0.13])
-
class
Arms.Constant.
Constant
(constant_reward=0.5, lower=0.0, amplitude=1.0)[source]¶ Bases:
Arms.Arm.Arm
Arm with a constant reward. Useful for debugging.
- constant_reward is the constant reward,
- lower, amplitude default to floor(constant_reward), 1 (so the )
>>> arm_0_5 = Constant(0.5) >>> arm_0_5.draw() 0.5 >>> arm_0_5.draw_nparray((3, 2)) array([[0.5, 0.5], [0.5, 0.5], [0.5, 0.5]])
-
constant_reward
= None¶ Constant value of rewards
-
lower
= None¶ Known lower value of rewards
-
amplitude
= None¶ Known amplitude of rewards
-
mean
= None¶ Mean for this Constant arm
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Constant arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Constant'¶
Arms.DiscreteArm module¶
Discretely distributed arm, of finite support.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> D3values = DiscreteArm({-1: 0.25, 0: 0.5, 1: 0.25})
>>> D3values
D({-1: 0.25, 0: 0.5, 1: 0.25})
>>> D3values.mean
0.0
- Examples of sampling from an arm:
>>> D3values.draw()
0
>>> D3values.draw_nparray(20)
array([ 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, -1, -1, -1, 1,
1, 1, 1])
- Another example, with heavy tail:
>>> D5values = DiscreteArm({-1000: 0.001, 0: 0.5, 1: 0.25, 2:0.25, 1000: 0.001})
>>> D5values
D({-1e+03: 0.001, 0: 0.5, 1: 0.25, 2: 0.25, 1e+03: 0.001})
>>> D5values.mean
0.75
Examples of sampling from an arm:
>>> D5values.draw()
2
>>> D5values.draw_nparray(20)
array([0, 2, 0, 1, 0, 2, 1, 0, 0, 2, 0, 1, 0, 1, 1, 1, 2, 1, 0, 0])
-
class
Arms.DiscreteArm.
DiscreteArm
(values_to_proba)[source]¶ Bases:
Arms.Arm.Arm
DiscreteArm distributed arm.
-
mean
= None¶ Mean for this DiscreteArm arm
-
size
= None¶ Number of different values in this DiscreteArm arm
-
lower_amplitude
¶ (lower, amplitude)
-
static
kl
(x, y)[source]¶ The kl(x, y) to use for this arm.
Warning
FIXME this is not correctly defined, except for the special case of having only 2 values, a
DiscreteArm
is NOT a one-dimensional distribution, and so the kl between two distributions is NOT a function of their mean!
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for DiscreteArm arms: (mumax - mu) / KL(mu, mumax).
Warning
FIXME this is not correctly defined, except for the special case of having only 2 values, a
DiscreteArm
is NOT a one-dimensional distribution, and so the kl between two distributions is NOT a function of their mean!
-
__module__
= 'Arms.DiscreteArm'¶
-
Arms.Exponential module¶
Exponentially distributed arm.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Exp03 = ExponentialFromMean(0.3)
>>> Exp03
\mathrm{Exp}(3.2, 1)
>>> Exp03.mean # doctest: +ELLIPSIS
0.3000...
Examples of sampling from an arm:
>>> Exp03.draw() # doctest: +ELLIPSIS
0.052...
>>> Exp03.draw_nparray(20) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.18..., 0.10..., 0.15..., 0.18..., 0.26...,
0.13..., 0.25..., 0.03..., 0.01..., 0.29... ,
0.07..., 0.19..., 0.17..., 0.02... , 0.82... ,
0.76..., 1. , 0.05..., 0.07..., 0.04...])
-
class
Arms.Exponential.
Exponential
(p, trunc=1)[source]¶ Bases:
Arms.Arm.Arm
Exponentially distributed arm, possibly truncated.
- Default is to truncate to 1 (so Exponential.draw() is in [0, 1]).
-
p
= None¶ Parameter p for Exponential arm
-
trunc
= None¶ Max value of reward
-
mean
= None¶ Mean of Exponential arm
-
lower_amplitude
¶ (lower, amplitude)
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Exponential arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Exponential'¶
-
class
Arms.Exponential.
ExponentialFromMean
(mean, trunc=1)[source]¶ Bases:
Arms.Exponential.Exponential
Exponentially distributed arm, possibly truncated, defined by its mean and not its parameter.
- Default is to truncate to 1 (so Exponential.draw() is in [0, 1]).
-
__module__
= 'Arms.Exponential'¶
Arms.Gamma module¶
Gamma distributed arm.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Gamma03 = GammaFromMean(0.3)
>>> Gamma03
\Gamma(0.3, 1)
>>> Gamma03.mean
0.3
Examples of sampling from an arm:
>>> Gamma03.draw() # doctest: +ELLIPSIS
0.079...
>>> Gamma03.draw_nparray(20) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([1.35...e-01, 1.84...e-01, 5.71...e-02, 6.36...e-02,
4.94...e-01, 1.51...e-01, 1.48...e-04, 2.25...e-06,
4.56...e-01, 1.00...e+00, 7.59...e-02, 8.12...e-04,
1.54...e-03, 1.14...e-01, 1.18...e-02, 7.30...e-02,
1.76...e-06, 1.94...e-01, 1.00...e+00, 3.30...e-02])
-
class
Arms.Gamma.
Gamma
(shape, scale=1.0, mini=0, maxi=1)[source]¶ Bases:
Arms.Arm.Arm
Gamma distributed arm, possibly truncated.
- Default is to truncate into [0, 1] (so Gamma.draw() is in [0, 1]).
- Cf. http://chercheurs.lille.inria.fr/ekaufman/NIPS13 Figure 1
-
shape
= None¶ Shape parameter for this Gamma arm
-
scale
= None¶ Scale parameter for this Gamma arm
-
mean
= None¶ Mean for this Gamma arm
-
min
= None¶ Lower value of rewards
-
max
= None¶ Larger value of rewards
-
lower_amplitude
¶ (lower, amplitude)
-
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - shape) / KL(shape, mumax).
-
__module__
= 'Arms.Gamma'¶
-
class
Arms.Gamma.
GammaFromMean
(mean, scale=1.0, mini=0, maxi=1)[source]¶ Bases:
Arms.Gamma.Gamma
Gamma distributed arm, possibly truncated, defined by its mean and not its scale parameter.
-
__init__
(mean, scale=1.0, mini=0, maxi=1)[source]¶ As mean = scale * shape, shape = mean / scale is used.
-
__module__
= 'Arms.Gamma'¶
-
Arms.Gaussian module¶
Gaussian distributed arm.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Gauss03 = Gaussian(0.3, 0.05) # small variance
>>> Gauss03
N(0.3, 0.05)
>>> Gauss03.mean
0.3
Examples of sampling from an arm:
>>> Gauss03.draw() # doctest: +ELLIPSIS
0.3470...
>>> Gauss03.draw_nparray(20) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.388..., 0.320..., 0.348... , 0.412..., 0.393... ,
0.251..., 0.347..., 0.292..., 0.294..., 0.320...,
0.307..., 0.372..., 0.338..., 0.306..., 0.322...,
0.316..., 0.374..., 0.289..., 0.315..., 0.257...])
-
class
Arms.Gaussian.
Gaussian
(mu, sigma=0.05, mini=0, maxi=1)[source]¶ Bases:
Arms.Arm.Arm
Gaussian distributed arm, possibly truncated.
- Default is to truncate into [0, 1] (so Gaussian.draw() is in [0, 1]).
-
mu
= None¶ Mean of Gaussian arm
-
mean
= None¶ Mean of Gaussian arm
-
sigma
= None¶ Variance of Gaussian arm
-
min
= None¶ Lower value of rewards
-
max
= None¶ Higher value of rewards
-
lower_amplitude
¶ (lower, amplitude)
-
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Gaussian arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Gaussian'¶
-
class
Arms.Gaussian.
Gaussian_0_1
(mu, sigma=0.05, mini=0, maxi=1)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [0, 1].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_0_2
(mu, sigma=0.1, mini=0, maxi=2)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [0, 2].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_0_5
(mu, sigma=0.5, mini=0, maxi=5)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [0, 5].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_0_10
(mu, sigma=1, mini=0, maxi=10)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [0, 10].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_0_100
(mu, sigma=5, mini=0, maxi=100)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [0, 100].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_m1_1
(mu, sigma=0.1, mini=-1, maxi=1)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [-1, 1].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_m2_2
(mu, sigma=0.25, mini=-2, maxi=2)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [-2, 2].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_m5_5
(mu, sigma=1, mini=-5, maxi=5)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [-5, 5].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_m10_10
(mu, sigma=2, mini=-10, maxi=10)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [-10, 10].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
Gaussian_m100_100
(mu, sigma=10, mini=-100, maxi=100)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, truncated to [-100, 100].
-
__module__
= 'Arms.Gaussian'¶
-
-
class
Arms.Gaussian.
UnboundedGaussian
(mu, sigma=1)[source]¶ Bases:
Arms.Gaussian.Gaussian
Gaussian distributed arm, not truncated, ie. supported in (-oo, oo).
-
__module__
= 'Arms.Gaussian'¶
-
Arms.Poisson module¶
Poisson distributed arm, possibly truncated.
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Poisson5 = Poisson(5, trunc=10)
>>> Poisson5
P(5, 10)
>>> Poisson5.mean # doctest: +ELLIPSIS
4.9778...
Examples of sampling from an arm:
>>> Poisson5.draw() # doctest: +ELLIPSIS
9
>>> Poisson5.draw_nparray(20) # doctest: +ELLIPSIS
array([ 5, 6, 5, 5, 8, 4, 5, 4, 3, 3, 7, 3, 3, 4, 5, 2, 1,
7, 7, 10])
-
class
Arms.Poisson.
Poisson
(p, trunc=1)[source]¶ Bases:
Arms.Arm.Arm
Poisson distributed arm, possibly truncated.
- Default is to not truncate.
- Warning: the draw() method is QUITE inefficient! (15 seconds for 200000 draws, 62 µs for 1).
-
p
= None¶ Parameter p for Poisson arm
-
trunc
= None¶ Max value of rewards
-
mean
= None¶ Mean for this Poisson arm
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for Poisson arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.Poisson'¶
Arms.RestedRottingArm module¶
author: Julien Seznec Rested rotting arm, i.e. arms with mean value which decay at each pull
-
class
Arms.RestedRottingArm.
RestedRottingArm
(decayingFunction, staticArm)[source]¶ Bases:
Arms.Arm.Arm
-
__module__
= 'Arms.RestedRottingArm'¶
-
-
class
Arms.RestedRottingArm.
RestedRottingBernoulli
(decayingFunction)[source]¶ Bases:
Arms.RestedRottingArm.RestedRottingArm
-
__module__
= 'Arms.RestedRottingArm'¶
-
-
class
Arms.RestedRottingArm.
RestedRottingBinomial
(decayingFunction, draws=1)[source]¶ Bases:
Arms.RestedRottingArm.RestedRottingArm
-
__module__
= 'Arms.RestedRottingArm'¶
-
-
class
Arms.RestedRottingArm.
RestedRottingConstant
(decayingFunction)[source]¶ Bases:
Arms.RestedRottingArm.RestedRottingArm
-
__module__
= 'Arms.RestedRottingArm'¶
-
-
class
Arms.RestedRottingArm.
RestedRottingExponential
(decayingFunction)[source]¶ Bases:
Arms.RestedRottingArm.RestedRottingArm
-
__module__
= 'Arms.RestedRottingArm'¶
-
-
class
Arms.RestedRottingArm.
RestedRottingGaussian
(decayingFunction, sigma=1)[source]¶ Bases:
Arms.RestedRottingArm.RestedRottingArm
-
__module__
= 'Arms.RestedRottingArm'¶
-
Arms.RestlessArm module¶
author: Julien Seznec Restless arm, i.e. arms with mean value which change at each round
-
class
Arms.RestlessArm.
RestlessArm
(rewardFunction, staticArm)[source]¶ Bases:
Arms.Arm.Arm
-
__module__
= 'Arms.RestlessArm'¶
-
-
class
Arms.RestlessArm.
RestlessBernoulli
(rewardFunction)[source]¶ Bases:
Arms.RestlessArm.RestlessArm
-
__module__
= 'Arms.RestlessArm'¶
-
-
class
Arms.RestlessArm.
RestlessBinomial
(rewardFunction, draws=1)[source]¶ Bases:
Arms.RestlessArm.RestlessArm
-
__module__
= 'Arms.RestlessArm'¶
-
-
class
Arms.RestlessArm.
RestlessConstant
(rewardFunction)[source]¶ Bases:
Arms.RestlessArm.RestlessArm
-
__module__
= 'Arms.RestlessArm'¶
-
-
class
Arms.RestlessArm.
RestlessExponential
(rewardFunction)[source]¶ Bases:
Arms.RestlessArm.RestlessArm
-
__module__
= 'Arms.RestlessArm'¶
-
-
class
Arms.RestlessArm.
RestlessGaussian
(rewardFunction, sigma=1)[source]¶ Bases:
Arms.RestlessArm.RestlessArm
-
__module__
= 'Arms.RestlessArm'¶
-
Arms.UniformArm module¶
Uniformly distributed arm in [0, 1], or [lower, lower + amplitude].
Example of creating an arm:
>>> import random; import numpy as np
>>> random.seed(0); np.random.seed(0)
>>> Unif01 = UniformArm(0, 1)
>>> Unif01
U(0, 1)
>>> Unif01.mean
0.5
Examples of sampling from an arm:
>>> Unif01.draw() # doctest: +ELLIPSIS
0.8444...
>>> Unif01.draw_nparray(20) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
array([0.54... , 0.71..., 0.60..., 0.54..., 0.42... ,
0.64..., 0.43..., 0.89... , 0.96..., 0.38...,
0.79..., 0.52..., 0.56..., 0.92..., 0.07...,
0.08... , 0.02... , 0.83..., 0.77..., 0.87...])
-
class
Arms.UniformArm.
UniformArm
(mini=0.0, maxi=1.0, mean=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Arms.Arm.Arm
Uniformly distributed arm, default in [0, 1],
- default to (mini, maxi),
- or [lower, lower + amplitude], if (lower=lower, amplitude=amplitude) is given.
>>> arm_0_1 = UniformArm() >>> arm_0_10 = UniformArm(0, 10) # maxi = 10 >>> arm_2_4 = UniformArm(2, 4) >>> arm_m10_10 = UniformArm(-10, 10) # also UniformArm(lower=-10, amplitude=20)
-
lower
= None¶ Lower value of rewards
-
amplitude
= None¶ Amplitude of rewards
-
mean
= None¶ Mean for this UniformArm arm
-
static
oneLR
(mumax, mu)[source]¶ One term of the Lai & Robbins lower bound for UniformArm arms: (mumax - mu) / KL(mu, mumax).
-
__module__
= 'Arms.UniformArm'¶
Arms.kullback module¶
Kullback-Leibler divergence functions and klUCB utilities.
- Faster implementation can be found in a C file, in
Policies/C
, and should be compiled to speedup computations. - However, the version here have examples, doctests, and are jit compiled on the fly (with numba, cf. http://numba.pydata.org/).
- Cf. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- Reference: [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf) and [Garivier & Cappé, 2011](https://arxiv.org/pdf/1102.2490.pdf)
Warning
All functions are not vectorized, and assume only one value for each argument.
If you want vectorized function, use the wrapper numpy.vectorize
:
>>> import numpy as np
>>> klBern_vect = np.vectorize(klBern)
>>> klBern_vect([0.1, 0.5, 0.9], 0.2) # doctest: +ELLIPSIS
array([0.036..., 0.223..., 1.145...])
>>> klBern_vect(0.4, [0.2, 0.3, 0.4]) # doctest: +ELLIPSIS
array([0.104..., 0.022..., 0...])
>>> klBern_vect([0.1, 0.5, 0.9], [0.2, 0.3, 0.4]) # doctest: +ELLIPSIS
array([0.036..., 0.087..., 0.550...])
For some functions, you would be better off writing a vectorized version manually, for instance if you want to fix a value of some optional parameters:
>>> # WARNING using np.vectorize gave weird result on klGauss
>>> # klGauss_vect = np.vectorize(klGauss, excluded="y")
>>> def klGauss_vect(xs, y, sig2x=0.25): # vectorized for first input only
... return np.array([klGauss(x, y, sig2x) for x in xs])
>>> klGauss_vect([-1, 0, 1], 0.1) # doctest: +ELLIPSIS
array([2.42, 0.02, 1.62])
-
Arms.kullback.
eps
= 1e-15¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Arms.kullback.
klBern
(x, y)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]>>> klBern(0.5, 0.5) 0.0 >>> klBern(0.1, 0.9) # doctest: +ELLIPSIS 1.757779... >>> klBern(0.9, 0.1) # And this KL is symmetric # doctest: +ELLIPSIS 1.757779... >>> klBern(0.4, 0.5) # doctest: +ELLIPSIS 0.020135... >>> klBern(0.01, 0.99) # doctest: +ELLIPSIS 4.503217...
- Special values:
>>> klBern(0, 1) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 34.539575...
-
Arms.kullback.
klBin
(x, y, n)[source]¶ Kullback-Leibler divergence for Binomial distributions. https://math.stackexchange.com/questions/320399/kullback-leibner-divergence-of-binomial-distributions
- It is simply the n times
klBern()
on x and y.
\[\mathrm{KL}(\mathrm{Bin}(x, n), \mathrm{Bin}(y, n)) = n \times \left(x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}) \right).\]Warning
The two distributions must have the same parameter n, and x, y are p, q in (0, 1).
>>> klBin(0.5, 0.5, 10) 0.0 >>> klBin(0.1, 0.9, 10) # doctest: +ELLIPSIS 17.57779... >>> klBin(0.9, 0.1, 10) # And this KL is symmetric # doctest: +ELLIPSIS 17.57779... >>> klBin(0.4, 0.5, 10) # doctest: +ELLIPSIS 0.20135... >>> klBin(0.01, 0.99, 10) # doctest: +ELLIPSIS 45.03217...
- Special values:
>>> klBin(0, 1, 10) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 345.39575...
- It is simply the n times
-
Arms.kullback.
klPoisson
(x, y)[source]¶ Kullback-Leibler divergence for Poison distributions. https://en.wikipedia.org/wiki/Poisson_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathrm{Poisson}(x), \mathrm{Poisson}(y)) = y - x + x \times \log(\frac{x}{y}).\]>>> klPoisson(3, 3) 0.0 >>> klPoisson(2, 1) # doctest: +ELLIPSIS 0.386294... >>> klPoisson(1, 2) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klPoisson(3, 6) # doctest: +ELLIPSIS 0.920558... >>> klPoisson(6, 8) # doctest: +ELLIPSIS 0.273907...
- Special values:
>>> klPoisson(1, 0) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 33.538776... >>> klPoisson(0, 0) 0.0
-
Arms.kullback.
klExp
(x, y)[source]¶ Kullback-Leibler divergence for exponential distributions. https://en.wikipedia.org/wiki/Exponential_distribution#Kullback.E2.80.93Leibler_divergence
\[\begin{split}\mathrm{KL}(\mathrm{Exp}(x), \mathrm{Exp}(y)) = \begin{cases} \frac{x}{y} - 1 - \log(\frac{x}{y}) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]>>> klExp(3, 3) 0.0 >>> klExp(3, 6) # doctest: +ELLIPSIS 0.193147... >>> klExp(1, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.193147... >>> klExp(2, 1) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klExp(4, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.306852... >>> klExp(6, 8) # doctest: +ELLIPSIS 0.037682...
- x, y have to be positive:
>>> klExp(-3, 2) inf >>> klExp(3, -2) inf >>> klExp(-3, -2) inf
-
Arms.kullback.
klGamma
(x, y, a=1)[source]¶ Kullback-Leibler divergence for gamma distributions. https://en.wikipedia.org/wiki/Gamma_distribution#Kullback.E2.80.93Leibler_divergence
- It is simply the a times
klExp()
on x and y.
\[\begin{split}\mathrm{KL}(\Gamma(x, a), \Gamma(y, a)) = \begin{cases} a \times \left( \frac{x}{y} - 1 - \log(\frac{x}{y}) \right) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]Warning
The two distributions must have the same parameter a.
>>> klGamma(3, 3) 0.0 >>> klGamma(3, 6) # doctest: +ELLIPSIS 0.193147... >>> klGamma(1, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.193147... >>> klGamma(2, 1) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klGamma(4, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.306852... >>> klGamma(6, 8) # doctest: +ELLIPSIS 0.037682...
- x, y have to be positive:
>>> klGamma(-3, 2) inf >>> klGamma(3, -2) inf >>> klGamma(-3, -2) inf
- It is simply the a times
-
Arms.kullback.
klNegBin
(x, y, r=1)[source]¶ Kullback-Leibler divergence for negative binomial distributions. https://en.wikipedia.org/wiki/Negative_binomial_distribution
\[\mathrm{KL}(\mathrm{NegBin}(x, r), \mathrm{NegBin}(y, r)) = r \times \log((r + x) / (r + y)) - x \times \log(y \times (r + x) / (x \times (r + y))).\]Warning
The two distributions must have the same parameter r.
>>> klNegBin(0.5, 0.5) 0.0 >>> klNegBin(0.1, 0.9) # doctest: +ELLIPSIS -0.711611... >>> klNegBin(0.9, 0.1) # And this KL is non-symmetric # doctest: +ELLIPSIS 2.0321564... >>> klNegBin(0.4, 0.5) # doctest: +ELLIPSIS -0.130653... >>> klNegBin(0.01, 0.99) # doctest: +ELLIPSIS -0.717353...
- Special values:
>>> klBern(0, 1) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 34.539575...
- With other values for r:
>>> klNegBin(0.5, 0.5, r=2) 0.0 >>> klNegBin(0.1, 0.9, r=2) # doctest: +ELLIPSIS -0.832991... >>> klNegBin(0.1, 0.9, r=4) # doctest: +ELLIPSIS -0.914890... >>> klNegBin(0.9, 0.1, r=2) # And this KL is non-symmetric # doctest: +ELLIPSIS 2.3325528... >>> klNegBin(0.4, 0.5, r=2) # doctest: +ELLIPSIS -0.154572... >>> klNegBin(0.01, 0.99, r=2) # doctest: +ELLIPSIS -0.836257...
-
Arms.kullback.
klGauss
(x, y, sig2x=0.25, sig2y=None)[source]¶ Kullback-Leibler divergence for Gaussian distributions of means
x
andy
and variancessig2x
andsig2y
, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties
- By default, sig2y is assumed to be sig2x (same variance).
Warning
The C version does not support different variances.
>>> klGauss(3, 3) 0.0 >>> klGauss(3, 6) 18.0 >>> klGauss(1, 2) 2.0 >>> klGauss(2, 1) # And this KL is symmetric 2.0 >>> klGauss(4, 2) 8.0 >>> klGauss(6, 8) 8.0
- x, y can be negative:
>>> klGauss(-3, 2) 50.0 >>> klGauss(3, -2) 50.0 >>> klGauss(-3, -2) 2.0 >>> klGauss(3, 2) 2.0
- With other values for sig2x:
>>> klGauss(3, 3, sig2x=10) 0.0 >>> klGauss(3, 6, sig2x=10) 0.45 >>> klGauss(1, 2, sig2x=10) 0.05 >>> klGauss(2, 1, sig2x=10) # And this KL is symmetric 0.05 >>> klGauss(4, 2, sig2x=10) 0.2 >>> klGauss(6, 8, sig2x=10) 0.2
- With different values for sig2x and sig2y:
>>> klGauss(0, 0, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS -0.0284... >>> klGauss(0, 0, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.2243... >>> klGauss(0, 0, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 1.1534...
>>> klGauss(0, 1, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS 0.9715... >>> klGauss(0, 1, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.7243... >>> klGauss(0, 1, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 3.1534...
>>> klGauss(1, 0, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS 0.9715... >>> klGauss(1, 0, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.7243... >>> klGauss(1, 0, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 3.1534...
Warning
Using
Policies.klUCB
(and variants) withklGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Arms.kullback.
klucb
(x, d, kl, upperbound, precision=1e-06, lowerbound=-inf, max_iterations=50)[source]¶ The generic KL-UCB index computation.
x
: value of the cum reward,d
: upper bound on the divergence,kl
: the KL divergence to be used (klBern()
,klGauss()
, etc),upperbound
,lowerbound=float('-inf')
: the known bound of the valuesx
,precision=1e-6
: the threshold from where to stop the research,max_iterations=50
: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{klucb}(x, d) \simeq \sup_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) < d \}.\]Note
It uses a bisection search, and one call to
kl
for each step of the bisection search.For example, for
klucbBern()
, the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:>>> x, d = 0.9, 0.2 # mean x, exploration term d >>> upperbound = min(1., klucbGauss(x, d, sig2x=0.25)) # variance 1/4 for [0,1] bounded distributions >>> upperbound # doctest: +ELLIPSIS 1.0 >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=10) # doctest: +ELLIPSIS 0.9941... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=10) # doctest: +ELLIPSIS 0.9944... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=50) # doctest: +ELLIPSIS 0.9941... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=100) # more and more precise! # doctest: +ELLIPSIS 0.994489...
Note
See below for more examples for different KL divergence functions.
-
Arms.kullback.
klucbBern
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Bernoulli distributions, using
klucb()
.- Influence of x:
>>> klucbBern(0.1, 0.2) # doctest: +ELLIPSIS 0.378391... >>> klucbBern(0.5, 0.2) # doctest: +ELLIPSIS 0.787088... >>> klucbBern(0.9, 0.2) # doctest: +ELLIPSIS 0.994489...
- Influence of d:
>>> klucbBern(0.1, 0.4) # doctest: +ELLIPSIS 0.519475... >>> klucbBern(0.1, 0.9) # doctest: +ELLIPSIS 0.734714...
>>> klucbBern(0.5, 0.4) # doctest: +ELLIPSIS 0.871035... >>> klucbBern(0.5, 0.9) # doctest: +ELLIPSIS 0.956809...
>>> klucbBern(0.9, 0.4) # doctest: +ELLIPSIS 0.999285... >>> klucbBern(0.9, 0.9) # doctest: +ELLIPSIS 0.999995...
-
Arms.kullback.
klucbGauss
(x, d, sig2x=0.25, precision=0.0)[source]¶ KL-UCB index computation for Gaussian distributions.
- Note that it does not require any search.
Warning
it works only if the good variance constant is given.
- Influence of x:
>>> klucbGauss(0.1, 0.2) # doctest: +ELLIPSIS 0.416227... >>> klucbGauss(0.5, 0.2) # doctest: +ELLIPSIS 0.816227... >>> klucbGauss(0.9, 0.2) # doctest: +ELLIPSIS 1.216227...
- Influence of d:
>>> klucbGauss(0.1, 0.4) # doctest: +ELLIPSIS 0.547213... >>> klucbGauss(0.1, 0.9) # doctest: +ELLIPSIS 0.770820...
>>> klucbGauss(0.5, 0.4) # doctest: +ELLIPSIS 0.947213... >>> klucbGauss(0.5, 0.9) # doctest: +ELLIPSIS 1.170820...
>>> klucbGauss(0.9, 0.4) # doctest: +ELLIPSIS 1.347213... >>> klucbGauss(0.9, 0.9) # doctest: +ELLIPSIS 1.570820...
Warning
Using
Policies.klUCB
(and variants) withklucbGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Arms.kullback.
klucbPoisson
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Poisson distributions, using
klucb()
.- Influence of x:
>>> klucbPoisson(0.1, 0.2) # doctest: +ELLIPSIS 0.450523... >>> klucbPoisson(0.5, 0.2) # doctest: +ELLIPSIS 1.089376... >>> klucbPoisson(0.9, 0.2) # doctest: +ELLIPSIS 1.640112...
- Influence of d:
>>> klucbPoisson(0.1, 0.4) # doctest: +ELLIPSIS 0.693684... >>> klucbPoisson(0.1, 0.9) # doctest: +ELLIPSIS 1.252796...
>>> klucbPoisson(0.5, 0.4) # doctest: +ELLIPSIS 1.422933... >>> klucbPoisson(0.5, 0.9) # doctest: +ELLIPSIS 2.122985...
>>> klucbPoisson(0.9, 0.4) # doctest: +ELLIPSIS 2.033691... >>> klucbPoisson(0.9, 0.9) # doctest: +ELLIPSIS 2.831573...
-
Arms.kullback.
klucbExp
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for exponential distributions, using
klucb()
.- Influence of x:
>>> klucbExp(0.1, 0.2) # doctest: +ELLIPSIS 0.202741... >>> klucbExp(0.5, 0.2) # doctest: +ELLIPSIS 1.013706... >>> klucbExp(0.9, 0.2) # doctest: +ELLIPSIS 1.824671...
- Influence of d:
>>> klucbExp(0.1, 0.4) # doctest: +ELLIPSIS 0.285792... >>> klucbExp(0.1, 0.9) # doctest: +ELLIPSIS 0.559088...
>>> klucbExp(0.5, 0.4) # doctest: +ELLIPSIS 1.428962... >>> klucbExp(0.5, 0.9) # doctest: +ELLIPSIS 2.795442...
>>> klucbExp(0.9, 0.4) # doctest: +ELLIPSIS 2.572132... >>> klucbExp(0.9, 0.9) # doctest: +ELLIPSIS 5.031795...
-
Arms.kullback.
klucbGamma
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Gamma distributions, using
klucb()
.- Influence of x:
>>> klucbGamma(0.1, 0.2) # doctest: +ELLIPSIS 0.202... >>> klucbGamma(0.5, 0.2) # doctest: +ELLIPSIS 1.013... >>> klucbGamma(0.9, 0.2) # doctest: +ELLIPSIS 1.824...
- Influence of d:
>>> klucbGamma(0.1, 0.4) # doctest: +ELLIPSIS 0.285... >>> klucbGamma(0.1, 0.9) # doctest: +ELLIPSIS 0.559...
>>> klucbGamma(0.5, 0.4) # doctest: +ELLIPSIS 1.428... >>> klucbGamma(0.5, 0.9) # doctest: +ELLIPSIS 2.795...
>>> klucbGamma(0.9, 0.4) # doctest: +ELLIPSIS 2.572... >>> klucbGamma(0.9, 0.9) # doctest: +ELLIPSIS 5.031...
-
Arms.kullback.
kllcb
(x, d, kl, lowerbound, precision=1e-06, upperbound=inf, max_iterations=50)[source]¶ The generic KL-LCB index computation.
x
: value of the cum reward,d
: lower bound on the divergence,kl
: the KL divergence to be used (klBern()
,klGauss()
, etc),lowerbound
,upperbound=float('-inf')
: the known bound of the valuesx
,precision=1e-6
: the threshold from where to stop the research,max_iterations=50
: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{kllcb}(x, d) \simeq \inf_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) > d \}.\]Note
It uses a bisection search, and one call to
kl
for each step of the bisection search.For example, for
kllcbBern()
, the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:>>> x, d = 0.9, 0.2 # mean x, exploration term d >>> lowerbound = max(0., kllcbGauss(x, d, sig2x=0.25)) # variance 1/4 for [0,1] bounded distributions >>> lowerbound # doctest: +ELLIPSIS 0.5837... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=10) # doctest: +ELLIPSIS 0.29... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=10) # doctest: +ELLIPSIS 0.29188... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=50) # doctest: +ELLIPSIS 0.291886... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=100) # more and more precise! # doctest: +ELLIPSIS 0.29188611...
Note
See below for more examples for different KL divergence functions.
-
Arms.kullback.
kllcbBern
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for Bernoulli distributions, using
kllcb()
.- Influence of x:
>>> kllcbBern(0.1, 0.2) # doctest: +ELLIPSIS 0.09999... >>> kllcbBern(0.5, 0.2) # doctest: +ELLIPSIS 0.49999... >>> kllcbBern(0.9, 0.2) # doctest: +ELLIPSIS 0.89999...
- Influence of d:
>>> kllcbBern(0.1, 0.4) # doctest: +ELLIPSIS 0.09999... >>> kllcbBern(0.1, 0.9) # doctest: +ELLIPSIS 0.09999...
>>> kllcbBern(0.5, 0.4) # doctest: +ELLIPSIS 0.4999... >>> kllcbBern(0.5, 0.9) # doctest: +ELLIPSIS 0.4999...
>>> kllcbBern(0.9, 0.4) # doctest: +ELLIPSIS 0.8999... >>> kllcbBern(0.9, 0.9) # doctest: +ELLIPSIS 0.8999...
-
Arms.kullback.
kllcbGauss
(x, d, sig2x=0.25, precision=0.0)[source]¶ KL-LCB index computation for Gaussian distributions.
- Note that it does not require any search.
Warning
it works only if the good variance constant is given.
- Influence of x:
>>> kllcbGauss(0.1, 0.2) # doctest: +ELLIPSIS -0.21622... >>> kllcbGauss(0.5, 0.2) # doctest: +ELLIPSIS 0.18377... >>> kllcbGauss(0.9, 0.2) # doctest: +ELLIPSIS 0.58377...
- Influence of d:
>>> kllcbGauss(0.1, 0.4) # doctest: +ELLIPSIS -0.3472... >>> kllcbGauss(0.1, 0.9) # doctest: +ELLIPSIS -0.5708...
>>> kllcbGauss(0.5, 0.4) # doctest: +ELLIPSIS 0.0527... >>> kllcbGauss(0.5, 0.9) # doctest: +ELLIPSIS -0.1708...
>>> kllcbGauss(0.9, 0.4) # doctest: +ELLIPSIS 0.4527... >>> kllcbGauss(0.9, 0.9) # doctest: +ELLIPSIS 0.2291...
Warning
Using
Policies.kllCB
(and variants) withkllcbGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Arms.kullback.
kllcbPoisson
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for Poisson distributions, using
kllcb()
.- Influence of x:
>>> kllcbPoisson(0.1, 0.2) # doctest: +ELLIPSIS 0.09999... >>> kllcbPoisson(0.5, 0.2) # doctest: +ELLIPSIS 0.49999... >>> kllcbPoisson(0.9, 0.2) # doctest: +ELLIPSIS 0.89999...
- Influence of d:
>>> kllcbPoisson(0.1, 0.4) # doctest: +ELLIPSIS 0.09999... >>> kllcbPoisson(0.1, 0.9) # doctest: +ELLIPSIS 0.09999...
>>> kllcbPoisson(0.5, 0.4) # doctest: +ELLIPSIS 0.49999... >>> kllcbPoisson(0.5, 0.9) # doctest: +ELLIPSIS 0.49999...
>>> kllcbPoisson(0.9, 0.4) # doctest: +ELLIPSIS 0.89999... >>> kllcbPoisson(0.9, 0.9) # doctest: +ELLIPSIS 0.89999...
-
Arms.kullback.
kllcbExp
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for exponential distributions, using
kllcb()
.- Influence of x:
>>> kllcbExp(0.1, 0.2) # doctest: +ELLIPSIS 0.15267... >>> kllcbExp(0.5, 0.2) # doctest: +ELLIPSIS 0.7633... >>> kllcbExp(0.9, 0.2) # doctest: +ELLIPSIS 1.3740...
- Influence of d:
>>> kllcbExp(0.1, 0.4) # doctest: +ELLIPSIS 0.2000... >>> kllcbExp(0.1, 0.9) # doctest: +ELLIPSIS 0.3842...
>>> kllcbExp(0.5, 0.4) # doctest: +ELLIPSIS 1.0000... >>> kllcbExp(0.5, 0.9) # doctest: +ELLIPSIS 1.9214...
>>> kllcbExp(0.9, 0.4) # doctest: +ELLIPSIS 1.8000... >>> kllcbExp(0.9, 0.9) # doctest: +ELLIPSIS 3.4586...
-
Arms.kullback.
maxEV
(p, V, klMax)[source]¶ Maximize expectation of \(V\) with respect to \(q\) st. \(\mathrm{KL}(p, q) < \text{klMax}\).
- Input args.: p, V, klMax.
- Reference: Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf).
-
Arms.kullback.
reseqp
(p, V, klMax, max_iterations=50)[source]¶ Solve
f(reseqp(p, V, klMax)) = klMax
, using Newton method.Note
This is a subroutine of
maxEV()
.- Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf).
Warning
np.dot is very slow!
-
Arms.kullback.
reseqp2
(p, V, klMax)[source]¶ Solve f(reseqp(p, V, klMax)) = klMax, using a blackbox minimizer, from scipy.optimize.
- FIXME it does not work well yet!
Note
This is a subroutine of
maxEV()
.- Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011].
Warning
np.dot is very slow!
Environment package¶
Environment
module:
MAB
,MarkovianMAB
,ChangingAtEachRepMAB
,IncreasingMAB
,PieceWiseStationaryMAB
,NonStationaryMAB
objects, used to wrap the problems (essentially a list of arms).Result
andResultMultiPlayers
objects, used to wrap simulation results (list of decisions and rewards).Evaluator
environment, used to wrap simulation, for the single player case.EvaluatorMultiPlayers
environment, used to wrap simulation, for the multi-players case.EvaluatorSparseMultiPlayers
environment, used to wrap simulation, for the multi-players case with sparse activated players.CollisionModels
implements different collision models.
And useful constants and functions for the plotting and stuff:
DPI
,signature()
,maximizeWindow()
,palette()
,makemarkers()
,wraptext()
: for plotting,notify()
: send a desktop notification,Parallel()
,delayed()
: joblib related,tqdm
: pretty range() loops,sortedDistance
,fairnessMeasures
: science related,getCurrentMemory()
,sizeof_fmt()
: to measure and pretty print memory consumption.
Submodules¶
Environment.CollisionModels module¶
Define the different collision models.
Collision models are generic functions, taking:
- the time:
t
- the arms of the current environment:
arms
- the list of players:
players
- the numpy array of their choices:
choices
- the numpy array to store their rewards:
rewards
- the numpy array to store their pulls:
pulls
- the numpy array to store their collisions:
collisions
As far as now, there is 4 different collision models implemented:
noCollision()
: simple collision model where all players sample it and receive the reward.onlyUniqUserGetsReward()
: simple collision model, where only the players alone on one arm sample it and receive the reward (default).rewardIsSharedUniformly()
: in case of more than one player on one arm, only one player (uniform choice) can sample it and receive the reward.closerUserGetsReward()
: in case of more than one player on one arm, only the closer player can sample it and receive the reward. It can take, or create if not given, a random distance of each player to the base station (random number in [0, 1]).
-
Environment.CollisionModels.
onlyUniqUserGetsReward
(t, arms, players, choices, rewards, pulls, collisions)[source]¶ Simple collision model where only the players alone on one arm samples it and receives the reward.
- This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
- The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
-
Environment.CollisionModels.
defaultCollisionModel
(t, arms, players, choices, rewards, pulls, collisions)¶ Simple collision model where only the players alone on one arm samples it and receives the reward.
- This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
- The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
-
Environment.CollisionModels.
onlyUniqUserGetsRewardSparse
(t, arms, players, choices, rewards, pulls, collisions)[source]¶ Simple collision model where only the players alone on one arm samples it and receives the reward.
- This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).
- The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
- Support for player non activated, by choosing a negative index.
-
Environment.CollisionModels.
allGetRewardsAndUseCollision
(t, arms, players, choices, rewards, pulls, collisions)[source]¶ A variant of the first simple collision model where all players sample their arm, receive their rewards, and are informed of the collisions.
Note
it is NOT the one we consider, and so our lower-bound on centralized regret is wrong (users don’t care about collisions for their internal rewards so regret does not take collisions into account!)
- This is the NOT default collision model, cf. [Liu & Zhao, 2009](https://arxiv.org/abs/0910.2065v3) collision model 1.
- The numpy array ‘choices’ is increased according to the number of users who collided (it is NOT binary).
-
Environment.CollisionModels.
noCollision
(t, arms, players, choices, rewards, pulls, collisions)[source]¶ Simple collision model where all players sample it and receive the reward.
- It corresponds to the single-player simulation: each player is a policy, compared without collision.
- The numpy array ‘collisions’ is not modified.
Less simple collision model where:
- The players alone on one arm sample it and receive the reward.
- In case of more than one player on one arm, only one player (uniform choice) can sample it and receive the reward. It is chosen by the base station.
Note
it can also model a choice from the users point of view: in a time frame (eg. 1 second), when there is a collision, each colliding user chose (uniformly) a random small time offset (eg. 20 ms), and start sensing + emitting again after that time. The first one to sense is alone, it transmits, and the next ones find the channel used when sensing. So only one player is transmitting, and from the base station point of view, it is the same as if it was chosen uniformly among the colliding users.
-
Environment.CollisionModels.
closerUserGetsReward
(t, arms, players, choices, rewards, pulls, collisions, distances='uniform')[source]¶ Simple collision model where:
- The players alone on one arm sample it and receive the reward.
- In case of more than one player on one arm, only the closer player can sample it and receive the reward. It can take, or create if not given, a distance of each player to the base station (numbers in [0, 1]).
- If distances is not given, it is either generated randomly (random numbers in [0, 1]) or is a linspace of nbPlayers values in (0, 1), equally spacen (default).
Note
This kind of effects is known in telecommunication as the Near-Far effect or the Capture effect [Roberts, 1975](https://dl.acm.org/citation.cfm?id=1024920)
-
Environment.CollisionModels.
collision_models
= [<function onlyUniqUserGetsReward>, <function onlyUniqUserGetsRewardSparse>, <function allGetRewardsAndUseCollision>, <function noCollision>, <function rewardIsSharedUniformly>, <function closerUserGetsReward>]¶ List of possible collision models
-
Environment.CollisionModels.
full_lost_if_collision
= {'allGetRewardsAndUseCollision': True, 'closerUserGetsReward': False, 'noCollision': False, 'onlyUniqUserGetsReward': True, 'onlyUniqUserGetsRewardSparse': True, 'rewardIsSharedUniformly': False}¶ Mapping of collision model names to True or False, to know if a collision implies a lost communication or not in this model
Environment.Evaluator module¶
Evaluator class to wrap and run the simulations. Lots of plotting methods, to have various visualizations.
-
Environment.Evaluator.
USE_PICKLE
= False¶ Should we save the figure objects to a .pickle file at the end of the simulation?
-
Environment.Evaluator.
REPETITIONS
= 1¶ Default nb of repetitions
-
Environment.Evaluator.
DELTA_T_PLOT
= 50¶ Default sampling rate for plotting
-
Environment.Evaluator.
plot_lowerbound
= True¶ Default is to plot the lower-bound
-
Environment.Evaluator.
USE_BOX_PLOT
= True¶ True to use boxplot, False to use violinplot.
-
Environment.Evaluator.
random_shuffle
= False¶ Use basic random events of shuffling the arms?
-
Environment.Evaluator.
random_invert
= False¶ Use basic random events of inverting the arms?
-
Environment.Evaluator.
nb_break_points
= 0¶ Default nb of random events
-
Environment.Evaluator.
STORE_ALL_REWARDS
= False¶ Store all rewards?
-
Environment.Evaluator.
STORE_REWARDS_SQUARED
= False¶ Store rewards squared?
-
Environment.Evaluator.
MORE_ACCURATE
= True¶ Use the count of selections instead of rewards for a more accurate mean/var reward measure.
-
Environment.Evaluator.
FINAL_RANKS_ON_AVERAGE
= True¶ Final ranks are printed based on average on last 1% rewards and not only the last rewards
-
Environment.Evaluator.
USE_JOBLIB_FOR_POLICIES
= False¶ Don’t use joblib to parallelize the simulations on various policies (we parallelize the random Monte Carlo repetitions)
-
class
Environment.Evaluator.
Evaluator
(configuration, finalRanksOnAverage=True, averageOn=0.005, useJoblibForPolicies=False, moreAccurate=True)[source]¶ Bases:
object
Evaluator class to run the simulations.
-
__init__
(configuration, finalRanksOnAverage=True, averageOn=0.005, useJoblibForPolicies=False, moreAccurate=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
cfg
= None¶ Configuration dictionnary
-
nbPolicies
= None¶ Number of policies
-
horizon
= None¶ Horizon (number of time steps)
-
repetitions
= None¶ Number of repetitions
-
delta_t_plot
= None¶ Sampling rate for plotting
-
random_shuffle
= None¶ Random shuffling of arms?
-
random_invert
= None¶ Random inversion of arms?
-
nb_break_points
= None¶ How many random events?
-
plot_lowerbound
= None¶ Should we plot the lower-bound?
-
moreAccurate
= None¶ Use the count of selections instead of rewards for a more accurate mean/var reward measure.
-
finalRanksOnAverage
= None¶ Final display of ranks are done on average rewards?
-
averageOn
= None¶ How many last steps for final rank average rewards
-
useJoblibForPolicies
= None¶ Use joblib to parallelize for loop on policies (useless)
-
useJoblib
= None¶ Use joblib to parallelize for loop on repetitions (useful)
-
cache_rewards
= None¶ Should we cache and precompute rewards
-
environment_bayesian
= None¶ Is the environment Bayesian?
-
showplot
= None¶ Show the plot (interactive display or not)
-
use_box_plot
= None¶ To use box plot (or violin plot if False). Force to use boxplot if repetitions=1.
-
change_labels
= None¶ Possibly empty dictionary to map ‘policyId’ to new labels (overwrite their name).
-
append_labels
= None¶ Possibly empty dictionary to map ‘policyId’ to new labels (by appending the result from ‘append_labels’).
-
envs
= None¶ List of environments
-
policies
= None¶ List of policies
-
rewards
= None¶ For each env, history of rewards, ie accumulated rewards
-
lastCumRewards
= None¶ For each env, last accumulated rewards, to compute variance and histogram of whole regret R_T
-
minCumRewards
= None¶ For each env, history of minimum of rewards, to compute amplitude (+- STD)
-
maxCumRewards
= None¶ For each env, history of maximum of rewards, to compute amplitude (+- STD)
-
rewardsSquared
= None¶ For each env, history of rewards squared
-
allRewards
= None¶ For each env, full history of rewards
-
bestArmPulls
= None¶ For each env, keep the history of best arm pulls
-
pulls
= None¶ For each env, keep cumulative counts of all arm pulls
-
allPulls
= None¶ For each env, keep cumulative counts of all arm pulls
-
lastPulls
= None¶ For each env, keep cumulative counts of all arm pulls
-
runningTimes
= None¶ For each env, keep the history of running times
-
memoryConsumption
= None¶ For each env, keep the history of running times
-
numberOfCPDetections
= None¶ For each env, store the number of change-point detections by each algorithms, to print it’s average at the end (to check if a certain Change-Point detector algorithm detects too few or too many changes).
-
compute_cache_rewards
(arms)[source]¶ Compute only once the rewards, then launch the experiments with the same matrix (r_{k,t}).
-
saveondisk
(filepath='saveondisk_Evaluator.hdf5')[source]¶ Save the content of the internal data to into a HDF5 file on the disk.
- See http://docs.h5py.org/en/stable/quick.html if needed.
-
getCumulatedRegret_LessAccurate
(policyId, envId=0)[source]¶ Compute cumulative regret, based on accumulated rewards.
-
getCumulatedRegret_MoreAccurate
(policyId, envId=0)[source]¶ Compute cumulative regret, based on counts of selections and not actual rewards.
-
getCumulatedRegret
(policyId, envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
getLastRegrets_LessAccurate
(policyId, envId=0)[source]¶ Extract last regrets, based on accumulated rewards.
-
getLastRegrets_MoreAccurate
(policyId, envId=0)[source]¶ Extract last regrets, based on counts of selections and not actual rewards.
-
getLastRegrets
(policyId, envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
getAverageRewards
(policyId, envId=0)[source]¶ Extract mean rewards (not rewards but cumsum(rewards)/cumsum(1).
-
getSTDRegret
(policyId, envId=0, meanReward=False)[source]¶ Extract standard deviation of rewards.
Warning
FIXME experimental!
-
getMaxMinReward
(policyId, envId=0)[source]¶ Extract amplitude of rewards as maxCumRewards - minCumRewards.
-
getRunningTimes
(envId=0)[source]¶ Get the means and stds and list of running time of the different policies.
-
getMemoryConsumption
(envId=0)[source]¶ Get the means and stds and list of memory consumptions of the different policies.
-
getNumberOfCPDetections
(envId=0)[source]¶ Get the means and stds and list of numberOfCPDetections of the different policies.
-
printFinalRanking
(envId=0, moreAccurate=None)[source]¶ Print the final ranking of the different policies.
-
_xlabel
(envId, *args, **kwargs)[source]¶ Add xlabel to the plot, and if the environment has change-point, draw vertical lines to clearly identify the locations of the change points.
-
plotRegrets
(envId=0, savefig=None, meanReward=False, plotSTD=False, plotMaxMin=False, semilogx=False, semilogy=False, loglog=False, normalizedRegret=False, drawUpperBound=False, moreAccurate=None)[source]¶ Plot the centralized cumulated regret, support more than one environments (use evaluators to give a list of other environments).
-
plotBestArmPulls
(envId, savefig=None)[source]¶ Plot the frequency of pulls of the best channel.
- Warning: does not adapt to dynamic settings!
-
printRunningTimes
(envId=0, precision=3)[source]¶ Print the average+-std running time of the different policies.
-
plotRunningTimes
(envId=0, savefig=None, base=1, unit='seconds')[source]¶ Plot the running times of the different policies, as a box plot for each.
-
printMemoryConsumption
(envId=0)[source]¶ Print the average+-std memory consumption of the different policies.
-
plotMemoryConsumption
(envId=0, savefig=None, base=1024, unit='KiB')[source]¶ Plot the memory consumption of the different policies, as a box plot for each.
-
printNumberOfCPDetections
(envId=0)[source]¶ Print the average+-std number_of_cp_detections of the different policies.
-
plotNumberOfCPDetections
(envId=0, savefig=None)[source]¶ Plot the number of change-point detections of the different policies, as a box plot for each.
-
__dict__
= mappingproxy({'__module__': 'Environment.Evaluator', '__doc__': ' Evaluator class to run the simulations.', '__init__': <function Evaluator.__init__>, '__initEnvironments__': <function Evaluator.__initEnvironments__>, '__initPolicies__': <function Evaluator.__initPolicies__>, 'compute_cache_rewards': <function Evaluator.compute_cache_rewards>, 'startAllEnv': <function Evaluator.startAllEnv>, 'startOneEnv': <function Evaluator.startOneEnv>, 'saveondisk': <function Evaluator.saveondisk>, 'getPulls': <function Evaluator.getPulls>, 'getBestArmPulls': <function Evaluator.getBestArmPulls>, 'getRewards': <function Evaluator.getRewards>, 'getAverageWeightedSelections': <function Evaluator.getAverageWeightedSelections>, 'getMaxRewards': <function Evaluator.getMaxRewards>, 'getCumulatedRegret_LessAccurate': <function Evaluator.getCumulatedRegret_LessAccurate>, 'getCumulatedRegret_MoreAccurate': <function Evaluator.getCumulatedRegret_MoreAccurate>, 'getCumulatedRegret': <function Evaluator.getCumulatedRegret>, 'getLastRegrets_LessAccurate': <function Evaluator.getLastRegrets_LessAccurate>, 'getAllLastWeightedSelections': <function Evaluator.getAllLastWeightedSelections>, 'getLastRegrets_MoreAccurate': <function Evaluator.getLastRegrets_MoreAccurate>, 'getLastRegrets': <function Evaluator.getLastRegrets>, 'getAverageRewards': <function Evaluator.getAverageRewards>, 'getRewardsSquared': <function Evaluator.getRewardsSquared>, 'getSTDRegret': <function Evaluator.getSTDRegret>, 'getMaxMinReward': <function Evaluator.getMaxMinReward>, 'getRunningTimes': <function Evaluator.getRunningTimes>, 'getMemoryConsumption': <function Evaluator.getMemoryConsumption>, 'getNumberOfCPDetections': <function Evaluator.getNumberOfCPDetections>, 'printFinalRanking': <function Evaluator.printFinalRanking>, '_xlabel': <function Evaluator._xlabel>, 'plotRegrets': <function Evaluator.plotRegrets>, 'plotBestArmPulls': <function Evaluator.plotBestArmPulls>, 'printRunningTimes': <function Evaluator.printRunningTimes>, 'plotRunningTimes': <function Evaluator.plotRunningTimes>, 'printMemoryConsumption': <function Evaluator.printMemoryConsumption>, 'plotMemoryConsumption': <function Evaluator.plotMemoryConsumption>, 'printNumberOfCPDetections': <function Evaluator.printNumberOfCPDetections>, 'plotNumberOfCPDetections': <function Evaluator.plotNumberOfCPDetections>, 'printLastRegrets': <function Evaluator.printLastRegrets>, 'plotLastRegrets': <function Evaluator.plotLastRegrets>, 'plotHistoryOfMeans': <function Evaluator.plotHistoryOfMeans>, '__dict__': <attribute '__dict__' of 'Evaluator' objects>, '__weakref__': <attribute '__weakref__' of 'Evaluator' objects>})¶
-
__module__
= 'Environment.Evaluator'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
printLastRegrets
(envId=0, moreAccurate=False)[source]¶ Print the last regrets of the different policies.
-
-
Environment.Evaluator.
delayed_play
(env, policy, horizon, random_shuffle=False, random_invert=False, nb_break_points=0, seed=None, allrewards=None, repeatId=0, useJoblib=False)[source]¶ Helper function for the parallelization.
-
Environment.Evaluator.
EvaluatorFromDisk
(filepath='/tmp/saveondiskEvaluator.hdf5')[source]¶ Create a new Evaluator object from the HDF5 file given in argument.
-
Environment.Evaluator.
shuffled
(mylist)[source]¶ Returns a shuffled version of the input 1D list. sorted() exists instead of list.sort(), but shuffled() does not exist instead of random.shuffle()…
>>> from random import seed; seed(1234) # reproducible results >>> mylist = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] >>> shuffled(mylist) [0.9, 0.4, 0.3, 0.6, 0.5, 0.7, 0.1, 0.2, 0.8] >>> shuffled(mylist) [0.4, 0.3, 0.7, 0.5, 0.8, 0.1, 0.9, 0.6, 0.2] >>> shuffled(mylist) [0.4, 0.6, 0.9, 0.5, 0.7, 0.2, 0.1, 0.3, 0.8] >>> shuffled(mylist) [0.8, 0.7, 0.3, 0.1, 0.9, 0.5, 0.6, 0.2, 0.4]
Environment.EvaluatorMultiPlayers module¶
EvaluatorMultiPlayers class to wrap and run the simulations, for the multi-players case. Lots of plotting methods, to have various visualizations. See documentation.
-
Environment.EvaluatorMultiPlayers.
USE_PICKLE
= False¶ Should we save the figure objects to a .pickle file at the end of the simulation?
-
Environment.EvaluatorMultiPlayers.
REPETITIONS
= 1¶ Default nb of repetitions
-
Environment.EvaluatorMultiPlayers.
DELTA_T_PLOT
= 50¶ Default sampling rate for plotting
-
Environment.EvaluatorMultiPlayers.
COUNT_RANKS_MARKOV_CHAIN
= False¶ If true, count and then print a lot of statistics for the Markov Chain of the underlying configurations on ranks
-
Environment.EvaluatorMultiPlayers.
MORE_ACCURATE
= True¶ Use the count of selections instead of rewards for a more accurate mean/var reward measure.
-
Environment.EvaluatorMultiPlayers.
plot_lowerbounds
= True¶ Default is to plot the lower-bounds
-
Environment.EvaluatorMultiPlayers.
USE_BOX_PLOT
= True¶ True to use boxplot, False to use violinplot (default).
-
Environment.EvaluatorMultiPlayers.
nb_break_points
= 0¶ Default nb of random events
-
Environment.EvaluatorMultiPlayers.
FINAL_RANKS_ON_AVERAGE
= True¶ Default value for
finalRanksOnAverage
-
Environment.EvaluatorMultiPlayers.
USE_JOBLIB_FOR_POLICIES
= False¶ Default value for
useJoblibForPolicies
. Does not speed up to use it (too much overhead in using too much threads); so it should really be disabled.
-
class
Environment.EvaluatorMultiPlayers.
EvaluatorMultiPlayers
(configuration, moreAccurate=True)[source]¶ Bases:
object
Evaluator class to run the simulations, for the multi-players case.
-
__init__
(configuration, moreAccurate=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
cfg
= None¶ Configuration dictionnary
-
nbPlayers
= None¶ Number of players
-
repetitions
= None¶ Number of repetitions
-
horizon
= None¶ Horizon (number of time steps)
-
collisionModel
= None¶ Which collision model should be used
-
full_lost_if_collision
= None¶ Is there a full loss of rewards if collision ? To compute the correct decomposition of regret
-
moreAccurate
= None¶ Use the count of selections instead of rewards for a more accurate mean/var reward measure.
-
finalRanksOnAverage
= None¶ Final display of ranks are done on average rewards?
-
averageOn
= None¶ How many last steps for final rank average rewards
-
nb_break_points
= None¶ How many random events?
-
plot_lowerbounds
= None¶ Should we plot the lower-bounds?
-
useJoblib
= None¶ Use joblib to parallelize for loop on repetitions (useful)
-
showplot
= None¶ Show the plot (interactive display or not)
-
use_box_plot
= None¶ To use box plot (or violin plot if False). Force to use boxplot if repetitions=1.
-
count_ranks_markov_chain
= None¶ If true, count and then print a lot of statistics for the Markov Chain of the underlying configurations on ranks
-
change_labels
= None¶ Possibly empty dictionary to map ‘playerId’ to new labels (overwrite their name).
-
append_labels
= None¶ Possibly empty dictionary to map ‘playerId’ to new labels (by appending the result from ‘append_labels’).
-
envs
= None¶ List of environments
-
players
= None¶ List of players
-
rewards
= None¶ For each env, history of rewards
-
pulls
= None¶ For each env, keep the history of arm pulls (mean)
-
lastPulls
= None¶ For each env, keep the distribution of arm pulls
-
allPulls
= None¶ For each env, keep the full history of arm pulls
-
collisions
= None¶ For each env, keep the history of collisions on all arms
-
lastCumCollisions
= None¶ For each env, last count of collisions on all arms
-
nbSwitchs
= None¶ For each env, keep the history of switches (change of configuration of players)
-
bestArmPulls
= None¶ For each env, keep the history of best arm pulls
-
freeTransmissions
= None¶ For each env, keep the history of successful transmission (1 - collisions, basically)
-
lastCumRewards
= None¶ For each env, last accumulated rewards, to compute variance and histogram of whole regret R_T
-
runningTimes
= None¶ For each env, keep the history of running times
-
memoryConsumption
= None¶ For each env, keep the history of running times
-
saveondisk
(filepath='saveondisk_EvaluatorMultiPlayers.hdf5')[source]¶ Save the content of the internal data to into a HDF5 file on the disk.
- See http://docs.h5py.org/en/stable/quick.html if needed.
-
loadfromdisk
(filepath)[source]¶ Update internal memory of the Evaluator object by loading data the opened HDF5 file.
Warning
FIXME this is not YET implemented!
-
getRegretMean
(playerId, envId=0)[source]¶ Extract mean of regret, for one arm for one player (no meaning).
Warning
This is the centralized regret, for one arm, it does not make much sense in the multi-players setting!
-
getCentralizedRegret_LessAccurate
(envId=0)[source]¶ Compute the empirical centralized regret: cumsum on time of the mean rewards of the M best arms - cumsum on time of the empirical rewards obtained by the players, based on accumulated rewards.
-
getFirstRegretTerm
(envId=0)[source]¶ Extract and compute the first term \((a)\) in the centralized regret: losses due to pulling suboptimal arms.
-
getSecondRegretTerm
(envId=0)[source]¶ Extract and compute the second term \((b)\) in the centralized regret: losses due to not pulling optimal arms.
-
getThirdRegretTerm
(envId=0)[source]¶ Extract and compute the third term \((c)\) in the centralized regret: losses due to collisions.
-
getCentralizedRegret_MoreAccurate
(envId=0)[source]¶ Compute the empirical centralized regret, based on counts of selections and not actual rewards.
-
getCentralizedRegret
(envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
getLastRegrets_MoreAccurate
(envId=0)[source]¶ Extract last regrets, based on counts of selections and not actual rewards.
-
getLastRegrets
(envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
getRunningTimes
(envId=0)[source]¶ Get the means and stds and list of running time of the different players.
-
getMemoryConsumption
(envId=0)[source]¶ Get the means and stds and list of memory consumptions of the different players.
-
plotRewards
(envId=0, savefig=None, semilogx=False, moreAccurate=None)[source]¶ Plot the decentralized (vectorial) rewards, for each player.
-
plotFairness
(envId=0, savefig=None, semilogx=False, fairness='default', evaluators=())[source]¶ Plot a certain measure of “fairness”, from these personal rewards, support more than one environments (use evaluators to give a list of other environments).
-
plotRegretCentralized
(envId=0, savefig=None, semilogx=False, semilogy=False, loglog=False, normalized=False, evaluators=(), subTerms=False, sumofthreeterms=False, moreAccurate=None)[source]¶ Plot the centralized cumulated regret, support more than one environments (use evaluators to give a list of other environments).
- The lower bounds are also plotted (Besson & Kaufmann, and Anandkumar et al).
- The three terms of the regret are also plotting if evaluators = () (that’s the default).
-
plotNbSwitchs
(envId=0, savefig=None, semilogx=False, cumulated=False)[source]¶ Plot cumulated number of switchs (to evaluate the switching costs), comparing each player.
-
plotNbSwitchsCentralized
(envId=0, savefig=None, semilogx=False, cumulated=False, evaluators=())[source]¶ Plot the centralized cumulated number of switchs (to evaluate the switching costs), support more than one environments (use evaluators to give a list of other environments).
-
plotBestArmPulls
(envId=0, savefig=None)[source]¶ Plot the frequency of pulls of the best channel.
- Warning: does not adapt to dynamic settings!
-
plotAllPulls
(envId=0, savefig=None, cumulated=True, normalized=False)[source]¶ Plot the frequency of use of every channels, one figure for each channel. Not so useful.
-
plotFreeTransmissions
(envId=0, savefig=None, cumulated=False)[source]¶ Plot the frequency free transmission.
-
plotNbCollisions
(envId=0, savefig=None, semilogx=False, semilogy=False, loglog=False, cumulated=False, upperbound=False, evaluators=())[source]¶ Plot the frequency or cum number of collisions, support more than one environments (use evaluators to give a list of other environments).
-
plotFrequencyCollisions
(envId=0, savefig=None, piechart=True, semilogy=False)[source]¶ Plot the frequency of collision, in a pie chart (histogram not supported yet).
-
printRunningTimes
(envId=0, precision=3, evaluators=())[source]¶ Print the average+-std runnning time of the different players.
-
printMemoryConsumption
(envId=0, evaluators=())[source]¶ Print the average+-std memory consumption of the different players.
-
plotRunningTimes
(envId=0, savefig=None, base=1, unit='seconds', evaluators=())[source]¶ Plot the running times of the different players, as a box plot for each evaluators.
-
plotMemoryConsumption
(envId=0, savefig=None, base=1024, unit='KiB', evaluators=())[source]¶ Plot the memory consumption of the different players, as a box plot for each.
-
printFinalRanking
(envId=0, verb=True)[source]¶ Compute and print the ranking of the different players.
-
printFinalRankingAll
(envId=0, evaluators=())[source]¶ Compute and print the ranking of the different players.
-
printLastRegrets
(envId=0, evaluators=(), moreAccurate=None)[source]¶ Print the last regrets of the different evaluators.
-
printLastRegretsPM
(envId=0, evaluators=(), moreAccurate=None)[source]¶ Print the average+-std last regret of the different players.
-
plotLastRegrets
(envId=0, normed=False, subplots=True, nbbins=15, log=False, all_on_separate_figures=False, sharex=False, sharey=False, boxplot=False, normalized_boxplot=True, savefig=None, moreAccurate=None, evaluators=())[source]¶ Plot histogram of the regrets R_T for all evaluators.
-
plotHistoryOfMeans
(envId=0, horizon=None, savefig=None)[source]¶ Plot the history of means, as a plot with x axis being the time, y axis the mean rewards, and K curves one for each arm.
-
__dict__
= mappingproxy({'__module__': 'Environment.EvaluatorMultiPlayers', '__doc__': ' Evaluator class to run the simulations, for the multi-players case.\n ', '__init__': <function EvaluatorMultiPlayers.__init__>, '__initEnvironments__': <function EvaluatorMultiPlayers.__initEnvironments__>, '__initPlayers__': <function EvaluatorMultiPlayers.__initPlayers__>, 'startAllEnv': <function EvaluatorMultiPlayers.startAllEnv>, 'startOneEnv': <function EvaluatorMultiPlayers.startOneEnv>, 'saveondisk': <function EvaluatorMultiPlayers.saveondisk>, 'loadfromdisk': <function EvaluatorMultiPlayers.loadfromdisk>, 'getPulls': <function EvaluatorMultiPlayers.getPulls>, 'getAllPulls': <function EvaluatorMultiPlayers.getAllPulls>, 'getNbSwitchs': <function EvaluatorMultiPlayers.getNbSwitchs>, 'getCentralizedNbSwitchs': <function EvaluatorMultiPlayers.getCentralizedNbSwitchs>, 'getBestArmPulls': <function EvaluatorMultiPlayers.getBestArmPulls>, 'getfreeTransmissions': <function EvaluatorMultiPlayers.getfreeTransmissions>, 'getCollisions': <function EvaluatorMultiPlayers.getCollisions>, 'getRewards': <function EvaluatorMultiPlayers.getRewards>, 'getRegretMean': <function EvaluatorMultiPlayers.getRegretMean>, 'getCentralizedRegret_LessAccurate': <function EvaluatorMultiPlayers.getCentralizedRegret_LessAccurate>, 'getFirstRegretTerm': <function EvaluatorMultiPlayers.getFirstRegretTerm>, 'getSecondRegretTerm': <function EvaluatorMultiPlayers.getSecondRegretTerm>, 'getThirdRegretTerm': <function EvaluatorMultiPlayers.getThirdRegretTerm>, 'getCentralizedRegret_MoreAccurate': <function EvaluatorMultiPlayers.getCentralizedRegret_MoreAccurate>, 'getCentralizedRegret': <function EvaluatorMultiPlayers.getCentralizedRegret>, 'getLastRegrets_LessAccurate': <function EvaluatorMultiPlayers.getLastRegrets_LessAccurate>, 'getAllLastWeightedSelections': <function EvaluatorMultiPlayers.getAllLastWeightedSelections>, 'getLastRegrets_MoreAccurate': <function EvaluatorMultiPlayers.getLastRegrets_MoreAccurate>, 'getLastRegrets': <function EvaluatorMultiPlayers.getLastRegrets>, 'getRunningTimes': <function EvaluatorMultiPlayers.getRunningTimes>, 'getMemoryConsumption': <function EvaluatorMultiPlayers.getMemoryConsumption>, 'plotRewards': <function EvaluatorMultiPlayers.plotRewards>, 'plotFairness': <function EvaluatorMultiPlayers.plotFairness>, 'plotRegretCentralized': <function EvaluatorMultiPlayers.plotRegretCentralized>, 'plotNbSwitchs': <function EvaluatorMultiPlayers.plotNbSwitchs>, 'plotNbSwitchsCentralized': <function EvaluatorMultiPlayers.plotNbSwitchsCentralized>, 'plotBestArmPulls': <function EvaluatorMultiPlayers.plotBestArmPulls>, 'plotAllPulls': <function EvaluatorMultiPlayers.plotAllPulls>, 'plotFreeTransmissions': <function EvaluatorMultiPlayers.plotFreeTransmissions>, 'plotNbCollisions': <function EvaluatorMultiPlayers.plotNbCollisions>, 'plotFrequencyCollisions': <function EvaluatorMultiPlayers.plotFrequencyCollisions>, 'printRunningTimes': <function EvaluatorMultiPlayers.printRunningTimes>, 'printMemoryConsumption': <function EvaluatorMultiPlayers.printMemoryConsumption>, 'plotRunningTimes': <function EvaluatorMultiPlayers.plotRunningTimes>, 'plotMemoryConsumption': <function EvaluatorMultiPlayers.plotMemoryConsumption>, 'printFinalRanking': <function EvaluatorMultiPlayers.printFinalRanking>, 'printFinalRankingAll': <function EvaluatorMultiPlayers.printFinalRankingAll>, 'printLastRegrets': <function EvaluatorMultiPlayers.printLastRegrets>, 'printLastRegretsPM': <function EvaluatorMultiPlayers.printLastRegretsPM>, 'plotLastRegrets': <function EvaluatorMultiPlayers.plotLastRegrets>, 'plotHistoryOfMeans': <function EvaluatorMultiPlayers.plotHistoryOfMeans>, 'strPlayers': <function EvaluatorMultiPlayers.strPlayers>, '__dict__': <attribute '__dict__' of 'EvaluatorMultiPlayers' objects>, '__weakref__': <attribute '__weakref__' of 'EvaluatorMultiPlayers' objects>})¶
-
__module__
= 'Environment.EvaluatorMultiPlayers'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Environment.EvaluatorSparseMultiPlayers module¶
EvaluatorSparseMultiPlayers class to wrap and run the simulations, for the multi-players case with sparse activated players. Lots of plotting methods, to have various visualizations. See documentation.
Warning
FIXME this environment is not as up-to-date as Environment.EvaluatorMultiPlayers
.
-
Environment.EvaluatorSparseMultiPlayers.
REPETITIONS
= 1¶ Default nb of repetitions
-
Environment.EvaluatorSparseMultiPlayers.
ACTIVATION
= 1¶ Default probability of activation
-
Environment.EvaluatorSparseMultiPlayers.
DELTA_T_PLOT
= 50¶ Default sampling rate for plotting
-
Environment.EvaluatorSparseMultiPlayers.
MORE_ACCURATE
= True¶ Use the count of selections instead of rewards for a more accurate mean/std reward measure.
-
Environment.EvaluatorSparseMultiPlayers.
FINAL_RANKS_ON_AVERAGE
= True¶ Default value for
finalRanksOnAverage
-
Environment.EvaluatorSparseMultiPlayers.
USE_JOBLIB_FOR_POLICIES
= False¶ Default value for
useJoblibForPolicies
. Does not speed up to use it (too much overhead in using too much threads); so it should really be disabled.
-
Environment.EvaluatorSparseMultiPlayers.
PICKLE_IT
= True¶ Default value for
pickleit
for saving the figures. If True, then allplt.figure
object are saved (in pickle format).
-
class
Environment.EvaluatorSparseMultiPlayers.
EvaluatorSparseMultiPlayers
(configuration, moreAccurate=True)[source]¶ Bases:
Environment.EvaluatorMultiPlayers.EvaluatorMultiPlayers
Evaluator class to run the simulations, for the multi-players case.
-
__init__
(configuration, moreAccurate=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
activations
= None¶ Probability of activations
-
collisionModel
= None¶ Which collision model should be used
-
full_lost_if_collision
= None¶ Is there a full loss of rewards if collision ? To compute the correct decomposition of regret
-
getCentralizedRegret_LessAccurate
(envId=0)[source]¶ Compute the empirical centralized regret: cumsum on time of the mean rewards of the M best arms - cumsum on time of the empirical rewards obtained by the players, based on accumulated rewards.
-
getFirstRegretTerm
(envId=0)[source]¶ Extract and compute the first term \((a)\) in the centralized regret: losses due to pulling suboptimal arms.
-
getSecondRegretTerm
(envId=0)[source]¶ Extract and compute the second term \((b)\) in the centralized regret: losses due to not pulling optimal arms.
-
getThirdRegretTerm
(envId=0)[source]¶ Extract and compute the third term \((c)\) in the centralized regret: losses due to collisions.
-
getCentralizedRegret_MoreAccurate
(envId=0)[source]¶ Compute the empirical centralized regret, based on counts of selections and not actual rewards.
-
getCentralizedRegret
(envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
getLastRegrets_MoreAccurate
(envId=0)[source]¶ Extract last regrets, based on counts of selections and not actual rewards.
-
getLastRegrets
(envId=0, moreAccurate=None)[source]¶ Using either the more accurate or the less accurate regret count.
-
strPlayers
(short=False, latex=True)[source]¶ Get a string of the players and their activations probability for this environment.
-
__module__
= 'Environment.EvaluatorSparseMultiPlayers'¶
-
-
Environment.EvaluatorSparseMultiPlayers.
delayed_play
(env, players, horizon, collisionModel, activations, seed=None, repeatId=0)[source]¶ Helper function for the parallelization.
-
Environment.EvaluatorSparseMultiPlayers.
uniform_in_zero_one
()¶ random() -> x in the interval [0, 1).
-
Environment.EvaluatorSparseMultiPlayers.
with_proba
(proba)[source]¶ True with probability = proba, False with probability = 1 - proba.
Examples:
>>> import random; random.seed(0) >>> tosses = [with_proba(0.6) for _ in range(10000)]; sum(tosses) 5977 >>> tosses = [with_proba(0.111) for _ in range(100000)]; sum(tosses) 11158
Environment.MAB module¶
MAB
, MarkovianMAB
, ChangingAtEachRepMAB
, IncreasingMAB
, PieceWiseStationaryMAB
and NonStationaryMAB
classes to wrap the arms of some Multi-Armed Bandit problems.
Such class has to have at least these methods:
draw(armId, t)
to draw one sample from thatarmId
at timet
,- and
reprarms()
to pretty print the arms (for titles of a plot), - and more, see below.
Warning
FIXME it is still a work in progress, I need to add continuously varying environments. See https://github.com/SMPyBandits/SMPyBandits/issues/71
-
class
Environment.MAB.
MAB
(configuration)[source]¶ Bases:
object
Basic Multi-Armed Bandit problem, for stochastic and i.i.d. arms.
configuration can be a dict with ‘arm_type’ and ‘params’ keys. ‘arm_type’ is a class from the Arms module, and ‘params’ is a dict, used as a list/tuple/iterable of named parameters given to ‘arm_type’. Example:
configuration = { 'arm_type': Bernoulli, 'params': [0.1, 0.5, 0.9] } configuration = { # for fixed variance Gaussian 'arm_type': Gaussian, 'params': [0.1, 0.5, 0.9] }
But it can also accept a list of already created arms:
configuration = [ Bernoulli(0.1), Bernoulli(0.5), Bernoulli(0.9), ]
Both will create three Bernoulli arms, of parameters (means) 0.1, 0.5 and 0.9.
-
isChangingAtEachRepetition
= None¶ Flag to know if the problem is changing at each repetition or not.
-
isDynamic
= None¶ Flag to know if the problem is static or not.
-
isMarkovian
= None¶ Flag to know if the problem is Markovian or not.
-
arms
= None¶ List of arms
-
means
= None¶ Means of arms
-
nbArms
= None¶ Number of arms
-
maxArm
= None¶ Max mean of arms
-
minArm
= None¶ Min mean of arms
-
new_order_of_arm
(arms)[source]¶ Feed a new order of the arms to the environment.
- Updates
means
correctly. - Return the new position(s) of the best arm (to count and plot
BestArmPulls
correctly).
Warning
This is a very limited support of non-stationary environment: only permutations of the arms are allowed, see
NonStationaryMAB
for more.- Updates
-
reprarms
(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]¶ Return a str representation of the list of the arms (like repr(self.arms) but better)
- If nbPlayers > 0, it surrounds the representation of the best arms by openTag, endTag (for plot titles, in a multi-player setting).
- Example: openTag = ‘’, endTag = ‘^*’ for LaTeX tags to put a star exponent.
- Example: openTag = ‘<red>’, endTag = ‘</red>’ for HTML-like tags.
- Example: openTag = r’ extcolor{red}{‘, endTag = ‘}’ for LaTeX tags.
-
draw
(armId, t=1)[source]¶ Return a random sample from the armId-th arm, at time t. Usually t is not used.
-
draw_nparray
(armId, shape=(1, ))[source]¶ Return a numpy array of random sample from the armId-th arm, of a certain shape.
-
draw_each_nparray
(shape=(1, ))[source]¶ Return a numpy array of random sample from each arm, of a certain shape.
-
get_minArm
(horizon=None)[source]¶ Return the vector of min mean of the arms.
- It is a vector of length horizon.
-
get_maxArm
(horizon=None)[source]¶ Return the vector of max mean of the arms.
- It is a vector of length horizon.
-
get_maxArms
(M=1, horizon=None)[source]¶ Return the vector of sum of the M-best means of the arms.
- It is a vector of length horizon.
-
get_allMeans
(horizon=None)[source]¶ Return the vector of means of the arms.
- It is a numpy array of shape (nbArms, horizon).
-
sparsity
¶ Estimate the sparsity of the problem, i.e., the number of arms with positive means.
-
str_sparsity
()[source]¶ Empty string if
sparsity = nbArms
, or a small string ‘, $s={}$’ if the sparsity is strictly less than the number of arm.
-
lowerbound
()[source]¶ Compute the constant \(C(\mu)\), for the [Lai & Robbins] lower-bound for this MAB problem (complexity), using functions from
kullback.py
orkullback.so
(seeArms.kullback
).
-
lowerbound_sparse
(sparsity=None)[source]¶ Compute the constant \(C(\mu)\), for [Kwon et al, 2017] lower-bound for sparse bandits for this MAB problem (complexity)
- I recomputed suboptimal solution to the optimization problem, and found the same as in [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].
-
hoifactor
()[source]¶ Compute the HOI factor H_OI(mu), the Optimal Arm Identification (OI) factor, for this MAB problem (complexity). Cf. (3.3) in Navikkumar MODI’s thesis, “Machine Learning and Statistical Decision Making for Green Radio” (2017).
-
lowerbound_multiplayers
(nbPlayers=1)[source]¶ Compute our multi-players lower bound for this MAB problem (complexity), using functions from
kullback
.
-
upperbound_collisions
(nbPlayers, times)[source]¶ Compute Anandkumar et al. multi-players upper bound for this MAB problem (complexity), for UCB only. Warning: it is HIGHLY asymptotic!
-
plotComparison_our_anandkumar
(savefig=None)[source]¶ Plot a comparison of our lowerbound and their lowerbound.
-
plotHistogram
(horizon=10000, savefig=None, bins=50, alpha=0.9, density=None)[source]¶ Plot a horizon=10000 draws of each arms.
-
__dict__
= mappingproxy({'__module__': 'Environment.MAB', '__doc__': " Basic Multi-Armed Bandit problem, for stochastic and i.i.d. arms.\n\n - configuration can be a dict with 'arm_type' and 'params' keys. 'arm_type' is a class from the Arms module, and 'params' is a dict, used as a list/tuple/iterable of named parameters given to 'arm_type'. Example::\n\n configuration = {\n 'arm_type': Bernoulli,\n 'params': [0.1, 0.5, 0.9]\n }\n\n configuration = { # for fixed variance Gaussian\n 'arm_type': Gaussian,\n 'params': [0.1, 0.5, 0.9]\n }\n\n - But it can also accept a list of already created arms::\n\n configuration = [\n Bernoulli(0.1),\n Bernoulli(0.5),\n Bernoulli(0.9),\n ]\n\n - Both will create three Bernoulli arms, of parameters (means) 0.1, 0.5 and 0.9.\n ", '__init__': <function MAB.__init__>, 'new_order_of_arm': <function MAB.new_order_of_arm>, '__repr__': <function MAB.__repr__>, 'reprarms': <function MAB.reprarms>, 'draw': <function MAB.draw>, 'draw_nparray': <function MAB.draw_nparray>, 'draw_each': <function MAB.draw_each>, 'draw_each_nparray': <function MAB.draw_each_nparray>, 'Mbest': <function MAB.Mbest>, 'Mworst': <function MAB.Mworst>, 'sumBestMeans': <function MAB.sumBestMeans>, 'get_minArm': <function MAB.get_minArm>, 'get_maxArm': <function MAB.get_maxArm>, 'get_maxArms': <function MAB.get_maxArms>, 'get_allMeans': <function MAB.get_allMeans>, 'sparsity': <property object>, 'str_sparsity': <function MAB.str_sparsity>, 'lowerbound': <function MAB.lowerbound>, 'lowerbound_sparse': <function MAB.lowerbound_sparse>, 'hoifactor': <function MAB.hoifactor>, 'lowerbound_multiplayers': <function MAB.lowerbound_multiplayers>, 'upperbound_collisions': <function MAB.upperbound_collisions>, 'plotComparison_our_anandkumar': <function MAB.plotComparison_our_anandkumar>, 'plotHistogram': <function MAB.plotHistogram>, '__dict__': <attribute '__dict__' of 'MAB' objects>, '__weakref__': <attribute '__weakref__' of 'MAB' objects>})¶
-
__module__
= 'Environment.MAB'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Environment.MAB.
RESTED
= True¶ Default is rested Markovian.
-
Environment.MAB.
dict_of_transition_matrix
(mat)[source]¶ Convert a transition matrix (list of list or numpy array) to a dictionary mapping (state, state) to probabilities (as used by
pykov.Chain
).
-
Environment.MAB.
transition_matrix_of_dict
(dic)[source]¶ Convert a dictionary mapping (state, state) to probabilities (as used by
pykov.Chain
) to a transition matrix (numpy array).
-
class
Environment.MAB.
MarkovianMAB
(configuration)[source]¶ Bases:
Environment.MAB.MAB
Classic MAB problem but the rewards are drawn from a rested/restless Markov chain.
- configuration is a dict with
rested
andtransitions
keys. rested
is a Boolean. See [Kalathil et al., 2012](https://arxiv.org/abs/1206.3582) page 2 for a description.transitions
is list of K transition matrices or dictionary (to specify non-integer states), one for each arm.
Example:
configuration = { "arm_type": "Markovian", "params": { "rested": True, # or False # Example from [Kalathil et al., 2012](https://arxiv.org/abs/1206.3582) Table 1 "transitions": [ # 1st arm, Either a dictionary { # Mean = 0.375 (0, 0): 0.7, (0, 1): 0.3, (1, 0): 0.5, (1, 1): 0.5, }, # 2nd arm, Or a right transition matrix [[0.2, 0.8], [0.6, 0.4]], # Mean = 0.571 ], # FIXME make this by default! include it in MAB.py and not in the configuration! "steadyArm": Bernoulli } }
- This class requires the [pykov](https://github.com/riccardoscalco/Pykov) module to represent and use Markov chain.
-
isChangingAtEachRepetition
= None¶ The problem is not changing at each repetition.
-
isDynamic
= None¶ The problem is static.
-
isMarkovian
= None¶ The problem is Markovian.
-
rested
= None¶ Rested or not Markovian model?
-
nbArms
= None¶ Number of arms
-
means
= None¶ Means of each arms, from their steady distributions.
-
maxArm
= None¶ Max mean of arms
-
minArm
= None¶ Min mean of arms
-
states
= None¶ States of each arm, initially they are all busy
-
reprarms
(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]¶ Return a str representation of the list of the arms (like repr(self.arms) but better).
- For Markovian MAB, the chain and the steady Bernoulli arm is represented.
- If nbPlayers > 0, it surrounds the representation of the best arms by openTag, endTag (for plot titles, in a multi-player setting).
- Example: openTag = ‘’, endTag = ‘^*’ for LaTeX tags to put a star exponent.
- Example: openTag = ‘<red>’, endTag = ‘</red>’ for HTML-like tags.
- Example: openTag = r’ extcolor{red}{‘, endTag = ‘}’ for LaTeX tags.
-
draw
(armId, t=1)[source]¶ Move on the Markov chain and return its state as a reward (0 or 1, or else).
- If rested Markovian, only the state of the Markov chain of arm armId changes. It is the simpler model, and the default model.
- But if restless (non rested) Markovian, the states of all the Markov chain of all arms change (not only armId).
-
__module__
= 'Environment.MAB'¶
- configuration is a dict with
-
Environment.MAB.
VERBOSE
= False¶ Whether to be verbose when generating new arms for Dynamic MAB
-
class
Environment.MAB.
ChangingAtEachRepMAB
(configuration, verbose=False)[source]¶ Bases:
Environment.MAB.MAB
Like a stationary MAB problem, but the arms are (randomly) regenerated for each repetition, with the
newRandomArms()
method.M.arms
andM.means
is changed after each call tonewRandomArms()
, but notnbArm
. All the other methods are carefully written to still make sense (Mbest
,Mworst
,minArm
,maxArm
).
Warning
It works perfectly fine, but it is still experimental, be careful when using this feature.
Note
Testing bandit algorithms against randomly generated problems at each repetitions is usually referred to as “Bayesian problems” in the literature: a prior is set on problems (eg. uniform on \([0,1]^K\) or less obvious for instance if a
mingap
is set), and the performance is assessed against this prior. It differs from the frequentist point of view of having one fixed problem and doing eg.n=1000
repetitions on the same problem.-
isChangingAtEachRepetition
= None¶ The problem is changing at each repetition or not.
-
isDynamic
= None¶ The problem is static.
-
isMarkovian
= None¶ The problem is not Markovian.
-
newMeans
= None¶ Function to generate the means
-
args
= None¶ Args to give to function
-
nbArms
= None¶ Means of arms
-
reprarms
(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]¶ Cannot represent the dynamic arms, so print the ChangingAtEachRepMAB object
-
newRandomArms
(t=None, verbose=False)[source]¶ Generate a new list of arms, from
arm_type(params['newMeans'](*params['args']))
.
-
arms
¶ Return the current list of arms.
-
means
¶ Return the list of means of arms for this ChangingAtEachRepMAB: after \(x\) calls to
newRandomArms()
, the return mean of arm \(k\) is the mean of the \(x\) means of that arm.Warning
Highly experimental!
-
minArm
¶ Return the smallest mean of the arms, for a dynamic MAB (averaged on all the draws of new means).
-
maxArm
¶ Return the largest mean of the arms, for a dynamic MAB (averaged on all the draws of new means).
-
lowerbound
()[source]¶ Compute the constant C(mu), for [Lai & Robbins] lower-bound for this MAB problem (complexity), using functions from
kullback
(averaged on all the draws of new means).
-
hoifactor
()[source]¶ Compute the HOI factor H_OI(mu), the Optimal Arm Identification (OI) factor, for this MAB problem (complexity). Cf. (3.3) in Navikkumar MODI’s thesis, “Machine Learning and Statistical Decision Making for Green Radio” (2017) (averaged on all the draws of new means).
-
lowerbound_multiplayers
(nbPlayers=1)[source]¶ Compute our multi-players lower bound for this MAB problem (complexity), using functions from
kullback
.
-
__module__
= 'Environment.MAB'¶
-
class
Environment.MAB.
PieceWiseStationaryMAB
(configuration, verbose=False)[source]¶ Bases:
Environment.MAB.MAB
Like a stationary MAB problem, but piece-wise stationary.
- Give it a list of vector of means, and a list of change-point locations.
- You can use
plotHistoryOfMeans()
to see a nice plot of the history of means.
Note
This is a generic class to implement one “easy” kind of non-stationary bandits, abruptly changing non-stationary bandits, if changepoints are fixed and decided in advanced.
Warning
It works fine, but it is still experimental, be careful when using this feature.
Warning
The number of arms is fixed, see https://github.com/SMPyBandits/SMPyBandits/issues/123 if you are curious about bandit problems with a varying number of arms (or sleeping bandits where some arms can be enabled or disabled at each time).
-
isChangingAtEachRepetition
= None¶ The problem is not changing at each repetition.
-
isDynamic
= None¶ The problem is dynamic.
-
isMarkovian
= None¶ The problem is not Markovian.
-
listOfMeans
= None¶ The list of means
-
nbArms
= None¶ Number of arms
-
changePoints
= None¶ List of the change points
-
reprarms
(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]¶ Cannot represent the dynamic arms, so print the PieceWiseStationaryMAB object
-
newRandomArms
(t=None, onlyOneArm=None, verbose=False)[source]¶ Fake function, there is nothing random here, it is just to tell the piece-wise stationary MAB problem to maybe use the next interval.
-
plotHistoryOfMeans
(horizon=None, savefig=None, forceTo01=False, showplot=True, pickleit=False)[source]¶ Plot the history of means, as a plot with x axis being the time, y axis the mean rewards, and K curves one for each arm.
-
arms
¶ Return the current list of arms. at time \(t\) , the return mean of arm \(k\) is the mean during the time interval containing \(t\).
-
means
¶ Return the list of means of arms for this PieceWiseStationaryMAB: at time \(t\) , the return mean of arm \(k\) is the mean during the time interval containing \(t\).
-
minArm
¶ Return the smallest mean of the arms, for the current vector of means.
-
maxArm
¶ Return the largest mean of the arms, for the current vector of means.
-
get_minArm
(horizon=None)[source]¶ Return the smallest mean of the arms, for a piece-wise stationary MAB
- It is a vector of length horizon.
-
get_minArms
(M=1, horizon=None)[source]¶ Return the vector of sum of the M-worst means of the arms, for a piece-wise stationary MAB.
- It is a vector of length horizon.
-
get_maxArm
(horizon=None)[source]¶ Return the vector of max mean of the arms, for a piece-wise stationary MAB.
- It is a vector of length horizon.
-
get_maxArms
(M=1, horizon=None)[source]¶ Return the vector of sum of the M-best means of the arms, for a piece-wise stationary MAB.
- It is a vector of length horizon.
-
get_allMeans
(horizon=None)[source]¶ Return the vector of mean of the arms, for a piece-wise stationary MAB.
- It is a numpy array of shape (nbArms, horizon).
-
__module__
= 'Environment.MAB'¶
-
class
Environment.MAB.
NonStationaryMAB
(configuration, verbose=False)[source]¶ Bases:
Environment.MAB.PieceWiseStationaryMAB
Like a stationary MAB problem, but the arms can be modified at each time step, with the
newRandomArms()
method.M.arms
andM.means
is changed after each call tonewRandomArms()
, but notnbArm
. All the other methods are carefully written to still make sense (Mbest
,Mworst
,minArm
,maxArm
).
Note
This is a generic class to implement different kinds of non-stationary bandits:
- Abruptly changing non-stationary bandits, in different variants: changepoints are randomly drawn (once for all
n
repetitions or at different location fo each repetition). - Slowly varying non-stationary bandits, where the underlying mean of each arm is slowing randomly modified and a bound on the speed of change (e.g., Lipschitz constant of \(t \mapsto \mu_i(t)\)) is known.
Warning
It works fine, but it is still experimental, be careful when using this feature.
Warning
The number of arms is fixed, see https://github.com/SMPyBandits/SMPyBandits/issues/123 if you are curious about bandit problems with a varying number of arms (or sleeping bandits where some arms can be enabled or disabled at each time).
-
isChangingAtEachRepetition
= None¶ The problem is not changing at each repetition.
-
isDynamic
= None¶ The problem is dynamic.
-
isMarkovian
= None¶ The problem is not Markovian.
-
newMeans
= None¶ Function to generate the means
-
changePoints
= None¶ List of the change points
-
onlyOneArm
= None¶ None by default, but can be “uniform” to only change one arm at each change point.
-
args
= None¶ Args to give to function
-
nbArms
= None¶ Means of arms
-
reprarms
(nbPlayers=None, openTag='', endTag='^*', latex=True)[source]¶ Cannot represent the dynamic arms, so print the NonStationaryMAB object
-
newRandomArms
(t=None, onlyOneArm=None, verbose=False)[source]¶ Generate a new list of arms, from
arm_type(params['newMeans'](t, **params['args']))
.- If
onlyOneArm
is given and is an integer, the change of mean only occurs for this arm and the others stay the same. - If
onlyOneArm="uniform"
, the change of mean only occurs for one arm and the others stay the same, and the changing arm is chosen uniformly at random.
Note
Only the means of the arms change (and so, their order), not their family.
Warning
TODO? So far the only change points we consider is when the means of arms change, but the family of distributions stay the same. I could implement a more generic way, for instance to be able to test algorithms that detect change between different families of distribution (e.g., from a Gaussian of variance=1 to a Gaussian of variance=2, with different or not means).
- If
-
get_minArm
(horizon=None)[source]¶ Return the smallest mean of the arms, for a non-stationary MAB
- It is a vector of length horizon.
-
get_maxArm
(horizon=None)[source]¶ Return the vector of max mean of the arms, for a non-stationary MAB.
- It is a vector of length horizon.
-
get_allMeans
(horizon=None)[source]¶ Return the vector of mean of the arms, for a non-stationary MAB.
- It is a numpy array of shape (nbArms, horizon).
-
__module__
= 'Environment.MAB'¶
-
Environment.MAB.
static_change_lower_amplitude
(t, l_t, a_t)[source]¶ A function called by
IncreasingMAB
at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).- First argument is a boolean, True if a change occurred, False otherwise.
-
Environment.MAB.
L0
= -1¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
A0
= 2¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
DELTA
= 0¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
T0
= -1¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
DELTA_T
= -1¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
ZOOM
= 2¶ Default value for the
doubling_change_lower_amplitude()
function.
-
Environment.MAB.
doubling_change_lower_amplitude
(t, l_t, a_t, l0=-1, a0=2, delta=0, T0=-1, deltaT=-1, zoom=2)[source]¶ A function called by
IncreasingMAB
at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).- At time 0, it forces to use \(l_0, a_0\) if they are given and not
None
. - At step T0, it reduces \(l_t\) by delta (typically from 0 to -1).
- Every deltaT steps, it multiplies both \(l_t\) and \(a_t\) by zoom.
- First argument is a boolean, True if a change occurred, False otherwise.
- At time 0, it forces to use \(l_0, a_0\) if they are given and not
-
Environment.MAB.
default_change_lower_amplitude
(t, l_t, a_t, l0=-1, a0=2, delta=0, T0=-1, deltaT=-1, zoom=2)¶ A function called by
IncreasingMAB
at every time t, to compute the (possibly) knew values for \(l_t\) and \(a_t\).- At time 0, it forces to use \(l_0, a_0\) if they are given and not
None
. - At step T0, it reduces \(l_t\) by delta (typically from 0 to -1).
- Every deltaT steps, it multiplies both \(l_t\) and \(a_t\) by zoom.
- First argument is a boolean, True if a change occurred, False otherwise.
- At time 0, it forces to use \(l_0, a_0\) if they are given and not
-
class
Environment.MAB.
IncreasingMAB
(configuration)[source]¶ Bases:
Environment.MAB.MAB
Like a stationary MAB problem, but the range of the rewards is increased from time to time, to test the
Policy.WrapRange
policy.- M.arms and M.means is NOT changed after each call to
newRandomArms()
, but not nbArm.
Warning
It is purely experimental, be careful when using this feature.
-
__module__
= 'Environment.MAB'¶
-
isDynamic
= None¶ Flag to know if the problem is static or not.
- M.arms and M.means is NOT changed after each call to
-
Environment.MAB.
binomialCoefficient
(k, n)[source]¶ Compute a binomial coefficient \(C^n_k\) by a direct multiplicative method: \(C^n_k = {k \choose n}\).
- Exact, using integers, not like https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.binom.html#scipy.special.binom which uses float numbers.
- Complexity: :math`mathcal{O}(1)` in memory, :math`mathcal{O}(n)` in time.
- From https://en.wikipedia.org/wiki/Binomial_coefficient#Binomial_coefficient_in_programming_languages
- From: http://userpages.umbc.edu/~rcampbel/Computers/Python/probstat.html#ProbStat-Combin-Combinations
- Examples:
>>> binomialCoefficient(-3, 10) 0 >>> binomialCoefficient(1, -10) 0 >>> binomialCoefficient(1, 10) 10 >>> binomialCoefficient(5, 10) 80 >>> binomialCoefficient(5, 20) 12960 >>> binomialCoefficient(10, 30) 10886400
Environment.MAB_rotting module¶
author : Julien SEZNEC Code to launch (rotting) bandit games. It is code in a functional programming way : each execution return arrays related to each run.
Environment.Result module¶
Result.Result class to wrap the simulation results.
-
class
Environment.Result.
Result
(nbArms, horizon, indexes_bestarm=-1, means=None)[source]¶ Bases:
object
Result accumulators.
-
choices
= None¶ Store all the choices.
-
rewards
= None¶ Store all the rewards, to compute the mean.
-
pulls
= None¶ Store the pulls.
-
indexes_bestarm
= None¶ Store also the position of the best arm, XXX in case of dynamically switching environment.
-
running_time
= None¶ Store the running time of the experiment.
-
memory_consumption
= None¶ Store the memory consumption of the experiment.
-
number_of_cp_detections
= None¶ Store the number of change point detected during the experiment.
-
change_in_arms
(time, indexes_bestarm)[source]¶ Store the position of the best arm from this list of arm.
- From that time t and after, the index of the best arm is stored as
indexes_bestarm
.
Warning
FIXME This is still experimental!
- From that time t and after, the index of the best arm is stored as
-
__dict__
= mappingproxy({'__module__': 'Environment.Result', '__doc__': ' Result accumulators.', '__init__': <function Result.__init__>, 'store': <function Result.store>, 'change_in_arms': <function Result.change_in_arms>, '__dict__': <attribute '__dict__' of 'Result' objects>, '__weakref__': <attribute '__weakref__' of 'Result' objects>})¶
-
__module__
= 'Environment.Result'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Environment.ResultMultiPlayers module¶
ResultMultiPlayers.ResultMultiPlayers class to wrap the simulation results, for the multi-players case.
-
class
Environment.ResultMultiPlayers.
ResultMultiPlayers
(nbArms, horizon, nbPlayers, means=None)[source]¶ Bases:
object
ResultMultiPlayers accumulators, for the multi-players case.
-
choices
= None¶ Store all the choices of all the players
-
rewards
= None¶ Store all the rewards of all the players, to compute the mean
-
pulls
= None¶ Store the pulls of all the players
-
allPulls
= None¶ Store all the pulls of all the players
-
collisions
= None¶ Store the collisions on all the arms
-
running_time
= None¶ Store the running time of the experiment
-
memory_consumption
= None¶ Store the memory consumption of the experiment
-
__dict__
= mappingproxy({'__module__': 'Environment.ResultMultiPlayers', '__doc__': ' ResultMultiPlayers accumulators, for the multi-players case. ', '__init__': <function ResultMultiPlayers.__init__>, 'store': <function ResultMultiPlayers.store>, '__dict__': <attribute '__dict__' of 'ResultMultiPlayers' objects>, '__weakref__': <attribute '__weakref__' of 'ResultMultiPlayers' objects>})¶
-
__module__
= 'Environment.ResultMultiPlayers'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Environment.fairnessMeasures module¶
Define some function to measure fairness of a vector of cumulated rewards, of shape (nbPlayers, horizon).
- All functions are valued in \([0, 1]\): \(100\%\) means fully unfair (one player has \(0\) rewards, another one has \(>0\) rewards), and \(0\%\) means fully fair (they all have exactly the same rewards).
- Reference: https://en.wikipedia.org/wiki/Fairness_measure and http://ica1www.epfl.ch/PS_files/LEB3132.pdf#search=%22max-min%20fairness%22.
-
Environment.fairnessMeasures.
amplitude_fairness
(X, axis=0)[source]¶ (Normalized) Amplitude fairness, homemade formula: \(1 - \min(X, axis) / \max(X, axis)\).
Examples:
>>> import numpy.random as rn; rn.seed(1) # for reproductibility >>> X = np.cumsum(rn.rand(10, 1000)) >>> amplitude_fairness(X) # doctest: +ELLIPSIS 0.999... >>> amplitude_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.999... >>> amplitude_fairness(np.log(1 + np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.959...
>>> rn.seed(3) # for reproductibility >>> X = rn.randint(0, 10, (10, 1000)); Y = np.cumsum(X, axis=1) >>> np.min(Y, axis=0)[0], np.max(Y, axis=0)[0] (3, 9) >>> np.min(Y, axis=0)[-1], np.max(Y, axis=0)[-1] (4387, 4601) >>> amplitude_fairness(Y, axis=0).shape (1000,) >>> list(amplitude_fairness(Y, axis=0)) # doctest: +ELLIPSIS [0.666..., 0.764..., ..., 0.0465...]
>>> X[X >= 3] = 3; Y = np.cumsum(X, axis=1) >>> np.min(Y, axis=0)[0], np.max(Y, axis=0)[0] (3, 3) >>> np.min(Y, axis=0)[-1], np.max(Y, axis=0)[-1] (2353, 2433) >>> amplitude_fairness(Y, axis=0).shape (1000,) >>> list(amplitude_fairness(Y, axis=0)) # Less spreadout # doctest: +ELLIPSIS [0.0, 0.5, ..., 0.0328...]
-
Environment.fairnessMeasures.
std_fairness
(X, axis=0)[source]¶ (Normalized) Standard-variation fairness, homemade formula: \(2 * \mathrm{std}(X, axis) / \max(X, axis)\).
Examples:
>>> import numpy.random as rn; rn.seed(1) # for reproductibility >>> X = np.cumsum(rn.rand(10, 1000)) >>> std_fairness(X) # doctest: +ELLIPSIS 0.575... >>> std_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.594... >>> std_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.470...
>>> rn.seed(2) # for reproductibility >>> X = np.cumsum(rn.randint(0, 10, (10, 100))) >>> std_fairness(X) # doctest: +ELLIPSIS 0.570... >>> std_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.587... >>> std_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.463...
-
Environment.fairnessMeasures.
rajjain_fairness
(X, axis=0)[source]¶ Raj Jain’s fairness index: \((\sum_{i=1}^{n} x_i)^2 / (n \times \sum_{i=1}^{n} x_i^2)\), projected to \([0, 1]\) instead of \([\frac{1}{n}, 1]\) as introduced in the reference article.
Examples:
>>> import numpy.random as rn; rn.seed(1) # for reproductibility >>> X = np.cumsum(rn.rand(10, 1000)) >>> rajjain_fairness(X) # doctest: +ELLIPSIS 0.248... >>> rajjain_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.441... >>> rajjain_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.110...
>>> rn.seed(2) # for reproductibility >>> X = np.cumsum(rn.randint(0, 10, (10, 100))) >>> rajjain_fairness(X) # doctest: +ELLIPSIS 0.246... >>> rajjain_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.917... >>> rajjain_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.107...
-
Environment.fairnessMeasures.
mo_walrand_fairness
(X, axis=0, alpha=2)[source]¶ Mo and Walrand’s family fairness index: \(U_{\alpha}(X)\), NOT projected to \([0, 1]\).
\[\begin{split}U_{\alpha}(X) = \begin{cases} \frac{1}{1 - \alpha} \sum_{i=1}^n x_i^{1 - \alpha} & \;\text{if}\; \alpha\in[0,+\infty)\setminus\{1\}, \\ \sum_{i=1}^{n} \ln(x_i) & \;\text{otherwise}. \end{cases}\end{split}\]Examples:
>>> import numpy.random as rn; rn.seed(1) # for reproductibility >>> X = np.cumsum(rn.rand(10, 1000))
>>> alpha = 0 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS 24972857.013... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS 82933940429.039... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS 471371.219...
>>> alpha = 0.99999 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS 1000075176.390... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS 1000150358.528... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS 1000037587.478...
>>> alpha = 1 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS 75173.509... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS 150347.019... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS 37586.754...
>>> alpha = 1.00001 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS -999924829.359... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS -999849664.476... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS -999962413.957...
>>> alpha = 2 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS -22.346... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS -9.874... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS -283.255...
>>> alpha = 5 >>> mo_walrand_fairness(X, alpha=alpha) # doctest: +ELLIPSIS -8.737... >>> mo_walrand_fairness(X ** 2, alpha=alpha) # More spreadout # doctest: +ELLIPSIS -273.522... >>> mo_walrand_fairness(np.sqrt(np.abs(X)), alpha=alpha) # Less spreadout # doctest: +ELLIPSIS -2.468...
-
Environment.fairnessMeasures.
mean_fairness
(X, axis=0, methods=(<function amplitude_fairness>, <function std_fairness>, <function rajjain_fairness>))[source]¶ Fairness index, based on mean of the 3 fairness measures: Amplitude, STD and Raj Jain fairness.
Examples:
>>> import numpy.random as rn; rn.seed(1) # for reproductibility >>> X = np.cumsum(rn.rand(10, 1000)) >>> mean_fairness(X) # doctest: +ELLIPSIS 0.607... >>> mean_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.678... >>> mean_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.523...
>>> rn.seed(2) # for reproductibility >>> X = np.cumsum(rn.randint(0, 10, (10, 100))) >>> mean_fairness(X) # doctest: +ELLIPSIS 0.605... >>> mean_fairness(X ** 2) # More spreadout # doctest: +ELLIPSIS 0.834... >>> mean_fairness(np.sqrt(np.abs(X))) # Less spreadout # doctest: +ELLIPSIS 0.509...
-
Environment.fairnessMeasures.
fairnessMeasure
(X, axis=0, methods=(<function amplitude_fairness>, <function std_fairness>, <function rajjain_fairness>))¶ Default fairness measure
-
Environment.fairnessMeasures.
fairness_mapping
= {'Amplitude': <function amplitude_fairness>, 'Default': <function mean_fairness>, 'Mean': <function mean_fairness>, 'MoWalrand': <function mo_walrand_fairness>, 'RajJain': <function rajjain_fairness>, 'STD': <function std_fairness>}¶ Mapping of names of measure to their function
Environment.memory_consumption module¶
Tiny module to measure and work on memory consumption.
It defines a utility function to get the memory consumes in the current process or the current thread (getCurrentMemory()
), and a function to pretty print memory size (sizeof_fmt()
).
It also imports tracemalloc
and define a convenient function that pretty print the most costly lines after a run.
- Reference: https://docs.python.org/3/library/tracemalloc.html#pretty-top
- Example:
>>> return_code = start_tracemalloc()
Starting to trace memory allocations...
>>> # ... run your application ...
>>> display_top_tracemalloc()
<BLANKLINE>
Top 10 lines ranked by memory consumption:
#1: python3.6/doctest.py:1330: 636 B
compileflags, 1), test.globs)
#2: <doctest __main__[1]>:1: 568 B
display_top_tracemalloc()
#3: python3.6/doctest.py:1346: 472 B
if check(example.want, got, self.optionflags):
#4: python3.6/doctest.py:1374: 464 B
self.report_success(out, test, example, got)
#5: python3.6/doctest.py:1591: 456 B
got = self._toAscii(got)
#6: ./memory_consumption.py:168: 448 B
snapshot = tracemalloc.take_snapshot()
#7: python3.6/doctest.py:1340: 440 B
self._fakeout.truncate(0)
#8: python3.6/doctest.py:1339: 440 B
got = self._fakeout.getvalue() # the actual output
#9: python3.6/doctest.py:1331: 432 B
self.debugger.set_continue() # ==== Example Finished ====
#10: python3.6/doctest.py:251: 89 B
result = StringIO.getvalue(self)
2 others: 78 B
<BLANKLINE>
Total allocated size: 4.4 KiB
4523
Warning
This is automatically used (for main.py
at least) when DEBUGMEMORY=True
(cli env).
Warning
This is experimental and does not work as well on Mac OS X and Windows as it works on GNU/Linux systems.
-
Environment.memory_consumption.
getCurrentMemory
(thread=False, both=False)[source]¶ Get the current memory consumption of the process, or the thread.
- Example, before and after creating a huge random matrix in Numpy, and asking to invert it:
>>> currentMemory = getCurrentMemory() >>> print("Consumed {} memory".format(sizeof_fmt(currentMemory))) # doctest: +SKIP Consumed 16.8 KiB memory
>>> import numpy as np; x = np.random.randn(1000, 1000) # doctest: +SKIP >>> diffMemory = getCurrentMemory() - currentMemory; currentMemory += diffMemory >>> print("Consumed {} more memory".format(sizeof_fmt(diffMemory))) # doctest: +SKIP Consumed 18.8 KiB more memory
>>> y = np.linalg.pinv(x) # doctest: +SKIP >>> diffMemory = getCurrentMemory() - currentMemory; currentMemory += diffMemory >>> print("Consumed {} more memory".format(sizeof_fmt(diffMemory))) # doctest: +SKIP Consumed 63.9 KiB more memory
Warning
This is still experimental for multi-threaded code.
Warning
It can break on some systems, see for instance [the issue #142](https://github.com/SMPyBandits/SMPyBandits/issues/142).
Warning
FIXME even on my own system, it works for the last few policies I test, but fails for the first??
Warning
This returns 0 on Microsoft Windows, because the
resource
module is not available on non-UNIX systems (see https://docs.python.org/3/library/unix.html).
-
Environment.memory_consumption.
sizeof_fmt
(num, suffix='B', longsuffix=True, usespace=True, base=1024)[source]¶ Returns a string representation of the size
num
.- Examples:
>>> sizeof_fmt(1020) '1020 B' >>> sizeof_fmt(1024) '1 KiB' >>> sizeof_fmt(12011993) '11.5 MiB' >>> sizeof_fmt(123456789) '117.7 MiB' >>> sizeof_fmt(123456789911) '115 GiB'
Options include:
- No space before unit:
>>> sizeof_fmt(123456789911, usespace=False) '115GiB'
- French style, with short suffix, the “O” suffix for “octets”, and a base 1000:
>>> sizeof_fmt(123456789911, longsuffix=False, suffix='O', base=1000) '123.5 GO'
- Reference: https://stackoverflow.com/a/1094933/5889533
-
Environment.memory_consumption.
start_tracemalloc
()[source]¶ Wrapper function around
tracemalloc.start()
, to log the start of tracing memory allocation.
Environment.notify module¶
Defines one useful function notify()
to (try to) send a desktop notification.
- Only tested on Ubuntu and Debian desktops.
- Should work on any FreeDesktop compatible desktop, see https://wiki.ubuntu.com/NotifyOSD.
Warning
Experimental support of Mac OS X has been added since #143 (https://github.com/SMPyBandits/SMPyBandits/issues/143).
-
Environment.notify.
PROGRAM_NAME
= 'SMPyBandits'¶ Program name
-
Environment.notify.
ICON_PATH
= 'logo.png'¶ Icon to use
-
Environment.notify.
has_Notify
= False¶ Trying to import gi.repository.Notify
-
Environment.notify.
notify_gi
(body, summary='SMPyBandits', icon='terminal', timeout=5)[source]¶ Send a notification, with gi.repository.Notify.
- icon can be “dialog-information”, “dialog-warn”, “dialog-error”.
Environment.plot_Cmu_HOI module¶
Environment.plotsettings module¶
plotsettings: use it like this, in the Environment folder:
>>> import sys; sys.path.insert(0, '..')
>>> from .plotsettings import BBOX_INCHES, signature, maximizeWindow, palette, makemarkers, add_percent_formatter, wraptext, wraplatex, legend, show_and_save, nrows_ncols
-
Environment.plotsettings.
monthyear
= 'Mar.2021'¶ Month.Year date
-
Environment.plotsettings.
signature
= ''¶ A small string to use as a signature
-
Environment.plotsettings.
DPI
= 120¶ DPI to use for the figures
-
Environment.plotsettings.
FIGSIZE
= (16, 9)¶ Figure size, in inches!
-
Environment.plotsettings.
HLS
= True¶ Use the HLS mapping, or HUSL mapping
-
Environment.plotsettings.
VIRIDIS
= False¶ Use the Viridis colormap
-
Environment.plotsettings.
BBOX_INCHES
= None¶ Use this parameter for bbox
-
Environment.plotsettings.
palette
(nb, hls=True, viridis=False)[source]¶ Use a smart palette from seaborn, for nb different plots on the same figure.
>>> palette(10, hls=True) # doctest: +ELLIPSIS [(0.86..., 0.37..., 0.33...), (0.86...,.65..., 0.33...), (0.78..., 0.86...,.33...), (0.49..., 0.86...,.33...), (0.33..., 0.86...,.46...), (0.33..., 0.86...,.74...), (0.33..., 0.68..., 0.86...) (0.33..., 0.40..., 0.86...) (0.56..., 0.33..., 0.86...) (0.84..., 0.33..., 0.86...)] >>> palette(10, hls=False) # doctest: +ELLIPSIS [[0.96..., 0.44..., 0.53...], [0.88..., 0.52..., 0.19...], [0.71..., 0.60..., 0.19...], [0.54..., 0.65..., 0.19...], [0.19..., 0.69..., 0.34...], [0.20..., 0.68..., 0.58...],[0.21..., 0.67..., 0.69...], [0.22..., 0.65..., 0.84...], [0.55..., 0.57..., 0.95...], [0.85..., 0.44..., 0.95...]] >>> palette(10, viridis=True) # doctest: +ELLIPSIS [(0.28..., 0.13..., 0.44...), (0.26..., 0.24..., 0.52...), (0.22..., 0.34..., 0.54...), (0.17..., 0.43..., 0.55...), (0.14..., 0.52..., 0.55...), (0.11..., 0.60..., 0.54...), (0.16..., 0.69..., 0.49...), (0.31..., 0.77..., 0.41...), (0.52..., 0.83..., 0.28...), (0.76..., 0.87..., 0.13...)]
- To visualize:
>>> sns.palplot(palette(10, hls=True)) # doctest: +SKIP >>> sns.palplot(palette(10, hls=False)) # use HUSL by default # doctest: +SKIP >>> sns.palplot(palette(10, viridis=True)) # doctest: +SKIP
-
Environment.plotsettings.
makemarkers
(nb)[source]¶ Give a list of cycling markers. See http://matplotlib.org/api/markers_api.html
Note
This what I consider the optimal sequence of markers, they are clearly differentiable one from another and all are pretty.
Examples:
>>> makemarkers(7) ['o', 'D', 'v', 'p', '<', 's', '^'] >>> makemarkers(12) ['o', 'D', 'v', 'p', '<', 's', '^', '*', 'h', '>', 'o', 'D']
-
Environment.plotsettings.
PUTATRIGHT
= False¶ Default parameter for legend(): if True, the legend is placed at the right side of the figure, not on it. This is almost mandatory for plots with more than 10 algorithms (good for experimenting, bad for publications).
-
Environment.plotsettings.
SHRINKFACTOR
= 0.75¶ Shrink factor if the legend is displayed on the right of the plot.
Warning
I still don’t really understand how this works. Just manually decrease if the legend takes more space (i.e., more algorithms with longer names)
-
Environment.plotsettings.
MAXNBOFLABELINFIGURE
= 8¶ Default parameter for maximum number of label to display in the legend INSIDE the figure
-
Environment.plotsettings.
legend
(putatright=False, fontsize='large', shrinkfactor=0.75, maxnboflabelinfigure=8, fig=None, title=None)[source]¶ plt.legend() with good options, cf. http://matplotlib.org/users/recipes.html#transparent-fancy-legends.
- It can place the legend to the right also, see https://stackoverflow.com/a/4701285/.
-
Environment.plotsettings.
maximizeWindow
()[source]¶ Experimental function to try to maximize a plot.
- Tries as well as possible to maximize the figure.
- Cf. https://stackoverflow.com/q/12439588/
Warning
This function is still experimental, but “it works on my machine” so I keep it.
-
Environment.plotsettings.
FORMATS
= ('png', 'pdf')¶ List of formats to use for saving the figures, by default. It is a smart idea to save in both a raster and vectorial formats
-
Environment.plotsettings.
show_and_save
(showplot=True, savefig=None, formats=('png', 'pdf'), pickleit=False, fig=None)[source]¶ Maximize the window if need to show it, save it if needed, and then show it or close it.
-
Environment.plotsettings.
add_percent_formatter
(which='xaxis', amplitude=1.0, oldformatter='%.2g%%', formatter='{x:.1%}')[source]¶ Small function to use a Percentage formatter for xaxis or yaxis, of a certain amplitude.
- which can be “xaxis” or “yaxis”,
- amplitude is a float, default to 1.
- More detail at http://stackoverflow.com/a/36320013/
- Not that the use of matplotlib.ticker.PercentFormatter require matplotlib >= 2.0.1
- But if not available, use matplotlib.ticker.StrMethodFormatter(“{:.0%}”) instead
-
Environment.plotsettings.
WIDTH
= 95¶ Default value for the
width
parameter forwraptext()
andwraplatex()
.
-
Environment.plotsettings.
wraptext
(text, width=95)[source]¶ Wrap the text, using
textwrap
module, andwidth
.
-
Environment.plotsettings.
wraplatex
(text, width=95)[source]¶ Wrap the text, for LaTeX, using
textwrap
module, andwidth
.
-
Environment.plotsettings.
nrows_ncols
(N)[source]¶ Return (nrows, ncols) to create a subplots for N plots of the good size.
>>> for N in range(1, 22): ... nrows, ncols = nrows_ncols(N) ... print("For N = {:>2}, {} rows and {} cols are enough.".format(N, nrows, ncols)) For N = 1, 1 rows and 1 cols are enough. For N = 2, 2 rows and 1 cols are enough. For N = 3, 2 rows and 2 cols are enough. For N = 4, 2 rows and 2 cols are enough. For N = 5, 3 rows and 2 cols are enough. For N = 6, 3 rows and 2 cols are enough. For N = 7, 3 rows and 3 cols are enough. For N = 8, 3 rows and 3 cols are enough. For N = 9, 3 rows and 3 cols are enough. For N = 10, 4 rows and 3 cols are enough. For N = 11, 4 rows and 3 cols are enough. For N = 12, 4 rows and 3 cols are enough. For N = 13, 4 rows and 4 cols are enough. For N = 14, 4 rows and 4 cols are enough. For N = 15, 4 rows and 4 cols are enough. For N = 16, 4 rows and 4 cols are enough. For N = 17, 5 rows and 4 cols are enough. For N = 18, 5 rows and 4 cols are enough. For N = 19, 5 rows and 4 cols are enough. For N = 20, 5 rows and 4 cols are enough. For N = 21, 5 rows and 5 cols are enough.
-
Environment.plotsettings.
addTextForWorstCases
(ax, n, bins, patches, rate=0.85, normed=False, fontsize=8)[source]¶ Add some text labels to the patches of an histogram, for the last ‘rate’%.
Use it like this, to add labels for the bins in the 65% largest values n:
>>> n, bins, patches = plt.hist(...) >>> addTextForWorstCases(ax, n, bins, patches, rate=0.65)
-
Environment.plotsettings.
violin_or_box_plot
(data=None, labels=None, boxplot=False, **kwargs)[source]¶ Automatically add labels to a box or violin plot.
Warning
Requires pandas (https://pandas.pydata.org/) to add the xlabel for violin plots.
-
Environment.plotsettings.
MAX_NB_OF_LABELS
= 50¶ If more than MAX_NB_OF_LABELS labels have to be displayed on a boxplot, don’t put a legend.
-
Environment.plotsettings.
adjust_xticks_subplots
(ylabel=None, labels=(), maxNbOfLabels=50)[source]¶ Adjust the size of the xticks, and maybe change size of ylabel.
-
Environment.plotsettings.
table_to_latex
(mean_data, std_data=None, labels=None, fmt_function=None, name_of_table=None, filename=None, erase_output=False, *args, **kwargs)[source]¶ Tries to print the data from the input array or collection of array or
pandas.DataFrame
to the stdout and to the filefilename
(if it does not exist).- Give
std_data
to printmean +- std
instead of justmean
frommean_data
, - Give a list to
labels
to use a header of the table, - Give a formatting function to
fmt_function
, likeIPython.core.magics.execution._format_time()
to print running times, ormemory_consumption.sizeof_fmt()
to print memory usages, orlambda s: "{:.3g}".format(s)
to printfloat
values (default), - Uses
tabulate.tabulate()
(https://bitbucket.org/astanin/python-tabulate/) orpandas.DataFrame.to_latex()
(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_latex.html#pandas.DataFrame.to_latex).
Warning
FIXME this is still experimental! And useless, most of the time we simply do a copy/paste from the terminal to the LaTeX in the article…
- Give
Environment.pykov module¶
Pykov documentation.
-
exception
Environment.pykov.
PykovError
(value)[source]¶ Bases:
Exception
Exception definition form Pykov Errors.
-
__module__
= 'Environment.pykov'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
Environment.pykov.
Vector
(data=None, **kwargs)[source]¶ Bases:
collections.OrderedDict
-
__init__
(data=None, **kwargs)[source]¶ >>> pykov.Vector({'A':.3, 'B':.7}) {'A':.3, 'B':.7} >>> pykov.Vector(A=.3, B=.7) {'A':.3, 'B':.7}
-
__setitem__
(key, value)[source]¶ >>> q = pykov.Vector(C=.4, B=.6) >>> q['Z']=.2 >>> q {'C': 0.4, 'B': 0.6, 'Z': 0.2} >>> q['Z']=0 >>> q {'C': 0.4, 'B': 0.6}
-
__mul__
(M)[source]¶ >>> p = pykov.Vector(A=.3, B=.7) >>> p * 3 {'A': 0.9, 'B': 2.1} >>> q = pykov.Vector(C=.5, B=.5) >>> p * q 0.35 >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> p * T {'A': 0.91, 'B': 0.09} >>> T * p {'A': 0.42, 'B': 0.3}
-
__add__
(v)[source]¶ >>> p = pykov.Vector(A=.3, B=.7) >>> q = pykov.Vector(C=.5, B=.5) >>> p + q {'A': 0.3, 'C': 0.5, 'B': 1.2}
-
__sub__
(v)[source]¶ >>> p = pykov.Vector(A=.3, B=.7) >>> q = pykov.Vector(C=.5, B=.5) >>> p - q {'A': 0.3, 'C': -0.5, 'B': 0.2} >>> q - p {'A': -0.3, 'C': 0.5, 'B': -0.2}
-
_toarray
(el2pos)[source]¶ >>> p = pykov.Vector(A=.3, B=.7) >>> el2pos = {'A': 1, 'B': 0} >>> v = p._toarray(el2pos) >>> v array([ 0.7, 0.3])
-
_fromarray
(arr, el2pos)[source]¶ >>> p = pykov.Vector() >>> el2pos = {'A': 1, 'B': 0} >>> v = numpy.array([ 0.7, 0.3]) >>> p._fromarray(v,el2pos) >>> p {'A': 0.3, 'B': 0.7}
-
sort
(reverse=False)[source]¶ List of (state,probability) sorted according the probability.
>>> p = pykov.Vector({'A':.3, 'B':.1, 'C':.6}) >>> p.sort() [('B', 0.1), ('A', 0.3), ('C', 0.6)] >>> p.sort(reverse=True) [('C', 0.6), ('A', 0.3), ('B', 0.1)]
-
normalize
()[source]¶ Normalize the vector so that the entries sum is 1.
>>> p = pykov.Vector({'A':3, 'B':1, 'C':6}) >>> p.normalize() >>> p {'A': 0.3, 'C': 0.6, 'B': 0.1}
-
choose
(random_func=None)[source]¶ Choose a state according to its probability.
>>> p = pykov.Vector(A=.3, B=.7) >>> p.choose() 'B'
Optionally, a function that generates a random number can be supplied. >>> def FakeRandom(min, max): return 0.01 >>> p = pykov.Vector(A=.05, B=.4, C=.4, D=.15) >>> p.choose(FakeRandom) ‘A’
See also
-
entropy
()[source]¶ Return the entropy.
\[H(p) = \sum_i p_i \ln p_i\]See also
Khinchin, A. I. Mathematical Foundations of Information Theory Dover, 1957.
>>> p = pykov.Vector(A=.3, B=.7) >>> p.entropy() 0.6108643020548935
-
relative_entropy
(p)[source]¶ Return the Kullback-Leibler distance.
\[d(q,p) = \sum_i q_i \ln (q_i/p_i)\]Note
The Kullback-Leibler distance is not symmetric.
>>> p = pykov.Vector(A=.3, B=.7) >>> q = pykov.Vector(A=.4, B=.6) >>> p.relative_entropy(q) 0.02160085414354654 >>> q.relative_entropy(p) 0.022582421084357485
-
copy
()[source]¶ Return a shallow copy.
>>> p = pykov.Vector(A=.3, B=.7) >>> q = p.copy() >>> p['C'] = 1. >>> q {'A': 0.3, 'B': 0.7}
-
dist
(v)[source]¶ Return the distance between the two probability vectors.
\[d(q,p) = \sum_i |q_i - p_i|\]>>> p = pykov.Vector(A=.3, B=.7) >>> q = pykov.Vector(C=.5, B=.5) >>> q.dist(p) 1.0
-
__module__
= 'Environment.pykov'¶
-
-
class
Environment.pykov.
Matrix
(data=None)[source]¶ Bases:
collections.OrderedDict
-
__getitem__
(*args)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T[('A','B')] 0.3 >>> T['A','B'] 0.3 >>> 0.0
-
copy
()[source]¶ Return a shallow copy.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> W = T.copy() >>> T[('B','B')] = 1. >>> W {('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7}
-
_numpy_mat
(el2pos)[source]¶ Return a numpy.matrix object from a dictionary.
– Parameters – t_ij : the OrderedDict, values must be real numbers, keys should be tuples of two strings. el2pos : see _map()
-
_from_numpy_mat
(T, pos2el)[source]¶ Return a dictionary from a numpy.matrix object.
– Parameters – T : the numpy.matrix. pos2el : see _map()
-
stochastic
()[source]¶ Make a right stochastic matrix.
Set the sum of every row equal to one, raise
PykovError
if it is not possible.>>> T = pykov.Matrix({('A','B'): 3, ('A','A'): 7, ('B','A'): .2}) >>> T.stochastic() >>> T {('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7} >>> T[('A','C')]=1 >>> T.stochastic() pykov.PykovError: 'Zero links from node C'
-
pred
(key=None)[source]¶ Return the precedessors of a state (if not indicated, of all states). In Matrix notation: return the coloum of the indicated state.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.pred() {'A': {'A': 0.7, 'B': 1.0}, 'B': {'A': 0.3}} >>> T.pred('A') {'A': 0.7, 'B': 1.0}
-
succ
(key=None)[source]¶ Return the successors of a state (if not indicated, of all states). In Matrix notation: return the row of the indicated state.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.succ() {'A': {'A': 0.7, 'B': 0.3}, 'B': {'A': 1.0}} >>> T.succ('A') {'A': 0.7, 'B': 0.3}
-
remove
(states)[source]¶ Return a copy of the Chain, without the indicated states.
Warning
All the links where the states appear are deleted, so that the result will not be in general a stochastic matrix.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.remove(['B']) {('A', 'A'): 0.7} >>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1., ('C','D'): .5, ('D','C'): 1., ('C','B'): .5}) >>> T.remove(['A','B']) {('C', 'D'): 0.5, ('D', 'C'): 1.0}
-
states
()[source]¶ Return the set of states.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.states() {'A', 'B'}
-
__pow__
(n)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T**2 {('A', 'B'): 0.21, ('B', 'A'): 0.70, ('A', 'A'): 0.79, ('B', 'B'): 0.30} >>> T**0 {('A', 'A'): 1.0, ('B', 'B'): 1.0}
-
__mul__
(v)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T * 3 {('B', 'A'): 3.0, ('A', 'B'): 0.9, ('A', 'A'): 2.1} >>> p = pykov.Vector(A=.3, B=.7) >>> T * p {'A': 0.42, 'B': 0.3} >>> W = pykov.Matrix({('N', 'M'): 0.5, ('M', 'N'): 0.7, ('M', 'M'): 0.3, ('O', 'N'): 0.5, ('O', 'O'): 0.5, ('N', 'O'): 0.5}) >>> W * W {('N', 'M'): 0.15, ('M', 'N'): 0.21, ('M', 'O'): 0.35, ('M', 'M'): 0.44, ('O', 'M'): 0.25, ('O', 'N'): 0.25, ('O', 'O'): 0.5, ('N', 'O'): 0.25, ('N', 'N'): 0.6}
-
__rmul__
(v)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> 3 * T {('B', 'A'): 3.0, ('A', 'B'): 0.9, ('A', 'A'): 2.1}
-
__add__
(M)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> I = pykov.Matrix({('A','A'):1, ('B','B'):1}) >>> T + I {('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 1.7, ('B', 'B'): 1.0}
-
__sub__
(M)[source]¶ >>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> I = pykov.Matrix({('A','A'):1, ('B','B'):1}) >>> T - I {('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): -0.3, ('B', 'B'): -1}
-
trace
()[source]¶ Return the Matrix trace.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.trace() 0.7
-
eye
()[source]¶ Return the Identity Matrix.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.eye() {('A', 'A'): 1., ('B', 'B'): 1.}
-
ones
()[source]¶ Return a
Vector
instance with entries equal to one.>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.ones() {'A': 1.0, 'B': 1.0}
-
transpose
()[source]¶ Return the transpose Matrix.
>>> T = pykov.Matrix({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.transpose() {('B', 'A'): 0.3, ('A', 'B'): 1.0, ('A', 'A'): 0.7}
-
_UMPFPACKSolve
(b, x=None, method='UMFPACK_A')[source]¶ UMFPACK ( U nsymmetric M ulti F Rontal PACK age)
- method:
- “UMFPACK_A” : mathbf{A} x = b (default) “UMFPACK_At” : mathbf{A}^T x = b
A column pre-ordering strategy for the unsymmetric-pattern multifrontal method, T. A. Davis, ACM Transactions on Mathematical Software, vol 30, no. 2, June 2004, pp. 165-195.
-
__module__
= 'Environment.pykov'¶
-
-
class
Environment.pykov.
Chain
(data=None)[source]¶ Bases:
Environment.pykov.Matrix
-
move
(state, random_func=None)[source]¶ Do one step from the indicated state, and return the final state.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.move('A') 'B'
Optionally, a function that generates a random number can be supplied. >>> def FakeRandom(min, max): return 0.01 >>> T.move(‘A’, FakeRandom) ‘B’
-
pow
(p, n)[source]¶ Find the probability distribution after n steps, starting from an initial
Vector
.>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> p = pykov.Vector(A=1) >>> T.pow(p,3) {'A': 0.7629999999999999, 'B': 0.23699999999999996} >>> p * T * T * T {'A': 0.7629999999999999, 'B': 0.23699999999999996}
-
steady
()[source]¶ With the assumption of ergodicity, return the steady state.
Note
Inverse iteration method (P is the Markov chain)
\[ \begin{align}\begin{aligned}Q = \mathbf{I} - P\\Q^T x = e\\e = (0,0,\dots,0,1)\end{aligned}\end{align} \]See also
W. Stewart: Introduction to the Numerical Solution of Markov Chains, Princeton University Press, Chichester, West Sussex, 1994.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.steady() {'A': 0.7692307692307676, 'B': 0.23076923076923028}
-
entropy
(p=None, norm=False)[source]¶ Return the
Chain
entropy, calculated with the indicated probability Vector (the steady state by default).\[ \begin{align}\begin{aligned}H_i = \sum_j P_{ij} \ln P_{ij}\\H = \sum \pi_i H_i\end{aligned}\end{align} \]See also
Khinchin, A. I. Mathematical Foundations of Information Theory Dover, 1957.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.entropy() 0.46989561696530169
With normalization entropy belongs to [0,1]
>>> T.entropy(norm=True) 0.33895603665233132
-
mfpt_to
(state)[source]¶ Return the Mean First Passage Times of every state to the indicated state.
See also
Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.
>>> d = {('R', 'N'): 0.25, ('R', 'S'): 0.25, ('S', 'R'): 0.25, ('R', 'R'): 0.5, ('N', 'S'): 0.5, ('S', 'S'): 0.5, ('S', 'N'): 0.25, ('N', 'R'): 0.5, ('N', 'N'): 0.0} >>> T = pykov.Chain(d) >>> T.mfpt_to('R') {'S': 3.333333333333333, 'N': 2.666666666666667}
-
adjacency
()[source]¶ Return the adjacency matrix.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.adjacency() {('B', 'A'): 1, ('A', 'B'): 1, ('A', 'A'): 1}
-
walk
(steps, start=None, stop=None)[source]¶ Return a random walk of n steps, starting and stopping at the indicated states.
Note
If not indicated or is None, then the starting state is chosen according to its steady probability. If the stopping state is not None, the random walk stops early if the stopping state is reached.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.walk(10) ['B', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'A'] >>> T.walk(10,'B','B') ['B', 'A', 'A', 'A', 'A', 'A', 'B']
-
walk_probability
(walk)[source]¶ Given a walk, return the log of its probability.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.walk_probability(['A','A','B','A','A']) -1.917322692203401 >>> probability = math.exp(-1.917322692203401) 0.147 >>> p = T.walk_probability(['A','B','B','B','A']) >>> math.exp(p) 0.0
-
mixing_time
(cutoff=0.25, jump=1, p=None)[source]¶ Return the mixing time.
If the initial distribution (p) is not indicated, then it is set to p={‘less probable state’:1}.
Note
The mixing time is calculated here as the number of steps (n) needed to have
\[ \begin{align}\begin{aligned}|p(n)-\pi| < 0.25\\p(n)=p P^n\\\pi=\pi P\end{aligned}\end{align} \]The parameter
jump
controls the iteration step, for example withjump=2
n has values 2,4,6,8,..>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4, ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2, ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2} >>> T = pykov.Chain(d) >>> T.mixing_time() 2
-
absorbing_time
(transient_set)[source]¶ Mean number of steps needed to leave the transient set.
Return the
Vector tau
, thetau[i]
is the mean number of steps needed to leave the transient set starting from statei
. The parametertransient_set
is a subset of nodes.Note
If the starting point is a
Vector p
, then it is sufficient to calculatep * tau
in order to weigh the mean times according the initial conditions.>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4, ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2, ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2} >>> T = pykov.Chain(d) >>> p = pykov.Vector({'N':.3, 'S':.7}) >>> tau = T.absorbing_time(p.keys()) >>> p * tau 3.1333333333333329
-
absorbing_tour
(p, transient_set=None)[source]¶ Return a
Vector v
,v[i]
is the mean of the total number of times the process is in a given transient statei
before to leave the transient set.Note
v.sum()
is equal top * tau
(seeabsorbing_time()
method).In not specified, the
transient set
is defined by means of theVector p
.See also
Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.
>>> d = {('R','R'):1./2, ('R','N'):1./4, ('R','S'):1./4, ('N','R'):1./2, ('N','N'):0., ('N','S'):1./2, ('S','R'):1./4, ('S','N'):1./4, ('S','S'):1./2} >>> T = pykov.Chain(d) >>> p = pykov.Vector({'N':.3, 'S':.7}) >>> T.absorbing_tour(p) {'S': 2.2666666666666666, 'N': 0.8666666666666669}
-
fundamental_matrix
()[source]¶ Return the fundamental matrix.
See also
Kemeny J. G.; Snell, J. L. Finite Markov Chains. Springer-Verlag: New York, 1976.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.fundamental_matrix() {('B', 'A'): 0.17751479289940991, ('A', 'B'): 0.053254437869822958, ('A', 'A'): 0.94674556213017902, ('B', 'B'): 0.82248520710059214}
-
kemeny_constant
()[source]¶ Return the Kemeny constant of the transition matrix.
>>> T = pykov.Chain({('A','B'): .3, ('A','A'): .7, ('B','A'): 1.}) >>> T.kemeny_constant() 1.7692307692307712
-
accessibility_matrix
()[source]¶ Return the accessibility matrix of the Markov chain.
..see also: http://www.ssc.wisc.edu/~jmontgom/commclasses.pdf
-
communication_classes
()[source]¶ Return a Set of all communication classes of the Markov chain.
..see also: http://www.ssc.wisc.edu/~jmontgom/commclasses.pdf
>>> T = pykov.Chain({('A','A'): 1.0, ('B','B'): 1.0}) >>> T.communication_classes()
-
__module__
= 'Environment.pykov'¶
-
-
Environment.pykov.
readmat
(filename)[source]¶ Read an external file and return a Chain.
The file must be of the form:
A A .7 A B .3 B A 1
>>> P = pykov.readmat('/mypath/mat') >>> P {('B', 'A'): 1.0, ('A', 'B'): 0.3, ('A', 'A'): 0.7}
-
Environment.pykov.
readtrj
(filename)[source]¶ In the case the
Chain
instance must be created from a finite chain of states, the transition matrix is not fully defined. The function defines the transition probabilities as the maximum likelihood probabilities calculated along the chain. Having the file/mypath/trj
with the following format:1 1 1 2 1 3
the
Chain
instance defined from that chain is:>>> t = pykov.readtrj('/mypath/trj') >>> t (1, 1, 1, 2, 1, 3) >>> p, P = maximum_likelihood_probabilities(t,lag_time=1, separator='0') >>> p {1: 0.6666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666} >>> P {(1, 2): 0.25, (1, 3): 0.25, (1, 1): 0.5, (2, 1): 1.0, (3, 3): 1.0} >>> type(P) <class 'pykov.Chain'> >>> type(p) <class 'pykov.Vector'>
-
Environment.pykov.
_writefile
(mylist, filename)[source]¶ Export in a file the list.
mylist could be a list of list.
>>> L = [[2,3],[4,5]] >>> pykov.writefile(L,'tmp') >>> l = [1,2] >>> pykov.writefile(l,'tmp')
-
Environment.pykov.
transitions
(trj, nsteps=1, lag_time=1, separator='0')[source]¶ Return the temporal list of transitions observed.
trj : the symbolic trajectory. nsteps : number of steps. lag_time : step length. separator: the special symbol indicating the presence of sub-trajectories.
>>> trj = [1,2,1,0,2,3,1,0,2,3,2,3,1,2,3] >>> pykov.transitions(trj,1,1,0) [(1, 2), (2, 1), (2, 3), (3, 1), (2, 3), (3, 2), (2, 3), (3, 1), (1, 2), (2, 3)] >>> pykov.transitions(trj,1,2,0) [(1, 1), (2, 1), (2, 2), (3, 3), (2, 1), (3, 2), (1, 3)] >>> pykov.transitions(trj,2,2,0) [(2, 2, 1), (3, 3, 2), (2, 1, 3)]
-
Environment.pykov.
maximum_likelihood_probabilities
(trj, lag_time=1, separator='0')[source]¶ Return a Chain calculated by means of maximum likelihood probabilities.
Return two objects: p : a Vector object, the probability distribution over the nodes. T : a Chain object, the Markov chain.
trj : the symbolic trajectory. lag_time : number of steps defining a transition. separator: the special symbol indicating the presence of sub-trajectories.
>>> t = [1,2,3,2,3,2,1,2,2,3,3,2] >>> p, T = pykov.maximum_likelihood_probabilities(t) >>> p {1: 0.18181818181818182, 2: 0.4545454545454546, 3: 0.36363636363636365} >>> T {(1, 2): 1.0, (3, 2): 0.7499999999999999, (2, 3): 0.5999999999999999, (3, 3): 0.25, (2, 2): 0.19999999999999998, (2, 1): 0.19999999999999998}
-
Environment.pykov.
_remove_dead_branch
(transitions_list)[source]¶ Remove dead branchs inserting a selfloop in every node that has not outgoing links.
>>> trj = [1,2,3,1,2,3,2,2,4,3,5] >>> tr = pykov.transitions(trj, nsteps=1) >>> tr [(1, 2), (2, 3), (3, 1), (1, 2), (2, 3), (3, 2), (2, 2), (2, 4), (4, 3), (3, 5)] >>> pykov._remove_dead_branch(tr) >>> tr [(1, 2), (2, 3), (3, 1), (1, 2), (2, 3), (3, 2), (2, 2), (2, 4), (4, 3), (3, 5), (5, 5)]
Environment.sortedDistance module¶
sortedDistance: define function to measure of sortedness of permutations of [0..N-1].
-
Environment.sortedDistance.
weightedDistance
(choices, weights, n=None)[source]¶ Relative difference between the best possible weighted choices and the actual choices.
>>> weights = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] >>> choices = [8, 6, 5, 2] >>> weightedDistance(choices, weights) # not a bad choice # doctest: +ELLIPSIS 0.8333... >>> choices = [8, 6, 5, 7] >>> weightedDistance(choices, weights) # best choice! # doctest: +ELLIPSIS 1.000... >>> choices = [3, 2, 1, 0] >>> weightedDistance(choices, weights) # worst choice! # doctest: +ELLIPSIS 0.3333...
-
Environment.sortedDistance.
manhattan
(permutation, comp=None)[source]¶ A certain measure of sortedness for the list A, based on Manhattan distance.
>>> perm = [0, 1, 2, 3, 4] >>> manhattan(perm) # sorted # doctest: +ELLIPSIS 1.0...
>>> perm = [0, 1, 2, 5, 4, 3] >>> manhattan(perm) # almost sorted! # doctest: +ELLIPSIS 0.777...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] # doctest: +ELLIPSIS >>> manhattan(perm) 0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! # doctest: +ELLIPSIS >>> manhattan(perm) 0.72
-
Environment.sortedDistance.
kendalltau
(permutation, comp=None)[source]¶ A certain measure of sortedness for the list A, based on Kendall Tau ranking coefficient.
>>> perm = [0, 1, 2, 3, 4] >>> kendalltau(perm) # sorted # doctest: +ELLIPSIS 0.98...
>>> perm = [0, 1, 2, 5, 4, 3] >>> kendalltau(perm) # almost sorted! # doctest: +ELLIPSIS 0.90...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] >>> kendalltau(perm) # doctest: +ELLIPSIS 0.211...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! >>> kendalltau(perm) # doctest: +ELLIPSIS 0.984...
-
Environment.sortedDistance.
spearmanr
(permutation, comp=None)[source]¶ A certain measure of sortedness for the list A, based on Spearman ranking coefficient.
>>> perm = [0, 1, 2, 3, 4] >>> spearmanr(perm) # sorted # doctest: +ELLIPSIS 1.0...
>>> perm = [0, 1, 2, 5, 4, 3] >>> spearmanr(perm) # almost sorted! # doctest: +ELLIPSIS 0.92...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] >>> spearmanr(perm) # doctest: +ELLIPSIS 0.248...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! >>> spearmanr(perm) # doctest: +ELLIPSIS 0.986...
-
Environment.sortedDistance.
gestalt
(permutation, comp=None)[source]¶ A certain measure of sortedness for the list A, based on Gestalt pattern matching.
>>> perm = [0, 1, 2, 3, 4] >>> gestalt(perm) # sorted # doctest: +ELLIPSIS 1.0...
>>> perm = [0, 1, 2, 5, 4, 3] >>> gestalt(perm) # almost sorted! # doctest: +ELLIPSIS 0.666...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] >>> gestalt(perm) # doctest: +ELLIPSIS 0.4...
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! >>> gestalt(perm) # doctest: +ELLIPSIS 0.5...
>>> import random >>> random.seed(0) >>> ratings = [random.gauss(1200, 200) for i in range(100000)] >>> gestalt(ratings) # doctest: +ELLIPSIS 8e-05...
-
Environment.sortedDistance.
meanDistance
(permutation, comp=None, methods=(<function manhattan>, <function gestalt>))[source]¶ A certain measure of sortedness for the list A, based on mean of the 2 distances: manhattan and gestalt.
>>> perm = [0, 1, 2, 3, 4] >>> meanDistance(perm) # sorted # doctest: +ELLIPSIS 1.0
>>> perm = [0, 1, 2, 5, 4, 3] >>> meanDistance(perm) # almost sorted! # doctest: +ELLIPSIS 0.722...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] # doctest: +ELLIPSIS >>> meanDistance(perm) 0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! # doctest: +ELLIPSIS >>> meanDistance(perm) 0.61
Warning
I removed
kendalltau()
andspearmanr()
as they were giving 100% for many cases where clearly there were no reason to give 100%…
-
Environment.sortedDistance.
sortedDistance
(permutation, comp=None, methods=(<function manhattan>, <function gestalt>))¶ A certain measure of sortedness for the list A, based on mean of the 2 distances: manhattan and gestalt.
>>> perm = [0, 1, 2, 3, 4] >>> meanDistance(perm) # sorted # doctest: +ELLIPSIS 1.0
>>> perm = [0, 1, 2, 5, 4, 3] >>> meanDistance(perm) # almost sorted! # doctest: +ELLIPSIS 0.722...
>>> perm = [2, 9, 6, 4, 0, 3, 1, 7, 8, 5] # doctest: +ELLIPSIS >>> meanDistance(perm) 0.4
>>> perm = [2, 1, 6, 4, 0, 3, 5, 7, 8, 9] # better sorted! # doctest: +ELLIPSIS >>> meanDistance(perm) 0.61
Warning
I removed
kendalltau()
andspearmanr()
as they were giving 100% for many cases where clearly there were no reason to give 100%…
Environment.usejoblib module¶
Import Parallel and delayed from joblib, safely.
-
class
Environment.usejoblib.
Parallel
(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)[source]¶ Bases:
joblib.logger.Logger
Helper class for readable parallel mapping.
Read more in the User Guide.
- n_jobs: int, default: None
- The maximum number of concurrently running jobs, such as the number of Python worker processes when backend=”multiprocessing” or the size of the thread-pool when backend=”threading”. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. None is a marker for ‘unset’ that will be interpreted as n_jobs=1 (sequential execution) unless the call is performed under a parallel_backend context manager that sets another value for n_jobs.
- backend: str, ParallelBackendBase instance or None, default: ‘loky’
Specify the parallelization backend implementation. Supported backends are:
- “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes.
- “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
- “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).
- finally, you can register backends by calling register_parallel_backend. This will allow you to implement a backend of your liking.
It is not recommended to hard-code the backend name in a call to Parallel in a library. Instead it is recommended to set soft hints (prefer) or hard constraints (require) so as to make it possible for library users to change the backend from the outside using the parallel_backend context manager.
- prefer: str in {‘processes’, ‘threads’} or None, default: None
- Soft hint to choose the default backend if no specific backend
was selected with the parallel_backend context manager. The
default process-based backend is ‘loky’ and the default
thread-based backend is ‘threading’. Ignored if the
backend
parameter is specified. - require: ‘sharedmem’ or None, default None
- Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with parallel_backend.
- verbose: int, optional
- The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported.
- timeout: float, optional
- Timeout limit for each task to complete. If any task takes longer a TimeOutError will be raised. Only applied when n_jobs != 1
- pre_dispatch: {‘all’, integer, or expression, as in ‘3*n_jobs’}
- The number of batches (of tasks) to be pre-dispatched. Default is ‘2*n_jobs’. When batch_size=”auto” this is reasonable default and the workers should never starve.
- batch_size: int or ‘auto’, default: ‘auto’
- The number of atomic tasks to dispatch at once to each
worker. When individual evaluations are very fast, dispatching
calls to workers can be slower than sequential computation because
of the overhead. Batching fast computations together can mitigate
this.
The
'auto'
strategy keeps track of the time it takes for a batch to complete, and dynamically adjusts the batch size to keep the time on the order of half a second, using a heuristic. The initial batch size is 1.batch_size="auto"
withbackend="threading"
will dispatch batches of a single task at a time as the threading backend has very little overhead and using larger batch size has not proved to bring any gain in that case. - temp_folder: str, optional
Folder to be used by the pool for memmapping large arrays for sharing memory with worker processes. If None, this will try in order:
- a folder pointed by the JOBLIB_TEMP_FOLDER environment variable,
- /dev/shm if the folder exists and is writable: this is a RAM disk filesystem available by default on modern Linux distributions,
- the default system temporary folder that can be overridden with TMP, TMPDIR or TEMP environment variables, typically /tmp under Unix operating systems.
Only active when backend=”loky” or “multiprocessing”.
- max_nbytes int, str, or None, optional, 1M by default
- Threshold on the size of arrays passed to the workers that triggers automated memory mapping in temp_folder. Can be an int in Bytes, or a human-readable string, e.g., ‘1M’ for 1 megabyte. Use None to disable memmapping of large arrays. Only active when backend=”loky” or “multiprocessing”.
- mmap_mode: {None, ‘r+’, ‘r’, ‘w+’, ‘c’}
- Memmapping mode for numpy arrays passed to workers. See ‘max_nbytes’ parameter documentation for more details.
This object uses workers to compute in parallel the application of a function to many different arguments. The main functionality it brings in addition to using the raw multiprocessing or concurrent.futures API are (see examples for details):
- More readable code, in particular since it avoids constructing list of arguments.
- Easier debugging:
- informative tracebacks even when the error happens on the client side
- using ‘n_jobs=1’ enables to turn off parallel computing for debugging without changing the codepath
- early capture of pickling errors
- An optional progress meter.
- Interruption of multiprocesses jobs with ‘Ctrl-C’
- Flexible pickling control for the communication to and from the worker processes.
- Ability to use shared memory efficiently with worker processes for large numpy-based datastructures.
A simple example:
>>> from math import sqrt >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
Reshaping the output when the function has several return values:
>>> from math import modf >>> from joblib import Parallel, delayed >>> r = Parallel(n_jobs=1)(delayed(modf)(i/2.) for i in range(10)) >>> res, i = zip(*r) >>> res (0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5) >>> i (0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0)
The progress meter: the higher the value of verbose, the more messages:
>>> from time import sleep >>> from joblib import Parallel, delayed >>> r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10)) #doctest: +SKIP [Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s [Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s [Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished
Traceback example, note how the line of the error is indicated as well as the values of the parameter passed to the function that triggered the exception, even though the traceback happens in the child process:
>>> from heapq import nlargest >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=2)(delayed(nlargest)(2, n) for n in (range(4), 'abcde', 3)) #doctest: +SKIP #... --------------------------------------------------------------------------- Sub-process traceback: --------------------------------------------------------------------------- TypeError Mon Nov 12 11:37:46 2012 PID: 12934 Python 2.7.3: /usr/bin/python ........................................................................... /usr/lib/python2.7/heapq.pyc in nlargest(n=2, iterable=3, key=None) 419 if n >= size: 420 return sorted(iterable, key=key, reverse=True)[:n] 421 422 # When key is none, use simpler decoration 423 if key is None: --> 424 it = izip(iterable, count(0,-1)) # decorate 425 result = _nlargest(n, it) 426 return map(itemgetter(0), result) # undecorate 427 428 # General case, slowest method TypeError: izip argument #1 must support iteration ___________________________________________________________________________
Using pre_dispatch in a producer/consumer situation, where the data is generated on the fly. Note how the producer is first called 3 times before the parallel loop is initiated, and then called to generate new data on the fly:
>>> from math import sqrt >>> from joblib import Parallel, delayed >>> def producer(): ... for i in range(6): ... print('Produced %s' % i) ... yield i >>> out = Parallel(n_jobs=2, verbose=100, pre_dispatch='1.5*n_jobs')( ... delayed(sqrt)(i) for i in producer()) #doctest: +SKIP Produced 0 Produced 1 Produced 2 [Parallel(n_jobs=2)]: Done 1 jobs | elapsed: 0.0s Produced 3 [Parallel(n_jobs=2)]: Done 2 jobs | elapsed: 0.0s Produced 4 [Parallel(n_jobs=2)]: Done 3 jobs | elapsed: 0.0s Produced 5 [Parallel(n_jobs=2)]: Done 4 jobs | elapsed: 0.0s [Parallel(n_jobs=2)]: Done 6 out of 6 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=2)]: Done 6 out of 6 | elapsed: 0.0s finished
-
__init__
(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)[source]¶ - depth: int, optional
- The depth of objects printed.
-
_dispatch
(batch)[source]¶ Queue the batch for computing, with or without multiprocessing
WARNING: this method is not thread-safe: it should be only called indirectly via dispatch_one_batch.
-
dispatch_next
()[source]¶ Dispatch more data for parallel processing
This method is meant to be called concurrently by the multiprocessing callback. We rely on the thread-safety of dispatch_one_batch to protect against concurrent consumption of the unprotected iterator.
-
dispatch_one_batch
(iterator)[source]¶ Prefetch the tasks for the next batch and dispatch them.
The effective size of the batch is computed here. If there are no more jobs to dispatch, return False, else return True.
The iterator consumption and dispatching is protected by the same lock so calling this function should be thread safe.
-
print_progress
()[source]¶ Display the process of the parallel execution only a fraction of time, controlled by self.verbose.
-
__module__
= 'joblib.parallel'¶
Environment.usenumba module¶
Import numba.jit or a dummy decorator.
-
Environment.usenumba.
USE_NUMBA
= False¶ Configure the use of numba
Environment.usetqdm module¶
Import tqdm from tqdm, safely.
-
class
Environment.usetqdm.
tqdm
(iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, lock_args=None, nrows=None, colour=None, delay=0, gui=False, **kwargs)[source]¶ Bases:
tqdm.utils.Comparable
Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.
-
monitor_interval
= 10¶
-
monitor
= None¶
-
_instances
= <_weakrefset.WeakSet object>¶
-
static
format_sizeof
(num, suffix='', divisor=1000)[source]¶ Formats a number (greater than unity) with SI Order of Magnitude prefixes.
- num : float
- Number ( >= 1) to format.
- suffix : str, optional
- Post-postfix [default: ‘’].
- divisor : float, optional
- Divisor between prefixes [default: 1000].
- out : str
- Number with Order of Magnitude SI unit postfix.
-
static
format_interval
(t)[source]¶ Formats a number of seconds as a clock time, [H:]MM:SS
- t : int
- Number of seconds.
- out : str
- [H:]MM:SS
-
static
format_num
(n)[source]¶ Intelligent scientific notation (.3g).
- n : int or float or Numeric
- A Number.
- out : str
- Formatted number.
-
static
status_printer
(file)[source]¶ Manage the printing and in-place updating of a line of characters. Note that if the string is longer than a line, then in-place updating may not work (it will print a new line at each refresh).
-
static
format_meter
(n, total, elapsed, ncols=None, prefix='', ascii=False, unit='it', unit_scale=False, rate=None, bar_format=None, postfix=None, unit_divisor=1000, initial=0, colour=None, **extra_kwargs)[source]¶ Return a string-based progress bar given some parameters
- n : int or float
- Number of finished iterations.
- total : int or float
- The expected total number of iterations. If meaningless (None), only basic progress statistics are displayed (no ETA).
- elapsed : float
- Number of seconds passed since start.
- ncols : int, optional
- The width of the entire output message. If specified, dynamically resizes {bar} to stay within this bound [default: None]. If 0, will not print any bar (only stats). The fallback is {bar:10}.
- prefix : str, optional
- Prefix message (included in total width) [default: ‘’]. Use as {desc} in bar_format string.
- ascii : bool, optional or str, optional
- If not set, use unicode (smooth blocks) to fill the meter [default: False]. The fallback is to use ASCII characters ” 123456789#”.
- unit : str, optional
- The iteration unit [default: ‘it’].
- unit_scale : bool or int or float, optional
- If 1 or True, the number of iterations will be printed with an appropriate SI metric prefix (k = 10^3, M = 10^6, etc.) [default: False]. If any other non-zero number, will scale total and n.
- rate : float, optional
- Manual override for iteration rate. If [default: None], uses n/elapsed.
- bar_format : str, optional
Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘
‘{rate_fmt}{postfix}]’- Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,
- percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s, eta.
Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.
- postfix : *, optional
- Similar to prefix, but placed at the end (e.g. for additional stats). Note: postfix is usually a string (not a dict) for this method, and will if possible be set to postfix = ‘, ‘ + postfix. However other types are supported (#382).
- unit_divisor : float, optional
- [default: 1000], ignored unless unit_scale is True.
- initial : int or float, optional
- The initial counter value [default: 0].
- colour : str, optional
- Bar colour (e.g. ‘green’, ‘#00ff00’).
out : Formatted meter and stats, ready to display.
-
static
__new__
(cls, *_, **__)[source]¶ Create and return a new object. See help(type) for accurate signature.
-
classmethod
_decr_instances
(instance)[source]¶ Remove from list and reposition another unfixed bar to fill the new gap.
This means that by default (where all nested bars are unfixed), order is not maintained but screen flicker/blank space is minimised. (tqdm<=4.44.1 moved ALL subsequent unfixed bars up.)
-
classmethod
write
(s, file=None, end='\n', nolock=False)[source]¶ Print a message via tqdm (without overlap with bars).
-
classmethod
external_write_mode
(file=None, nolock=False)[source]¶ Disable tqdm within context and refresh tqdm when exits. Useful when writing to standard output stream
-
classmethod
pandas
(**tqdm_kwargs)[source]¶ - Registers the current tqdm class with
- pandas.core. ( frame.DataFrame | series.Series | groupby.(generic.)DataFrameGroupBy | groupby.(generic.)SeriesGroupBy ).progress_apply
A new instance will be create every time progress_apply is called, and each instance will automatically close() upon completion.
tqdm_kwargs : arguments for the tqdm instance
>>> import pandas as pd >>> import numpy as np >>> from tqdm import tqdm >>> from tqdm.gui import tqdm as tqdm_gui >>> >>> df = pd.DataFrame(np.random.randint(0, 100, (100000, 6))) >>> tqdm.pandas(ncols=50) # can use tqdm_gui, optional kwargs, etc >>> # Now you can use `progress_apply` instead of `apply` >>> df.groupby(0).progress_apply(lambda x: x**2)
<https://stackoverflow.com/questions/18603270/ progress-indicator-during-pandas-operations-python>
-
__init__
(iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, lock_args=None, nrows=None, colour=None, delay=0, gui=False, **kwargs)[source]¶ - iterable : iterable, optional
- Iterable to decorate with a progressbar. Leave blank to manually manage the updates.
- desc : str, optional
- Prefix for the progressbar.
- total : int or float, optional
- The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If gui is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.
- leave : bool, optional
- If [default: True], keeps all traces of the progressbar upon termination of iteration. If None, will leave only if position is 0.
- file : io.TextIOWrapper or io.StringIO, optional
- Specifies where to output the progress messages (default: sys.stderr). Uses file.write(str) and file.flush() methods. For encoding, see write_bytes.
- ncols : int, optional
- The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).
- mininterval : float, optional
- Minimum progress display update interval [default: 0.1] seconds.
- maxinterval : float, optional
- Maximum progress display update interval [default: 10] seconds. Automatically adjusts miniters to correspond to mininterval after long display update lag. Only works if dynamic_miniters or monitor thread is enabled.
- miniters : int or float, optional
- Minimum progress display update interval, in iterations. If 0 and dynamic_miniters, will automatically adjust to equal mininterval (more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this and mininterval to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.
- ascii : bool or str, optional
- If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.
- disable : bool, optional
- Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.
- unit : str, optional
- String that will be used to define the unit of each iteration [default: it].
- unit_scale : bool or int or float, optional
- If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale total and n.
- dynamic_ncols : bool, optional
- If set, constantly alters ncols and nrows to the environment (allowing for window resizes) [default: False].
- smoothing : float, optional
- Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].
- bar_format : str, optional
Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘
‘{rate_fmt}{postfix}]’- Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,
- percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s, eta.
Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.
- initial : int or float, optional
- The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.
- position : int, optional
- Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).
- postfix : dict or *, optional
- Specify additional stats to display at the end of the bar. Calls set_postfix(**postfix) if possible (dict).
- unit_divisor : float, optional
- [default: 1000], ignored unless unit_scale is True.
- write_bytes : bool, optional
- If (default: None) and file is unspecified, bytes will be written in Python 2. If True will also write bytes. In all other cases will default to unicode.
- lock_args : tuple, optional
- Passed to refresh for intermediate output (initialisation, iterating, and updating).
- nrows : int, optional
- The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.
- colour : str, optional
- Bar colour (e.g. ‘green’, ‘#00ff00’).
- delay : float, optional
- Don’t display until [default: 0] seconds have elapsed.
- gui : bool, optional
- WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].
out : decorated iterator.
-
_comparable
¶
-
update
(n=1)[source]¶ Manually update the progress bar, useful for streams such as reading files. E.g.: >>> t = tqdm(total=filesize) # Initialise >>> for current_buffer in stream: … … … t.update(len(current_buffer)) >>> t.close() The last line is highly recommended, but possibly not necessary if t.update() will be called in such a way that filesize will be exactly reached and printed.
- n : int or float, optional
- Increment to add to the internal counter of iterations [default: 1]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.
- out : bool or None
- True if a display() was triggered.
-
__module__
= 'tqdm.std'¶
-
refresh
(nolock=False, lock_args=None)[source]¶ Force refresh the display of this bar.
- nolock : bool, optional
- If True, does not lock. If [default: False]: calls acquire() on internal lock.
- lock_args : tuple, optional
- Passed to internal lock’s acquire(). If specified, will only display() if acquire() returns True.
-
reset
(total=None)[source]¶ Resets to 0 iterations for repeated use.
Consider combining with leave=True.
total : int or float, optional. Total to use for the new bar.
-
set_description
(desc=None, refresh=True)[source]¶ Set/modify description of the progress bar.
desc : str, optional refresh : bool, optional
Forces refresh [default: True].
-
set_postfix
(ordered_dict=None, refresh=True, **kwargs)[source]¶ Set/modify postfix (additional stats) with automatic formatting based on datatype.
ordered_dict : dict or OrderedDict, optional refresh : bool, optional
Forces refresh [default: True].kwargs : dict, optional
-
set_postfix_str
(s='', refresh=True)[source]¶ Postfix without dictionary expansion, similar to prefix handling.
-
format_dict
¶ Public API for read-only member access.
-
display
(msg=None, pos=None)[source]¶ Use self.sp to display msg in the specified pos.
Consider overloading this function when inheriting to use e.g.: self.some_frontend(**self.format_dict) instead of self.sp.
msg : str, optional. What to display (default: repr(self)). pos : int, optional. Position to moveto
(default: abs(self.pos)).
-
classmethod
wrapattr
(stream, method, total=None, bytes=True, **tqdm_kwargs)[source]¶ stream : file-like object. method : str, “read” or “write”. The result of read() and
the first argument of write() should have a len().>>> with tqdm.wrapattr(file_obj, "read", total=file_obj.size) as fobj: ... while True: ... chunk = fobj.read(chunk_size) ... if not chunk: ... break
-
Policies package¶
Policies
module : contains all the (single-player) bandits algorithms:
- “Stupid” algorithms:
Uniform
,UniformOnSome
,TakeFixedArm
,TakeRandomFixedArm
, - Greedy algorithms:
EpsilonGreedy
,EpsilonFirst
,EpsilonDecreasing
,EpsilonDecreasingMEGA
,EpsilonExpDecreasing
, - And variants of the Explore-Then-Commit policy:
ExploreThenCommit.ETC_KnownGap
,ExploreThenCommit.ETC_RandomStop
,ExploreThenCommit.ETC_FixedBudget
,ExploreThenCommit.ETC_SPRT
,ExploreThenCommit.ETC_BAI
,ExploreThenCommit.DeltaUCB
, - Probabilistic weighting algorithms:
Hedge
,Softmax
,Softmax.SoftmaxDecreasing
,Softmax.SoftMix
,Softmax.SoftmaxWithHorizon
,Exp3
,Exp3.Exp3Decreasing
,Exp3.Exp3SoftMix
,Exp3.Exp3WithHorizon
,Exp3.Exp3ELM
,ProbabilityPursuit
,Exp3PlusPlus
, a smart variantBoltzmannGumbel
, and a recent extensionTsallisInf
, - Index based UCB algorithms:
EmpiricalMeans
,UCB
,UCBalpha
,UCBmin
,UCBplus
,UCBrandomInit
,UCBV
,UCBVtuned
,UCBH
,CPUCB
,UCBimproved
, - Index based MOSS algorithms:
MOSS
,MOSSH
,MOSSAnytime
,MOSSExperimental
, - Bayesian algorithms:
Thompson
,BayesUCB
, andDiscountedThompson
, - Based on Kullback-Leibler divergence:
klUCB
,klUCBloglog
,klUCBPlus
,klUCBH
,klUCBHPlus
,klUCBPlusPlus
,klUCBswitch
, - Other index algorithms:
DMED
,DMED.DMEDPlus
,IMED
,OCUCBH
,OCUCBH.AOCUCBH
,OCUCB
,UCBdagger
, - Hybrids algorithms, mixing Bayesian and UCB indexes:
AdBandits
, - Aggregation algorithms:
Aggregator
(mine, it’s awesome, go on try it!), andCORRAL
,LearnExp
, - Finite-Horizon Gittins index, approximated version:
ApproximatedFHGittins
, - An experimental policy, using a sliding window of for instance 100 draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible),
SlidingWindowRestart
, and 3 versions for UCB, UCBalpha and klUCB:SlidingWindowRestart.SWR_UCB
,SlidingWindowRestart.SWR_UCBalpha
,SlidingWindowRestart.SWR_klUCB
(my algorithm, unpublished yet), - An experimental policy, using just a sliding window of for instance 100 draws,
SlidingWindowUCB.SWUCB
, andSlidingWindowUCB.SWUCBPlus
if the horizon is known. There is alsoSlidingWindowUCB.SWklUCB
. - Another experimental policy with a discount factor,
DiscountedUCB
andDiscountedUCB.DiscountedUCBPlus
, as well as versions using klUCB,DiscountedUCB.DiscountedklUCB
, andDiscountedUCB.DiscountedklUCBPlus
. - Other policies for the non-stationary problems:
LM_DSEE
,SWHash_UCB.SWHash_IndexPolicy
,CD_UCB.CUSUM_IndexPolicy
,CD_UCB.PHT_IndexPolicy
,CD_UCB.UCBLCB_IndexPolicy
,CD_UCB.GaussianGLR_IndexPolicy
,CD_UCB.BernoulliGLR_IndexPolicy
,Monitored_UCB.Monitored_IndexPolicy
,OracleSequentiallyRestartPolicy
,AdSwitch
. - A policy designed to tackle sparse stochastic bandit problems,
SparseUCB
,SparseklUCB
, andSparseWrapper
that can be used with any index policy. - A policy that implements a “smart doubling trick” to turn any horizon-dependent policy into a horizon-independent policy without loosing in performances:
DoublingTrickWrapper
, - An experimental policy, implementing a another kind of doubling trick to turn any policy that needs to know the range \([a,b]\) of rewards a policy that don’t need to know the range, and that adapt dynamically from the new observations,
WrapRange
, - The Optimal Sampling for Structured Bandits (OSSB) policy:
OSSB
(it is more generic and can be applied to almost any kind of bandit problem, it works fine for classical stationary bandits but it is not optimal), a variant for gaussian problemGaussianOSSB
, and a variant for sparse banditsSparseOSSB
. There is also two variants with decreasing rates,OSSB_DecreasingRate
andOSSB_AutoDecreasingRate
, - The Best Empirical Sampled Average (BESA) policy:
BESA
(it works crazily well), - New! The UCBoost (Upper Confidence bounds with Boosting) policies, first with no boosting:
UCBoost.UCB_sq
,UCBoost.UCB_bq
,UCBoost.UCB_h
,UCBoost.UCB_lb
,UCBoost.UCB_t
, and then the ones with non-adaptive boosting:UCBoost.UCBoost_bq_h_lb
,UCBoost.UCBoost_bq_h_lb_t
,UCBoost.UCBoost_bq_h_lb_t_sq
,UCBoost.UCBoost
, and finally the epsilon-approximation boosting withUCBoost.UCBoostEpsilon
, - Some are designed only for (fully decentralized) multi-player games:
MusicalChair
,MEGA
,TrekkingTSN
,MusicalChairNoSensing
,SIC_MMAB
…
Note
The list above might not be complete, see the details below.
All policies have the same interface, as described in BasePolicy
,
in order to use them in any experiment with the following approach:
my_policy = Policy(nbArms)
my_policy.startGame() # start the game
for t in range(T):
chosen_arm_t = k_t = my_policy.choice() # chose one arm
reward_t = sampled from an arm k_t # sample a reward
my_policy.getReward(k_t, reward_t) # give it the the policy
-
Policies.
klucb_mapping
= {'Bernoulli': <function klucbBern>, 'Exponential': <function klucbExp>, 'Gamma': <function klucbGamma>, 'Gaussian': <function klucbGauss>, 'Poisson': <function klucbPoisson>}¶ Maps name of arms to kl functions
Subpackages¶
Policies.Experimentals package¶
The Empirical KL-UCB algorithm non-parametric policy. Reference: [Maillard, Munos & Stoltz - COLT, 2011], [Cappé, Garivier, Maillard, Munos & Stoltz, 2012].
-
class
Policies.Experimentals.KLempUCB.
KLempUCB
(nbArms, maxReward=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
IndexPolicy.IndexPolicy
The Empirical KL-UCB algorithm non-parametric policy. References: [Maillard, Munos & Stoltz - COLT, 2011], [Cappé, Garivier, Maillard, Munos & Stoltz, 2012].
-
__init__
(nbArms, maxReward=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
maxReward
= None¶ Known upper bound on the rewards
-
pulls
= None¶ Keep track of pulls of each arm
-
obs
= None¶ UNBOUNDED dictionnary for each arm: keep track of how many observation of each rewards were seen. Warning: KLempUCB works better for discrete distributions!
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update count of observations for that arm.
-
__module__
= 'Policies.Experimentals.KLempUCB'¶
-
The Thompson (Bayesian) index policy, using an average of 20 index. By default, it uses a Beta posterior. Reference: [Thompson - Biometrika, 1933].
-
Policies.Experimentals.ThompsonRobust.
AVERAGEON
= 10¶ Default value of how many indexes are computed by sampling the posterior for the ThompsonRobust variant.
-
class
Policies.Experimentals.ThompsonRobust.
ThompsonRobust
(nbArms, posterior=<class 'Posterior.Beta.Beta'>, averageOn=10, lower=0.0, amplitude=1.0)[source]¶ Bases:
Thompson.Thompson
The Thompson (Bayesian) index policy, using an average of 20 index. By default, it uses a Beta posterior. Reference: [Thompson - Biometrika, 1933].
-
__init__
(nbArms, posterior=<class 'Posterior.Beta.Beta'>, averageOn=10, lower=0.0, amplitude=1.0)[source]¶ Create a new Bayesian policy, by creating a default posterior on each arm.
-
averageOn
= None¶ How many indexes are computed before averaging
-
computeIndex
(arm)[source]¶ Compute the current index for this arm, by sampling averageOn times the posterior and returning the average index.
At time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior and averaging:
\[\begin{split}I_k(t) &= \frac{1}{\mathrm{averageOn}} \sum_{i=1}^{\mathrm{averageOn}} I_k^{(i)}(t), \\ I_k^{(i)}(t) &\sim \mathrm{Beta}(1 + S_k(t), 1 + N_k(t) - S_k(t)).\end{split}\]
-
__module__
= 'Policies.Experimentals.ThompsonRobust'¶
-
The UCB policy for bounded bandits, with UCB indexes computed with Julia. Reference: [Lai & Robbins, 1985].
Warning
Using a Julia function from Python will not speed up anything, as there is a lot of overhead in the “bridge” protocol used by pyjulia. The idea of using naively a tiny Julia function to speed up computations is basically useless.
A naive benchmark showed that in this approach, UCBjulia
(used withing Python) is about 125 times slower (!) than UCB
.
Warning
This is only experimental, and purely useless. See https://github.com/SMPyBandits/SMPyBandits/issues/98
-
class
Policies.Experimentals.UCBjulia.
UCBjulia
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
IndexPolicy.IndexPolicy
The UCB policy for bounded bandits, with UCB indexes computed with Julia. Reference: [Lai & Robbins, 1985].
Warning
This is only experimental, and purely useless. See https://github.com/SMPyBandits/SMPyBandits/issues/98
-
__init__
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Will fail directly if the bridge with julia is unavailable or buggy.
-
__module__
= 'Policies.Experimentals.UCBjulia'¶
-
The UCB policy for bounded bandits, using \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Lai & Robbins, 1985].
-
class
Policies.Experimentals.UCBlog10.
UCBlog10
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
IndexPolicy.IndexPolicy
The UCB policy for bounded bandits, using \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Lai & Robbins, 1985].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \log_{10}(t)}{N_k(t)}}.\]
-
__module__
= 'Policies.Experimentals.UCBlog10'¶
-
The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Note: \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Auer et al. 02].
-
Policies.Experimentals.UCBlog10alpha.
ALPHA
= 1¶ Default parameter for alpha
-
class
Policies.Experimentals.UCBlog10alpha.
UCBlog10alpha
(nbArms, alpha=1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Experimentals.UCBlog10.UCBlog10
The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Note: \(\log10(t)\) and not \(\log(t)\) for UCB index. Reference: [Auer et al. 02].
-
__init__
(nbArms, alpha=1, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
alpha
= None¶ Parameter alpha
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log_{10}(t)}{2 N_k(t)}}.\]
-
__module__
= 'Policies.Experimentals.UCBlog10alpha'¶
-
The UCBwrong policy for bounded bandits, like UCB but with a typo on the estimator of means: \(\frac{X_k(t)}{t}\) is used instead of \(\frac{X_k(t)}{N_k(t)}\).
One paper of W.Jouini, C.Moy and J.Palicot from 2009 contained this typo, I reimplemented it just to check that:
- its performance is worse than simple UCB,
- but not that bad…
-
class
Policies.Experimentals.UCBwrong.
UCBwrong
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
IndexPolicy.IndexPolicy
The UCBwrong policy for bounded bandits, like UCB but with a typo on the estimator of means.
One paper of W.Jouini, C.Moy and J.Palicot from 2009 contained this typo, I reimplemented it just to check that:
- its performance is worse than simple UCB
- but not that bad…
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{t} + \sqrt{\frac{2 \log(t)}{N_k(t)}}.\]
-
__module__
= 'Policies.Experimentals.UCBwrong'¶
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].
-
class
Policies.Experimentals.klUCBlog10.
klUCBlog10
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
klUCB.klUCB
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log_{10}(t)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.Experimentals.klUCBlog10'¶
-
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].
-
class
Policies.Experimentals.klUCBloglog10.
klUCBloglog10
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
klUCB.klUCB
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using \(\log10(t)\) and not \(\log(t)\) for the KL-UCB index. Reference: [Garivier & Cappé - COLT, 2011].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log_{10}(t) + c \log(\max(1, \log_{10}(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.Experimentals.klUCBloglog10'¶
-
Policies.Posterior package¶
Posteriors for Bayesian Index policies:
Beta
is the default forThompson
Sampling andBayesUCB
, ideal for Bernoulli experiments,Gamma
andGauss
are more suited for respectively Poisson and Gaussian arms,DiscountedBeta
is the default forPolicies.DiscountedThompson
Sampling, ideal for Bernoulli experiments on non stationary bandits.
Manipulate posteriors of Bernoulli/Beta experiments.
Rewards not in \({0, 1}\) are handled with a trick, see bernoulliBinarization()
, with a “random binarization”, cf., [Agrawal12] (algorithm 2).
When reward \(r_t \in [0, 1]\) is observed, the player receives the result of a Bernoulli sample of average \(r_t\): \(r_t \sim \mathrm{Bernoulli}(r_t)\) so it is well in \({0, 1}\).
- See https://en.wikipedia.org/wiki/Bernoulli_distribution#Related_distributions
- And https://en.wikipedia.org/wiki/Conjugate_prior#Discrete_distributions
[Agrawal12] | http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf |
-
Policies.Posterior.Beta.
bernoulliBinarization
(r_t)[source]¶ Return a (random) binarization of a reward \(r_t\), in the continuous interval \([0, 1]\) as an observation in discrete \({0, 1}\).
- Useful to allow to use a Beta posterior for non-Bernoulli experiments,
- That way,
Thompson
sampling can be used for any continuous-valued bounded rewards.
Examples:
>>> import random >>> random.seed(0)
>>> bernoulliBinarization(0.3) 1 >>> bernoulliBinarization(0.3) 0 >>> bernoulliBinarization(0.3) 0 >>> bernoulliBinarization(0.3) 0
>>> bernoulliBinarization(0.9) 1 >>> bernoulliBinarization(0.9) 1 >>> bernoulliBinarization(0.9) 1 >>> bernoulliBinarization(0.9) 0
-
class
Policies.Posterior.Beta.
Beta
(a=1, b=1)[source]¶ Bases:
Policies.Posterior.Posterior.Posterior
Manipulate posteriors of Bernoulli/Beta experiments.
-
__init__
(a=1, b=1)[source]¶ Create a Beta posterior \(\mathrm{Beta}(\alpha, \beta)\) with no observation, i.e., \(\alpha = 1\) and \(\beta = 1\) by default.
-
N
= None¶ List of two parameters [a, b]
-
sample
()[source]¶ Get a random sample from the Beta posterior (using
numpy.random.betavariate()
).- Used only by
Thompson
Sampling andAdBandits
so far.
- Used only by
-
quantile
(p)[source]¶ Return the p quantile of the Beta posterior (using
scipy.stats.btdtri()
).- Used only by
BayesUCB
andAdBandits
so far.
- Used only by
-
update
(obs)[source]¶ Add an observation.
- If obs is 1, update \(\alpha\) the count of positive observations,
- If it is 0, update \(\beta\) the count of negative observations.
Note
Otherwise, a trick with
bernoulliBinarization()
has to be used.
-
__module__
= 'Policies.Posterior.Beta'¶
-
-
Policies.Posterior.Beta.
betavariate
()¶ beta(a, b, size=None)
Draw samples from a Beta distribution.
The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function
\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]where the normalization, B, is the beta function,
\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]It is often seen in Bayesian inference and order statistics.
Note
New code should use the
beta
method of adefault_rng()
instance instead; please see the Quick Start.- a : float or array_like of floats
- Alpha, positive (>0).
- b : float or array_like of floats
- Beta, positive (>0).
- size : int or tuple of ints, optional
- Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifa
andb
are both scalars. Otherwise,np.broadcast(a, b).size
samples are drawn.
- out : ndarray or scalar
- Drawn samples from the parameterized beta distribution.
Generator.beta: which should be used for new code.
-
Policies.Posterior.Beta.
random
() → x in the interval [0, 1).¶
Manipulate posteriors of Bernoulli/Beta experiments., for discounted Bayesian policies (Policies.DiscountedBayesianIndexPolicy
).
-
Policies.Posterior.DiscountedBeta.
GAMMA
= 0.95¶ Default value for the discount factor \(\gamma\in(0,1)\).
0.95
is empirically a reasonable value for short-term non-stationary experiments.
-
class
Policies.Posterior.DiscountedBeta.
DiscountedBeta
(gamma=0.95, a=1, b=1)[source]¶ Bases:
Policies.Posterior.Beta.Beta
Manipulate posteriors of Bernoulli/Beta experiments, for discounted Bayesian policies (
Policies.DiscountedBayesianIndexPolicy
).- It keeps \(\tilde{S}(t)\) and \(\tilde{F}(t)\) the discounted counts of successes and failures (S and F).
-
__init__
(gamma=0.95, a=1, b=1)[source]¶ Create a Beta posterior \(\mathrm{Beta}(\alpha, \beta)\) with no observation, i.e., \(\alpha = 1\) and \(\beta = 1\) by default.
-
N
= None¶ List of two parameters [a, b]
-
gamma
= None¶ Discount factor \(\gamma\in(0,1)\).
-
reset
(a=None, b=None)[source]¶ Reset alpha and beta, both to 0 as when creating a new default DiscountedBeta.
-
sample
()[source]¶ Get a random sample from the DiscountedBeta posterior (using
numpy.random.betavariate()
).- Used only by
Thompson
Sampling andAdBandits
so far.
- Used only by
-
quantile
(p)[source]¶ Return the p quantile of the DiscountedBeta posterior (using
scipy.stats.btdtri()
).- Used only by
BayesUCB
andAdBandits
so far.
- Used only by
-
update
(obs)[source]¶ Add an observation, and discount the previous observations.
- If obs is 1, update \(\alpha\) the count of positive observations,
- If it is 0, update \(\beta\) the count of negative observations.
- But instead of using \(\tilde{S}(t) = S(t)\) and \(\tilde{N}(t) = N(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\tilde{S}(t+1) &= \gamma \tilde{S}(t) + r(t), \tilde{F}(t+1) &= \gamma \tilde{F}(t) + (1 - r(t)).\]Note
Otherwise, a trick with
bernoulliBinarization()
has to be used.
-
discount
()[source]¶ Simply discount the old observation, when no observation is given at this time.
\[\tilde{S}(t+1) &= \gamma \tilde{S}(t), \tilde{F}(t+1) &= \gamma \tilde{F}(t).\]
-
undiscount
()[source]¶ Simply cancel the discount on the old observation, when no observation is given at this time.
\[\tilde{S}(t+1) &= \frac{1}{\gamma} \tilde{S}(t), \tilde{F}(t+1) &= \frac{1}{\gamma} \tilde{F}(t).\]
-
__module__
= 'Policies.Posterior.DiscountedBeta'¶
-
Policies.Posterior.DiscountedBeta.
betavariate
()¶ beta(a, b, size=None)
Draw samples from a Beta distribution.
The Beta distribution is a special case of the Dirichlet distribution, and is related to the Gamma distribution. It has the probability distribution function
\[f(x; a,b) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1},\]where the normalization, B, is the beta function,
\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt.\]It is often seen in Bayesian inference and order statistics.
Note
New code should use the
beta
method of adefault_rng()
instance instead; please see the Quick Start.- a : float or array_like of floats
- Alpha, positive (>0).
- b : float or array_like of floats
- Beta, positive (>0).
- size : int or tuple of ints, optional
- Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifa
andb
are both scalars. Otherwise,np.broadcast(a, b).size
samples are drawn.
- out : ndarray or scalar
- Drawn samples from the parameterized beta distribution.
Generator.beta: which should be used for new code.
Manipulate a Gamma posterior. No need for tricks to handle non-binary rewards.
- See https://en.wikipedia.org/wiki/Gamma_distribution#Conjugate_prior
- And https://en.wikipedia.org/wiki/Conjugate_prior#Continuous_distributions
-
class
Policies.Posterior.Gamma.
Gamma
(k=1, lmbda=1)[source]¶ Bases:
Policies.Posterior.Posterior.Posterior
Manipulate a Gamma posterior.
-
__init__
(k=1, lmbda=1)[source]¶ Create a Gamma posterior, \(\Gamma(k, \lambda)\), with \(k=1\) and \(\lambda=1\) by default.
-
k
= None¶ Parameter \(k\)
-
lmbda
= None¶ Parameter \(\lambda\)
-
reset
(k=None, lmbda=None)[source]¶ Reset k and lmbda, both to 1 as when creating a new default Gamma.
-
sample
()[source]¶ Get a random sample from the Beta posterior (using
numpy.random.gammavariate()
).- Used only by
Thompson
Sampling andAdBandits
so far.
- Used only by
-
quantile
(p)[source]¶ Return the p quantile of the Gamma posterior (using
scipy.stats.gdtrix()
).- Used only by
BayesUCB
andAdBandits
so far.
- Used only by
-
update
(obs)[source]¶ Add an observation: increase k by k0, and lmbda by obs (do not have to be normalized).
-
__module__
= 'Policies.Posterior.Gamma'¶
-
-
Policies.Posterior.Gamma.
gammavariate
()¶ gamma(shape, scale=1.0, size=None)
Draw samples from a Gamma distribution.
Samples are drawn from a Gamma distribution with specified parameters, shape (sometimes designated “k”) and scale (sometimes designated “theta”), where both parameters are > 0.
Note
New code should use the
gamma
method of adefault_rng()
instance instead; please see the Quick Start.- shape : float or array_like of floats
- The shape of the gamma distribution. Must be non-negative.
- scale : float or array_like of floats, optional
- The scale of the gamma distribution. Must be non-negative. Default is equal to 1.
- size : int or tuple of ints, optional
- Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifshape
andscale
are both scalars. Otherwise,np.broadcast(shape, scale).size
samples are drawn.
- out : ndarray or scalar
- Drawn samples from the parameterized gamma distribution.
- scipy.stats.gamma : probability density function, distribution or
- cumulative density function, etc.
Generator.gamma: which should be used for new code.
The probability density for the Gamma distribution is
\[p(x) = x^{k-1}\frac{e^{-x/\theta}}{\theta^k\Gamma(k)},\]where \(k\) is the shape and \(\theta\) the scale, and \(\Gamma\) is the Gamma function.
The Gamma distribution is often used to model the times to failure of electronic components, and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.
[1] Weisstein, Eric W. “Gamma Distribution.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/GammaDistribution.html [2] Wikipedia, “Gamma distribution”, https://en.wikipedia.org/wiki/Gamma_distribution Draw samples from the distribution:
>>> shape, scale = 2., 2. # mean=4, std=2*sqrt(2) >>> s = np.random.gamma(shape, scale, 1000)
Display the histogram of the samples, along with the probability density function:
>>> import matplotlib.pyplot as plt >>> import scipy.special as sps # doctest: +SKIP >>> count, bins, ignored = plt.hist(s, 50, density=True) >>> y = bins**(shape-1)*(np.exp(-bins/scale) / # doctest: +SKIP ... (sps.gamma(shape)*scale**shape)) >>> plt.plot(bins, y, linewidth=2, color='r') # doctest: +SKIP >>> plt.show()
Manipulate a posterior of Gaussian experiments, which happens to also be a Gaussian distribution if the prior is Gaussian. Easy peasy!
Warning
TODO I have to test it!
- Reference: [[Further optimal regret bounds for Thompson sampling, S. Agrawal and N. Goyal, In Artificial Intelligence and Statistics, pages 99–107, 2013.](http://proceedings.mlr.press/v31/agrawal13a.pdf)]
-
class
Policies.Posterior.Gauss.
Gauss
(mu=0.0)[source]¶ Bases:
Policies.Posterior.Posterior.Posterior
Manipulate a posterior of Gaussian experiments, which happens to also be a Gaussian distribution if the prior is Gaussian.
The posterior distribution is a \(\mathcal{N}(\hat{\mu_k}(t), \hat{\sigma_k}^2(t))\), where
\[\hat{\mu_k}(t) &= \frac{X_k(t)}{N_k(t)}, \hat{\sigma_k}^2(t) &= \frac{1}{N_k(t)}.\]Warning
This works only for prior with a variance \(\sigma^2=1\) !
-
__init__
(mu=0.0)[source]¶ Create a posterior assuming the prior is \(\mathcal{N}(\mu, 1)\).
- The prior is centered (\(\mu=1\)) by default, but parameter
mu
can be used to change this default.
- The prior is centered (\(\mu=1\)) by default, but parameter
-
mu
= None¶ Parameter \(\mu\) of the posterior
-
sigma
= None¶ The parameter \(\sigma\) of the posterior
-
reset
(mu=None)[source]¶ Reset the for parameters \(\mu, \sigma\), as when creating a new Gauss posterior.
-
sample
()[source]¶ Get a random sample \((x, \sigma^2)\) from the Gaussian posterior (using
scipy.stats.invgamma()
for the variance \(\sigma^2\) parameter andnumpy.random.normal()
for the mean \(x\)).- Used only by
Thompson
Sampling andAdBandits
so far.
- Used only by
-
quantile
(p)[source]¶ Return the p-quantile of the Gauss posterior.
Note
It now works fine with
Policies.BayesUCB
with Gauss posteriors, even if it is MUCH SLOWER than the Bernoulli posterior (Gamma
).
-
update
(obs)[source]¶ Add an observation \(x\) or a vector of observations, assumed to be drawn from an unknown normal distribution.
-
__module__
= 'Policies.Posterior.Gauss'¶
-
-
Policies.Posterior.Gauss.
normalvariate
()¶ normal(loc=0.0, scale=1.0, size=None)
Draw random samples from a normal (Gaussian) distribution.
The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [2], is often called the bell curve because of its characteristic shape (see the example below).
The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [2].
Note
New code should use the
normal
method of adefault_rng()
instance instead; please see the Quick Start.- loc : float or array_like of floats
- Mean (“centre”) of the distribution.
- scale : float or array_like of floats
- Standard deviation (spread or “width”) of the distribution. Must be non-negative.
- size : int or tuple of ints, optional
- Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. If size isNone
(default), a single value is returned ifloc
andscale
are both scalars. Otherwise,np.broadcast(loc, scale).size
samples are drawn.
- out : ndarray or scalar
- Drawn samples from the parameterized normal distribution.
- scipy.stats.norm : probability density function, distribution or
- cumulative density function, etc.
Generator.normal: which should be used for new code.
The probability density for the Gaussian distribution is
\[p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },\]where \(\mu\) is the mean and \(\sigma\) the standard deviation. The square of the standard deviation, \(\sigma^2\), is called the variance.
The function has its peak at the mean, and its “spread” increases with the standard deviation (the function reaches 0.607 times its maximum at \(x + \sigma\) and \(x - \sigma\) [2]). This implies that normal is more likely to return samples lying close to the mean, rather than those far away.
[1] Wikipedia, “Normal distribution”, https://en.wikipedia.org/wiki/Normal_distribution [2] (1, 2, 3) P. R. Peebles Jr., “Central Limit Theorem” in “Probability, Random Variables and Random Signal Principles”, 4th ed., 2001, pp. 51, 51, 125. Draw samples from the distribution:
>>> mu, sigma = 0, 0.1 # mean and standard deviation >>> s = np.random.normal(mu, sigma, 1000)
Verify the mean and the variance:
>>> abs(mu - np.mean(s)) 0.0 # may vary
>>> abs(sigma - np.std(s, ddof=1)) 0.1 # may vary
Display the histogram of the samples, along with the probability density function:
>>> import matplotlib.pyplot as plt >>> count, bins, ignored = plt.hist(s, 30, density=True) >>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * ... np.exp( - (bins - mu)**2 / (2 * sigma**2) ), ... linewidth=2, color='r') >>> plt.show()
Two-by-four array of samples from N(3, 6.25):
>>> np.random.normal(3, 2.5, size=(2, 4)) array([[-4.49401501, 4.00950034, -1.81814867, 7.29718677], # random [ 0.39924804, 4.68456316, 4.99394529, 4.84057254]]) # random
Base class for a posterior. Cf. http://chercheurs.lille.inria.fr/ekaufman/NIPS13 Fig.1 for a list of posteriors.
-
class
Policies.Posterior.Posterior.
Posterior
(*args, **kwargs)[source]¶ Bases:
object
Manipulate posteriors experiments.
-
__dict__
= mappingproxy({'__module__': 'Policies.Posterior.Posterior', '__doc__': ' Manipulate posteriors experiments.', '__init__': <function Posterior.__init__>, 'reset': <function Posterior.reset>, 'sample': <function Posterior.sample>, 'quantile': <function Posterior.quantile>, 'mean': <function Posterior.mean>, 'forget': <function Posterior.forget>, 'update': <function Posterior.update>, '__dict__': <attribute '__dict__' of 'Posterior' objects>, '__weakref__': <attribute '__weakref__' of 'Posterior' objects>})¶
-
__module__
= 'Policies.Posterior.Posterior'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Simply defines a function with_proba()
that is used everywhere.
-
Policies.Posterior.with_proba.
with_proba
(epsilon)[source]¶ Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.
Example:
>>> from random import seed; seed(0) # reproductible >>> with_proba(0.5) False >>> with_proba(0.9) True >>> with_proba(0.1) False >>> if with_proba(0.2): ... print("This happens 20% of the time.")
-
Policies.Posterior.with_proba.
random
() → x in the interval [0, 1).¶
Submodules¶
Policies.AdBandits module¶
The AdBandits bandit algorithm, mixing Thompson Sampling and BayesUCB.
- Reference: [AdBandit: A New Algorithm For Multi-Armed Bandits, F.S.Truzzi, V.F.da Silva, A.H.R.Costa, F.G.Cozman](http://sites.poli.usp.br/p/fabio.cozman/Publications/Article/truzzi-silva-costa-cozman-eniac2013.pdf)
- Code inspired from: https://github.com/flaviotruzzi/AdBandits/
Warning
This policy is very not famous, but for stochastic bandits it works usually VERY WELL! It is not anytime thought.
-
class
Policies.AdBandits.
AdBandits
(nbArms, horizon=1000, alpha=1, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The AdBandits bandit algorithm, mixing Thompson Sampling and BayesUCB.
- Reference: [AdBandit: A New Algorithm For Multi-Armed Bandits, F.S.Truzzi, V.F.da Silva, A.H.R.Costa, F.G.Cozman](http://sites.poli.usp.br/p/fabio.cozman/Publications/Article/truzzi-silva-costa-cozman-eniac2013.pdf)
- Code inspired from: https://github.com/flaviotruzzi/AdBandits/
Warning
This policy is very not famous, but for stochastic bandits it works usually VERY WELL! It is not anytime thought.
-
__init__
(nbArms, horizon=1000, alpha=1, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0)[source]¶ New policy.
-
alpha
= None¶ Parameter alpha
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment. Default value is 1000.
-
posterior
= None¶ Posterior for each arm. List instead of dict, quicker access
-
epsilon
¶ Time variating parameter \(\varepsilon(t)\).
-
choice
()[source]¶ With probability \(1 - \varepsilon(t)\), use a Thompson Sampling step, otherwise use a UCB-Bayes step, to choose one arm.
-
choiceWithRank
(rank=1)[source]¶ With probability \(1 - \varepsilon(t)\), use a Thompson Sampling step, otherwise use a UCB-Bayes step, to choose one arm of a certain rank.
-
__module__
= 'Policies.AdBandits'¶
-
Policies.AdBandits.
random
() → x in the interval [0, 1).¶
Policies.AdSwitch module¶
The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)
- It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
This implementation is still experimental!
-
class
Policies.AdSwitch.
Phase
¶ Bases:
enum.Enum
Different phases during the AdSwitch algorithm
-
Checking
= 2¶
-
Estimation
= 1¶
-
Exploitation
= 3¶
-
__module__
= 'Policies.AdSwitch'¶
-
-
Policies.AdSwitch.
mymean
(x)[source]¶ Simply
numpy.mean()
on x if x is non empty, otherwise0.0
.>>> np.mean([]) /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
-
Policies.AdSwitch.
Constant_C1
= 1.0¶ Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large.
-
Policies.AdSwitch.
Constant_C2
= 1.0¶ Default value for the constant \(C_2\). Should be \(>0\) and as large as possible, but not too large.
-
class
Policies.AdSwitch.
AdSwitch
(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)
-
horizon
= None¶ Parameter \(T\) for the AdSwitch algorithm, the horizon of the experiment. TODO try to use
DoublingTrickWrapper
to remove the dependency in \(T\) ?
-
C1
= None¶ Parameter \(C_1\) for the AdSwitch algorithm.
-
C2
= None¶ Parameter \(C_2\) for the AdSwitch algorithm.
-
phase
= None¶ Current phase, exploration or exploitation.
-
current_exploration_arm
= None¶ Currently explored arm. It cycles uniformly, in step 2.
-
current_exploitation_arm
= None¶ Currently exploited arm. It is \(\overline{a_k}\) in the algorithm.
-
batch_number
= None¶ Number of batch
-
last_restart_time
= None¶ Time step of the last restart (beginning of phase of Estimation)
-
length_of_current_phase
= None¶ Length of the current tests phase, computed as \(s_i\), with
compute_di_pi_si()
.
-
step_of_current_phase
= None¶ Timer inside the current phase.
-
current_best_arm
= None¶ Current best arm, when finishing step 3. Denote \(\overline{a_k}\) in the algorithm.
-
current_worst_arm
= None¶ Current worst arm, when finishing step 3. Denote \(\underline{a_k}\) in the algorithm.
-
current_estimated_gap
= None¶ Gap between the current best and worst arms, ie largest gap, when finishing step 3. Denote \(\widehat{\Delta_k}\) in the algorithm.
-
last_used_di_pi_si
= None¶ Memory of the currently used \((d_i, p_i, s_i)\).
-
all_rewards
= None¶ Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!
-
read_range_of_rewards
(arm, start, end)[source]¶ Read the
all_rewards
attribute to extract all the rewards for thatarm
, obtained between timestart
(included) andend
(not included).
-
statistical_test
(t, t0)[source]¶ Test if at time \(t\) there is a \(\sigma\), \(t_0 \leq \sigma < t\), and a pair of arms \(a,b\), satisfying this test:
\[| \hat{\mu_a}[\sigma,t] - \hat{\mu_b}[\sigma,t] | > \sqrt{\frac{C_1 \log T}{t - \sigma}}.\]where \(\hat{\mu_a}[t_1,t_2]\) is the empirical mean for arm \(a\) for samples obtained from times \(t \in [t_1,t_2)\).
- Return
True, sigma
if the test was satisfied, and the smallest \(\sigma\) that was satisfying the test, orFalse, None
otherwise.
- Return
-
find_Ik
()[source]¶ Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search:
\[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]
-
__module__
= 'Policies.AdSwitch'¶
-
Policies.AdSwitchNew module¶
The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)
- It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
This implementation is still experimental!
-
Policies.AdSwitchNew.
mymean
(x)[source]¶ Simply
numpy.mean()
on x if x is non empty, otherwise0.0
.>>> np.mean([]) /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.
-
Policies.AdSwitchNew.
Constant_C1
= 16.1¶ Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large. In their paper, in section 4.2) page 8, an inequality controls C1: (5) states that for all s’, t’, C1 > 8 (2n - 1)/n where n = n_[s’,t’], so C1 > 16.
-
Policies.AdSwitchNew.
DELTA_T
= 50¶ A small trick to speed-up the computations, the checks for changes of good/bad arms are going to have a step
DELTA_T
.
-
Policies.AdSwitchNew.
DELTA_S
= 20¶ A small trick to speed-up the computations, the loops on \(s_1\), \(s_2\) and \(s\) are going to have a step
DELTA_S
.
-
class
Policies.AdSwitchNew.
AdSwitchNew
(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The AdSwitchNew policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner, 2019]](http://proceedings.mlr.press/v99/auer19a/auer19a.pdf)
-
__init__
(nbArms, horizon=None, C1=16.1, delta_s=20, delta_t=50, *args, **kwargs)[source]¶ New policy.
-
horizon
= None¶ Parameter \(T\) for the AdSwitchNew algorithm, the horizon of the experiment. TODO try to use
DoublingTrickWrapper
to remove the dependency in \(T\) ?
-
C1
= None¶ Parameter \(C_1\) for the AdSwitchNew algorithm.
-
delta_s
= None¶ Parameter \(\delta_s\) for the AdSwitchNew algorithm.
-
delta_t
= None¶ Parameter \(\delta_s\) for the AdSwitchNew algorithm.
-
ell
= None¶ Variable \(\ell\) in the algorithm. Count the number of new episode.
-
start_of_episode
= None¶ Variable \(t_l\) in the algorithm. Count the starting time of the current episode.
-
set_GOOD
= None¶ Variable \(\mathrm{GOOD}_t\) in the algorithm. Set of “good” arms at current time.
-
set_BAD
= None¶ Variable \(\mathrm{BAD}_t\) in the algorithm. Set of “bad” arms at current time. It always satisfies \(\mathrm{BAD}_t = \{1,\dots,K\} \setminus \mathrm{GOOD}_t\).
-
set_S
= None¶ Variable \(S_t\) in the algorithm. A list of sets of sampling obligations of arm \(a\) at current time.
-
mu_tilde_of_l
= None¶ Vector of variables \(\tilde{\mu}_{\ell}(a)\) in the algorithm. Count the empirical average of arm \(a\).
-
gap_Delta_tilde_of_l
= None¶ Vector of variables \(\tilde{\Delta}_{\ell}(a)\) in the algorithm. Count the estimate of the gap of arm \(a\) against the best of the “good” arms.
-
all_rewards
= None¶ Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!
-
history_of_plays
= None¶ Memory of all the past actions played!
-
check_changes_good_arms
()[source]¶ Check for changes of good arms.
- I moved this into a function, in order to stop the 4 for loops (
good_arm
,s_1
,s_2
,s
) as soon as a change was detected (early stopping). - TODO this takes a crazy O(K t^3) time, it HAS to be done faster!
- I moved this into a function, in order to stop the 4 for loops (
-
check_changes_bad_arms
()[source]¶ Check for changes of bad arms, in O(K t).
- I moved this into a function, in order to stop the 2 for loops (
good_arm
,s
) as soon as a change was detected (early stopping).
- I moved this into a function, in order to stop the 2 for loops (
-
n_s_t
(arm, s, t)[source]¶ Compute \(n_{[s,t]}(a) := \#\{\tau : s \leq \tau \leq t, a_{\tau} = a \}\), naively by using the dictionary of all plays
all_rewards
.
-
mu_hat_s_t
(arm, s, t)[source]¶ Compute \(\hat{\tau}_{[s,t]}(a) := \frac{1}{n_{[s,t]}(a)} \sum_{\tau : s \leq \tau \leq t, a_{\tau} = a} r_t\), naively by using the dictionary of all plays
all_rewards
.
-
__module__
= 'Policies.AdSwitchNew'¶
-
Policies.Aggregator module¶
My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.
The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).
- At every step, the prediction of every slave is gathered, and a vote is done to decide A’s decision.
- The vote is simply a majority vote, weighted by a trust probability. If \(A_i\) decides arm \(I_i\), then the probability of selecting \(k\) is the sum of trust probabilities, \(P_i\), of every \(A_i\) for which \(I_i = k\).
- The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm \(k\) (the reward), the trust in each slave \(A_i\) is updated: \(P_i\) increases if \(A_i\) advised \(k\) (\(I_i = k\)), or decreases if \(A_i\) advised another arm.
- The detail about how to increase or decrease the probabilities are specified below.
- Reference: [[Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017]](https://hal.inria.fr/hal-01705292)
Note
Why call it Aggregator ? Because this algorithm is an efficient aggregation algorithm, and like The Terminator, he beats his opponents with an iron fist! (OK, that’s a stupid joke but a cool name, thanks Emilie!)

Note
I wanted to call it Aggragorn. Because this algorithm is like Aragorn the ranger, it starts like a simple bandit, but soon it will become king!!

-
Policies.Aggregator.
UNBIASED
= True¶ A flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).
-
Policies.Aggregator.
UPDATE_LIKE_EXP4
= False¶ Flag to know if we should update the trusts proba like in Exp4 or like in my initial Aggregator proposal
- First choice: like Exp4, trusts are fully recomputed,
trusts^(t+1) = exp(rate_t * estimated mean rewards upto time t)
, - Second choice: my proposal, trusts are just updated multiplicatively,
trusts^(t+1) <-- trusts^t * exp(rate_t * estimate instant reward at time t)
.
Both choices seem fine, and anyway the trusts are renormalized to be a probability distribution, so it doesn’t matter much.
- First choice: like Exp4, trusts are fully recomputed,
-
Policies.Aggregator.
USE_LOSSES
= False¶ Non parametric flag to know if the Exp4-like update uses losses or rewards. Losses are
1 - reward
, in which case therate_t
is negative.
-
Policies.Aggregator.
UPDATE_ALL_CHILDREN
= False¶ Should all trusts be updated, or only the trusts of slaves Ai who advised the decision
Aggregator[A1..AN]
followed.
-
class
Policies.Aggregator.
Aggregator
(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]¶ Bases:
Policies.BasePolicy.BasePolicy
My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.
-
__init__
(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]¶ New policy.
-
nbArms
= None¶ Number of arms
-
lower
= None¶ Lower values for rewards
-
amplitude
= None¶ Larger values for rewards
-
unbiased
= None¶ Flag, see above.
-
horizon
= None¶ Horizon T, if given and not None, can be used to compute a “good” constant learning rate, \(\sqrt{\frac{2 \log(N)}{T K}}\) for N slaves, K arms (heuristic).
-
extra_str
= None¶ A string to add at the end of the
str(self)
, to specify which algorithms are aggregated for instance.
-
update_all_children
= None¶ Flag, see above.
-
nbChildren
= None¶ Number N of slave algorithms.
-
t
= None¶ Internal time
-
update_like_exp4
= None¶ Flag, see above.
-
learningRate
= None¶ Value of the learning rate (can be decreasing in time)
-
decreaseRate
= None¶ Value of the constant used in the decreasing of the learning rate
-
children
= None¶ List of slave algorithms.
-
trusts
= None¶ Initial trusts in the slaves. Default to uniform, but a prior can also be given.
-
choices
= None¶ Keep track of the last choices of each slave, to know whom to update if update_all_children is false.
-
children_cumulated_losses
= None¶ Keep track of the cumulated loss (empirical mean)
-
index
= None¶ Numerical index for each arms
-
rate
¶ Learning rate, can be constant if self.decreaseRate is None, or decreasing.
- if horizon is known, use the formula which uses it,
- if horizon is not known, use the formula which uses current time \(t\),
- else, if decreaseRate is a number, use an exponentionally decreasing learning rate,
rate = learningRate * exp(- t / decreaseRate)
. Bad.
-
getReward
(arm, reward)[source]¶ Give reward for each child, and then update the trust probabilities.
-
_makeChildrenChoose
()[source]¶ Convenience method to make every children chose their best arm, and store their decision in
self.choices
.
-
choice
()[source]¶ Make each child vote, then sample the decision by importance sampling on their votes with the trust probabilities.
-
choiceWithRank
(rank=1)[source]¶ Make each child vote, with rank, then sample the decision by importance sampling on their votes with the trust probabilities.
-
choiceFromSubSet
(availableArms='all')[source]¶ Make each child vote, on subsets of arms, then sample the decision by importance sampling on their votes with the trust probabilities.
-
__module__
= 'Policies.Aggregator'¶
-
choiceMultiple
(nb=1)[source]¶ Make each child vote, multiple times, then sample the decision by importance sampling on their votes with the trust probabilities.
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Make each child vote, multiple times (with IMP scheme), then sample the decision by importance sampling on their votes with the trust probabilities.
-
estimatedOrder
()[source]¶ Make each child vote for their estimate order of the arms, then randomly select an ordering by importance sampling with the trust probabilities. Return the estimate order of the arms, as a permutation on
[0..K-1]
that would order the arms by increasing means.
-
estimatedBestArms
(M=1)[source]¶ Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.
-
computeIndex
(arm)[source]¶ Compute the current index of arm ‘arm’, by computing all the indexes of the children policies, and computing a convex combination using the trusts probabilities.
-
Policies.ApproximatedFHGittins module¶
The approximated Finite-Horizon Gittins index policy for bounded bandits.
- This is not the computationally costly Gittins index, but a simple approximation, using the knowledge of the horizon T.
- Reference: [Lattimore - COLT, 2016](http://www.jmlr.org/proceedings/papers/v49/lattimore16.pdf), and [his COLT presentation](https://youtu.be/p8AwKiudhZ4?t=276)
-
Policies.ApproximatedFHGittins.
ALPHA
= 0.5¶ Default value for the parameter \(\alpha > 0\) for ApproximatedFHGittins.
-
Policies.ApproximatedFHGittins.
DISTORTION_HORIZON
= 1.01¶ Default value for the parameter \(\tau \geq 1\) that is used to artificially increase the horizon, from \(T\) to :math`tau T`.
-
class
Policies.ApproximatedFHGittins.
ApproximatedFHGittins
(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The approximated Finite-Horizon Gittins index policy for bounded bandits.
- This is not the computationally costly Gittins index, but a simple approximation, using the knowledge of the horizon T.
- Reference: [Lattimore - COLT, 2016](http://www.jmlr.org/proceedings/papers/v49/lattimore16.pdf), and [his COLT presentation](https://youtu.be/p8AwKiudhZ4?t=276)
-
__init__
(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
alpha
= None¶ Parameter \(\alpha > 0\).
-
distortion_horizon
= None¶ Parameter \(\tau > 0\).
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
m
¶ \(m = T - t + 1\) is the number of steps to be played until end of the game.
Note
The article does not explain how to deal with unknown horizon, but eventually if \(T\) is wrong, this m becomes negative. Empirically, I force it to be \(\geq 1\), to not mess up with the \(\log(m)\) used below, by using \(\tau T\) instead of \(T\) (e.g., \(\tau = 1.01\) is enough to not ruin the performance in the last steps of the experiment).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \log\left( \frac{m}{N_k(t) \log^{1/2}\left( \frac{m}{N_k(t)} \right)} \right)}, \\ \text{where}\;\; & m = T - t + 1.\end{split}\]Note
This \(\log^{1/2}(\dots) = \sqrt(\log(\dots)))\) term can be undefined, as soon as \(m < N_k(t)\), so empirically, \(\sqrt(\max(0, \log(\dots))\) is used instead, or a larger horizon can be used to make \(m\) artificially larger (e.g., \(T' = 1.1 T\)).
-
__module__
= 'Policies.ApproximatedFHGittins'¶
Policies.BESA module¶
The Best Empirical Sampled Average (BESA) algorithm.
- Reference: [[Sub-Sampling For Multi Armed Bandits, Baransi et al., 2014]](https://hal.archives-ouvertes.fr/hal-01025651)
- See also: https://github.com/SMPyBandits/SMPyBandits/issues/103 and https://github.com/SMPyBandits/SMPyBandits/issues/116
Warning
This algorithm works VERY well but it is looks weird at first sight. It sounds “too easy”, so take a look to the article before wondering why it should work.
Warning
Right now, it is between 10 and 25 times slower than Policies.klUCB
and other single-player policies.
-
Policies.BESA.
subsample_deterministic
(n, m)[source]¶ Returns \(\{1,\dots,n\}\) if \(n < m\) or \(\{1,\dots,m\}\) if \(n \geq m\) (ie, it is \(\{1,\dots,\min(n,m)\}\)).
Warning
The BESA algorithm is efficient only with the random sub-sampling, don’t use this one except for comparing.
>>> subsample_deterministic(5, 3) # doctest: +ELLIPSIS array([0, 1, 2, 3]) >>> subsample_deterministic(10, 20) # doctest: +ELLIPSIS array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
-
Policies.BESA.
subsample_uniform
(n, m)[source]¶ Returns a uniform sub-set of size \(n\), from \(\{1,dots, m\}\).
- Fails if n > m.
Note
The BESA algorithm is efficient only with the random sub-sampling.
>>> np.random.seed(1234) # reproducible results >>> subsample_uniform(3, 5) # doctest: +ELLIPSIS array([4, 0, 1]) >>> subsample_uniform(10, 20) # doctest: +ELLIPSIS array([ 7, 16, 2, 3, 1, 18, 5, 4, 0, 8])
-
Policies.BESA.
TOLERANCE
= 1e-06¶ Numerical tolerance when comparing two means. Should not be zero!
-
Policies.BESA.
inverse_permutation
(permutation, j)[source]¶ Inverse the permutation for given input j, that is, it finds i such that p[i] = j.
>>> permutation = [1, 0, 3, 2] >>> inverse_permutation(permutation, 1) 0 >>> inverse_permutation(permutation, 0) 1
-
Policies.BESA.
besa_two_actions
(rewards, pulls, a, b, subsample_function=<function subsample_uniform>)[source]¶ Core algorithm for the BESA selection, for two actions a and b:
- N = min(Na, Nb),
- Sub-sample N values from rewards of arm a, and N values from rewards of arm b,
- Compute mean of both samples of size N, call them m_a, m_b,
- If m_a > m_b, choose a,
- Else if m_a < m_b, choose b,
- And in case of a tie, break by choosing i such that Ni is minimal (or random [a, b] if Na=Nb).
Note
rewards
can be a numpy array of shape (at least)(nbArms, max(Na, Nb))
or a dictionary mapinga,b
to lists (or iterators) of lengths>= max(Na, Nb)
.>>> np.random.seed(2345) # reproducible results >>> pulls = [6, 10]; K = len(pulls); N = max(pulls) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 1 is better # doctest: +ELLIPSIS array([0.154..., 0.158...]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 0 is better in the first 6 samples # doctest: +ELLIPSIS array([0.341..., 0.019...]) >>> besa_two_actions(rewards, pulls, 0, 1, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 0 >>> [besa_two_actions(rewards, pulls, 0, 1, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [0, 0, 1, 1, 0, 0, 1, 0, 0, 0]
-
Policies.BESA.
besa_K_actions__non_randomized
(rewards, pulls, left, right, subsample_function=<function subsample_uniform>, depth=0)[source]¶ BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).
- I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic
actions
vector, to speed up the code, but it is less readable. - The depth argument is just for pretty printing debugging information (useless).
Warning
The binary tournament is NOT RANDOMIZED here, this version is only for testing.
>>> np.random.seed(1234) # reproducible results >>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 0 is better array([ 0.09876921, -0.18561207, 0.04463033, 0.0653539 ]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 1 is better in the first 6 samples array([-0.06401484, 0.17366346, 0.05323033, -0.09514708]) >>> besa_K_actions__non_randomized(rewards, pulls, 0, K-1, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 3 >>> [besa_K_actions__non_randomized(rewards, pulls, 0, K-1, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
- I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic
-
Policies.BESA.
besa_K_actions__smart_divideandconquer
(rewards, pulls, left, right, random_permutation_of_arm=None, subsample_function=<function subsample_uniform>, depth=0)[source]¶ BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).
- I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic
actions
vector, to speed up the code, but it is less readable. - The depth argument is just for pretty printing debugging information (useless).
Note
The binary tournament is RANDOMIZED here, as it should be.
>>> np.random.seed(1234) # reproducible results >>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 0 is better array([ 0.09876921, -0.18561207, 0.04463033, 0.0653539 ]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 1 is better in the first 6 samples array([-0.06401484, 0.17366346, 0.05323033, -0.09514708]) >>> besa_K_actions__smart_divideandconquer(rewards, pulls, 0, K-1, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 3 >>> [besa_K_actions__smart_divideandconquer(rewards, pulls, 0, K-1, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
- I prefer to implement for a discrete action set \(\{\text{left}, \dots, \text{right}\}\) (end included) instead of a generic
-
Policies.BESA.
besa_K_actions
(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]¶ BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).
- The divide and conquer is implemented for a generic list of actions, it’s slower but simpler to write! Left and right divisions are just
actions[:len(actions)//2]
andactions[len(actions)//2:]
. - Actions is assumed to be shuffled before calling this function!
- The depth argument is just for pretty printing debugging information (useless).
Note
The binary tournament is RANDOMIZED here, as it should be.
>>> np.random.seed(1234) # reproducible results >>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls) >>> actions = np.arange(K) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 0 is better array([ 0.09876921, -0.18561207, 0.04463033, 0.0653539 ]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 1 is better in the first 6 samples array([-0.06401484, 0.17366346, 0.05323033, -0.09514708]) >>> besa_K_actions(rewards, pulls, actions, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 3 >>> [besa_K_actions(rewards, pulls, actions, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [3, 3, 2, 3, 3, 0, 0, 0, 2, 3]
- The divide and conquer is implemented for a generic list of actions, it’s slower but simpler to write! Left and right divisions are just
-
Policies.BESA.
besa_K_actions__non_binary
(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]¶ BESA recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).
- Instead of doing this binary tree tournaments (which results in \(\mathcal{O}(K^2)\) calls to the 2-arm procedure), we can do a line tournaments: 1 vs 2, winner vs 3, winner vs 4 etc, winner vs K-1 (which results in \(\mathcal{O}(K)\) calls),
- Actions is assumed to be shuffled before calling this function!
- The depth argument is just for pretty printing debugging information (useless).
>>> np.random.seed(1234) # reproducible results >>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls) >>> actions = np.arange(K) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 0 is better array([ 0.09876921, -0.18561207, 0.04463033, 0.0653539 ]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 1 is better in the first 6 samples array([-0.06401484, 0.17366346, 0.05323033, -0.09514708]) >>> besa_K_actions__non_binary(rewards, pulls, actions, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 3 >>> [besa_K_actions__non_binary(rewards, pulls, actions, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [3, 3, 3, 2, 0, 3, 3, 3, 3, 3]
-
Policies.BESA.
besa_K_actions__non_recursive
(rewards, pulls, actions, subsample_function=<function subsample_uniform>, depth=0)[source]¶ BESA non-recursive selection algorithm for an action set of size \(\mathcal{K} \geq 1\).
- No calls to
besa_two_actions()
, just generalize it to K actions instead of 2. - Actions is assumed to be shuffled before calling this function!
>>> np.random.seed(1234) # reproducible results >>> pulls = [5, 6, 7, 8]; K = len(pulls); N = max(pulls) >>> rewards = np.random.randn(K, N) >>> np.mean(rewards, axis=1) # arm 0 is better array([ 0.09876921, -0.18561207, 0.04463033, 0.0653539 ]) >>> np.mean(rewards[:, :min(pulls)], axis=1) # arm 1 is better in the first 6 samples array([-0.06401484, 0.17366346, 0.05323033, -0.09514708]) >>> besa_K_actions__non_recursive(rewards, pulls, None, subsample_function=subsample_deterministic) # doctest: +ELLIPSIS 3 >>> [besa_K_actions__non_recursive(rewards, pulls, None, subsample_function=subsample_uniform) for _ in range(10)] # doctest: +ELLIPSIS [1, 3, 0, 2, 2, 3, 1, 1, 3, 1]
- No calls to
-
class
Policies.BESA.
BESA
(nbArms, horizon=None, minPullsOfEachArm=1, randomized_tournament=True, random_subsample=True, non_binary=False, non_recursive=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The Best Empirical Sampled Average (BESA) algorithm.
- Reference: [[Sub-Sampling For Multi Armed Bandits, Baransi et al., 2014]](https://hal.inria.fr/hal-01025651)
Warning
The BESA algorithm requires to store all the history of rewards, so its memory usage for \(T\) rounds with \(K\) arms is \(\mathcal{O}(K T)\), which is huge for large \(T\), be careful! Aggregating different BESA instances is probably a bad idea because of this limitation!
-
__init__
(nbArms, horizon=None, minPullsOfEachArm=1, randomized_tournament=True, random_subsample=True, non_binary=False, non_recursive=False, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Just to know the memory to allocate for rewards. It could be implemented without knowing the horizon, by using lists to keep all the reward history, but this would be way slower!
-
minPullsOfEachArm
= None¶ Minimum number of pulls of each arm before using the BESA algorithm. Using 1 might not be the best choice
-
randomized_tournament
= None¶ Whether to use a deterministic or random tournament.
-
random_subsample
= None¶ Whether to use a deterministic or random sub-sampling procedure.
-
non_binary
= None¶ Whether to use
besa_K_actions()
orbesa_K_actions__non_binary()
for the selection of K arms.
-
non_recursive
= None¶ Whether to use
besa_K_actions()
orbesa_K_actions__non_recursive()
for the selection of K arms.
-
all_rewards
= None¶ Keep all rewards of each arms. It consumes a \(\mathcal{O}(K T)\) memory, that’s really bad!!
-
getReward
(arm, reward)[source]¶ Add the current reward in the global history.
Note
There is no need to normalize the reward in [0,1], that’s one of the strong point of the BESA algorithm.
-
choiceFromSubSet
(availableArms='all')[source]¶ Applies the BESA procedure with the current data history, to the restricted set of arm.
-
choiceMultiple
(nb=1)[source]¶ Applies the multiple-choice BESA procedure with the current data history:
- select a first arm with basic BESA procedure with full action set,
- remove it from the set of actions,
- restart step 1 with new smaller set of actions, until
nb
arm where chosen by basic BESA.
Note
This was not studied or published before, and there is no theoretical results about it!
Warning
This is very inefficient! The BESA procedure is already quite slow (with my current naive implementation), this is crazily slow!
-
choiceWithRank
(rank=1)[source]¶ Applies the ranked BESA procedure with the current data history:
- use
choiceMultiplie()
to selectrank
actions, - then take the
rank
-th chosen action (the last one).
Note
This was not studied or published before, and there is no theoretical results about it!
Warning
This is very inefficient! The BESA procedure is already quite slow (with my current naive implementation), this is crazily slow!
- use
-
__module__
= 'Policies.BESA'¶
-
computeIndex
(arm)[source]¶ Compute the current index of arm ‘arm’.
Warning
This index is not the one used for the choice of arm (which use sub sampling). It’s just the empirical mean of the arm.
Policies.BasePolicy module¶
Base class for any policy.
- If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
-
Policies.BasePolicy.
CHECKBOUNDS
= False¶ If True, every time a reward is received, a warning message is displayed if it lies outsides of
[lower, lower + amplitude]
.
-
class
Policies.BasePolicy.
BasePolicy
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
object
Base class for any policy.
-
nbArms
= None¶ Number of arms
-
lower
= None¶ Lower values for rewards
-
amplitude
= None¶ Larger values for rewards
-
t
= None¶ Internal time
-
pulls
= None¶ Number of pulls of each arms
-
rewards
= None¶ Cumulated rewards of each arms
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
__dict__
= mappingproxy({'__module__': 'Policies.BasePolicy', '__doc__': ' Base class for any policy.', '__init__': <function BasePolicy.__init__>, '__str__': <function BasePolicy.__str__>, 'startGame': <function BasePolicy.startGame>, 'getReward': <function BasePolicy.getReward>, 'choice': <function BasePolicy.choice>, 'choiceWithRank': <function BasePolicy.choiceWithRank>, 'choiceFromSubSet': <function BasePolicy.choiceFromSubSet>, 'choiceMultiple': <function BasePolicy.choiceMultiple>, 'choiceIMP': <function BasePolicy.choiceIMP>, 'estimatedOrder': <function BasePolicy.estimatedOrder>, '__dict__': <attribute '__dict__' of 'BasePolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BasePolicy' objects>})¶
-
__module__
= 'Policies.BasePolicy'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
Policies.BaseWrapperPolicy module¶
Base class for any wrapper policy.
- It encapsulates another policy, and defer all methods calls to the underlying policy.
- For instance, see
Policies.SparseWrapper
,Policies.DoublingTrickWrapper
orPolicies.SlidingWindowRestart
.
-
class
Policies.BaseWrapperPolicy.
BaseWrapperPolicy
(nbArms, policy=<class 'Policies.UCB.UCB'>, *args, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
Base class for any wrapper policy.
-
startGame
(createNewPolicy=True)[source]¶ Initialize the policy for a new game.
Warning
createNewPolicy=True
creates a new object for the underlying policy, whilecreateNewPolicy=False
only callBasePolicy.startGame()
.
-
getReward
(arm, reward)[source]¶ Pass the reward, as usual, update t and sometimes restart the underlying policy.
-
index
¶ Get attribute
index
from the underlying policy.
-
choiceFromSubSet
(availableArms='all')[source]¶ Pass the call to
choiceFromSubSet
of the underlying policy.
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Pass the call to
choiceIMP
of the underlying policy.
-
__module__
= 'Policies.BaseWrapperPolicy'¶
-
Policies.BayesUCB module¶
The Bayes-UCB policy.
- By default, it uses a Beta posterior (
Policies.Posterior.Beta
), one by arm. - Reference: [Kaufmann, Cappé & Garivier - AISTATS, 2012]
-
class
Policies.BayesUCB.
BayesUCB
(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BayesianIndexPolicy.BayesianIndexPolicy
The Bayes-UCB policy.
- By default, it uses a Beta posterior (
Policies.Posterior.Beta
), one by arm.
-Reference: [Kaufmann, Cappé & Garivier - AISTATS, 2012].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by taking the \(1 - \frac{1}{t}\) quantile from the Beta posterior:
\[I_k(t) = \mathrm{Quantile}\left(\mathrm{Beta}(1 + S_k(t), 1 + N_k(t) - S_k(t)), 1 - \frac{1}{t}\right).\]
-
__module__
= 'Policies.BayesUCB'¶
- By default, it uses a Beta posterior (
Policies.BayesianIndexPolicy module¶
Basic Bayesian index policy. By default, it uses a Beta posterior.
-
class
Policies.BayesianIndexPolicy.
BayesianIndexPolicy
(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
Basic Bayesian index policy.
- By default, it uses a Beta posterior (
Policies.Posterior.Beta
), one by arm. - Use
*args
and**kwargs
if you want to give parameters to the underlying posteriors. - Or use
params_for_each_posterior
as a list of parameters (as a dictionary) to give a different set of parameters for each posterior.
-
__init__
(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Create a new Bayesian policy, by creating a default posterior on each arm.
-
posterior
= None¶ Posterior for each arm. List instead of dict, quicker access
-
__module__
= 'Policies.BayesianIndexPolicy'¶
- By default, it uses a Beta posterior (
Policies.BoltzmannGumbel module¶
The Boltzmann-Gumbel Exploration (BGE) index policy, a different formulation of the Exp3
policy with an optimally tune decreasing sequence of temperature parameters \(\gamma_t\).
- Reference: Section 4 of [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
- It is an index policy with indexes computed from the empirical mean estimators and a random sample from a Gumbel distribution.
-
Policies.BoltzmannGumbel.
SIGMA
= 1¶ Default constant \(\sigma\) assuming the arm distributions are \(\sigma^2\)-subgaussian. 1 for Bernoulli arms.
-
class
Policies.BoltzmannGumbel.
BoltzmannGumbel
(nbArms, C=1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The Boltzmann-Gumbel Exploration (BGE) index policy, a different formulation of the
Exp3
policy with an optimally tune decreasing sequence of temperature parameters \(\gamma_t\).- Reference: Section 4 of [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
- It is an index policy with indexes computed from the empirical mean estimators and a random sample from a Gumbel distribution.
-
__init__
(nbArms, C=1, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Take a random index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \beta_k(t) Z_k(t), \\ \text{where}\;\; \beta_k(t) &:= \sqrt{C^2 / N_k(t)}, \\ \text{and}\;\; Z_k(t) &\sim \mathrm{Gumbel}(0, 1).\end{split}\]Where \(\mathrm{Gumbel}(0, 1)\) is the standard Gumbel distribution. See [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.gumbel.html#numpy.random.gumbel) or [Wikipedia page](https://en.wikipedia.org/wiki/Gumbel_distribution) for more details.
-
__module__
= 'Policies.BoltzmannGumbel'¶
Policies.CD_UCB module¶
The CD-UCB generic policy policies for non-stationary bandits.
Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)
It runs on top of a simple policy, e.g.,
UCB
, andUCBLCB_IndexPolicy
is a wrapper:>>> policy = UCBLCB_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.CD_UCB.
VERBOSE
= False¶ Whether to be verbose when doing the change detection algorithm.
-
Policies.CD_UCB.
PROBA_RANDOM_EXPLORATION
= 0.1¶ Default probability of random exploration \(\alpha\).
-
Policies.CD_UCB.
PER_ARM_RESTART
= True¶ Should we reset one arm empirical average or all? Default is
True
, it’s usually more efficient!
-
Policies.CD_UCB.
FULL_RESTART_WHEN_REFRESH
= False¶ Should we fully restart the algorithm or simply reset one arm empirical average? Default is
False
, it’s usually more efficient!
-
Policies.CD_UCB.
EPSILON
= 0.05¶ Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).
-
Policies.CD_UCB.
LAMBDA
= 1¶ Default value of \(\lambda\).
-
Policies.CD_UCB.
MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT
= 50¶ Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.
-
Policies.CD_UCB.
LAZY_DETECT_CHANGE_ONLY_X_STEPS
= 10¶ XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.
-
class
Policies.CD_UCB.
CD_IndexPolicy
(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
The CD-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).
-
__init__
(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶ New policy.
-
epsilon
= None¶ Parameter \(\varepsilon\) for the test.
-
lazy_detect_change_only_x_steps
= None¶ Be lazy and try to detect changes only X steps, where X is small like 10 for instance.
-
proba_random_exploration
= None¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
all_rewards
= None¶ Keep in memory all the rewards obtained since the last restart on that arm.
-
last_pulls
= None¶ Keep in memory the number times since last restart. Start with -1 (never seen)
-
last_restart_times
= None¶ Keep in memory the times of last restarts (for each arm).
-
number_of_restart
= None¶ Keep in memory the number of restarts.
-
choice
()[source]¶ With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to
choice()
of the underlying policy.
-
choiceWithRank
(rank=1)[source]¶ With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to
choiceWithRank()
of the underlying policy.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the change detection algorithm says so, with method
detect_change()
, for this arm at this current time step.
Warning
This is computationally costly, so an easy way to speed up this step is to use
lazy_detect_change_only_x_steps
\(= \mathrm{Step_t}\) for a small value (e.g., 10), so not test for all \(t\in\mathbb{N}^*\) but only \(s\in\mathbb{N}^*, s % \mathrm{Step_t} = 0\) (e.g., one out of every 10 steps).Warning
If the \(detect_change\) method also returns an estimate of the position of the change-point, \(\hat{tau}\), then it is used to reset the memory of the changing arm and keep the observations from \(\hat{tau}+1\).
- Reset the whole empirical average if the change detection algorithm says so, with method
-
detect_change
(arm, verbose=False)[source]¶ Try to detect a change in the current arm.
Warning
This is not implemented for the generic CD algorithm, it has to be implement by a child of the class
CD_IndexPolicy
.
-
__module__
= 'Policies.CD_UCB'¶
-
-
class
Policies.CD_UCB.
SlidingWindowRestart_IndexPolicy
(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
A more generic implementation is the
Policies.SlidingWindowRestart
class.Warning
I have no idea if what I wrote is correct or not!
-
detect_change
(arm, verbose=False)[source]¶ Try to detect a change in the current arm.
Warning
This one is simply using a sliding-window of fixed size = 100. A more generic implementation is the
Policies.SlidingWindowRestart
class.
-
__module__
= 'Policies.CD_UCB'¶
-
-
Policies.CD_UCB.
LAZY_TRY_VALUE_S_ONLY_X_STEPS
= 10¶ XXX Be lazy and try to detect changes for \(s\) taking steps of size
steps_s
. Default is to havesteps_s=1
, but only usingsteps_s=2
should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.
-
Policies.CD_UCB.
USE_LOCALIZATION
= True¶ Default value of
use_localization
for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.
-
class
Policies.CD_UCB.
UCBLCB_IndexPolicy
(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
The UCBLCB-UCB generic policy for non-stationary bandits, from [[Improved Changepoint Detection for Piecewise i.i.d Bandits, by S. Mukherjee & O.-A. Maillard, preprint 2018](https://subhojyoti.github.io/pdf/aistats_2019.pdf)].
Warning
This is still experimental! See https://github.com/SMPyBandits/SMPyBandits/issues/177
-
__init__
(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
proba_random_exploration
= None¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
lazy_try_value_s_only_x_steps
= None¶ Be lazy and try to detect changes for \(s\) taking steps of size
steps_s
.
-
use_localization
= None¶ experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
__module__
= 'Policies.CD_UCB'¶
-
delta
(t)[source]¶ Use \(\delta = \delta_0\) if it was given as an argument to the policy, or \(\frac{\delta_0}{t}\) as the confidence level of UCB/LCB test (default is \(\delta_0=1\)).
Warning
It is unclear (in the article) whether \(t\) is the time since the last restart or the total time?
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using the two-sided UCB-LCB algorithm [Mukherjee & Maillard, 2018].
- Let \(\hat{\mu}_{i,t:t'}\) the empirical mean of rewards obtained for arm i from time \(t\) to \(t'\), and \(N_{i,t:t'}\) the number of samples.
- Let \(S_{i,t:t'} = \sqrt{\frac{\log(4 t^2 / \delta)}{2 N_{i,t:t'}}}\) the length of the confidence interval.
- When we have data starting at \(t_0=0\) (since last restart) and up-to current time \(t\), for each arm i,
- For each intermediate time steps \(t' \in [t_0, t)\),
- Compute \(LCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} - S_{i,t_0:t'}\),
- Compute \(UCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} + S_{i,t_0:t'}\),
- Compute \(LCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} - S_{i,t'+1:t}\),
- Compute \(UCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} + S_{i,t'+1:t}\),
- If \(UCB_{\text{before}} < LCB_{\text{after}}\) or \(UCB_{\text{after}} < LCB_{\text{before}}\), then restart.
-
Policies.CORRAL module¶
The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.
The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).
- At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).
- Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
- The reward is reweighted by the trust of the listened algorithm, and given back to it.
- The other slaves, whose decision was not even asked, receive a zero reward, or no reward at all.
- The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.
- The detail about how to increase or decrease the probabilities are specified in the reference article.
Note
Reference: [[“Corralling a Band of Bandit Algorithms”, by A. Agarwal, H. Luo, B. Neyshabur, R.E. Schapire, 01.2017](https://arxiv.org/abs/1612.06246v2)].
-
Policies.CORRAL.
renormalize_reward
(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]¶ Renormalize the reward to [0, 1]:
- divide by (trust/mintrust) if unbiased is True.
- simply project to [0, 1] if unbiased is False,
Warning
If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.
-
Policies.CORRAL.
unnormalize_reward
(reward, lower=0.0, amplitude=1.0)[source]¶ Project back reward to [lower, lower + amplitude].
-
Policies.CORRAL.
log_Barrier_OMB
(trusts, losses, rates)[source]¶ A step of the log-barrier Online Mirror Descent, updating the trusts:
- Find \(\lambda \in [\min_i l_{t,i}, \max_i l_{t,i}]\) such that \(\sum_i \frac{1}{1/p_{t,i} + \eta_{t,i}(l_{t,i} - \lambda)} = 1\).
- Return \(\mathbf{p}_{t+1,i}\) such that \(\frac{1}{p_{t+1,i}} = \frac{1}{p_{t,i}} + \eta_{t,i}(l_{t,i} - \lambda)\).
- Note: uses
scipy.optimize.minimize_scalar()
for the optimization. - Reference: [Learning in games: Robustness of fast convergence, by D.Foster, Z.Li, T.Lykouris, K.Sridharan, and E.Tardos, NIPS 2016].
-
Policies.CORRAL.
UNBIASED
= True¶ self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).
-
Policies.CORRAL.
BROADCAST_ALL
= False¶ Whether to give back a reward to only one slave algorithm (default, False) or to all slaves who voted for the same arm
-
class
Policies.CORRAL.
CORRAL
(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.
-
__init__
(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]¶ New policy.
-
nbArms
= None¶ Number of arms.
-
lower
= None¶ Lower values for rewards.
-
amplitude
= None¶ Larger values for rewards.
-
unbiased
= None¶ Flag, see above.
-
broadcast_all
= None¶ Flag, see above.
-
gamma
= None¶ Constant \(\gamma = 1 / T\).
-
beta
= None¶ Constant \(\beta = \exp(1 / \log(T))\).
-
rates
= None¶ Value of the learning rate (will be increasing in time).
-
children
= None¶ List of slave algorithms.
-
trusts
= None¶ Initial trusts in the slaves. Default to uniform, but a prior can also be given.
-
bar_trusts
= None¶ Initial bar trusts in the slaves. Default to uniform, but a prior can also be given.
-
choices
= None¶ Keep track of the last choices of each slave, to know whom to update if update_all_children is false.
-
last_choice
= None¶ Remember the index of the last child trusted for a decision.
-
losses
= None¶ For the log-barrier OMD step, a vector of losses has to be given. Faster to keep it as an attribute instead of reallocating it every time.
-
rhos
= None¶ I use the inverses of the \(\rho_{t,i}\) from the Algorithm in the reference article. Simpler to understand, less numerical errors.
-
__setattr__
(name, value)[source]¶ Trick method, to update the \(\gamma\) and \(\beta\) parameters of the CORRAL algorithm if the horizon T changes.
- This is here just to eventually allow
Policies.DoublingTrickWrapper
to be used with a CORRAL player.
Warning
Not tested yet!
- This is here just to eventually allow
-
getReward
(arm, reward)[source]¶ Give reward for each child, and then update the trust probabilities.
-
choiceFromSubSet
(availableArms='all')[source]¶ Trust one of the slave and listen to his choiceFromSubSet.
-
__module__
= 'Policies.CORRAL'¶
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Trust one of the slave and listen to his choiceIMP.
-
Policies.CPUCB module¶
The Clopper-Pearson UCB policy for bounded bandits. Reference: [Garivier & Cappé, COLT 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
Policies.CPUCB.
binofit_scalar
(x, n, alpha=0.05)[source]¶ Parameter estimates and confidence intervals for binomial data.
For example:
>>> np.random.seed(1234) # reproducible results >>> true_p = 0.6 >>> N = 100 >>> x = np.random.binomial(N, true_p) >>> (phat, pci) = binofit_scalar(x, N) >>> phat 0.61 >>> pci # 0.6 of course lies in the 95% confidence interval # doctest: +ELLIPSIS (0.507..., 0.705...) >>> (phat, pci) = binofit_scalar(x, N, 0.01) >>> pci # 0.6 is also in the 99% confidence interval, but it is larger # doctest: +ELLIPSIS (0.476..., 0.732...)
Like binofit_scalar in MATLAB, see https://fr.mathworks.com/help/stats/binofit_scalar.html.
(phat, pci) = binofit_scalar(x, n)
returns a maximum likelihood estimate of the probability of success in a given binomial trial based on the number of successes,x
, observed inn
independent trials.(phat, pci) = binofit_scalar(x, n)
returns the probability estimate, phat, and the 95% confidence intervals, pci, by using the Clopper-Pearson method to calculate confidence intervals.(phat, pci) = binofit_scalar(x, n, alpha)
returns the100(1 - alpha)%
confidence intervals. For example,alpha = 0.01
yields99%
confidence intervals.
For the Clopper-Pearson UCB algorithms:
- x is the cum rewards of some arm k, \(x = X_k(t)\),
- n is the number of samples of that arm k, \(n = N_k(t)\),
- and alpha is a small positive number, \(\alpha = \frac{1}{t^c}\) in this algorithm (for \(c > 1, \simeq 1\), for instance c = 1.01).
Returns: (phat, pci)
- phat: is the estimate of p
- pci: is the confidence interval
Note
My reference implementation was https://github.com/sjara/extracellpy/blob/master/extrastats.py#L35, but http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.proportion.proportion_confint.html can also be used (it implies an extra requirement for the project).
-
Policies.CPUCB.
binofit
(xArray, nArray, alpha=0.05)[source]¶ Parameter estimates and confidence intervals for binomial data, for vectorial inputs.
For example:
>>> np.random.seed(1234) # reproducible results >>> true_p = 0.6 >>> N = 100 >>> xArray = np.random.binomial(N, true_p, 4) >>> xArray array([61, 54, 61, 52])
>>> (phat, pci) = binofit(xArray, N) >>> phat array([0.61, 0.54, 0.61, 0.52]) >>> pci # 0.6 of course lies in the 95% confidence intervals # doctest: +ELLIPSIS array([[0.507..., 0.705...], [0.437..., 0.640...], [0.507..., 0.705...], [0.417..., 0.620...]])
>>> (phat, pci) = binofit(xArray, N, 0.01) >>> pci # 0.6 is also in the 99% confidence intervals, but it is larger # doctest: +ELLIPSIS array([[0.476..., 0.732...], [0.407..., 0.668...], [0.476..., 0.732...], [0.387..., 0.650...]])
-
Policies.CPUCB.
ClopperPearsonUCB
(x, N, alpha=0.05)[source]¶ Returns just the upper-confidence bound of the confidence interval.
-
Policies.CPUCB.
C
= 1.01¶ Default value for the parameter c for CP-UCB
-
class
Policies.CPUCB.
CPUCB
(nbArms, c=1.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The Clopper-Pearson UCB policy for bounded bandits. Reference: [Garivier & Cappé, COLT 2011].
-
__init__
(nbArms, c=1.01, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c for the CP-UCB formula (see below)
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \mathrm{ClopperPearsonUCB}\left( X_k(t), N_k(t), \frac{1}{t^c} \right).\]Where \(\mathrm{ClopperPearsonUCB}\) is defined above. The index is the upper-confidence bound of the binomial trial of \(N_k(t)\) samples from arm k, having mean \(\mu_k\), and empirical outcome \(X_k(t)\). The confidence interval is with \(\alpha = 1 / t^c\), for a \(100(1 - \alpha)\%\) confidence bound.
-
__module__
= 'Policies.CPUCB'¶
-
Policies.CUSUM_UCB module¶
The CUSUM-UCB and PHT-UCB policies for non-stationary bandits.
Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)
It runs on top of a simple policy, e.g.,
UCB
, andCUSUM_IndexPolicy
is a wrapper:>>> policy = CUSUM_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.CUSUM_UCB.
VERBOSE
= False¶ Whether to be verbose when doing the change detection algorithm.
-
Policies.CUSUM_UCB.
PROBA_RANDOM_EXPLORATION
= 0.1¶ Default probability of random exploration \(\alpha\).
-
Policies.CUSUM_UCB.
PER_ARM_RESTART
= True¶ Should we reset one arm empirical average or all? For CUSUM-UCB it is
True
by default.
-
Policies.CUSUM_UCB.
FULL_RESTART_WHEN_REFRESH
= False¶ Should we fully restart the algorithm or simply reset one arm empirical average? For CUSUM-UCB it is
False
by default.
-
Policies.CUSUM_UCB.
EPSILON
= 0.01¶ Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).
-
Policies.CUSUM_UCB.
LAMBDA
= 1¶ Default value of \(\lambda\). Used only if \(h\) and \(\alpha\) are computed using
compute_h_alpha_from_input_parameters__CUSUM_complicated()
.
-
Policies.CUSUM_UCB.
MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT
= 100¶ Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.
-
Policies.CUSUM_UCB.
LAZY_DETECT_CHANGE_ONLY_X_STEPS
= 10¶ XXX Be lazy and try to detect changes only X steps, where X is small like 20 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 20 should speed up the test by x20.
-
Policies.CUSUM_UCB.
USE_LOCALIZATION
= True¶ Default value of
use_localization
for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.
-
Policies.CUSUM_UCB.
ALPHA0_SCALE_FACTOR
= 1¶ For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!
-
Policies.CUSUM_UCB.
compute_h_alpha_from_input_parameters__CUSUM_complicated
(horizon, max_nb_random_events, nbArms=None, epsilon=None, lmbda=None, M=None, scaleFactor=1)[source]¶ Compute the values \(C_1^+, C_1^-, C_1, C_2, h\) from the formulas in Theorem 2 and Corollary 2 in the paper.
-
Policies.CUSUM_UCB.
compute_h_alpha_from_input_parameters__CUSUM
(horizon, max_nb_random_events, scaleFactor=1, **kwargs)[source]¶ Compute the values \(h, \alpha\) from the simplified formulas in Theorem 2 and Corollary 2 in the paper.
\[\begin{split}h &= \log(\frac{T}{\Upsilon_T}),\\ \alpha &= \mathrm{scaleFactor} \times \sqrt{\frac{\Upsilon_T}{T} \log(\frac{T}{\Upsilon_T})}.\end{split}\]
-
class
Policies.CUSUM_UCB.
CUSUM_IndexPolicy
(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
The CUSUM-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).
-
__init__
(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
M
= None¶ Parameter \(M\) for the test.
-
threshold_h
= None¶ Parameter \(h\) for the test (threshold).
-
proba_random_exploration
= None¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
use_localization
= None¶ Experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
getReward
(arm, reward)[source]¶ Be sure that the underlying UCB or klUCB indexes are used with \(\log(n_t)\) for the exploration term, where \(n_t = \sum_{i=1}^K N_i(t)\) the number of pulls of each arm since its last restart times (different restart time for each arm, CUSUM use local restart only).
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using the two-sided CUSUM algorithm [Page, 1954].
- For each data k, compute:
\[\begin{split}s_k^- &= (y_k - \hat{u}_0 - \varepsilon) 1(k > M),\\ s_k^+ &= (\hat{u}_0 - y_k - \varepsilon) 1(k > M),\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]- The change is detected if \(\max(g_k^+, g_k^-) > h\), where
threshold_h
is the threshold of the test, - And \(\hat{u}_0 = \frac{1}{M} \sum_{k=1}^{M} y_k\) is the mean of the first M samples, where M is
M
the min number of observation between change points.
-
__module__
= 'Policies.CUSUM_UCB'¶
-
-
class
Policies.CUSUM_UCB.
PHT_IndexPolicy
(nbArms, horizon=None, max_nb_random_events=None, lmbda=1, min_number_of_observation_between_change_point=100, full_restart_when_refresh=False, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CUSUM_UCB.CUSUM_IndexPolicy
The PHT-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).
-
__module__
= 'Policies.CUSUM_UCB'¶
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using the two-sided PHT algorithm [Hinkley, 1971].
- For each data k, compute:
\[\begin{split}s_k^- &= y_k - \hat{y}_k - \varepsilon,\\ s_k^+ &= \hat{y}_k - y_k - \varepsilon,\\ g_k^+ &= \max(0, g_{k-1}^+ + s_k^+),\\ g_k^- &= \max(0, g_{k-1}^- + s_k^-).\end{split}\]- The change is detected if \(\max(g_k^+, g_k^-) > h\), where
threshold_h
is the threshold of the test, - And \(\hat{y}_k = \frac{1}{k} \sum_{s=1}^{k} y_s\) is the mean of the first k samples.
-
Policies.DMED module¶
The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
class
Policies.DMED.
DMED
(nbArms, genuine=False, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
__init__
(nbArms, genuine=False, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ New policy.
-
kl
= None¶ kl function to use
-
tolerance
= None¶ Numerical tolerance
-
genuine
= None¶ Flag to know which variant is implemented, DMED or DMED+
-
nextActions
= None¶ List of next actions to play, every next step is playing
nextActions.pop(0)
-
choice
()[source]¶ If there is still a next action to play, pop it and play it, otherwise make new list and play first action.
The list of action is obtained as all the indexes \(k\) satisfying the following equation.
- For the naive version (
genuine = False
), DMED:
\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(t)}{N_k(t)}.\]- For the original version (
genuine = True
), DMED+:
\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(\frac{t}{N_k(t)})}{N_k(t)}.\]Where \(X_k(t)\) is the sum of rewards from arm k, \(\hat{\mu}_k(t)\) is the empirical mean, and \(\hat{\mu}^*(t)\) is the best empirical mean.
\[\begin{split}X_k(t) &= \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma) \\ \hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \hat{\mu}^*(t) &= \max_{k=1}^{K} \hat{\mu}_k(t)\end{split}\]- For the naive version (
-
choiceMultiple
(nb=1)[source]¶ If there is still enough actions to play, pop them and play them, otherwise make new list and play nb first actions.
-
__module__
= 'Policies.DMED'¶
-
class
Policies.DMED.
DMEDPlus
(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.DMED.DMED
The DMED+ policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]).
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
__init__
(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ New policy.
-
__module__
= 'Policies.DMED'¶
Policies.DiscountedBayesianIndexPolicy module¶
Discounted Bayesian index policy.
- By default, it uses a DiscountedBeta posterior (
Policies.Posterior.DiscountedBeta
), one by arm. - Use discount factor \(\gamma\in(0,1)\).
Warning
This is still highly experimental!
-
Policies.DiscountedBayesianIndexPolicy.
GAMMA
= 0.95¶ Default value for the discount factor \(\gamma\in(0,1)\).
0.95
is empirically a reasonable value for short-term non-stationary experiments.
-
class
Policies.DiscountedBayesianIndexPolicy.
DiscountedBayesianIndexPolicy
(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BayesianIndexPolicy.BayesianIndexPolicy
Discounted Bayesian index policy.
- By default, it uses a DiscountedBeta posterior (
Policies.Posterior.DiscountedBeta
), one by arm. - Use discount factor \(\gamma\in(0,1)\).
- It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.
- But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]-
__init__
(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Create a new Bayesian policy, by creating a default posterior on each arm.
-
gamma
= None¶ Discount factor \(\gamma\in(0,1)\).
-
__module__
= 'Policies.DiscountedBayesianIndexPolicy'¶
- By default, it uses a DiscountedBeta posterior (
Policies.DiscountedThompson module¶
The Discounted Thompson (Bayesian) index policy.
- By default, it uses a DiscountedBeta posterior (
Policies.Posterior.DiscountedBeta
), one by arm. - Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].
Warning
This is still highly experimental!
-
class
Policies.DiscountedThompson.
DiscountedThompson
(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.DiscountedBayesianIndexPolicy.DiscountedBayesianIndexPolicy
The DiscountedThompson (Bayesian) index policy.
- By default, it uses a DiscountedBeta posterior (
Policies.Posterior.DiscountedBeta
), one by arm. - Reference: [[“Taming Non-stationary Bandits: A Bayesian Approach”, Vishnu Raj & Sheetal Kalyani, arXiv:1707.09727](https://arxiv.org/abs/1707.09727)].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, by sampling from the DiscountedBeta posterior.
\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \widetilde{S_k}(t), 1 + \widetilde{F_k}(t)).\end{split}\]- It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.
- But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):
\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]
-
__module__
= 'Policies.DiscountedThompson'¶
- By default, it uses a DiscountedBeta posterior (
Policies.DiscountedUCB module¶
The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).
- Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
- \(\gamma\) should not be 1, otherwise you should rather use
Policies.UCBalpha.UCBalpha
instead. - The smaller the \(\gamma\), the shorter the “memory” of the algorithm is.
-
Policies.DiscountedUCB.
ALPHA
= 1¶ Default parameter for alpha.
-
Policies.DiscountedUCB.
GAMMA
= 0.99¶ Default parameter for gamma.
-
class
Policies.DiscountedUCB.
DiscountedUCB
(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]¶ Bases:
Policies.UCBalpha.UCBalpha
The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).
- Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
-
__init__
(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
discounted_pulls
= None¶ Number of pulls of each arms
-
discounted_rewards
= None¶ Cumulated rewards of each arms
-
alpha
= None¶ Parameter alpha
-
gamma
= None¶ Parameter gamma
-
delta_time_steps
= None¶ Keep memory of the \(\Delta_k(t)\) for each time step.
-
useRealDiscount
= None¶ Flag to know if the real update should be used, the one with a multiplication by \(\gamma^{1+\Delta_k(t)}\) and not simply a multiplication by \(\gamma\).
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
- Keep up-to date the following two quantities, using different definition and notation as from the article, but being consistent w.r.t. my project:
\[\begin{split}N_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} N_k(s), \\ X_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} X_k(s).\end{split}\]- Instead of keeping the whole history of rewards, as expressed in the math formula, we keep the sum of discounted rewards from
s=0
tos=t
, because updating it is easy (2 operations instead of just 1 for classicalPolicies.UCBalpha.UCBalpha
, and 2 operations instead of \(\mathcal{O}(t)\) as expressed mathematically). Denote \(\Delta_k(t)\) the number of time steps during which the armk
was not selected (maybe 0 if it is selected twice in a row). Then the update can be done easily by multiplying by \(\gamma^{1+\Delta_k(t)}\):
\[\begin{split}N_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times N_{k,\gamma}(\text{last pull}) + \mathbb{1}(A(t+1) = k), \\ X_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times X_{k,\gamma}(\text{last pull}) + X_k(t+1).\end{split}\]
-
computeIndex
(arm)[source]¶ Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:
\[\begin{split}I_k(t) &:= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} + \sqrt{\frac{\alpha \log(n_{\gamma}(t))}{2 N_{k,\gamma}(t)}}, \\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]
-
__module__
= 'Policies.DiscountedUCB'¶
-
class
Policies.DiscountedUCB.
DiscountedUCBPlus
(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]¶ Bases:
Policies.DiscountedUCB.DiscountedUCB
The Discounted-UCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).
- Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
- Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
-
__init__
(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
__module__
= 'Policies.DiscountedUCB'¶
-
Policies.DiscountedUCB.
constant_c
= 1.0¶ default value, as it was in pymaBandits v1.0
-
Policies.DiscountedUCB.
tolerance
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.DiscountedUCB.
DiscountedklUCB
(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]¶ Bases:
Policies.DiscountedUCB.DiscountedUCB
The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).
- Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
-
__init__
(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
klucb
= None¶ kl function to use
-
computeIndex
(arm)[source]¶ Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:
\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(t)}{N_{k,\gamma}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
computeAllIndex
()[source]¶ Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.
-
__module__
= 'Policies.DiscountedUCB'¶
-
class
Policies.DiscountedUCB.
DiscountedklUCBPlus
(nbArms, klucb=<function klucbBern>, *args, **kwargs)[source]¶ Bases:
Policies.DiscountedUCB.DiscountedklUCB
,Policies.DiscountedUCB.DiscountedUCBPlus
The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).
- Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
- Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.
-
__module__
= 'Policies.DiscountedUCB'¶
Policies.DoublingTrickWrapper module¶
A policy that acts as a wrapper on another policy P, assumed to be horizon dependent (has to known \(T\)), by implementing a “doubling trick”:
- starts to assume that \(T=T_0=1000\), and run the policy \(P(T_0)\), from \(t=1\) to \(t=T_0\),
- if \(t > T_0\), then the “doubling trick” is performed, by either re-initializing or just changing the parameter horizon of the policy P, for instance with \(T_2 = 10 \times T_0\),
- and keep doing this until \(t = T\).
Note
This is implemented in a very generic way, with simply a function next_horizon(horizon) that gives the next horizon to try when crossing the current guess. It can be a simple linear function (next_horizon(horizon) = horizon + 100), a geometric growth to have the “real” doubling trick (next_horizon(horizon) = horizon * 10), or even functions growing exponentially fast (next_horizon(horizon) = horizon ** 1.1, next_horizon(horizon) = horizon ** 1.5, next_horizon(horizon) = horizon ** 2).
Note
My guess is that this “doubling trick” wrapping policy can only be efficient (for stochastic problems) if:
- the underlying policy P is a very efficient horizon-dependent algorithm, e.g., the
Policies.ApproximatedFHGittins
, - the growth function next_horizon is growing faster than any geometric rate, so that the number of refresh is \(o(\log T)\) and not \(O(\log T)\).
See also
Reference: [[What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018]](https://hal.inria.fr/hal-01736357), to be presented soon.
Warning
Interface: If FULL_RESTART=False (default), the underlying algorithm is recreated at every breakpoint, instead its attribute horizon or _horizon is updated. Be sure that this is enough to really change the internal value used by the policy. Some policy use T only once to compute others parameters, which should be updated as well. A manual implementation of the __setattr__ method can help.
-
Policies.DoublingTrickWrapper.
default_horizonDependent_policy
¶ alias of
Policies.UCBH.UCBH
-
Policies.DoublingTrickWrapper.
FULL_RESTART
= False¶ Default constant to know what to do when restarting the underlying policy with a new horizon parameter.
- True means that a new policy, initialized from scratch, will be created at every breakpoint.
- False means that the same policy object is used but just its attribute horizon is updated (default).
-
Policies.DoublingTrickWrapper.
DEFAULT_FIRST_HORIZON
= 200¶ Default horizon, used for the first step.
-
Policies.DoublingTrickWrapper.
ARITHMETIC_STEP
= 200¶ Default stepsize for the arithmetic horizon progression.
-
Policies.DoublingTrickWrapper.
next_horizon__arithmetic
(i, horizon)[source]¶ The arithmetic horizon progression function:
\[\begin{split}T &\mapsto T + 100,\\ T_i &:= T_0 + 100 \times i.\end{split}\]
-
Policies.DoublingTrickWrapper.
GEOMETRIC_STEP
= 2¶ Default multiplicative constant for the geometric horizon progression.
-
Policies.DoublingTrickWrapper.
next_horizon__geometric
(i, horizon)[source]¶ The geometric horizon progression function:
\[\begin{split}T &\mapsto T \times 2,\\ T_i &:= T_0 2^i.\end{split}\]
-
Policies.DoublingTrickWrapper.
EXPONENTIAL_STEP
= 1.5¶ Default exponential constant for the exponential horizon progression.
-
Policies.DoublingTrickWrapper.
next_horizon__exponential
(i, horizon)[source]¶ The exponential horizon progression function:
\[\begin{split}T &\mapsto \left\lfloor T^{1.5} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.5^i} \right\rfloor.\end{split}\]
-
Policies.DoublingTrickWrapper.
SLOW_EXPONENTIAL_STEP
= 1.1¶ Default exponential constant for the slow exponential horizon progression.
-
Policies.DoublingTrickWrapper.
next_horizon__exponential_slow
(i, horizon)[source]¶ The exponential horizon progression function:
\[\begin{split}T &\mapsto \left\lfloor T^{1.1} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.1^i} \right\rfloor.\end{split}\]
-
Policies.DoublingTrickWrapper.
FAST_EXPONENTIAL_STEP
= 2¶ Default exponential constant for the fast exponential horizon progression.
-
Policies.DoublingTrickWrapper.
next_horizon__exponential_fast
(i, horizon)[source]¶ The exponential horizon progression function:
\[\begin{split}T &\mapsto \lfloor T^{2} \rfloor,\\ T_i &:= \lfloor T_0^{2^i} \rfloor.\end{split}\]
-
Policies.DoublingTrickWrapper.
ALPHA
= 2¶ Default constant \(\alpha\) for the generic exponential sequence.
-
Policies.DoublingTrickWrapper.
BETA
= 2¶ Default constant \(\beta\) for the generic exponential sequence.
-
Policies.DoublingTrickWrapper.
next_horizon__exponential_generic
(i, horizon)[source]¶ The generic exponential horizon progression function:
\[T_i := \left\lfloor \frac{T_0}{a} a^{b^i} \right\rfloor.\]
-
Policies.DoublingTrickWrapper.
default_next_horizon
(i, horizon)¶ The exponential horizon progression function:
\[\begin{split}T &\mapsto \left\lfloor T^{1.1} \right\rfloor,\\ T_i &:= \left\lfloor T_0^{1.1^i} \right\rfloor.\end{split}\]
-
Policies.DoublingTrickWrapper.
breakpoints
(next_horizon, first_horizon, horizon, debug=False)[source]¶ Return the list of restart point (breakpoints), if starting from
first_horizon
tohorizon
with growth functionnext_horizon
.- Also return the gap between the last guess for horizon and the true horizon. This gap should not be too large.
- Nicely print all the values if
debug=True
. - First examples:
>>> first_horizon = 1000 >>> horizon = 30000 >>> breakpoints(next_horizon__arithmetic, first_horizon, horizon) # doctest: +ELLIPSIS ([1000, 1200, 1400, ..., 29800, 30000], 0) >>> breakpoints(next_horizon__geometric, first_horizon, horizon) ([1000, 2000, 4000, 8000, 16000, 32000], 2000) >>> breakpoints(next_horizon__exponential, first_horizon, horizon) ([1000, 31622], 1622) >>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon) ([1000, 1995, 4265, 9838, 24671, 67827], 37827) >>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon) ([1000, 1000000], 970000)
- Second examples:
>>> first_horizon = 5000 >>> horizon = 1000000 >>> breakpoints(next_horizon__arithmetic, first_horizon, horizon) # doctest: +ELLIPSIS ([5000, 5200, ..., 999600, 999800, 1000000], 0) >>> breakpoints(next_horizon__geometric, first_horizon, horizon) ([5000, 10000, 20000, 40000, 80000, 160000, 320000, 640000, 1280000], 280000) >>> breakpoints(next_horizon__exponential, first_horizon, horizon) ([5000, 353553, 210223755], 209223755) >>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon) ([5000, 11718, 29904, 83811, 260394, 906137, 3572014], 2572014) >>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon) ([5000, 25000000], 24000000)
- Third examples:
>>> first_horizon = 10 >>> horizon = 1123456 >>> breakpoints(next_horizon__arithmetic, first_horizon, horizon) # doctest: +ELLIPSIS ([10, 210, 410, ..., 1123210, 1123410, 1123610], 154) >>> breakpoints(next_horizon__geometric, first_horizon, horizon) ([10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480, 40960, 81920, 163840, 327680, 655360, 1310720], 187264) >>> breakpoints(next_horizon__exponential, first_horizon, horizon) ([10, 31, 172, 2255, 107082, 35040856], 33917400) >>> breakpoints(next_horizon__exponential_slow, first_horizon, horizon) ([10, 12, 15, 19, 25, 34, 48, 70, 107, 170, 284, 499, 928, 1837, 3895, 8903, 22104, 60106, 180638, 606024, 2294768], 1171312) >>> breakpoints(next_horizon__exponential_fast, first_horizon, horizon) ([10, 100, 10000, 100000000], 98876544)
-
Policies.DoublingTrickWrapper.
constant_c_for_the_functions_f
= 0.5¶ The constant c in front of the function f.
-
Policies.DoublingTrickWrapper.
function_f__for_geometric_sequences
(i, c=0.5)[source]¶ For the geometric doubling sequences, \(f(i) = c \times \log(i)\).
-
Policies.DoublingTrickWrapper.
function_f__for_exponential_sequences
(i, c=0.5)[source]¶ For the exponential doubling sequences, \(f(i) = c \times i\).
-
Policies.DoublingTrickWrapper.
function_f__for_generic_sequences
(i, c=0.5, d=0.5, e=0.0)[source]¶ For a certain generic family of doubling sequences, \(f(i) = c \times i^{d} \times (\log(i))^{e}\).
d, e = 0, 1
givesfunction_f__for_geometric_sequences()
,d, e = 1, 0
givesfunction_f__for_geometric_sequences()
,d, e = 0.5, 0
gives an intermediate sequence, growing faster than any geometric sequence and slower than any exponential sequence,- any other combination has not been studied yet.
Warning
d
should most probably be smaller than 1.
-
Policies.DoublingTrickWrapper.
alpha_for_Ti
= 0.5¶ Value of the parameter \(\alpha\) for the
Ti_from_f()
function.
-
Policies.DoublingTrickWrapper.
Ti_from_f
(f, alpha=0.5, *args, **kwargs)[source]¶ For any non-negative and increasing function \(f: i \mapsto f(i)\), the corresponding sequence is defined by:
\[\forall i\in\mathbb{N},\; T_i := \lfloor \exp(\alpha \times \exp(f(i))) \rfloor.\]Warning
\(f(i)\) can need other parameters, see the examples above. They can be given as
*args
or**kwargs
toTi_from_f()
.Warning
it should be computed otherwise, I should give \(i \mapsto \exp(f(i))\) instead of \(f: i \mapsto f(i)\). I need to try as much as possible to reduce the risk of overflow errors!
-
Policies.DoublingTrickWrapper.
Ti_geometric
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_geometric_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_exponential
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_exponential_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_intermediate_sqrti
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_intermediate_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_intermediate_i13
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_intermediate2_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_intermediate_i23
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_intermediate3_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_intermediate_i12_logi12
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_intermediate4_sequences()
.
-
Policies.DoublingTrickWrapper.
Ti_intermediate_i_by_logi
(i, horizon, alpha=0.5, first_horizon=200, *args, **kwargs)[source]¶ Sequence \(T_i\) generated from the function \(f\) =
function_f__for_intermediate5_sequences()
.
-
Policies.DoublingTrickWrapper.
last_term_operator_LT
(Ti, max_i=10000)[source]¶ For a certain function representing a doubling sequence, \(T: i \mapsto T_i\), this
last_term_operator_LT()
function returns the function \(L: T \mapsto L_T\), defined as:\[\forall T\in\mathbb{N},\; L_T := \min\{ i \in\mathbb{N},\; T \leq T_i \}.\]\(L_T\) is the only integer which satisfies \(T_{L_T - 1} < T \leq T_{L_T}\).
-
Policies.DoublingTrickWrapper.
plot_doubling_sequences
(i_min=1, i_max=30, list_of_f=(<function function_f__for_geometric_sequences>, <function function_f__for_intermediate_sequences>, <function function_f__for_intermediate2_sequences>, <function function_f__for_intermediate3_sequences>, <function function_f__for_intermediate4_sequences>, <function function_f__for_exponential_sequences>), label_of_f=('Geometric doubling (d=0, e=1)', 'Intermediate doubling (d=1/2, e=0)', 'Intermediate doubling (d=1/3, e=0)', 'Intermediate doubling (d=2/3, e=0)', 'Intermediate doubling (d=1/2, e=1/2)', 'Exponential doubling (d=1, e=0)'), *args, **kwargs)[source]¶ Display a plot to illustrate the values of the \(T_i\) as a function of \(i\) for some i.
- Can accept many functions f (and labels).
-
Policies.DoublingTrickWrapper.
plot_quality_first_upper_bound
(Tmin=10, Tmax=100000000, nbTs=100, gamma=0.0, delta=1.0, list_of_f=(<function function_f__for_geometric_sequences>, <function function_f__for_intermediate_sequences>, <function function_f__for_intermediate2_sequences>, <function function_f__for_intermediate3_sequences>, <function function_f__for_intermediate4_sequences>, <function function_f__for_exponential_sequences>), label_of_f=('Geometric doubling (d=0, e=1)', 'Intermediate doubling (d=1/2, e=0)', 'Intermediate doubling (d=1/3, e=0)', 'Intermediate doubling (d=2/3, e=0)', 'Intermediate doubling (d=1/2, e=1/2)', 'Exponential doubling (d=1, e=0)'), show_Ti_m_Tim1=True, *args, **kwargs)[source]¶ Display a plot to compare numerically between the following sum \(S\) and the upper-bound we hope to have, \(T^{\gamma} (\log T)^{\delta}\), as a function of \(T\) for some values between \(T_{\min}\) and \(T_{\max}\):
\[S := \sum_{i=0}^{L_T} (T_i - T_{i-1})^{\gamma} (\log (T_i - T_{i-1}))^{\delta}.\]- Can accept many functions f (and labels).
- Can use \(T_i\) instead of \(T_i - T_{i-1}\) if
show_Ti_m_Tim1=False
(default is to use the smaller possible bound, with difference of sequence lengths, \(T_i - T_{i-1}\)).
Warning
This is still ON GOING WORK.
-
Policies.DoublingTrickWrapper.
MAX_NB_OF_TRIALS
= 500¶ If the sequence Ti does not grow enough, artificially increase i until T_inext > T_i
-
class
Policies.DoublingTrickWrapper.
DoublingTrickWrapper
(nbArms, full_restart=False, policy=<class 'Policies.UCBH.UCBH'>, next_horizon=<function next_horizon__exponential_slow>, first_horizon=200, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
A policy that acts as a wrapper on another policy P, assumed to be horizon dependent (has to known \(T\)), by implementing a “doubling trick”.
- Reference: [[What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018]](https://hal.inria.fr/hal-01736357), to be presented soon.
-
__init__
(nbArms, full_restart=False, policy=<class 'Policies.UCBH.UCBH'>, next_horizon=<function next_horizon__exponential_slow>, first_horizon=200, *args, **kwargs)[source]¶ New policy.
-
full_restart
= None¶ Constant to know how to refresh the underlying policy.
-
__module__
= 'Policies.DoublingTrickWrapper'¶
-
next_horizon_name
= None¶ Pretty string of the name of this growing function
-
horizon
= None¶ Last guess for the horizon
Policies.EmpiricalMeans module¶
The naive Empirical Means policy for bounded bandits: like UCB but without a bias correction term. Note that it is equal to UCBalpha with alpha=0, only quicker.
-
class
Policies.EmpiricalMeans.
EmpiricalMeans
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The naive Empirical Means policy for bounded bandits: like UCB but without a bias correction term. Note that it is equal to UCBalpha with alpha=0, only quicker.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)}.\]
-
__module__
= 'Policies.EmpiricalMeans'¶
-
Policies.EpsilonGreedy module¶
The epsilon-greedy random policies, with the naive one and some variants.
- At every time step, a fully uniform random exploration has probability \(\varepsilon(t)\) to happen, otherwise an exploitation is done on accumulated rewards (not means).
- Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
Warning
Except if \(\varepsilon(t)\) is optimally tuned for a specific problem, none of these policies can hope to be efficient.
-
class
Policies.EpsilonGreedy.
EpsilonGreedy
(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The epsilon-greedy random policy.
- At every time step, a fully uniform random exploration has probability \(\varepsilon(t)\) to happen, otherwise an exploitation is done on accumulated rewards (not means).
- Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
-
epsilon
¶
-
choice
()[source]¶ With a probability of epsilon, explore (uniform choice), otherwhise exploit based on just accumulated rewards (not empirical mean rewards).
-
choiceWithRank
(rank=1)[source]¶ With a probability of epsilon, explore (uniform choice), otherwhise exploit with the rank, based on just accumulated rewards (not empirical mean rewards).
-
__module__
= 'Policies.EpsilonGreedy'¶
-
class
Policies.EpsilonGreedy.
EpsilonDecreasing
(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The epsilon-decreasing random policy.
- \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\)
- Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
-
epsilon
¶ Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\).
-
__module__
= 'Policies.EpsilonGreedy'¶
-
Policies.EpsilonGreedy.
C
= 0.1¶ Constant C in the MEGA formula
-
Policies.EpsilonGreedy.
D
= 0.5¶ Constant C in the MEGA formula
-
Policies.EpsilonGreedy.
epsilon0
(c, d, nbArms)[source]¶ MEGA heuristic:
\[\varepsilon_0 = \frac{c K^2}{d^2 (K - 1)}.\]
-
class
Policies.EpsilonGreedy.
EpsilonDecreasingMEGA
(nbArms, c=0.1, d=0.5, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The epsilon-decreasing random policy, using MEGA’s heuristic for a good choice of epsilon0 value.
- \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\)
- \(\varepsilon_0 = \frac{c K^2}{d^2 (K - 1)}\)
- Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
-
epsilon
¶ Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))\).
-
__module__
= 'Policies.EpsilonGreedy'¶
-
class
Policies.EpsilonGreedy.
EpsilonFirst
(nbArms, horizon, epsilon=0.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The epsilon-first random policy. Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
epsilon
¶ 1 while \(t \leq \varepsilon_0 T\), 0 after.
-
__module__
= 'Policies.EpsilonGreedy'¶
-
-
Policies.EpsilonGreedy.
EPSILON
= 0.1¶ Default value for epsilon for
EpsilonDecreasing
-
Policies.EpsilonGreedy.
DECREASINGRATE
= 1e-06¶ Default value for the constant for the decreasing rate
-
class
Policies.EpsilonGreedy.
EpsilonExpDecreasing
(nbArms, epsilon=0.1, decreasingRate=1e-06, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The epsilon exp-decreasing random policy.
- \(\varepsilon(t) = \varepsilon_0 \exp(-t \mathrm{decreasingRate})\).
- Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
-
__module__
= 'Policies.EpsilonGreedy'¶
-
epsilon
¶ Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 \exp(- t \tau))\).
-
Policies.EpsilonGreedy.
random
() → x in the interval [0, 1).¶
Policies.Exp3 module¶
The Exp3 randomized index policy.
Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).
-
Policies.Exp3.
UNBIASED
= True¶ self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).
-
Policies.Exp3.
GAMMA
= 0.01¶ Default \(\gamma\) parameter.
-
class
Policies.Exp3.
Exp3
(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The Exp3 randomized index policy.
Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).
-
unbiased
= None¶ Unbiased estimators ?
-
weights
= None¶ Weights on the arms
-
gamma
¶ Constant \(\gamma_t = \gamma\).
-
trusts
¶ Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).
\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]If \(w_k(t)\) is the current weight from arm k.
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.
- With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
- But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
-
choice
()[source]¶ One random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
, and select the last one (less probable).- Note that if not enough entries in the trust vector are non-zero, then
choice()
is called instead (rank is ignored).
- Note that if not enough entries in the trust vector are non-zero, then
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.
-
estimatedBestArms
(M=1)[source]¶ Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.
-
__module__
= 'Policies.Exp3'¶
-
-
class
Policies.Exp3.
Exp3WithHorizon
(nbArms, horizon, unbiased=True, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Exp3.Exp3
Exp3 with fixed gamma, \(\gamma_t = \gamma_0\), chosen with a knowledge of the horizon.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gamma
¶ Fixed temperature, small, knowing the horizon: \(\gamma_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).
- Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Exp3'¶
-
-
class
Policies.Exp3.
Exp3Decreasing
(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Exp3.Exp3
Exp3 with decreasing parameter \(\gamma_t\).
-
gamma
¶ Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).
- Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Exp3'¶
-
-
class
Policies.Exp3.
Exp3SoftMix
(nbArms, gamma=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Exp3.Exp3
Another Exp3 with decreasing parameter \(\gamma_t\).
-
gamma
¶ Decreasing gamma parameter with the time: \(\gamma_t = c \frac{\log(t)}{t}\) (heuristic).
- Cf. [Cesa-Bianchi & Fisher, 1998](http://dl.acm.org/citation.cfm?id=657473).
- Default value for is \(c = \sqrt(\frac{\log(K)}{K})\).
-
__module__
= 'Policies.Exp3'¶
-
-
Policies.Exp3.
DELTA
= 0.01¶ Default value for the confidence parameter delta
-
class
Policies.Exp3.
Exp3ELM
(nbArms, delta=0.01, unbiased=True, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Exp3.Exp3
A variant of Exp3, apparently designed to work better in stochastic environments.
- Reference: [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).
-
delta
= None¶ Confidence parameter, given in input
-
B
= None¶ Constant B given by \(B = 4 (e - 2) (2 \log K + \log(2 / \delta))\).
-
availableArms
= None¶ Set of available arms, starting from all arms, and it can get reduced at each step.
-
varianceTerm
= None¶ Estimated variance term, for each arm.
-
getReward
(arm, reward)[source]¶ Get reward and update the weights, as in Exp3, but also update the variance term \(V_k(t)\) for all arms, and the set of available arms \(\mathcal{A}(t)\), by removing arms whose empirical accumulated reward and variance term satisfy a certain inequality.
\[\begin{split}a^*(t+1) &= \arg\max_a \hat{R}_{a}(t+1), \\ V_k(t+1) &= V_k(t) + \frac{1}{\mathrm{trusts}_k(t+1)}, \\ \mathcal{A}(t+1) &= \mathcal{A}(t) \setminus \left\{ a : \hat{R}_{a^*(t+1)}(t+1) - \hat{R}_{a}(t+1) > \sqrt{B (V_{a^*(t+1)}(t+1) + V_{a}(t+1))} \right\}.\end{split}\]
-
trusts
¶ Update the trusts probabilities according to Exp3ELM formula, and the parameter \(\gamma_t\).
\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - |\mathcal{A}_t| \gamma_t) w_k(t) + \gamma_t, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]If \(w_k(t)\) is the current weight from arm k.
-
__module__
= 'Policies.Exp3'¶
-
gamma
¶ Decreasing gamma with the time: \(\gamma_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).
- Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
Policies.Exp3PlusPlus module¶
The EXP3++ randomized index policy, an improved version of the EXP3 policy.
Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].
See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].
-
Policies.Exp3PlusPlus.
ALPHA
= 3¶ Value for the \(\alpha\) parameter.
-
Policies.Exp3PlusPlus.
BETA
= 256¶ Value for the \(\beta\) parameter.
-
class
Policies.Exp3PlusPlus.
Exp3PlusPlus
(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The EXP3++ randomized index policy, an improved version of the EXP3 policy.
Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].
See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].
-
alpha
= None¶ \(\alpha\) parameter for computations of \(\xi_t(a)\).
-
beta
= None¶ \(\beta\) parameter for computations of \(\xi_t(a)\).
-
weights
= None¶ Weights on the arms
-
losses
= None¶ Cumulative sum of losses estimates for each arm
-
unweighted_losses
= None¶ Cumulative sum of unweighted losses for each arm
-
eta
¶ Decreasing sequence of learning rates, given by \(\eta_t = \frac{1}{2} \sqrt{\frac{\log K}{t K}}\).
-
gamma
¶ Constant \(\gamma_t = \gamma\).
-
gap_estimate
¶ Compute the gap estimate \(\widehat{\Delta}^{\mathrm{LCB}}_t(a)\) from :
- Compute the UCB: \(\mathrm{UCB}_t(a) = \min\left( 1, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} + \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),
- Compute the LCB: \(\mathrm{LCB}_t(a) = \max\left( 0, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} - \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),
- Then the gap: \(\widehat{\Delta}^{\mathrm{LCB}}_t(a) = \max\left( 0, \mathrm{LCB}_t(a) - \min_{a'} \mathrm{UCB}_t(a') \right)\).
- The gap should be in \([0, 1]\).
-
xi
¶ Compute the \(\xi_t(a) = \frac{\beta \log t}{t \widehat{\Delta}^{\mathrm{LCB}}_t(a)^2}\) vector of indexes.
-
epsilon
¶ Compute the vector of parameters \(\eta_t(a) = \min\left(\frac{1}{2 K}, \frac{1}{2} \sqrt{\frac{\log K}{t K}}, \xi_t(a) \right)\).
-
trusts
¶ Update the trusts probabilities according to Exp3PlusPlus formula, and the parameter \(\eta_t\).
\[\begin{split}\tilde{\rho}'_{t+1}(a) &= (1 - \sum_{a'=1}^{K}\eta_t(a')) w_t(a) + \eta_t(a), \\ \tilde{\rho}_{t+1} &= \tilde{\rho}'_{t+1} / \sum_{a=1}^{K} \tilde{\rho}'_{t+1}(a).\end{split}\]If \(rho_t(a)\) is the current weight from arm a.
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate losses on that arm a, then update the weight \(\rho_t(a)\) and renormalize the weights.
- Divide by the trust on that arm a, i.e., the probability of observing arm a: \(\tilde{l}_t(a) = \frac{l_t(a)}{\tilde{\rho}_t(a)} 1(A_t = a)\).
- Add this loss to the cumulative loss: \(\tilde{L}_t(a) := \tilde{L}_{t-1}(a) + \tilde{l}_t(a)\).
- But the un-weighted loss is added to the other cumulative loss: \(\widehat{L}_t(a) := \widehat{L}_{t-1}(a) + l_t(a) 1(A_t = a)\).
\[\begin{split}\rho'_{t+1}(a) &= \exp\left( - \tilde{L}_t(a) \eta_t \right) \\ \rho_{t+1} &= \rho'_{t+1} / \sum_{a=1}^{K} \rho'_{t+1}(a).\end{split}\]
-
choice
()[source]¶ One random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
, and select the last one (less probable).- Note that if not enough entries in the trust vector are non-zero, then
choice()
is called instead (rank is ignored).
- Note that if not enough entries in the trust vector are non-zero, then
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.
-
estimatedBestArms
(M=1)[source]¶ Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.
-
__module__
= 'Policies.Exp3PlusPlus'¶
-
Policies.Exp3R module¶
The Drift-Detection algorithm for non-stationary bandits.
Reference: [[“EXP3 with Drift Detection for the Switching Bandit Problem”, Robin Allesiardo & Raphael Feraud]](https://www.researchgate.net/profile/Allesiardo_Robin/publication/281028960_EXP3_with_Drift_Detection_for_the_Switching_Bandit_Problem/links/55d1927808aee19936fdac8e.pdf)
It runs on top of a simple policy like
Exp3
, andDriftDetection_IndexPolicy
is a wrapper:>>> policy = DriftDetection_IndexPolicy(nbArms, C=1) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
It works on Exp3
or other parametrizations of the Exp3 policy, e.g., Exp3PlusPlus
.
-
Policies.Exp3R.
VERBOSE
= False¶ Whether to be verbose when doing the search for valid parameter \(\ell\).
-
Policies.Exp3R.
CONSTANT_C
= 1.0¶ The constant \(C\) used in Corollary 1 of paper [[“EXP3 with Drift Detection for the Switching Bandit Problem”, Robin Allesiardo & Raphael Feraud]](https://www.researchgate.net/profile/Allesiardo_Robin/publication/281028960_EXP3_with_Drift_Detection_for_the_Switching_Bandit_Problem/links/55d1927808aee19936fdac8e.pdf).
-
class
Policies.Exp3R.
DriftDetection_IndexPolicy
(nbArms, H=None, delta=None, C=1.0, horizon=None, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
The Drift-Detection generic policy for non-stationary bandits, using a custom Drift-Detection test, for 1-dimensional exponential families.
- From [[“EXP3 with Drift Detection for the Switching Bandit Problem”, Robin Allesiardo & Raphael Feraud]](https://www.researchgate.net/profile/Allesiardo_Robin/publication/281028960_EXP3_with_Drift_Detection_for_the_Switching_Bandit_Problem/links/55d1927808aee19936fdac8e.pdf).
-
__init__
(nbArms, H=None, delta=None, C=1.0, horizon=None, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]¶ New policy.
-
H
= None¶ Parameter \(H\) for the Drift-Detection algorithm. Default value is \(\lceil C \sqrt{T \log(T)} \rceil\), for some constant \(C=\)
C
(=CONSTANT_C
by default).
-
delta
= None¶ Parameter \(\delta\) for the Drift-Detection algorithm. Default value is \(\sqrt{\frac{\log(T)}{K T}}\) for \(K\) arms and horizon \(T\).
-
proba_random_exploration
¶ Parameter \(\gamma\) for the Exp3 algorithm.
-
threshold_h
¶ Parameter \(\varepsilon\) for the Drift-Detection algorithm.
\[\varepsilon = \sqrt{\frac{K \log(\frac{1}{\delta})}{2 \gamma H}}.\]
-
min_number_of_pulls_to_test_change
¶ Compute \(\Gamma_{\min}(I) := \frac{\gamma H}{K}\), the minimum number of samples we should have for all arms before testing for a change.
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using a Drift-Detection test (DD).
\[\begin{split}k_{\max} &:= \arg\max_k \tilde{\rho}_k(t),\\ DD_t(k) &= \hat{\mu}_k(I) - \hat{\mu}_{k_{\max}}(I).\end{split}\]- The change is detected if there is an arm \(k\) such that \(DD_t(k) \geq 2 * \varepsilon = h\), where
threshold_h
is the threshold of the test, and \(I\) is the (number of the) current interval since the last (global) restart, - where \(\tilde{\rho}_k(t)\) is the trust probability of arm \(k\) from the Exp3 algorithm,
- and where \(\hat{\mu}_k(I)\) is the empirical mean of arm \(k\) from the data in the current interval.
Warning
FIXME I know this implementation is not (yet) correct… I should count differently the samples we obtained from the Gibbs distribution (when Exp3 uses the trust vector) and from the uniform distribution This \(\Gamma_{\min}(I)\) is the minimum number of samples obtained from the uniform exploration (of probability \(\gamma\)). It seems painful to code correctly, I will do it later.
- The change is detected if there is an arm \(k\) such that \(DD_t(k) \geq 2 * \varepsilon = h\), where
-
__module__
= 'Policies.Exp3R'¶
-
class
Policies.Exp3R.
Exp3R
(nbArms, policy=<class 'Policies.Exp3.Exp3'>, *args, **kwargs)[source]¶ Bases:
Policies.Exp3R.DriftDetection_IndexPolicy
The Exp3.R policy for non-stationary bandits.
-
__module__
= 'Policies.Exp3R'¶
-
-
class
Policies.Exp3R.
Exp3RPlusPlus
(nbArms, policy=<class 'Policies.Exp3PlusPlus.Exp3PlusPlus'>, *args, **kwargs)[source]¶ Bases:
Policies.Exp3R.DriftDetection_IndexPolicy
The Exp3.R++ policy for non-stationary bandits.
-
__init__
(nbArms, policy=<class 'Policies.Exp3PlusPlus.Exp3PlusPlus'>, *args, **kwargs)[source]¶ New policy.
-
__module__
= 'Policies.Exp3R'¶
-
Policies.Exp3S module¶
The historical Exp3.S algorithm for non-stationary bandits.
Reference: [[“The nonstochastic multiarmed bandit problem”, P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire, SIAM journal on computing, 2002]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8735&rep=rep1&type=pdf)
It is a simple extension of the
Exp3
policy:>>> policy = Exp3S(nbArms, C=1) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
-
class
Policies.Exp3S.
Exp3S
(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]¶ Bases:
Policies.Exp3.Exp3
The historical Exp3.S algorithm for non-stationary bandits.
- Reference: [[“The nonstochastic multiarmed bandit problem”, P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire, SIAM journal on computing, 2002]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8735&rep=rep1&type=pdf)
-
__init__
(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]¶ New policy.
-
weights
= None¶ Weights on the arms
-
gamma
¶ Constant \(\gamma_t = \gamma\).
-
alpha
¶ Constant \(\alpha_t = \alpha\).
-
trusts
¶ Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).
\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]If \(w_k(t)\) is the current weight from arm k.
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.
- With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
- But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
-
__module__
= 'Policies.Exp3S'¶
Policies.ExploreThenCommit module¶
Different variants of the Explore-Then-Commit policy.
- Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
- And [Kaufmann & Moy, 2017, ICC](http://icc2017.ieee-icc.org/program/tutorials#TT01), E.Kaufmann’s slides at IEEE ICC 2017
- See also: https://github.com/SMPyBandits/SMPyBandits/issues/62 and https://github.com/SMPyBandits/SMPyBandits/issues/102
- Also [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf)
Warning
They sometimes do not work empirically as well as the theory predicted…
Warning
TODO I should factor all this code and write all of them in a more “unified” way…
-
Policies.ExploreThenCommit.
GAP
= 0.1¶ Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.
-
class
Policies.ExploreThenCommit.
ETC_KnownGap
(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
max_t
= None¶ Time until pure exploitation,
m_
steps in each arm.
-
epsilon
¶ 1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:
\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]
-
__module__
= 'Policies.ExploreThenCommit'¶
-
-
Policies.ExploreThenCommit.
ALPHA
= 4¶ Default value for parameter \(\alpha\) for
ETC_RandomStop
-
class
Policies.ExploreThenCommit.
ETC_RandomStop
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Variant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
stillRandom
= None¶ Still randomly exploring?
-
epsilon
¶ 1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:
\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]
-
__module__
= 'Policies.ExploreThenCommit'¶
-
-
class
Policies.ExploreThenCommit.
ETC_FixedBudget
(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
- Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 1.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
max_t
= None¶ Time until pure exploitation.
-
round_robin_index
= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm
= None¶ Arm on which we commit, not defined in the beginning.
-
choice
()[source]¶ For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.
\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]- Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with
scipy.special.lambertw()
).
- Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with
-
epsilon
¶ 1 while \(t \leq n\), 0 after.
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
_ETC_RoundRobin_WithStoppingCriteria
(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Base class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
- Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2 and 3.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
round_robin_index
= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm
= None¶ Arm on which we commit, not defined in the beginning.
-
choice
()[source]¶ Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.
\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]
-
epsilon
¶ 1 while not fixed, 0 after.
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
ETC_SPRT
(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria
The Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
- Very similar to
ETC_RandomStop
, but with a sequential exploration until the stopping time. - Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2.
-
__module__
= 'Policies.ExploreThenCommit'¶
- Very similar to
-
class
Policies.ExploreThenCommit.
ETC_BAI
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria
The Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).
- Very similar to
ETC_RandomStop
, but with a sequential exploration until the stopping time. - Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 3.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
__module__
= 'Policies.ExploreThenCommit'¶
- Very similar to
-
class
Policies.ExploreThenCommit.
DeltaUCB
(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
- Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 4.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
epsilon_T
= None¶ Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).
-
choice
()[source]¶ Chose between the most chosen and the least chosen arm, based on the following criteria:
\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]
-
__module__
= 'Policies.ExploreThenCommit'¶
Policies.FEWA module¶
author: Julien Seznec
Filtering on Expanding Window Algorithm for rotting bandits.
Reference: [Seznec et al., 2019a] Rotting bandits are not harder than stochastic ones; Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko ; Proceedings of Machine Learning Research, PMLR 89:2564-2572, 2019. http://proceedings.mlr.press/v89/seznec19a.html https://arxiv.org/abs/1811.11043 (updated version)
Reference : [Seznec et al., 2019b] A single algorithm for both rested and restless rotting bandits (WIP) Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko
-
class
Policies.FEWA.
EFF_FEWA
(nbArms, alpha=0.06, subgaussian=1, m=None, delta=None, delay=False)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
Efficient Filtering on Expanding Window Average Efficient trick described in [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043] (m=2) and [Seznec et al., 2019b, WIP] (m<=2) We use the confidence level :math:`delta_t =rac{1}{t^lpha}`.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
__module__
= 'Policies.FEWA'¶
-
-
class
Policies.FEWA.
FEWA
(nbArms, subgaussian=1, alpha=4, delta=None)[source]¶ Bases:
Policies.FEWA.EFF_FEWA
Filtering on Expanding Window Average. Reference: [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043]. FEWA is equivalent to EFF_FEWA for \(m < 1+1/T\) [Seznec et al., 2019b, WIP]. This implementation is valid for $:math:T < 10^{15}. For \(T>10^{15}\), FEWA will have time and memory issues as its time and space complexity is O(KT) per round.
-
__module__
= 'Policies.FEWA'¶
-
Policies.GLR_UCB module¶
The GLR-UCB policy and variants, for non-stationary bandits.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
It runs on top of a simple policy, e.g.,
UCB
, andBernoulliGLR_IndexPolicy
is a wrapper:>>> policy = BernoulliGLR_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.GLR_UCB.
VERBOSE
= False¶ Whether to be verbose when doing the change detection algorithm.
-
Policies.GLR_UCB.
PROBA_RANDOM_EXPLORATION
= 0.1¶ Default probability of random exploration \(\alpha\).
-
Policies.GLR_UCB.
PER_ARM_RESTART
= True¶ Should we reset one arm empirical average or all? Default is
True
, it’s usually more efficient!
-
Policies.GLR_UCB.
FULL_RESTART_WHEN_REFRESH
= False¶ Should we fully restart the algorithm or simply reset one arm empirical average? Default is
False
, it’s usually more efficient!
-
Policies.GLR_UCB.
LAZY_DETECT_CHANGE_ONLY_X_STEPS
= 10¶ XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.
-
Policies.GLR_UCB.
LAZY_TRY_VALUE_S_ONLY_X_STEPS
= 10¶ XXX Be lazy and try to detect changes for \(s\) taking steps of size
steps_s
. Default is to havesteps_s=1
, but only usingsteps_s=2
should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.
-
Policies.GLR_UCB.
USE_LOCALIZATION
= True¶ Default value of
use_localization
for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.
-
Policies.GLR_UCB.
eps
= 1e-10¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Policies.GLR_UCB.
klBern
(x, y)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]
-
Policies.GLR_UCB.
klGauss
(x, y, sig2x=1)[source]¶ Kullback-Leibler divergence for Gaussian distributions of means
x
andy
and variancessig2x
andsig2y
, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties
-
Policies.GLR_UCB.
threshold_GaussianGLR
(t, horizon=None, delta=None, variant=None)[source]¶ Compute the value :math:`c from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
- The threshold is computed as (with \(t_0 = 0\)):
\[\beta(t_0, t, \delta) := \left(1 + \frac{1}{t - t_0 + 1}\right) 2 \log\left(\frac{2 (t - t_0) \sqrt{(t - t_0) + 2}}{\delta}\right).\]
-
Policies.GLR_UCB.
function_h_minus_one
(x)[source]¶ The inverse function of \(h(u)\), that is \(h^{-1}(x) = u \Leftrightarrow h(u) = x\). It is given by the Lambert W function, see
scipy.special.lambertw()
:\[h^{-1}(x) = - \mathcal{W}(- \exp(-x)).\]- Example:
>>> np.random.seed(105) >>> y = np.random.randn() ** 2 >>> print(f"y = {y}") y = 0.060184682907834595 >>> x = function_h(y) >>> print(f"h(y) = {x}") h(y) = 2.8705220786966508 >>> z = function_h_minus_one(x) >>> print(f"h^-1(x) = {z}") h^-1(x) = 0.060184682907834595 >>> assert np.isclose(z, y), "Error: h^-1(h(y)) = z = {z} should be very close to y = {}...".format(z, y)
-
Policies.GLR_UCB.
constant_power_function_h
= 1.5¶ The constant \(\frac{3}{2}\), used in the definition of functions \(h\), \(h^{-1}\), \(\tilde{h}\) and \(\mathcal{T}\).
-
Policies.GLR_UCB.
threshold_function_h_tilde
= 3.801770285137458¶ The constant \(h^{-1}(1/\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).
-
Policies.GLR_UCB.
constant_function_h_tilde
= -0.90272045571788¶ The constant \(\log(\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).
-
Policies.GLR_UCB.
function_h_tilde
(x)[source]¶ The function \(\tilde{h}(x)\), defined by:
\[\begin{split}\tilde{h}(x) = \begin{cases} e^{1/h^{-1}(x)} h^{-1}(x) & \text{ if } x \ge h^{-1}(1/\ln (3/2)), \\ (3/2) (x-\ln \ln (3/2)) & \text{otherwise}. \end{cases}\end{split}\]
-
Policies.GLR_UCB.
zeta_of_two
= 1.6449340668482264¶ The constant \(\zeta(2) = \frac{\pi^2}{6}\).
-
Policies.GLR_UCB.
function_T_mathcal
(x)[source]¶ The function \(\mathcal{T}(x)\), defined by:
\[\mathcal{T}(x) = 2 \tilde h\left(\frac{h^{-1}(1+x) + \ln(2\zeta(2))}{2}\right).\]
-
Policies.GLR_UCB.
approximation_function_T_mathcal
(x)[source]¶ An efficiently computed approximation of \(\mathcal{T}(x)\), valid for \(x \geq 5\):
\[\mathcal{T}(x) \simeq x + 4 \log(1 + x + \sqrt(2 x)).\]
-
Policies.GLR_UCB.
threshold_BernoulliGLR
(t, horizon=None, delta=None, variant=None)[source]¶ Compute the value \(c\) from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
Warning
This is still experimental, you can try different variants of the threshold function:
- Variant #0 (default) is:
\[\beta(t, \delta) := \log\left(\frac{3 t^{3/2}}{\delta}\right) = \log(\frac{1}{\delta}) + \log(3) + 3/2 \log(t).\]- Variant #1 is smaller:
\[\beta(t, \delta) := \log(\frac{1}{\delta}) + \log(1 + \log(t)).\]- Variant #2 is using \(\mathcal{T}\):
\[\beta(t, \delta) := 2 \mathcal{T}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]- Variant #3 is using \(\tilde{\mathcal{T}}(x) = x + 4 \log(1 + x + \sqrt{2x})\) an approximation of \(\mathcal{T}(x)\) (valid and quite accurate as soon as \(x \geq 5\)):
\[\beta(t, \delta) := 2 \tilde{\mathcal{T}}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]
-
Policies.GLR_UCB.
EXPONENT_BETA
= 1.01¶ The default value of parameter \(\beta\) for the function
decreasing_alpha__GLR()
.
-
Policies.GLR_UCB.
ALPHA_T1
= 0.05¶ The default value of parameter \(\alpha_{t=1}\) for the function
decreasing_alpha__GLR()
.
-
Policies.GLR_UCB.
decreasing_alpha__GLR
(alpha0=None, t=1, exponentBeta=1.01, alpha_t1=0.05)[source]¶ Either use a fixed alpha, or compute it with an exponential decay (if
alpha0=None
).Note
I am currently exploring the following variant (November 2018):
- The probability of uniform exploration, \(\alpha\), is computed as a function of the current time:
\[\forall t>0, \alpha = \alpha_t := \alpha_{t=1} \frac{1}{\max(1, t^{\beta})}.\]- with \(\beta > 1, \beta\) =
exponentBeta
(=1.05) and \(\alpha_{t=1} < 1, \alpha_{t=1}\) =alpha_t1
(=0.01). - the only requirement on \(\alpha_t\) seems to be that sum_{t=1}^T alpha_t < +infty (ie. be finite), which is the case for \(\alpha_t = \alpha = \frac{1}{T}\), but also any \(\alpha_t = \frac{\alpha_1}{t^{\beta}}\) for any \(\beta>1\) (cf. Riemann series).
-
Policies.GLR_UCB.
smart_delta_from_T_UpsilonT
(horizon=1, max_nb_random_events=1, scaleFactor=1.0, per_arm_restart=True, nbArms=1)[source]¶ Compute a smart estimate of the optimal value for the confidence level \(\delta\), with
scaleFactor
\(= \delta_0\in(0,1)\) a constant.- If
per_arm_restart
is True (Local option):
\[\delta = \frac{\delta_0}{\sqrt{K \Upsilon_T T}.\]- If
per_arm_restart
is False (Global option):
\[\delta = \frac{\delta_0}{\sqrt{\Upsilon_T T}.\]Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).
- If
-
Policies.GLR_UCB.
smart_alpha_from_T_UpsilonT
(horizon=1, max_nb_random_events=1, scaleFactor=0.1, per_arm_restart=True, nbArms=1)[source]¶ Compute a smart estimate of the optimal value for the fixed or random forced exploration probability \(\alpha\) (or tracking based), with
scaleFactor
\(= \alpha_0\in(0,1)\) a constant.- If
per_arm_restart
is True (Local option):
\[\alpha = \alpha_0 \times \sqrt{\frac{K \Upsilon_T}{T} \log(T)}.\]- If
per_arm_restart
is False (Global option):
\[\alpha = \alpha_0 \times \sqrt{\frac{\Upsilon_T}{T} \log(T)}.\]Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).
- If
-
class
Policies.GLR_UCB.
GLR_IndexPolicy
(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
The GLR-UCB generic policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for 1-dimensional exponential families.
- It works for any 1-dimensional exponential family, you just have to give a
kl
function. - For instance
kullback.klBern()
, for Bernoulli distributions, givesGaussianGLR_IndexPolicy
, - And
kullback.klGauss()
for univariate Gaussian distributions, givesBernoulliGLR_IndexPolicy
. threshold_function
computes the threshold \(\beta(t, \delta)\), it can be for instancethreshold_GaussianGLR()
orthreshold_BernoulliGLR()
.- From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__
(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
horizon
= None¶ The horizon \(T\).
-
max_nb_random_events
= None¶ The number of breakpoints \(\Upsilon_T\).
-
use_localization
= None¶ experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
delta
= None¶ The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{\sqrt{T}}\) if
horizon
is given anddelta=None
but \(\Upsilon_T\) is unknown. Defaults to \(\delta=\frac{1}{\sqrt{\Upsilon_T T}}\) if both \(T\) and \(\Upsilon_T\) are given (horizon
andmax_nb_random_events
).
-
kl
= None¶ The parametrized Kullback-Leibler divergence (\(\mathrm{kl}(x,y) = KL(D(x),D(y))\)) for the 1-dimensional exponential family \(x\mapsto D(x)\). Example:
kullback.klBern()
orkullback.klGauss()
.
-
lazy_try_value_s_only_x_steps
= None¶ Be lazy and try to detect changes for \(s\) taking steps of size
steps_s
.
-
proba_random_exploration
¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
getReward
(arm, reward)[source]¶ Do as
CD_UCB
to handle the new reward, and also, update the internal times of each arm for the indexes ofklUCB_forGLR
(or other index policies), which use \(f(t - \tau_i(t))\) for the exploration function of each arm \(i\) at time \(t\), where \(\tau_i(t)\) denotes the (last) restart time of the arm.
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using the Generalized Likelihood Ratio test (GLR) and the
kl
function.- For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\mathrm{kl}}_{t_0:s:t} = (s-t_0+1) \mathrm{kl}(\mu_{t_0,s}, \mu_{t_0,t}) + (t-s) \mathrm{kl}(\mu_{s+1,t}, \mu_{t_0,t}).\]- The change is detected if there is a time \(s\) such that \(G^{\mathrm{kl}}_{t_0:s:t} > h\), where
threshold_h
is the threshold of the test, - And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).
Warning
This is computationally costly, so an easy way to speed up this test is to use
lazy_try_value_s_only_x_steps
\(= \mathrm{Step_s}\) for a small value (e.g., 10), so not test for all \(s\in[t_0, t-1]\) but only \(s\in[t_0, t-1], s \mod \mathrm{Step_s} = 0\) (e.g., one out of every 10 steps).
-
__module__
= 'Policies.GLR_UCB'¶
- It works for any 1-dimensional exponential family, you just have to give a
-
class
Policies.GLR_UCB.
GLR_IndexPolicy_WithTracking
(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy
A variant of the GLR policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough (with a tracking).
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
choice
()[source]¶ If any arm is not explored enough (\(n_k \leq \frac{\alpha}{K} \times (t - n_k)\), play uniformly at random one of these arms, otherwise, pass the call to
choice()
of the underlying policy.
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
GLR_IndexPolicy_WithDeterministicExploration
(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy
A variant of the GLR policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
- If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
choice
()[source]¶ For some time steps, play uniformly at random one of these arms, otherwise, pass the call to
choice()
of the underlying policy.
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
GaussianGLR_IndexPolicy
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy
The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed).
-
__init__
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ New policy.
-
_sig2
= None¶ Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to
kullback.klGauss()
. Default to \(\sigma^2 = \frac{1}{4}\).
-
__module__
= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.
GaussianGLR_IndexPolicy_WithTracking
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking
,Policies.GLR_UCB.GaussianGLR_IndexPolicy
A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.
-
__module__
= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.
GaussianGLR_IndexPolicy_WithDeterministicExploration
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration
,Policies.GLR_UCB.GaussianGLR_IndexPolicy
A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
-
__module__
= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.
BernoulliGLR_IndexPolicy
(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy
The BernoulliGLR-UCB policy for non-stationary bandits, for Bernoulli distributions.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__
(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ New policy.
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
BernoulliGLR_IndexPolicy_WithTracking
(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking
,Policies.GLR_UCB.BernoulliGLR_IndexPolicy
A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
BernoulliGLR_IndexPolicy_WithDeterministicExploration
(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration
,Policies.GLR_UCB.BernoulliGLR_IndexPolicy
A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
OurGaussianGLR_IndexPolicy
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy
The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed), but with our threshold designed for the sub-Bernoulli case.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ New policy.
-
_sig2
= None¶ Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to
kullback.klGauss()
. Default to \(\sigma^2 = \frac{1}{4}\).
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
OurGaussianGLR_IndexPolicy_WithTracking
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking
,Policies.GLR_UCB.OurGaussianGLR_IndexPolicy
A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough, but with our threshold designed for the sub-Bernoulli case, but with our threshold designed for the sub-Bernoulli case.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__
= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.
OurGaussianGLR_IndexPolicy_WithDeterministicExploration
(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration
,Policies.GLR_UCB.OurGaussianGLR_IndexPolicy
A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed, but with our threshold designed for the sub-Bernoulli case.
- Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__
= 'Policies.GLR_UCB'¶
-
Policies.GLR_UCB.
SubGaussianGLR_DELTA
= 0.01¶ Default confidence level for
SubGaussianGLR_IndexPolicy
.
-
Policies.GLR_UCB.
SubGaussianGLR_SIGMA
= 0.25¶ By default,
SubGaussianGLR_IndexPolicy
assumes distributions are 0.25-sub Gaussian, like Bernoulli or any distributions with support on \([0,1]\).
-
Policies.GLR_UCB.
SubGaussianGLR_JOINT
= True¶ Whether to use the joint or disjoint threshold function (
threshold_SubGaussianGLR_joint()
orthreshold_SubGaussianGLR_disjoint()
) forSubGaussianGLR_IndexPolicy
.
-
Policies.GLR_UCB.
threshold_SubGaussianGLR_joint
(s, t, delta=0.01, sigma=0.25)[source]¶ Compute the threshold :math:`b^{text{joint}}_{t_0}(s,t,delta) according to this formula:
\[b^{\text{joint}}_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]
-
Policies.GLR_UCB.
threshold_SubGaussianGLR_disjoint
(s, t, delta=0.01, sigma=0.25)[source]¶ Compute the threshold \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\) according to this formula:
\[b^{\text{disjoint}}_{t_0}(s,t,\delta) := \sqrt{2} \sigma \sqrt{\frac{1 + \frac{1}{s - t_0 + 1}}{s - t_0 + 1} \log\left( \frac{4 \sqrt{s - t_0 + 2}}{\delta}\right)} + \sqrt{\frac{1 + \frac{1}{t - s + 1}}{t - s + 1} \log\left( \frac{4 (t - t_0) \sqrt{t - s + 1}}{\delta}\right)}.\]
-
Policies.GLR_UCB.
threshold_SubGaussianGLR
(s, t, delta=0.01, sigma=0.25, joint=True)[source]¶ Compute the threshold \(b^{\text{joint}}_{t_0}(s,t,\delta)\) or \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\).
-
class
Policies.GLR_UCB.
SubGaussianGLR_IndexPolicy
(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicy
The SubGaussianGLR-UCB policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for sub-Gaussian distributions.
- It works for any sub-Gaussian family of distributions, being \(\sigma^2\)-sub Gaussian with known \(\sigma\).
- From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
-
__init__
(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
horizon
= None¶ The horizon \(T\).
-
max_nb_random_events
= None¶ The number of breakpoints \(\Upsilon_T\).
-
delta
= None¶ The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{T}\) if
horizon
is given anddelta=None
.
-
sigma
= None¶ Parameter \(\sigma\) for the Sub-Gaussian-GLR test.
-
joint
= None¶ Parameter
joint
for the Sub-Gaussian-GLR test.
-
lazy_try_value_s_only_x_steps
= None¶ Be lazy and try to detect changes for \(s\) taking steps of size
steps_s
.
-
use_localization
= None¶ experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
compute_threshold_h
(s, t)[source]¶ Compute the threshold \(h\) with
threshold_SubGaussianGLR()
.
-
__module__
= 'Policies.GLR_UCB'¶
-
proba_random_exploration
¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
detect_change
(arm, verbose=False)[source]¶ Detect a change in the current arm, using the non-parametric sub-Gaussian Generalized Likelihood Ratio test (GLR) works like this:
- For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\text{sub-}\sigma}_{t_0:s:t} = |\mu_{t_0,s} - \mu_{s+1,t}|.\]- The change is detected if there is a time \(s\) such that \(G^{\text{sub-}\sigma}_{t_0:s:t} > b_{t_0}(s,t,\delta)\), where \(b_{t_0}(s,t,\delta)\) is the threshold of the test,
- The threshold is computed as:
\[b_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]- And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).
Policies.GenericAggregation module¶
The GenericAggregation aggregation bandit algorithm: use a bandit policy A (master), managing several “slave” algorithms, \(A_1, ..., A_N\).
- At every step, one slave algorithm A_i is selected, by the master policy A.
- Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
- All slaves receive the observation (arm, reward).
- The master also receives the same observation.
-
class
Policies.GenericAggregation.
GenericAggregation
(nbArms, master=None, children=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The GenericAggregation aggregation bandit algorithm.
-
nbArms
= None¶ Number of arms.
-
lower
= None¶ Lower values for rewards.
-
amplitude
= None¶ Larger values for rewards.
-
last_choice
= None¶ Remember the index of the last child trusted for a decision.
-
children
= None¶ List of slave algorithms.
-
choiceFromSubSet
(availableArms='all')[source]¶ Trust one of the slave and listen to his choiceFromSubSet.
-
__module__
= 'Policies.GenericAggregation'¶
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Trust one of the slave and listen to his choiceIMP.
-
-
Policies.GenericAggregation.
random
() → x in the interval [0, 1).¶
Policies.GreedyOracle module¶
author: Julien Seznec
Oracle and near-minimax policy for rotting bandits without noise.
Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf] Tight Policy Regret Bounds for Improving and Decaying Bandits. Hoda Heidari, Michael Kearns, Aaron Roth. International Joint Conference on Artificial Intelligence (IJCAI) 2016, 1562.
-
class
Policies.GreedyOracle.
GreedyPolicy
(nbArms)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
Greedy Policy for rotting bandits (A2 in the reference below). Selects arm with best last value. Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf]
-
__init__
(nbArms)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
computeAllIndex
()[source]¶ Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.
-
__module__
= 'Policies.GreedyOracle'¶
-
-
class
Policies.GreedyOracle.
GreedyOracle
(nbArms, arms)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
Greedy Oracle for rotting bandits (A0 in the reference below). Look 1 step forward and select next best value. Optimal policy for rotting bandits problem. Reference: [Heidari et al., 2016, https://www.ijcai.org/Proceedings/16/Papers/224.pdf]
-
__init__
(nbArms, arms)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
__module__
= 'Policies.GreedyOracle'¶
-
Policies.Hedge module¶
The Hedge randomized index policy.
Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
-
Policies.Hedge.
EPSILON
= 0.01¶ Default \(\varepsilon\) parameter.
-
class
Policies.Hedge.
Hedge
(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The Hedge randomized index policy.
Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf).
-
weights
= None¶ Weights on the arms
-
epsilon
¶ Constant \(\varepsilon_t = \varepsilon\).
-
trusts
¶ Update the trusts probabilities according to Hedge formula, and the parameter \(\varepsilon_t\).
\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \varepsilon_t) w_k(t) + \varepsilon_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]If \(w_k(t)\) is the current weight from arm k.
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.
\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\varepsilon_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
-
choice
()[source]¶ One random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
, and select the last one (less probable).- Note that if not enough entries in the trust vector are non-zero, then
choice()
is called instead (rank is ignored).
- Note that if not enough entries in the trust vector are non-zero, then
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.
-
estimatedBestArms
(M=1)[source]¶ Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.
-
__module__
= 'Policies.Hedge'¶
-
-
class
Policies.Hedge.
HedgeWithHorizon
(nbArms, horizon, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Hedge.Hedge
Hedge with fixed epsilon, \(\varepsilon_t = \varepsilon_0\), chosen with a knowledge of the horizon.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
epsilon
¶ Fixed temperature, small, knowing the horizon: \(\varepsilon_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).
- Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Hedge'¶
-
-
class
Policies.Hedge.
HedgeDecreasing
(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Hedge.Hedge
Hedge with decreasing parameter \(\varepsilon_t\).
-
epsilon
¶ Decreasing epsilon with the time: \(\varepsilon_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).
- Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Hedge'¶
-
Policies.IMED module¶
The IMED policy of [Honda & Takemura, JMLR 2015].
- Reference: [[“Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards”, J. Honda and A. Takemura, JMLR, 2015](http://jmlr.csail.mit.edu/papers/volume16/honda15a/honda15a.pdf)].
-
Policies.IMED.
Dinf
(x=None, mu=None, kl=<function klBern>, lowerbound=0, upperbound=1, precision=1e-06, max_iterations=50)[source]¶ The generic Dinf index computation.
x
: value of the cum reward,mu
: upperbound on the meany
,kl
: the KL divergence to be used (klBern()
,klGauss()
, etc),lowerbound
,upperbound=1
: the known bound of the valuesy
andx
,precision=1e-6
: the threshold from where to stop the research,max_iterations
: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[D_{\inf}(x, d) \simeq \inf_{\max(\mu, \mathrm{lowerbound}) \leq y \leq \mathrm{upperbound}} \mathrm{kl}(x, y).\]Note
It uses a call the
scipy.optimize.minimize_scalar()
. If this fails, it uses a bisection search, and one call tokl
for each step of the bisection search.
-
class
Policies.IMED.
IMED
(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.DMED.DMED
The IMED policy of [Honda & Takemura, JMLR 2015].
- Reference: [[“Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards”, J. Honda and A. Takemura, JMLR, 2015](http://jmlr.csail.mit.edu/papers/volume16/honda15a/honda15a.pdf)].
-
__init__
(nbArms, tolerance=0.0001, kl=<function klBern>, lower=0.0, amplitude=1.0)[source]¶ New policy.
-
one_Dinf
(x, mu)[source]¶ Compute the \(D_{\inf}\) solution, for one value of
x
, and one value formu
.
-
Dinf
(xs, mu)[source]¶ Compute the \(D_{\inf}\) solution, for a vector of value of
xs
, and one value formu
.
-
choice
()[source]¶ Choose an arm with minimal index (uniformly at random):
\[A(t) \sim U(\arg\min_{1 \leq k \leq K} I_k(t)).\]Where the indexes are:
\[I_k(t) = N_k(t) D_{\inf}(\hat{\mu_{k}}(t), \max_{k'} \hat{\mu_{k'}}(t)) + \log(N_k(t)).\]
-
__module__
= 'Policies.IMED'¶
Policies.IndexPolicy module¶
Generic index policy.
- If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
-
class
Policies.IndexPolicy.
IndexPolicy
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
Class that implements a generic index policy.
-
__init__
(nbArms, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
index
= None¶ Numerical index for each arms
-
computeAllIndex
()[source]¶ Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.
-
choice
()[source]¶ In an index policy, choose an arm with maximal index (uniformly at random):
\[A(t) \sim U(\arg\max_{1 \leq k \leq K} I_k(t)).\]Warning
In almost all cases, there is a unique arm with maximal index, so we loose a lot of time with this generic code, but I couldn’t find a way to be more efficient without loosing generality.
-
choiceWithRank
(rank=1)[source]¶ In an index policy, choose an arm with index is the (1+rank)-th best (uniformly at random).
- For instance, if rank is 1, the best arm is chosen (the 1-st best).
- If rank is 4, the 4-th best arm is chosen.
Note
This method is required for the
PoliciesMultiPlayers.rhoRand
policy.
-
choiceFromSubSet
(availableArms='all')[source]¶ In an index policy, choose the best arm from sub-set availableArms (uniformly at random).
-
choiceMultiple
(nb=1)[source]¶ In an index policy, choose nb arms with maximal indexes (uniformly at random).
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ In an index policy, the IMP strategy is hybrid: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing means.
-
estimatedBestArms
(M=1)[source]¶ Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.
-
__module__
= 'Policies.IndexPolicy'¶
-
Policies.LM_DSEE module¶
The LM-DSEE policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)
- It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
This implementation is still experimental!
-
class
Policies.LM_DSEE.
State
¶ Bases:
enum.Enum
Different states during the LM-DSEE algorithm
-
Exploitation
= 2¶
-
Exploration
= 1¶
-
__module__
= 'Policies.LM_DSEE'¶
-
-
Policies.LM_DSEE.
VERBOSE
= False¶ Whether to be verbose when doing the search for valid parameter \(\ell\).
-
Policies.LM_DSEE.
parameter_ell
(a, N, b, gamma, verbose=False, max_value_on_l=1000000)[source]¶ Look for the smallest value of the parameter \(\ell\) that satisfies the following equations:
-
class
Policies.LM_DSEE.
LM_DSEE
(nbArms, nu=0.5, DeltaMin=0.5, a=1, b=0.25, *args, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The LM-DSEE policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)
-
a
= None¶ Parameter \(a\) for the LM-DSEE algorithm.
-
b
= None¶ Parameter \(b\) for the LM-DSEE algorithm.
-
l
= None¶ Parameter \(\ell\) for the LM-DSEE algorithm, as computed by the function
parameter_ell()
.
-
gamma
= None¶ Parameter \(\gamma\) for the LM-DSEE algorithm.
-
rho
= None¶ Parameter \(\rho = \frac{1-\nu}{1+\nu}\) for the LM-DSEE algorithm.
-
phase
= None¶ Current phase, exploration or exploitation.
-
current_exploration_arm
= None¶ Currently explored arm.
-
current_exploitation_arm
= None¶ Currently exploited arm.
-
batch_number
= None¶ Number of batch
-
length_of_current_phase
= None¶ Length of the current phase, either computed from
length_exploration_phase()
or func:length_exploitation_phase.
-
step_of_current_phase
= None¶ Timer inside the current phase.
-
all_rewards
= None¶ Memory of all the rewards. A list per arm. Growing list until restart of that arm?
-
length_exploration_phase
(verbose=False)[source]¶ Compute the value of the current exploration phase:
\[L_1(k) = L(k) = \lceil \gamma \log(k^{\rho} l b)\rceil.\]Warning
I think there is a typo in the paper, as their formula are weird (like \(al\) is defined from \(a\)). See
parameter_ell()
.
-
length_exploitation_phase
(verbose=False)[source]¶ Compute the value of the current exploitation phase:
\[L_2(k) = \lceil a k^{\rho} l \rceil - K L_1(k).\]Warning
I think there is a typo in the paper, as their formula are weird (like \(al\) is defined from \(a\)). See
parameter_ell()
.
-
__module__
= 'Policies.LM_DSEE'¶
-
Policies.LearnExp module¶
The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.
The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).
- At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).
- Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
- The reward is reweighted by the trust of the listened algorithm, and given back to it with a certain probability.
- The other slaves, whose decision was not even asked, receive nothing.
- The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.
- The detail about how to increase or decrease the probabilities are specified in the reference article.
Note
Reference: [[Learning to Use Learners’ Advice, A.Singla, H.Hassani & A.Krause, 2017](https://arxiv.org/abs/1702.04825)].
-
Policies.LearnExp.
renormalize_reward
(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]¶ Renormalize the reward to [0, 1]:
- divide by (trust/mintrust) if unbiased is True.
- simply project to [0, 1] if unbiased is False,
Warning
If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.
-
Policies.LearnExp.
unnormalize_reward
(reward, lower=0.0, amplitude=1.0)[source]¶ Project back reward to [lower, lower + amplitude].
-
Policies.LearnExp.
UNBIASED
= True¶ self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).
-
Policies.LearnExp.
ETA
= 0.5¶ Default value for the constant Eta in (0, 1]
-
class
Policies.LearnExp.
LearnExp
(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.
-
__init__
(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶ New policy.
-
nbArms
= None¶ Number of arms.
-
lower
= None¶ Lower values for rewards.
-
amplitude
= None¶ Larger values for rewards.
-
unbiased
= None¶ Flag, see above.
-
eta
= None¶ Constant parameter \(\eta\).
-
rate
= None¶ Constant \(\eta / N\), faster computations if it is stored once.
-
children
= None¶ List of slave algorithms.
-
last_choice
= None¶ Remember the index of the last child trusted for a decision.
-
trusts
= None¶ Initial trusts in the slaves \(p_j^t\). Default to uniform, but a prior can also be given.
-
weights
= None¶ Weights \(w_j^t\).
-
getReward
(arm, reward)[source]¶ Give reward for each child, and then update the trust probabilities.
-
choiceFromSubSet
(availableArms='all')[source]¶ Trust one of the slave and listen to his choiceFromSubSet.
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Trust one of the slave and listen to his choiceIMP.
-
__module__
= 'Policies.LearnExp'¶
-
-
Policies.LearnExp.
random
() → x in the interval [0, 1).¶
Policies.MEGA module¶
MEGA: implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421).
The Multi-user epsilon-Greedy collision Avoiding (MEGA) algorithm is based on the epsilon-greedy algorithm introduced in [2], augmented by a collision avoidance mechanism that is inspired by the classical ALOHA protocol.
- [2]: Finite-time analysis of the multi-armed bandit problem, P.Auer & N.Cesa-Bianchi & P.Fischer, 2002
-
class
Policies.MEGA.
MEGA
(nbArms, p0=0.5, alpha=0.5, beta=0.5, c=0.1, d=0.01, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
MEGA: implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421).
-
__init__
(nbArms, p0=0.5, alpha=0.5, beta=0.5, c=0.1, d=0.01, lower=0.0, amplitude=1.0)[source]¶ - nbArms: number of arms.
- p0: initial probability p(0); p(t) is the probability of persistance on the chosenArm at time t
- alpha: scaling in the update for p(t+1) <- alpha p(t) + (1 - alpha(t))
- beta: exponent used in the interval [t, t + t^beta], from where to sample a random time t_next(k), until when the chosenArm is unavailable
- c, d: used to compute the exploration probability epsilon_t, cf the function
_epsilon_t()
.
Example:
>>> nbArms, p0, alpha, beta, c, d = 17, 0.5, 0.5, 0.5, 0.1, 0.01 >>> player1 = MEGA(nbArms, p0, alpha, beta, c, d)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, MEGA, nbArms, p0, alpha, beta, c, d).children
-
c
= None¶ Parameter c
-
d
= None¶ Parameter d
-
p0
= None¶ Parameter p0, should not be modified
-
p
= None¶ Parameter p, can be modified
-
alpha
= None¶ Parameter alpha
-
beta
= None¶ Parameter beta
-
chosenArm
= None¶ Last chosen arm
-
tnext
= None¶ Only store the delta time
-
meanRewards
= None¶ Mean rewards
-
getReward
(arm, reward)[source]¶ Receive a reward on arm of index ‘arm’, as described by the MEGA algorithm.
- If not collision, receive a reward after pulling the arm.
-
handleCollision
(arm, reward=None)[source]¶ Handle a collision, on arm of index ‘arm’.
- Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
Note
We do not care on which arm the collision occured.
-
_epsilon_t
()[source]¶ Compute the value of decreasing epsilon(t), cf. Algorithm 1 in [Avner & Mannor, 2014](https://arxiv.org/abs/1404.5421).
-
__module__
= 'Policies.MEGA'¶
-
-
Policies.MEGA.
random
() → x in the interval [0, 1).¶
Policies.MOSS module¶
The MOSS policy for bounded bandits. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).
-
class
Policies.MOSS.
MOSS
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The MOSS policy for bounded bandits. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
-
__module__
= 'Policies.MOSS'¶
-
Policies.MOSSAnytime module¶
The MOSS-Anytime policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).
-
Policies.MOSSAnytime.
ALPHA
= 1.0¶ Default value for the parameter \(\alpha\) for the MOSS-Anytime algorithm.
-
class
Policies.MOSSAnytime.
MOSSAnytime
(nbArms, alpha=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.MOSS.MOSS
The MOSS-Anytime policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).
-
__init__
(nbArms, alpha=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
alpha
= None¶ Parameter \(\alpha \geq 0\) for the computations of the index. Optimal value seems to be \(1.35\).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\left(\frac{1+\alpha}{2}\right) \max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
-
__module__
= 'Policies.MOSSAnytime'¶
-
Policies.MOSSExperimental module¶
The MOSS-Experimental policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).
Warning
Nothing was proved for this heuristic!
-
class
Policies.MOSSExperimental.
MOSSExperimental
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.MOSS.MOSS
The MOSS-Experimental policy for bounded bandits, without knowing the horizon (and no doubling trick). Reference: [Degenne & Perchet, 2016](http://proceedings.mlr.press/v48/degenne16.pdf).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:
\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{ \max\left(0, \frac{\log\left(\frac{t}{\hat{H}(t)}\right)}{N_k(t)}\right)},\\ \text{where}\;\; \hat{H}(t) &:= \begin{cases} \sum\limits_{j=1, N_j(t) < \sqrt{t}}^{K} N_j(t) & \;\text{if it is}\; > 0,\\ K N_k(t) & \;\text{otherwise}\; \end{cases}\end{split}\]Note
In the article, the authors do not explain this subtlety, and I don’t see an argument to justify that at anytime, \(\hat{H}(t) > 0\) ie to justify that there is always some arms \(j\) such that \(0 < N_j(t) < \sqrt{t}\).
-
__module__
= 'Policies.MOSSExperimental'¶
-
Policies.MOSSH module¶
The MOSS-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).
-
class
Policies.MOSSH.
MOSSH
(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.MOSS.MOSS
The MOSS-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf).
-
__init__
(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, if there is K arms:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{T}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
-
__module__
= 'Policies.MOSSH'¶
-
Policies.Monitored_UCB module¶
The Monitored-UCB generic policy for non-stationary bandits.
Reference: [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)
It runs on top of a simple policy, e.g.,
UCB
, andMonitored_IndexPolicy
is a wrapper:>>> policy = Monitored_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(K w)\) memory for a window of size \(w\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.Monitored_UCB.
DELTA
= 0.1¶ Default value for the parameter \(\delta\), the lower-bound for \(\delta_k^{(i)}\) the amplitude of change of arm k at break-point. Default is
0.05
.
-
Policies.Monitored_UCB.
PER_ARM_RESTART
= False¶ Should we reset one arm empirical average or all? For M-UCB it is
False
by default.
-
Policies.Monitored_UCB.
FULL_RESTART_WHEN_REFRESH
= True¶ Should we fully restart the algorithm or simply reset one arm empirical average? For M-UCB it is
True
by default.
-
Policies.Monitored_UCB.
WINDOW_SIZE
= None¶ Default value of the window-size. Give
None
to use the default value computed from a knowledge of the horizon and number of break-points.
-
Policies.Monitored_UCB.
GAMMA_SCALE_FACTOR
= 1¶ For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!
-
class
Policies.Monitored_UCB.
Monitored_IndexPolicy
(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
The Monitored-UCB generic policy for non-stationary bandits, from [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)
- For a window size
w
, it uses only \(\mathcal{O}(K w)\) memory.
-
__init__
(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶ New policy.
-
window_size
= None¶ Parameter \(w\) for the M-UCB algorithm.
-
threshold_b
= None¶ Parameter \(b\) for the M-UCB algorithm.
-
gamma
= None¶ What they call \(\gamma\) in their paper: the share of uniform exploration.
-
last_update_time_tau
= None¶ Keep in memory the last time a change was detected, ie, the variable \(\tau\) in the algorithm.
-
last_w_rewards
= None¶ Keep in memory all the rewards obtained since the last restart on that arm.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
last_restart_times
= None¶ Keep in memory the times of last restarts (for each arm).
-
choice
()[source]¶ Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to
choice
of the underlying policy (eg. UCB).Warning
Actually, it’s more complicated:
- If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]
-
choiceWithRank
(rank=1)[source]¶ Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to
choiceWithRank
of the underlying policy (eg. UCB).
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the change detection algorithm says so.
-
__module__
= 'Policies.Monitored_UCB'¶
-
detect_change
(arm)[source]¶ A change is detected for the current arm if the following test is true:
\[|\sum_{i=w/2+1}^{w} Y_i - \sum_{i=1}^{w/2} Y_i | > b ?\]- where \(Y_i\) is the i-th data in the latest w data from this arm (ie, \(X_k(t)\) for \(t = n_k - w + 1\) to \(t = n_k\) current number of samples from arm k).
- where
threshold_b
is the threshold b of the test, andwindow_size
is the window-size w.
Warning
FIXED only the last \(w\) data are stored, using lists that got their first element ``pop()``ed out (deleted). See https://github.com/SMPyBandits/SMPyBandits/issues/174
- For a window size
Policies.MusicalChair module¶
MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).
- Each player has 3 states, 1st is random exploration, 2nd is musical chair, 3rd is staying sit
- 1st step
- Every player tries uniformly an arm for \(T_0\) steps, counting the empirical means of each arm, and the number of observed collisions \(C_{T_0}\)
- Finally, \(N^* = M\) =
nbPlayers
is estimated based on nb of collisions \(C_{T_0}\), and the \(N^*\) best arms are computed from their empirical means
- 2nd step:
- Every player Choose an arm uniformly, among the \(N^*\) best arms, until she does not encounter collision right after choosing it
- When an arm was chosen by only one player, she decides to sit on this chair (= arm)
- 3rd step:
- Every player stays sitted on her chair for the rest of the game
- \(\implies\) constant regret if \(N^*\) is well estimated and if the estimated N* best arms were correct
- \(\implies\) linear regret otherwise
-
Policies.MusicalChair.
optimalT0
(nbArms=10, epsilon=0.1, delta=0.05)[source]¶ Compute the lower-bound suggesting “large-enough” values for \(T_0\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\).
- Cf. Theorem 1 of [Shamir et al., 2015](https://arxiv.org/abs/1512.02866).
Examples:
- For \(K=2\) arms, and in order to have a constant regret with probability at least \(90\%\), if the gap \(\Delta\) is known to be \(\geq 0.05\), then their theoretical analysis suggests to use \(T_0 \geq 18459\). That’s very huge, for just two arms!
>>> optimalT0(2, 0.1, 0.05) # Just 2 arms ! 18459 # ==> That's a LOT of steps for just 2 arms!
- For a harder problem with \(K=6\) arms, for a risk smaller than \(1\%\) and a gap \(\Delta \geq 0.05\), they suggest at least \(T_0 \geq 7646924\), i.e., about 7 millions of trials. That is simply too much for any realistic system, and starts to be too large for simulated systems.
>>> optimalT0(6, 0.01, 0.05) # Constant regret with >99% proba 7646924 # ==> That's a LOT of steps! >>> optimalT0(6, 0.001, 0.05) # Reasonable value of epsilon 764692376 # ==> That's a LOT of steps!!!
- For an even harder problem with \(K=17\) arms, the values given by their Theorem 1 start to be really unrealistic:
>>> optimalT0(17, 0.01, 0.05) # Constant regret with >99% proba 27331794 # ==> That's a LOT of steps! >>> optimalT0(17, 0.001, 0.05) # Reasonable value of epsilon 2733179304 # ==> That's a LOT of steps!!!
-
Policies.MusicalChair.
boundOnFinalRegret
(T0, nbPlayers)[source]¶ Use the upper-bound on regret when \(T_0\) and \(M\) are known.
The “constant” regret of course grows linearly with \(T_0\), as:
\[\forall T \geq T_0, \;\; R_T \leq T_0 K + 2 \mathrm{exp}(2) K.\]
Warning
this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_0\) is chosen as given by
optimalT0()
).- Cf. Theorem 1 of [Shamir et al., 2015](https://arxiv.org/abs/1512.02866).
- Examples:
>>> boundOnFinalRegret(18459, 2) # Crazy constant regret! # doctest: +ELLIPSIS 36947.5.. >>> boundOnFinalRegret(7646924, 6) # Crazy constant regret!! # doctest: +ELLIPSIS 45881632.6... >>> boundOnFinalRegret(764692376, 6) # Crazy constant regret!! # doctest: +ELLIPSIS 4588154344.6... >>> boundOnFinalRegret(27331794, 17) # Crazy constant regret!! # doctest: +ELLIPSIS 464640749.2... >>> boundOnFinalRegret(2733179304, 17) # Crazy constant regret!! # doctest: +ELLIPSIS 46464048419.2...
-
class
Policies.MusicalChair.
State
¶ Bases:
enum.Enum
Different states during the Musical Chair algorithm
-
InitialPhase
= 2¶
-
MusicalChair
= 3¶
-
NotStarted
= 1¶
-
Sitted
= 4¶
-
__module__
= 'Policies.MusicalChair'¶
-
-
class
Policies.MusicalChair.
MusicalChair
(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).
-
__init__
(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]¶ - nbArms: number of arms,
- Time0: required, number of step, or portion of the horizon Time1 (optional), for the first step (pure random exploration by each players),
- N: optional, exact or upper bound on the number of players,
- Time1: optional, only used to compute Time0 if Time0 is fractional (eg. 0.2).
Example:
>>> nbArms, Time0, Time1, N = 17, 0.1, 10000, 6 >>> player1 = MusicalChair(nbArms, Time0, Time1, N)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.25, Time1=HORIZON, N=NB_PLAYERS).children
-
state
= None¶ Current state
-
Time0
= None¶ Parameter T0
-
nbPlayers
= None¶ Number of players
-
chair
= None¶ Current chair. Not sited yet.
-
cumulatedRewards
= None¶ That’s the s_i(t) of the paper
-
nbObservations
= None¶ That’s the o_i of the paper
-
A
= None¶ A random permutation of arms, it will then be of size nbPlayers!
-
nbCollision
= None¶ Number of collisions, that’s the C_Time0 of the paper
-
t
= None¶ Internal times
-
startGame
()[source]¶ Just reinitialize all the internal memory, and decide how to start (state 1 or 2).
-
getReward
(arm, reward)[source]¶ Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.
- If not collision, receive a reward after pulling the arm.
-
_endInitialPhase
()[source]¶ Small computation needed at the end of the initial random exploration phase.
-
handleCollision
(arm, reward=None)[source]¶ Handle a collision, on arm of index ‘arm’.
- Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
-
__module__
= 'Policies.MusicalChair'¶
-
Policies.MusicalChairNoSensing module¶
MusicalChairNoSensing: implementation of the decentralized multi-player policy from [[“Multiplayer bandits without observing collision information”, by Gabor Lugosi and Abbas Mehrabian]](https://arxiv.org/abs/1808.08416).
Note
The algorithm implemented here is Algorithm 1 (page 8) in the article, but the authors did not named it. I will refer to it as the Musical Chair algorithm with no sensing, or MusicalChairNoSensing
in the code.
-
Policies.MusicalChairNoSensing.
ConstantC
= 1¶ A crazy large constant to get all the theoretical results working. The paper suggests \(C = 128\).
Warning
One can choose a much smaller value in order to (try to) have reasonable empirical performances! I have tried \(C = 1\). BUT the algorithm DOES NOT work better with a much smaller constant: every single simulations I tried end up with a linear regret for
MusicalChairNoSensing
.
-
Policies.MusicalChairNoSensing.
parameter_g
(K=9, m=3, T=1000, constant_c=1)[source]¶ Length \(g\) of the phase 1, from parameters
K
,m
andT
.\[g = 128 K \log(3 K m^2 T^2).\]Examples:
>>> parameter_g(m=2, K=2, T=100) # DOCTEST: +ELLIPSIS 3171.428... >>> parameter_g(m=2, K=2, T=1000) # DOCTEST: +ELLIPSIS 4350.352... >>> parameter_g(m=2, K=3, T=100) # DOCTEST: +ELLIPSIS 4912.841... >>> parameter_g(m=3, K=3, T=100) # DOCTEST: +ELLIPSIS 5224.239...
-
Policies.MusicalChairNoSensing.
estimate_length_phases_12
(K=3, m=9, Delta=0.1, T=1000)[source]¶ Estimate the length of phase 1 and 2 from the parameters of the problem.
Examples:
>>> estimate_length_phases_12(m=2, K=2, Delta=0.1, T=100) 198214307 >>> estimate_length_phases_12(m=2, K=2, Delta=0.01, T=100) 19821430723 >>> estimate_length_phases_12(m=2, K=2, Delta=0.1, T=1000) 271897030 >>> estimate_length_phases_12(m=2, K=3, Delta=0.1, T=100) 307052623 >>> estimate_length_phases_12(m=2, K=5, Delta=0.1, T=100) 532187397
-
Policies.MusicalChairNoSensing.
smallest_T_from_where_length_phases_12_is_larger
(K=3, m=9, Delta=0.1, Tmax=1000000000.0)[source]¶ Compute the smallest horizon T from where the (estimated) length of phases 1 and 2 is larger than T.
Examples:
>>> smallest_T_from_where_length_phases_12_is_larger(K=2, m=1) 687194767 >>> smallest_T_from_where_length_phases_12_is_larger(K=3, m=2) 1009317314 >>> smallest_T_from_where_length_phases_12_is_larger(K=3, m=3) 1009317314
Examples with even longer phase 1:
>>> smallest_T_from_where_length_phases_12_is_larger(K=10, m=5) 1009317314 >>> smallest_T_from_where_length_phases_12_is_larger(K=10, m=10) 1009317314
With \(K=100\) arms, it starts to be crazy:
>>> smallest_T_from_where_length_phases_12_is_larger(K=100, m=10) 1009317314
-
class
Policies.MusicalChairNoSensing.
State
¶ Bases:
enum.Enum
Different states during the Musical Chair with no sensing algorithm
-
InitialPhase
= 2¶
-
MusicalChair
= 4¶
-
NotStarted
= 1¶
-
Sitted
= 5¶
-
UniformWaitPhase2
= 3¶
-
__module__
= 'Policies.MusicalChairNoSensing'¶
-
-
class
Policies.MusicalChairNoSensing.
MusicalChairNoSensing
(nbPlayers=1, nbArms=1, horizon=1000, constant_c=1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
MusicalChairNoSensing: implementation of the decentralized multi-player policy from [[“Multiplayer bandits without observing collision information”, by Gabor Lugosi and Abbas Mehrabian]](https://arxiv.org/abs/1808.08416).
-
__init__
(nbPlayers=1, nbArms=1, horizon=1000, constant_c=1, lower=0.0, amplitude=1.0)[source]¶ - nbArms: number of arms (
K
in the paper), - nbPlayers: number of players (
m
in the paper), - horizon: horizon (length) of the game (
T
in the paper),
Example:
>>> nbPlayers, nbArms, horizon = 3, 9, 10000 >>> player1 = MusicalChairNoSensing(nbPlayers, nbArms, horizon)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChairNoSensing, nbArms, nbPlayers=nbPlayers, horizon=horizon).children
or
>>> configuration["players"] = [ MusicalChairNoSensing(nbPlayers=nbPlayers, nbArms=nbArms, horizon=horizon) for _ in range(NB_PLAYERS) ]
- nbArms: number of arms (
-
state
= None¶ Current state
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
horizon
= None¶ Parameter T (horizon)
-
chair
= None¶ Current chair. Not sited yet.
-
cumulatedRewards
= None¶ That’s the s_i(t) of the paper
-
nbObservations
= None¶ That’s the o_i of the paper
-
A
= None¶ A random permutation of arms, it will then be of size nbPlayers!
-
tau_phase_2
= None¶ Time when phase 2 starts
-
t
= None¶ Internal times
-
startGame
()[source]¶ Just reinitialize all the internal memory, and decide how to start (state 1 or 2).
-
getReward
(arm, reward)[source]¶ Receive a reward on arm of index ‘arm’, as described by the Musical Chair with no Sensing algorithm.
- If not collision, receive a reward after pulling the arm.
-
__module__
= 'Policies.MusicalChairNoSensing'¶
-
handleCollision
(arm, reward=None)[source]¶ Handle a collision, on arm of index ‘arm’.
- Here, as its name suggests it, the
MusicalChairNoSensing
algorithm does not use any collision information, hence this method is empty. - Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
- Here, as its name suggests it, the
-
Policies.OCUCB module¶
The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits, with sub-Gaussian noise.
- Reference: [Lattimore, 2016](https://arxiv.org/pdf/1603.08661.pdf).
- There is also a horizon-dependent version,
OCUCBH.OCUCBH
, from [Lattimore, 2015](https://arxiv.org/pdf/1507.07880.pdf).
-
Policies.OCUCB.
ETA
= 2¶ Default value for parameter \(\eta > 1\) for OCUCB.
-
Policies.OCUCB.
RHO
= 1¶ Default value for parameter \(\rho \in (1/2, 1]\) for OCUCB.
-
class
Policies.OCUCB.
OCUCB
(nbArms, eta=2, rho=1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits, with sub-Gaussian noise.
- Reference: [Lattimore, 2016](https://arxiv.org/pdf/1603.08661.pdf).
-
__init__
(nbArms, eta=2, rho=1, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
eta
= None¶ Parameter \(\eta > 1\).
-
rho
= None¶ Parameter \(\rho \in (1/2, 1]\).
-
_Bterm
(k)[source]¶ Compute the extra term \(B_k(t)\) as follows:
\[\begin{split}B_k(t) &= \max\Big\{ \exp(1), \log(t), t \log(t) / C_k(t) \Big\},\\ \text{where}\; C_k(t) &= \sum_{j=1}^{K} \min\left\{ T_k(t), T_j(t)^{\rho} T_k(t)^{1 - \rho} \right\}\end{split}\]
-
_Bterms
()[source]¶ Compute all the extra terms, \(B_k(t)\) for each arm k, in a naive manner, not optimized to be vectorial, but it works.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \eta \log(B_k(t))}{N_k(t)}}.\]- Where \(\eta\) is a parameter of the algorithm,
- And \(B_k(t)\) is the additional term defined above.
-
__module__
= 'Policies.OCUCB'¶
Policies.OCUCBH module¶
The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).
- Reference: [Lattimore, 2015](https://arxiv.org/pdf/1507.07880.pdf)
- There is also a horizon-independent version,
OCUCB.OCUCB
, from [Lattimore, 2016](https://arxiv.org/pdf/1603.08661.pdf).
-
Policies.OCUCBH.
PSI
= 2¶ Default value for parameter \(\psi \geq 2\) for OCUCBH.
-
Policies.OCUCBH.
ALPHA
= 4¶ Default value for parameter \(\alpha \geq 2\) for OCUCBH.
-
class
Policies.OCUCBH.
OCUCBH
(nbArms, horizon=None, psi=2, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.OCUCB.OCUCB
The Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).
- Reference: [Lattimore, 2015](https://arxiv.org/pdf/1507.07880.pdf)
-
__init__
(nbArms, horizon=None, psi=2, alpha=4, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
psi
= None¶ Parameter \(\psi \geq 2\).
-
alpha
= None¶ Parameter \(\alpha \geq 2\).
-
horizon
= None¶ Horizon T.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha}{N_k(t)} \log(\frac{\psi T}{t})}.\]- Where \(\alpha\) and \(\psi\) are two parameters of the algorithm.
-
__module__
= 'Policies.OCUCBH'¶
-
class
Policies.OCUCBH.
AOCUCBH
(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.OCUCBH.OCUCBH
The Almost Optimally Confident UCB (OC-UCB) policy for bounded stochastic bandits. Initial version (horizon-dependent).
- Reference: [Lattimore, 2015](https://arxiv.org/pdf/1507.07880.pdf)
-
__init__
(nbArms, horizon=None, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2}{N_k(t)} \log(\frac{T}{N_k(t)})}.\]
-
__module__
= 'Policies.OCUCBH'¶
Policies.OSSB module¶
Optimal Sampling for Structured Bandits (OSSB) algorithm.
- Reference: [[Minimal Exploration in Structured Stochastic Bandits, Combes et al, arXiv:1711.00400 [stat.ML]]](https://arxiv.org/abs/1711.00400)
- See also: https://github.com/SMPyBandits/SMPyBandits/issues/101
Warning
This is the simplified OSSB algorithm for classical bandits. It can be applied to more general bandit problems, see the original paper.
- The
OSSB
is for Bernoulli stochastic bandits, andGaussianOSSB
is for Gaussian stochastic bandits, with a direct application of the result from their paper. - The
SparseOSSB
is for sparse Gaussian (or sub-Gaussian) stochastic bandits, of known variance. - I also added support for non-constant :math:`
arepsilon` and \(\gamma\) rates, as suggested in a talk given by Combes, 24th of May 2018, Rotterdam (Workshop, “Learning while Earning”). See OSSB_DecreasingRate
and OSSB_AutoDecreasingRate
.
-
class
Policies.OSSB.
Phase
¶ Bases:
enum.Enum
Different phases during the OSSB algorithm
-
__module__
= 'Policies.OSSB'¶
-
estimation
= 3¶
-
exploitation
= 2¶
-
exploration
= 4¶
-
initialisation
= 1¶
-
-
Policies.OSSB.
EPSILON
= 0.0¶ Default value for the \(\varepsilon\) parameter, 0.0 is a safe default.
-
Policies.OSSB.
GAMMA
= 0.0¶ Default value for the \(\gamma\) parameter, 0.0 is a safe default.
-
Policies.OSSB.
solve_optimization_problem__classic
(thetas)[source]¶ Solve the optimization problem (2)-(3) as defined in the paper, for classical stochastic bandits.
- No need to solve anything, as they give the solution for classical bandits.
-
Policies.OSSB.
solve_optimization_problem__gaussian
(thetas, sig2x=0.25)[source]¶ Solve the optimization problem (2)-(3) as defined in the paper, for Gaussian classical stochastic bandits.
- No need to solve anything, as they give the solution for Gaussian classical bandits.
-
Policies.OSSB.
solve_optimization_problem__sparse_bandits
(thetas, sparsity=None, only_strong_or_weak=False)[source]¶ Solve the optimization problem (2)-(3) as defined in the paper, for sparse stochastic bandits.
- I recomputed suboptimal solution to the optimization problem, and found the same as in [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].
- If only_strong_or_weak is
True
, the solution \(c_i\) are not returned, but insteadstrong_or_weak, k
is returned (to know if the problem is strongly sparse or not, and if not, the k that satisfy the required constraint).
-
class
Policies.OSSB.
OSSB
(nbArms, epsilon=0.0, gamma=0.0, solve_optimization_problem='classic', lower=0.0, amplitude=1.0, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
Optimal Sampling for Structured Bandits (OSSB) algorithm.
solve_optimization_problem
can be"classic"
or"bernoulli"
for classic stochastic bandit with no structure,"gaussian"
for classic bandit for Gaussian arms, or"sparse"
for sparse stochastic bandit (give the sparsitys
in akwargs
).- Reference: [[Minimal Exploration in Structured Stochastic Bandits, Combes et al, arXiv:1711.00400 [stat.ML]]](https://arxiv.org/abs/1711.00400)
-
__init__
(nbArms, epsilon=0.0, gamma=0.0, solve_optimization_problem='classic', lower=0.0, amplitude=1.0, **kwargs)[source]¶ New policy.
-
epsilon
= None¶ Parameter \(\varepsilon\) for the OSSB algorithm. Can be = 0.
-
gamma
= None¶ Parameter \(\gamma\) for the OSSB algorithm. Can be = 0.
-
counter_s_no_exploitation_phase
= None¶ counter of number of exploitation phase
-
phase
= None¶ categorical variable for the phase
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
__module__
= 'Policies.OSSB'¶
-
class
Policies.OSSB.
GaussianOSSB
(nbArms, epsilon=0.0, gamma=0.0, variance=0.25, lower=0.0, amplitude=1.0, **kwargs)[source]¶ Bases:
Policies.OSSB.OSSB
Optimal Sampling for Structured Bandits (OSSB) algorithm, for Gaussian Stochastic Bandits.
-
__init__
(nbArms, epsilon=0.0, gamma=0.0, variance=0.25, lower=0.0, amplitude=1.0, **kwargs)[source]¶ New policy.
-
__module__
= 'Policies.OSSB'¶
-
-
class
Policies.OSSB.
SparseOSSB
(nbArms, epsilon=0.0, gamma=0.0, sparsity=None, lower=0.0, amplitude=1.0, **kwargs)[source]¶ Bases:
Policies.OSSB.OSSB
Optimal Sampling for Structured Bandits (OSSB) algorithm, for Sparse Stochastic Bandits.
-
__init__
(nbArms, epsilon=0.0, gamma=0.0, sparsity=None, lower=0.0, amplitude=1.0, **kwargs)[source]¶ New policy.
-
__module__
= 'Policies.OSSB'¶
-
-
Policies.OSSB.
DECREASINGRATE
= 1e-06¶ Default value for the constant for the decreasing rate
-
class
Policies.OSSB.
OSSB_DecreasingRate
(nbArms, epsilon=0.0, gamma=0.0, decreasingRate=1e-06, lower=0.0, amplitude=1.0, **kwargs)[source]¶ Bases:
Policies.OSSB.OSSB
Optimal Sampling for Structured Bandits (OSSB) algorithm, with decreasing rates for both \(\varepsilon\) and \(\gamma\).
Warning
This is purely experimental, the paper does not talk about how to chose decreasing rates. It is inspired by the rates for Exp3 algorithm, cf [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__init__
(nbArms, epsilon=0.0, gamma=0.0, decreasingRate=1e-06, lower=0.0, amplitude=1.0, **kwargs)[source]¶ New policy.
-
epsilon
¶ Decreasing \(\varepsilon(t) = \min(1, \varepsilon_0 \exp(- t \tau))\).
-
__module__
= 'Policies.OSSB'¶
-
gamma
¶ Decreasing \(\gamma(t) = \min(1, \gamma_0 \exp(- t \tau))\).
-
-
class
Policies.OSSB.
OSSB_AutoDecreasingRate
(nbArms, lower=0.0, amplitude=1.0, **kwargs)[source]¶ Bases:
Policies.OSSB.OSSB
Optimal Sampling for Structured Bandits (OSSB) algorithm, with automatically-tuned decreasing rates for both \(\varepsilon\) and \(\gamma\).
Warning
This is purely experimental, the paper does not talk about how to chose decreasing rates. It is inspired by the rates for Exp3++ algorithm, [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].
-
__module__
= 'Policies.OSSB'¶
-
epsilon
¶ Decreasing \(\varepsilon(t) = \frac{1}{2} \sqrt{\frac{\log(K)}{t K}}\).
-
gamma
¶ Decreasing \(\gamma(t) = \frac{1}{2} \sqrt{\frac{\log(K)}{t K}}\).
-
Policies.OracleSequentiallyRestartPolicy module¶
An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.
It runs on top of a simple policy, e.g.,
UCB
, andOracleSequentiallyRestartPolicy
is a wrapper:>>> policy = OracleSequentiallyRestartPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses the knowledge of the breakpoints to restart the underlying algorithm at each breakpoint.
It is very simple but impractical: in any real problem it is impossible to know the locations of the breakpoints, but it acts as an efficient baseline.
Warning
It is an efficient baseline, but it has no reason to be the best algorithm on a given problem (empirically)! I found that Policy.DiscountedThompson.DiscountedThompson
is usually the most efficient.
-
Policies.OracleSequentiallyRestartPolicy.
PER_ARM_RESTART
= True¶ Should we reset one arm empirical average or all? Default is
False
for this algorithm.
-
Policies.OracleSequentiallyRestartPolicy.
FULL_RESTART_WHEN_REFRESH
= False¶ Should we fully restart the algorithm or simply reset one arm empirical average? Default is
False
, it’s usually more efficient!
-
Policies.OracleSequentiallyRestartPolicy.
RESET_FOR_ALL_CHANGE
= False¶ True
if the algorithm reset one/all arm memories when a change occur on any arm.False`
if the algorithms only resets one arm memories when a change occur on this arm (needs to knowlistOfMeans
) (default, it should be more efficient).
-
Policies.OracleSequentiallyRestartPolicy.
RESET_FOR_SUBOPTIMAL_CHANGE
= True¶ True
if the algorithms resets memories of this arm no matter if it stays optimal/suboptimal (default, it should be more efficient).False
if the algorithm reset memories only when a change make the previously best arm become suboptimal.
-
class
Policies.OracleSequentiallyRestartPolicy.
OracleSequentiallyRestartPolicy
(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.
-
__init__
(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]¶ New policy.
-
reset_for_all_change
= None¶
-
reset_for_suboptimal_change
= None¶
-
changePoints
= None¶ Locations of the break points (or change points) of the switching bandit problem, for each arm. If
None
, an empty list is used.
-
all_rewards
= None¶ Keep in memory all the rewards obtained since the last restart on that arm.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
compute_optimized_changePoints
(changePoints=None, listOfMeans=None)[source]¶ Compute the list of change points for each arm.
- If
reset_for_all_change
isTrue
, all change points concern all arms (sub optimal)! - If
reset_for_all_change
isFalse
, - If
reset_for_suboptimal_change
isTrue
, all change points were the mean of an arm change concern it (still sub optimal)! - If
reset_for_suboptimal_change
isFalse
, only the change points were an arm goes from optimal to sub-optimal or sub-optimal to optimal concern it (optimal!)!
- If
- If
- If
-
__module__
= 'Policies.OracleSequentiallyRestartPolicy'¶
-
Policies.PHE module¶
The PHE, Perturbed-History Exploration, policy for bounded bandits.
- Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
-
Policies.PHE.
DEFAULT_PERTURBATION_SCALE
= 1.0¶ By default, \(a\) the perturbation scale in PHE is 1, that is, at current time step t, if there is \(s = T_{i,t-1}\) samples of arm i, PHE generates \(s\) pseudo-rewards (of mean \(1/2\))
-
class
Policies.PHE.
PHE
(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The PHE, Perturbed-History Exploration, policy for bounded bandits.
- Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
- They prove that PHE achieves a regret of \(\mathcal{O}(K \Delta^{-1} \log(T))\) regret for horizon \(T\), and if \(\Delta\) is the minimum gap between the expected rewards of the optimal and suboptimal arms, for \(a > 1\).
- Note that the limit case of \(a=0\) gives the Follow-the-Leader algorithm (FTL), known to fail.
-
__init__
(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
perturbation_scale
= None¶ Perturbation scale, denoted \(a\) in their paper. Should be a float or int number. With \(s\) current samples, \(\lceil a s \rceil\) additional pseudo-rewards are generated.
-
computeIndex
(arm)[source]¶ Compute a randomized index by adding \(a\) pseudo-rewards (of mean \(1/2\)) to the current observations of this arm.
-
__module__
= 'Policies.PHE'¶
Policies.ProbabilityPursuit module¶
The basic Probability Pursuit algorithm.
We use the simple version of the pursuit algorithm, as described in the seminal book by Sutton and Barto (1998), https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html.
Initially, a uniform probability is set on each arm, \(p_k(0) = 1/k\).
At each time step \(t\), the probabilities are all recomputed, following this equation:
\[\begin{split}p_k(t+1) = \begin{cases} (1 - \beta) p_k(t) + \beta \times 1 & \text{if}\; \hat{\mu}_k(t) = \max_j \hat{\mu}_j(t) \\ (1 - \beta) p_k(t) + \beta \times 0 & \text{otherwise}. \end{cases}\end{split}\]\(\beta \in (0, 1)\) is a learning rate, default is BETA = 0.5.
And then arm \(A_k(t+1)\) is randomly selected from the distribution \((p_k(t+1))_{1 \leq k \leq K}\).
References: [Kuleshov & Precup - JMLR, 2000](http://www.cs.mcgill.ca/~vkules/bandits.pdf#page=6), [Sutton & Barto, 1998]
-
Policies.ProbabilityPursuit.
BETA
= 0.5¶ Default value for the beta parameter
-
class
Policies.ProbabilityPursuit.
ProbabilityPursuit
(nbArms, beta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The basic Probability pursuit algorithm.
- References: [Kuleshov & Precup - JMLR, 2000](http://www.cs.mcgill.ca/~vkules/bandits.pdf#page=6), [Sutton & Barto, 1998]
-
probabilities
= None¶ Probabilities of each arm
-
beta
¶ Constant parameter \(\beta(t) = \beta(0)\).
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate rewards on that arm k, then update the probabilities \(p_k(t)\) of each arm.
-
choice
()[source]¶ One random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to
numpy.random.choice()
, and select the last one (less probable).
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to
numpy.random.choice()
.
-
__module__
= 'Policies.ProbabilityPursuit'¶
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to
numpy.random.choice()
.
Policies.RAWUCB module¶
author: Julien Seznec
Rotting Adaptive Window Upper Confidence Bounds for rotting bandits.
Reference : [Seznec et al., 2019b] A single algorithm for both rested and restless rotting bandits (WIP) Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko
-
class
Policies.RAWUCB.
EFF_RAWUCB
(nbArms, alpha=0.06, subgaussian=1, m=None, delta=None, delay=False)[source]¶ Bases:
Policies.FEWA.EFF_FEWA
Efficient Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2020] Efficient trick described in [Seznec et al., 2019a, https://arxiv.org/abs/1811.11043] (m=2) and [Seznec et al., 2020] (m<=2) We use the confidence level :math:`delta_t =rac{1}{t^lpha}`.
-
__module__
= 'Policies.RAWUCB'¶
-
-
class
Policies.RAWUCB.
EFF_RAWklUCB
(nbArms, subgaussian=1, alpha=1, klucb=<function klucbBern>, tol=0.0001, m=2)[source]¶ Bases:
Policies.RAWUCB.EFF_RAWUCB
Use KL-confidence bound instead of close formula approximation. Experimental work : Much slower (!!) because we compute many UCB at each round per arm)
-
__init__
(nbArms, subgaussian=1, alpha=1, klucb=<function klucbBern>, tol=0.0001, m=2)[source]¶ New policy.
-
__module__
= 'Policies.RAWUCB'¶
-
-
class
Policies.RAWUCB.
RAWUCB
(nbArms, subgaussian=1, alpha=1)[source]¶ Bases:
Policies.RAWUCB.EFF_RAWUCB
Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2020] We use the confidence level :math:`delta_t =rac{1}{t^lpha}`.
-
__module__
= 'Policies.RAWUCB'¶
-
-
class
Policies.RAWUCB.
EFF_RAWUCB_pp
(nbArms, subgaussian=1, alpha=1, beta=0, m=2)[source]¶ Bases:
Policies.RAWUCB.EFF_RAWUCB
Efficient Rotting Adaptive Window Upper Confidence Bound ++ (RAW-UCB++) [Seznec et al., 2020, Thesis] We use the confidence level :math:`delta_{t,h} =rac{Kh}{t(1+log(t/Kh)^Beta)}`.
-
__module__
= 'Policies.RAWUCB'¶
-
-
class
Policies.RAWUCB.
RAWUCB_pp
(nbArms, subgaussian=1, beta=2)[source]¶ Bases:
Policies.RAWUCB.EFF_RAWUCB_pp
Rotting Adaptive Window Upper Confidence Bound (RAW-UCB) [Seznec et al., 2019b, WIP] We use the confidence level :math:`delta_t =rac{Kh}{t^lpha}`.
-
__module__
= 'Policies.RAWUCB'¶
-
Policies.RCB module¶
The RCB, Randomized Confidence Bound, policy for bounded bandits.
- Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
class
Policies.RCB.
RCB
(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.RandomizedIndexPolicy.RandomizedIndexPolicy
,Policies.UCBalpha.UCBalpha
The RCB, Randomized Confidence Bound, policy for bounded bandits.
- Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
__module__
= 'Policies.RCB'¶
Policies.RandomizedIndexPolicy module¶
Generic randomized index policy.
- Reference: [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
VERBOSE
= False¶ True to debug information about the perturbations
-
Policies.RandomizedIndexPolicy.
uniform_perturbation
(size=1, low=-1.0, high=1.0)[source]¶ Uniform random perturbation, not from \([0, 1]\) but from \([-1, 1]\), that is \(\mathcal{U}niform([-1, 1])\).
- Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
normal_perturbation
(size=1, loc=0.0, scale=0.25)[source]¶ Normal (Gaussian) random perturbation, with mean
loc=0
and scale (sigma2)scale=0.25
(by default), that is \(\mathcal{N}ormal(loc, scale)\).- Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
gaussian_perturbation
(size=1, loc=0.0, scale=0.25)¶ Normal (Gaussian) random perturbation, with mean
loc=0
and scale (sigma2)scale=0.25
(by default), that is \(\mathcal{N}ormal(loc, scale)\).- Reference: see Corollary 6 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
exponential_perturbation
(size=1, scale=0.25)[source]¶ Exponential random perturbation, with parameter (\(\lambda\))
scale=0.25
(by default), that is \(\mathcal{E}xponential(\lambda)\).- Reference: see Corollary 7 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
gumbel_perturbation
(size=1, loc=0.0, scale=0.25)[source]¶ Gumbel random perturbation, with mean
loc=0
and scalescale=0.25
(by default), that is \(\mathcal{G}umbel(loc, scale)\).- Reference: see Corollary 7 from [[“On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems”, by Baekjin Kim, Ambuj Tewari, arXiv:1902.00610]](https://arxiv.org/pdf/1902.00610.pdf)
-
Policies.RandomizedIndexPolicy.
map_perturbation_str_to_function
= {'exponential': <function exponential_perturbation>, 'gaussian': <function normal_perturbation>, 'gumbel': <function gumbel_perturbation>, 'normal': <function normal_perturbation>, 'uniform': <function uniform_perturbation>}¶ Map perturbation names (like
"uniform"
) to perturbation functions (likeuniform_perturbation()
).
-
class
Policies.RandomizedIndexPolicy.
RandomizedIndexPolicy
(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
Class that implements a generic randomized index policy.
-
__init__
(nbArms, perturbation='uniform', lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- perturbation: [“uniform”, “normal”, “exponential”, “gaussian”] or a function like
numpy.random.uniform()
, - lower, amplitude: lower value and known amplitude of the rewards.
-
perturbation_name
= None¶ Name of the function to generate the random perturbation.
-
perturbation
= None¶ Function to generate the random perturbation.
-
computeIndex
(arm)[source]¶ In a randomized index policy, with distribution \(\mathrm{Distribution}\) generating perturbations \(Z_k(t)\), with index \(I_k(t)\) and mean \(\hat{\mu}_k(t)\) for each arm \(k\), it chooses an arm with maximal perturbated index (uniformly at random):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ Z_k(t) &\sim \mathrm{Distribution}, \\ \mathrm{UCB}_k(t) &= I_k(t) - \hat{\mu}_k(t),\\ A(t) &\sim U(\arg\max_{1 \leq k \leq K} \hat{\mu}_k(t) + \mathrm{UCB}_k(t) \cdot Z_k(t)).\end{split}\]
-
__module__
= 'Policies.RandomizedIndexPolicy'¶
-
computeAllIndex
()[source]¶ In a randomized index policy, with distribution \(\mathrm{Distribution}\) generating perturbations \(Z_k(t)\), with index \(I_k(t)\) and mean \(\hat{\mu}_k(t)\) for each arm \(k\), it chooses an arm with maximal perturbated index (uniformly at random):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ Z_k(t) &\sim \mathrm{Distribution}, \\ \mathrm{UCB}_k(t) &= I_k(t) - \hat{\mu}_k(t),\\ A(t) &\sim U(\arg\max_{1 \leq k \leq K} \hat{\mu}_k(t) + \mathrm{UCB}_k(t) \cdot Z_k(t)).\end{split}\]
-
Policies.SIC_MMAB module¶
SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].
- The algorithm is quite complicated, please see the paper (Algorithm 1, page 6).
- The UCB-H indexes are used, for more details see
Policies.UCBH
.
-
Policies.SIC_MMAB.
c
= 1.0¶ default value, as it was in pymaBandits v1.0
-
Policies.SIC_MMAB.
TOLERANCE
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.SIC_MMAB.
State
¶ Bases:
enum.Enum
Different states during the Musical Chair algorithm
-
Communication
= 4¶
-
Estimation
= 2¶
-
Exploitation
= 5¶
-
Exploration
= 3¶
-
Fixation
= 1¶
-
__module__
= 'Policies.SIC_MMAB'¶
-
-
class
Policies.SIC_MMAB.
SIC_MMAB
(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
SIC_MMAB: implementation of the decentralized multi-player policy from [[“SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits”, by Etienne Boursier, Vianney Perchet, arXiv 1809.08151, 2018](https://arxiv.org/abs/1809.08151)].
-
__init__
(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶ - nbArms: number of arms,
- horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
- alpha: for the UCB/LCB computations.
Example:
>>> nbArms, horizon, N = 17, 10000, 6 >>> player1 = SIC_MMAB(nbArms, horizon, N)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
-
phase
= None¶ Current state
-
horizon
= None¶ Horizon T of the experiment.
-
alpha
= None¶ Parameter \(\alpha\) for the UCB/LCB computations.
-
Time0
= None¶ Parameter \(T_0 = \lceil K \log(T) \rceil\).
-
ext_rank
= None¶ External rank, -1 until known
-
int_rank
= None¶ Internal rank, starts to be 0 then increase when needed
-
nbPlayers
= None¶ Estimated number of players, starts to be 1
-
last_action
= None¶ Keep memory of the last played action (starts randomly)
-
t_phase
= None¶ Number of the phase XXX ?
-
round_number
= None¶ Number of the round XXX ?
-
active_arms
= None¶ Set of active arms (kept as a numpy array)
-
startGame
()[source]¶ Just reinitialize all the internal memory, and decide how to start (state 1 or 2).
-
compute_ucb_lcb
()[source]¶ Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.
- By default, the SIC-MMAB algorithm uses the UCB-H confidence bounds:
\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\end{split}\]- Reference: [Audibert et al. 09].
- Other possibilities include UCB (see
SIC_MMAB_UCB
) and klUCB (seeSIC_MMAB_klUCB
).
-
getReward
(arm, reward, collision=False)[source]¶ Receive a reward on arm of index ‘arm’, as described by the SIC-MMAB algorithm.
- If not collision, receive a reward after pulling the arm.
-
__module__
= 'Policies.SIC_MMAB'¶
-
-
class
Policies.SIC_MMAB.
SIC_MMAB_UCB
(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False)[source]¶ Bases:
Policies.SIC_MMAB.SIC_MMAB
SIC_MMAB_UCB: SIC-MMAB with the simple UCB-1 confidence bounds.
-
compute_ucb_lcb
()[source]¶ Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.
SIC_MMAB_UCB
uses the simple UCB-1 confidence bounds:
\[\begin{split}\mathrm{UCB}_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}},\\ \mathrm{LCB}_k(t) &= \frac{X_k(t)}{N_k(t)} - \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\end{split}\]- Reference: [Auer et al. 02].
- Other possibilities include UCB-H (the default, see
SIC_MMAB
) and klUCB (seeSIC_MMAB_klUCB
).
-
__module__
= 'Policies.SIC_MMAB'¶
-
-
class
Policies.SIC_MMAB.
SIC_MMAB_klUCB
(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]¶ Bases:
Policies.SIC_MMAB.SIC_MMAB
SIC_MMAB_klUCB: SIC-MMAB with the kl-UCB confidence bounds.
-
__init__
(nbArms, horizon, lower=0.0, amplitude=1.0, alpha=4.0, verbose=False, tolerance=0.0001, klucb=<function klucbBern>, c=1.0)[source]¶ - nbArms: number of arms,
- horizon: to compute the time \(T_0 = \lceil K \log(T) \rceil\),
- alpha: for the UCB/LCB computations.
Example:
>>> nbArms, horizon, N = 17, 10000, 6 >>> player1 = SIC_MMAB(nbArms, horizon, N)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, SIC_MMAB, nbArms, horizon=HORIZON).children
-
c
= None¶ Parameter c
-
klucb
= None¶ kl function to use
-
tolerance
= None¶ Numerical tolerance
-
compute_ucb_lcb
()[source]¶ Compute the Upper-Confidence Bound and Lower-Confidence Bound for active arms, at the current time step.
SIC_MMAB_klUCB
uses the simple kl-UCB confidence bounds:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \mathrm{UCB}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathrm{Biais}_k(t) &= \mathrm{UCB}_k(t) - \hat{\mu}_k(t),\\ \mathrm{LCB}_k(t) &= \hat{\mu}_k(t) - \mathrm{Biais}_k(t).\end{split}\]- If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
),
and c is the parameter (default to 1).
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
- Other possibilities include UCB-H (the default, see
SIC_MMAB
) and klUCB (seeSIC_MMAB_klUCB
).
-
__module__
= 'Policies.SIC_MMAB'¶
-
Policies.SWA module¶
author : Julien Seznec Sliding Window Average policy for rotting bandits.
Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf]. Advances in Neural Information Processing Systems 30 (NIPS 2017) Nir Levine, Koby Crammer, Shie Mannor
-
class
Policies.SWA.
SWA
(nbArms, horizon=1, subgaussian=1, maxDecrement=1, alpha=0.2, doublingTrick=False)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The Sliding Window Average policy for rotting bandits. Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf].
-
__init__
(nbArms, horizon=1, subgaussian=1, maxDecrement=1, alpha=0.2, doublingTrick=False)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
__module__
= 'Policies.SWA'¶
-
-
class
Policies.SWA.
wSWA
(nbArms, firstHorizon=1, subgaussian=1, maxDecrement=1, alpha=0.2)[source]¶ Bases:
Policies.SWA.SWA
SWA with doubling trick Reference: [Levine et al., 2017, https://papers.nips.cc/paper/6900-rotting-bandits.pdf].
-
__init__
(nbArms, firstHorizon=1, subgaussian=1, maxDecrement=1, alpha=0.2)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
__module__
= 'Policies.SWA'¶
-
Policies.SWHash_UCB module¶
The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)
Instead of being restricted to UCB, it runs on top of a simple policy, e.g.,
UCB
, andSWHash_IndexPolicy()
is a generic policy using any simple policy with this “sliding window” trick:>>> policy = SWHash_IndexPolicy(nbArms, UCB, tau=100, threshold=0.1) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional non-fixed \(\mathcal{O}(\tau(t,\alpha))\) memory and an extra time complexity.
Warning
This implementation is still experimental!
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.SWHash_UCB.
alpha_for_abruptly_changing_env
(nu=0.5)[source]¶ For abruptly-changing environement, if the number of break-points is \(\Upsilon_T = \mathcal{O}(T^{\nu})\), then the SW-UCB# algorithm chooses \(\alpha = \frac{1-\nu}{2}\).
-
Policies.SWHash_UCB.
alpha_for_slowly_varying_env
(kappa=1)[source]¶ For slowly-varying environement, if the change in mean reward between two time steps is bounded by \(\varepsilon_T = \mathcal{O}(T^{-\kappa})\), then the SW-UCB# algorithm chooses \(\alpha = \min{1, \frac{3\kappa}{4}}\).
-
Policies.SWHash_UCB.
ALPHA
= 0.5¶ Default parameter for \(\alpha\).
-
Policies.SWHash_UCB.
LAMBDA
= 1¶ Default parameter for \(\lambda\).
-
Policies.SWHash_UCB.
tau_t_alpha
(t, alpha=0.5, lmbda=1)[source]¶ Compute \(\tau(t,\alpha) = \min(\lceil \lambda t^{\alpha} \rceil, t)\).
-
class
Policies.SWHash_UCB.
SWHash_IndexPolicy
(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
The SW-UCB# policy for non-stationary bandits, from [[“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”, by Lai Wei, Vaibhav Srivastava, 2018, arXiv:1802.08380]](https://arxiv.org/pdf/1802.08380)
-
__init__
(nbArms, policy=<class 'Policies.UCBalpha.UCBalpha'>, alpha=0.5, lmbda=1, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ New policy.
-
alpha
= None¶ The parameter \(\alpha\) for the SW-UCB# algorithm (see article for reference).
-
lmbda
= None¶ The parameter \(\lambda\) for the SW-UCB# algorithm (see article for reference).
-
all_rewards
= None¶ Keep in memory all the rewards obtained in the all the past steps (the size of the window is evolving!).
-
all_pulls
= None¶ Keep in memory all the pulls obtained in the all the past steps (the size of the window is evolving!). Start with -1 (never seen).
-
tau
¶ The current \(\tau(t,\alpha)\).
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update total history and partial history of all arms (normalized in [0, 1]).
Warning
So far this is badly implemented and the algorithm is VERY slow: it has to store all the past, as the window-length is increasing when t increases.
-
__module__
= 'Policies.SWHash_UCB'¶
-
Policies.SlidingWindowRestart module¶
An experimental policy, using a sliding window (of for instance \(\tau=100\) draws of each arm), and reset the algorithm as soon as the small empirical average is too far away from the long history empirical average (or just restart for one arm, if possible).
Reference: none yet, idea from Rémi Bonnefoi and Lilian Besson.
It runs on top of a simple policy, e.g.,
UCB
, andSlidingWindowRestart()
is a generic policy using any simple policy with this “sliding window” trick:>>> policy = SlidingWindowRestart(nbArms, UCB, tau=100, threshold=0.1) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau)\) memory but do not cost anything else in terms of time complexity (the average is done with a sliding window, and costs \(\mathcal{O}(1)\) at every time step).
Warning
This is very experimental!
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB
, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)! Note that it works on Policies.Thompson.Thompson
.
-
Policies.SlidingWindowRestart.
TAU
= 100¶ Size of the sliding window.
-
Policies.SlidingWindowRestart.
THRESHOLD
= 0.005¶ Threshold to know when to restart the base algorithm.
-
Policies.SlidingWindowRestart.
FULL_RESTART_WHEN_REFRESH
= True¶ Should we fully restart the algorithm or simply reset one arm empirical average ?
-
class
Policies.SlidingWindowRestart.
SlidingWindowRestart
(nbArms, policy=<class 'Policies.UCB.UCB'>, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).
-
__init__
(nbArms, policy=<class 'Policies.UCB.UCB'>, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]¶ New policy.
-
last_rewards
= None¶ Keep in memory all the rewards obtained in the last \(\tau\) steps.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the small average is too far away from it.
-
__module__
= 'Policies.SlidingWindowRestart'¶
-
-
class
Policies.SlidingWindowRestart.
SWR_UCB
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]¶ Bases:
Policies.UCB.UCB
An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).
Warning
FIXME I should remove this code, it’s useless now that the generic wrapper
SlidingWindowRestart
works fine.-
__init__
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
tau
= None¶ Size of the sliding window.
-
threshold
= None¶ Threshold to know when to restart the base algorithm.
-
last_rewards
= None¶ Keep in memory all the rewards obtained in the last \(\tau\) steps.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
full_restart_when_refresh
= None¶ Should we fully restart the algorithm or simply reset one arm empirical average ?
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the small average is too far away from it.
-
__module__
= 'Policies.SlidingWindowRestart'¶
-
-
class
Policies.SlidingWindowRestart.
SWR_UCBalpha
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, alpha=4, *args, **kwargs)[source]¶ Bases:
Policies.UCBalpha.UCBalpha
An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).
Warning
FIXME I should remove this code, it’s useless now that the generic wrapper
SlidingWindowRestart
works fine.-
__init__
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, alpha=4, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
tau
= None¶ Size of the sliding window.
-
threshold
= None¶ Threshold to know when to restart the base algorithm.
-
last_rewards
= None¶ Keep in memory all the rewards obtained in the last \(\tau\) steps.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
full_restart_when_refresh
= None¶ Should we fully restart the algorithm or simply reset one arm empirical average ?
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the small average is too far away from it.
-
__module__
= 'Policies.SlidingWindowRestart'¶
-
-
class
Policies.SlidingWindowRestart.
SWR_klUCB
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, *args, **kwargs)[source]¶ Bases:
Policies.klUCB.klUCB
An experimental policy, using a sliding window of for instance \(\tau=100\) draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible).
Warning
FIXME I should remove this code, it’s useless now that the generic wrapper
SlidingWindowRestart
works fine.-
__init__
(nbArms, tau=100, threshold=0.005, full_restart_when_refresh=True, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
tau
= None¶ Size of the sliding window.
-
threshold
= None¶ Threshold to know when to restart the base algorithm.
-
last_rewards
= None¶ Keep in memory all the rewards obtained in the last \(\tau\) steps.
-
last_pulls
= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
full_restart_when_refresh
= None¶ Should we fully restart the algorithm or simply reset one arm empirical average ?
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
- Reset the whole empirical average if the small average is too far away from it.
-
__module__
= 'Policies.SlidingWindowRestart'¶
-
Policies.SlidingWindowUCB module¶
An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.
- Reference: [On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
- It uses an additional \(\mathcal{O}(\tau)\) memory but do not cost anything else in terms of time complexity (the average is done with a sliding window, and costs \(\mathcal{O}(1)\) at every time step).
Warning
This is very experimental!
Note
This is similar to SlidingWindowRestart.SWR_UCB
but slightly different: SlidingWindowRestart.SWR_UCB
uses a window of size \(T_0=100\) to keep in memory the last 100 draws of each arm, and restart the index if the small history mean is too far away from the whole mean, while this SWUCB
uses a fixed-size window of size \(\tau=1000\) to keep in memory the last 1000 steps.
-
Policies.SlidingWindowUCB.
TAU
= 1000¶ Size of the sliding window.
-
Policies.SlidingWindowUCB.
ALPHA
= 1.0¶ Default value for the constant \(\alpha\).
-
class
Policies.SlidingWindowUCB.
SWUCB
(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.
-
__init__
(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
tau
= None¶ Size \(\tau\) of the sliding window.
-
alpha
= None¶ Constant \(\alpha\) in the square-root in the computation for the index.
-
last_rewards
= None¶ Keep in memory all the rewards obtained in the last \(\tau\) steps.
-
last_choices
= None¶ Keep in memory the times where each arm was last seen.
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
-
computeIndex
(arm)[source]¶ Compute the current index, at time \(t\) and after \(N_{k,\tau}(t)\) pulls of arm \(k\):
\[\begin{split}I_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} + c_{k,\tau}(t),\\ \text{where}\;\; c_{k,\tau}(t) &:= \sqrt{\alpha \frac{\log(\min(t,\tau))}{N_{k,\tau}(t)}},\\ \text{and}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]
-
__module__
= 'Policies.SlidingWindowUCB'¶
-
-
class
Policies.SlidingWindowUCB.
SWUCBPlus
(nbArms, horizon=None, *args, **kwargs)[source]¶ Bases:
Policies.SlidingWindowUCB.SWUCB
An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history.
- Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
-
__init__
(nbArms, horizon=None, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
__module__
= 'Policies.SlidingWindowUCB'¶
-
Policies.SlidingWindowUCB.
constant_c
= 1.0¶ default value, as it was in pymaBandits v1.0
-
Policies.SlidingWindowUCB.
tolerance
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.SlidingWindowUCB.
SWklUCB
(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]¶ Bases:
Policies.SlidingWindowUCB.SWUCB
An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see
Policy.klUCB
) indexes instead of UCB.-
__init__
(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
klucb
= None¶ kl function to use
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(\min(t,\tau))}{N_{k,\tau}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.SlidingWindowUCB'¶
-
-
class
Policies.SlidingWindowUCB.
SWklUCBPlus
(nbArms, tau=1000, klucb=<function klucbBern>, *args, **kwargs)[source]¶ Bases:
Policies.SlidingWindowUCB.SWklUCB
,Policies.SlidingWindowUCB.SWUCBPlus
An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see
Policy.klUCB
) indexes instead of UCB.- Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
-
__module__
= 'Policies.SlidingWindowUCB'¶
Policies.Softmax module¶
The Boltzmann Exploration (Softmax) index policy.
- Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
- Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)
-
Policies.Softmax.
UNBIASED
= False¶ self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).
-
class
Policies.Softmax.
Softmax
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The Boltzmann Exploration (Softmax) index policy, with a constant temperature \(\eta_t\).
- Reference: [Algorithms for the multi-armed bandit problem, V.Kuleshov & D.Precup, JMLR, 2008, §2.1](http://www.cs.mcgill.ca/~vkules/bandits.pdf) and [Boltzmann Exploration Done Right, N.Cesa-Bianchi & C.Gentile & G.Lugosi & G.Neu, arXiv 2017](https://arxiv.org/pdf/1705.10257.pdf).
- Very similar to Exp3 but uses a Boltzmann distribution. Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://sbubeck.com/SurveyBCB12.pdf)
-
unbiased
= None¶ Flag
-
temperature
¶ Constant temperature, \(\eta_t\).
-
trusts
¶ Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature \(\eta_t\).
\[\begin{split}\mathrm{trusts}'_k(t+1) &= \exp\left( \frac{X_k(t)}{\eta_t N_k(t)} \right) \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]If \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k.
-
choice
()[source]¶ One random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceWithRank
(rank=1)[source]¶ Multiple (rank >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
, and select the last one (least probable one).- Note that if not enough entries in the trust vector are non-zero, then
choice()
is called instead (rank is ignored).
- Note that if not enough entries in the trust vector are non-zero, then
-
choiceFromSubSet
(availableArms='all')[source]¶ One random selection, from availableArms, with probabilities = trusts, thank to
numpy.random.choice()
.
-
choiceMultiple
(nb=1)[source]¶ Multiple (nb >= 1) random selection, with probabilities = trusts, thank to
numpy.random.choice()
.
-
estimatedOrder
()[source]¶ Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.
-
__module__
= 'Policies.Softmax'¶
-
class
Policies.Softmax.
SoftmaxWithHorizon
(nbArms, horizon, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Softmax with fixed temperature \(\eta_t = \eta_0\) chosen with a knowledge of the horizon.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
temperature
¶ Fixed temperature, small, knowing the horizon: \(\eta_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).
- Cf. Theorem 3.1 case #1 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Softmax'¶
-
-
class
Policies.Softmax.
SoftmaxDecreasing
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Softmax with decreasing temperature \(\eta_t\).
-
temperature
¶ Decreasing temperature with the time: \(\eta_t = \sqrt(\frac{\log(K)}{t K})\) (heuristic).
- Cf. Theorem 3.1 case #2 of [Bubeck & Cesa-Bianchi, 2012](http://sbubeck.com/SurveyBCB12.pdf).
-
__module__
= 'Policies.Softmax'¶
-
-
class
Policies.Softmax.
SoftMix
(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Softmax.Softmax
Another Softmax with decreasing temperature \(\eta_t\).
-
__module__
= 'Policies.Softmax'¶
-
temperature
¶ Decreasing temperature with the time: \(\eta_t = c \frac{\log(t)}{t}\) (heuristic).
- Cf. [Cesa-Bianchi & Fisher, 1998](http://dl.acm.org/citation.cfm?id=657473).
- Default value for is \(c = \sqrt(\frac{\log(K)}{K})\).
-
Policies.SparseUCB module¶
The SparseUCB policy, designed to tackle sparse stochastic bandit problems:
- This means that only a small subset of size
s
of theK
arms has non-zero means. - The SparseUCB algorithm requires to known exactly the value of
s
. - Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)].
Warning
This algorithm only works for sparse Gaussian (or sub-Gaussian) stochastic bandits.
-
class
Policies.SparseUCB.
Phase
¶ Bases:
enum.Enum
Different states during the SparseUCB algorithm.
RoundRobin
means all are sampled once.ForceLog
uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).UCB
is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
-
ForceLog
= 2¶
-
RoundRobin
= 1¶
-
UCB
= 3¶
-
__module__
= 'Policies.SparseUCB'¶
-
Policies.SparseUCB.
ALPHA
= 4¶ Default parameter for \(\alpha\) for the UCB indexes.
-
class
Policies.SparseUCB.
SparseUCB
(nbArms, sparsity=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBalpha.UCBalpha
The SparseUCB policy, designed to tackle sparse stochastic bandit problems.
- By default, assume
sparsity
=nbArms
.
-
__init__
(nbArms, sparsity=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
sparsity
= None¶ Known value of the sparsity of the current problem.
-
phase
= None¶ Current phase of the algorithm.
-
force_to_see
= None¶ Binary array for the set \(\mathcal{J}(t)\).
-
goods
= None¶ Binary array for the set \(\mathcal{K}(t)\).
-
offset
= None¶ Next arm to sample, for the Round-Robin phase
-
update_j
()[source]¶ Recompute the set \(\mathcal{J}(t)\):
\[\mathcal{J}(t) = \left\{ k \in [1,...,K]\;, \frac{X_k(t)}{N_k(t)} \geq \sqrt{\frac{\alpha \log(N_k(t))}{N_k(t)}} \right\}.\]
-
update_k
()[source]¶ Recompute the set \(\mathcal{K}(t)\):
\[\mathcal{K}(t) = \left\{ k \in [1,...,K]\;, \frac{X_k(t)}{N_k(t)} \geq \sqrt{\frac{\alpha \log(t)}{N_k(t)}} \right\}.\]
-
choice
()[source]¶ Choose the next arm to play:
- If still in a Round-Robin phase, play the next arm,
- Otherwise, recompute the set \(\mathcal{J}(t)\),
- If it is too small, if \(\mathcal{J}(t) < s\):
- Start a new Round-Robin phase from arm 0.
- Otherwise, recompute the second set \(\mathcal{K}(t)\),
- If it is too small, if \(\mathcal{K}(t) < s\):
- Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).
- Otherwise,
- Play a UCB step by choosing an arm with highest UCB index from the set \(\mathcal{K}(t)\).
-
__module__
= 'Policies.SparseUCB'¶
- By default, assume
Policies.SparseWrapper module¶
The SparseWrapper policy, designed to tackle sparse stochastic bandit problems:
- This means that only a small subset of size
s
of theK
arms has non-zero means. - The SparseWrapper algorithm requires to known exactly the value of
s
. - This SparseWrapper is a very generic version, and can use any index policy for both the decision in the UCB phase and the construction of the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).
- The usual UCB indexes can be used for the set \(\mathcal{K}(t)\) by setting the flag
use_ucb_for_set_K
to true (but usually the indexes from the underlying policy can be used efficiently for set \(\mathcal{K}(t)\)), and for the set \(\mathcal{J}(t)\) by setting the flaguse_ucb_for_set_J
to true (as its formula is less easily generalized). - If used with
Policy.UCBalpha
orPolicy.klUCB
, it should be better to use directlyPolicy.SparseUCB
orPolicy.SparseklUCB
. - Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)] who introduced SparseUCB.
Warning
This is very EXPERIMENTAL! I have no proof yet! But it works fine!!
-
Policies.SparseWrapper.
default_index_policy
¶ alias of
Policies.UCBalpha.UCBalpha
-
class
Policies.SparseWrapper.
Phase
¶ Bases:
enum.Enum
Different states during the SparseWrapper algorithm.
RoundRobin
means all are sampled once.ForceLog
uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).UCB
is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
-
ForceLog
= 2¶
-
RoundRobin
= 1¶
-
UCB
= 3¶
-
__module__
= 'Policies.SparseWrapper'¶
-
Policies.SparseWrapper.
USE_UCB_FOR_SET_K
= False¶ Default value for the flag controlling whether the usual UCB indexes are used for the set \(\mathcal{K}(t)\). Default it to use the indexes of the underlying policy, which could be more efficient.
-
Policies.SparseWrapper.
USE_UCB_FOR_SET_J
= False¶ Default value for the flag controlling whether the usual UCB indexes are used for the set \(\mathcal{J}(t)\). Default it to use the UCB indexes as there is no clean and generic formula to obtain the indexes for \(\mathcal{J}(t)\) from the indexes of the underlying policy. Note that I found a formula, it’s just durty. See below.
-
Policies.SparseWrapper.
ALPHA
= 1¶ Default parameter for \(\alpha\) for the UCB indexes.
-
class
Policies.SparseWrapper.
SparseWrapper
(nbArms, sparsity=None, use_ucb_for_set_K=False, use_ucb_for_set_J=False, alpha=1, policy=<class 'Policies.UCBalpha.UCBalpha'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicy
The SparseWrapper policy, designed to tackle sparse stochastic bandit problems.
- By default, assume
sparsity
=nbArms
.
-
__init__
(nbArms, sparsity=None, use_ucb_for_set_K=False, use_ucb_for_set_J=False, alpha=1, policy=<class 'Policies.UCBalpha.UCBalpha'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ New policy.
-
sparsity
= None¶ Known value of the sparsity of the current problem.
-
use_ucb_for_set_K
= None¶ Whether the usual UCB indexes are used for the set \(\mathcal{K}(t)\).
-
use_ucb_for_set_J
= None¶ Whether the usual UCB indexes are used for the set \(\mathcal{J}(t)\).
-
alpha
= None¶ Parameter \(\alpha\) for the UCB indexes for the two sets, if not using the indexes of the underlying policy.
-
phase
= None¶ Current phase of the algorithm.
-
force_to_see
= None¶ Binary array for the set \(\mathcal{J}(t)\).
-
goods
= None¶ Binary array for the set \(\mathcal{K}(t)\).
-
offset
= None¶ Next arm to sample, for the Round-Robin phase
-
update_j
()[source]¶ Recompute the set \(\mathcal{J}(t)\):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= I_k^{P}(t) - \hat{\mu}_k(t),\\ U^{\mathcal{J}}_k(t) &= U^{\mathcal{K}}_k(t) \times \sqrt{\frac{\log(N_k(t))}{\log(t)}},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{J}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]- Yes, this is a nothing but a hack, as there is no generic formula to retrieve the indexes used in the set \(\mathcal{J}(t)\) from the indexes \(I_k^{P}(t)\) of the underlying index policy \(P\).
- If
use_ucb_for_set_J
isTrue
, the same formula fromPolicies.SparseUCB
is used.
Warning
FIXME rewrite the above with LCB and UCB instead of this weird U - mean.
-
__module__
= 'Policies.SparseWrapper'¶
-
update_k
()[source]¶ Recompute the set \(\mathcal{K}(t)\):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= I_k^{P}(t) - \hat{\mu}_k(t),\\ \mathcal{K}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{K}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]- If
use_ucb_for_set_K
isTrue
, the same formula fromPolicies.SparseUCB
is used.
- If
-
choice
()[source]¶ Choose the next arm to play:
- If still in a Round-Robin phase, play the next arm,
- Otherwise, recompute the set \(\mathcal{J}(t)\),
- If it is too small, if \(\mathcal{J}(t) < s\):
- Start a new Round-Robin phase from arm 0.
- Otherwise, recompute the second set \(\mathcal{K}(t)\),
- If it is too small, if \(\mathcal{K}(t) < s\):
- Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus K(t)\).
- Otherwise,
- Play a UCB step by choosing an arm with highest index (from the underlying policy) from the set \(\mathcal{K}(t)\).
- By default, assume
Policies.SparseklUCB module¶
The SparseklUCB policy, designed to tackle sparse stochastic bandit problems:
- This means that only a small subset of size
s
of theK
arms has non-zero means. - The SparseklUCB algorithm requires to known exactly the value of
s
. - This SparseklUCB is my version. It uses the KL-UCB index for both the decision in the UCB phase and the construction of the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).
- The usual UCB indexes can be used for the sets by setting the flag
use_ucb_for_sets
to true. - Reference: [[“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017](https://arxiv.org/abs/1706.01383)] who introduced SparseUCB.
Warning
This algorithm only works for sparse Gaussian (or sub-Gaussian) stochastic bandits, of known variance.
-
class
Policies.SparseklUCB.
Phase
¶ Bases:
enum.Enum
Different states during the SparseklUCB algorithm.
RoundRobin
means all are sampled once.ForceLog
uniformly explores arms that are in the set \(\mathcal{J}(t) \setminus \mathcal{K}(t)\).UCB
is the phase that the algorithm should converge to, when a normal UCB selection is done only on the “good” arms, i.e., \(\mathcal{K}(t)\).
-
ForceLog
= 2¶
-
RoundRobin
= 1¶
-
UCB
= 3¶
-
__module__
= 'Policies.SparseklUCB'¶
-
Policies.SparseklUCB.
c
= 1.0¶ default value, as it was in pymaBandits v1.0
-
Policies.SparseklUCB.
USE_UCB_FOR_SETS
= False¶ Default value for the flag controlling whether the usual UCB indexes are used for the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\). Default it to use the KL-UCB indexes, which should be more efficient.
-
class
Policies.SparseklUCB.
SparseklUCB
(nbArms, sparsity=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, use_ucb_for_sets=False, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The SparseklUCB policy, designed to tackle sparse stochastic bandit problems.
- By default, assume
sparsity
=nbArms
.
-
__init__
(nbArms, sparsity=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, use_ucb_for_sets=False, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
sparsity
= None¶ Known value of the sparsity of the current problem.
-
use_ucb_for_sets
= None¶ Whether the usual UCB indexes are used for the sets \(\mathcal{J}(t)\) and \(\mathcal{K}(t)\).
-
phase
= None¶ Current phase of the algorithm.
-
force_to_see
= None¶ Binary array for the set \(\mathcal{J}(t)\).
-
goods
= None¶ Binary array for the set \(\mathcal{K}(t)\).
-
offset
= None¶ Next arm to sample, for the Round-Robin phase
-
update_j
()[source]¶ Recompute the set \(\mathcal{J}(t)\):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{J}}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(N_k(t))}{N_k(t)} \right\},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{J}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]- If
use_ucb_for_sets
isTrue
, the same formula fromPolicies.SparseUCB
is used.
- If
-
update_k
()[source]¶ Recompute the set \(\mathcal{K}(t)\):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U^{\mathcal{K}}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ \mathcal{J}(t) &= \left\{ k \in [1,...,K]\;, \hat{\mu}_k(t) \geq U^{\mathcal{K}}_k(t) - \hat{\mu}_k(t) \right\}.\end{split}\]- If
use_ucb_for_sets
isTrue
, the same formula fromPolicies.SparseUCB
is used.
- If
-
__module__
= 'Policies.SparseklUCB'¶
-
choice
()[source]¶ Choose the next arm to play:
- If still in a Round-Robin phase, play the next arm,
- Otherwise, recompute the set \(\mathcal{J}(t)\),
- If it is too small, if \(\mathcal{J}(t) < s\):
- Start a new Round-Robin phase from arm 0.
- Otherwise, recompute the second set \(\mathcal{K}(t)\),
- If it is too small, if \(\mathcal{K}(t) < s\):
- Play a Force-Log step by choosing an arm uniformly at random from the set \(\mathcal{J}(t) \setminus K(t)\).
- Otherwise,
- Play a UCB step by choosing an arm with highest KL-UCB index from the set \(\mathcal{K}(t)\).
- By default, assume
Policies.SuccessiveElimination module¶
Generic policy based on successive elimination, mostly useless except to maintain a clear hierarchy of inheritance.
-
class
Policies.SuccessiveElimination.
SuccessiveElimination
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
Generic policy based on successive elimination, mostly useless except to maintain a clear hierarchy of inheritance.
-
choice
()[source]¶ In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (
self.activeArms
) with methodchoiceFromSubSet
.
-
__module__
= 'Policies.SuccessiveElimination'¶
-
Policies.TakeFixedArm module¶
TakeFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).
-
class
Policies.TakeFixedArm.
TakeFixedArm
(nbArms, armIndex=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
TakeFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).
-
nbArms
= None¶ Number of arms
-
armIndex
= None¶ Fixed arm
-
__module__
= 'Policies.TakeFixedArm'¶
-
Policies.TakeRandomFixedArm module¶
TakeRandomFixedArm: always select a fixed arm. This is the perfect static policy if armIndex = bestArmIndex (not realistic, for test only).
-
class
Policies.TakeRandomFixedArm.
TakeRandomFixedArm
(nbArms, lower=0.0, amplitude=1.0, nbArmIndexes=None)[source]¶ Bases:
Policies.TakeFixedArm.TakeFixedArm
TakeRandomFixedArm: first selects a random sub-set of arms, then always select from it.
-
nbArms
= None¶ Number of arms
-
armIndexes
= None¶ Fix the set of arms
-
__module__
= 'Policies.TakeRandomFixedArm'¶
-
Policies.Thompson module¶
The Thompson (Bayesian) index policy.
- By default, it uses a Beta posterior (
Policies.Posterior.Beta
), one by arm. - Reference: [Thompson - Biometrika, 1933].
-
class
Policies.Thompson.
Thompson
(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BayesianIndexPolicy.BayesianIndexPolicy
The Thompson (Bayesian) index policy.
By default, it uses a Beta posterior (
Policies.Posterior.Beta
), one by arm.Prior is initially flat, i.e., \(a=\alpha_0=1\) and \(b=\beta_0=1\).
A non-flat prior for each arm can be given with parameters
a
andb
, for instance:nbArms = 2 prior_failures = a = 100 prior_successes = b = 50 policy = Thompson(nbArms, a=a, b=b) np.mean([policy.choice() for _ in range(1000)]) # 0.515 ~= 0.5: each arm has same prior!
A different prior for each arm can be given with parameters
params_for_each_posterior
, for instance:nbArms = 2 params0 = { 'a': 10, 'b': 5} # mean 1/3 params1 = { 'a': 5, 'b': 10} # mean 2/3 params = [params0, params1] policy = Thompson(nbArms, params_for_each_posterior=params) np.mean([policy.choice() for _ in range(1000)]) # 0.9719 ~= 1: arm 1 is better than arm 0 !
Reference: [Thompson - Biometrika, 1933].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k, giving \(S_k(t)\) rewards of 1, by sampling from the Beta posterior:
\[\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \tilde{S_k}(t), 1 + \tilde{N_k}(t) - \tilde{S_k}(t)).\end{split}\]
-
__module__
= 'Policies.Thompson'¶
Policies.TrekkingTSN module¶
TrekkingTSN: implementation of the decentralized multi-player policy from [R.Kumar, A.Yadav, S.J.Darak, M.K.Hanawal, Trekking based Distributed Algorithm for Opportunistic Spectrum Access in Infrastructure-less Network, 2018](XXX).
- Each player has 3 states, 1st is channel characterization, 2nd is Trekking phase
- 1st step
- FIXME
- 2nd step:
- FIXME
-
Policies.TrekkingTSN.
special_times
(nbArms=10, theta=0.01, epsilon=0.1, delta=0.05)[source]¶ Compute the lower-bound suggesting “large-enough” values for the different parameters \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\) and the smallest mean is larger than \(\theta\).
\[\begin{split}T_{RH} &= \frac{\log(\frac{\delta}{3 K})}{\log(1 - \theta (1 - \frac{1}{K})^{K-1}))} \\ T_{SH} &= (2 K / \varepsilon^2) \log(\frac{2 K^2}{\delta / 3}) \\ T_{TR} &= \lceil\frac{\log((\delta / 3) K XXX)}{\log(1 - \theta)} \rceil \frac{(K - 1) K}{2}.\end{split}\]- Cf. Theorem 1 of [Kumar et al., 2018](XXX).
- Examples:
>>> nbArms = 8 >>> theta = Delta = 0.07 >>> epsilon = theta >>> delta = 0.1 >>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta) # doctest: +ELLIPSIS (197, 26949, -280) >>> delta = 0.01 >>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta) # doctest: +ELLIPSIS (279, 34468, 616) >>> delta = 0.001 >>> special_times(nbArms=nbArms, theta=theta, epsilon=epsilon, delta=delta) # doctest: +ELLIPSIS (362, 41987, 1512)
-
Policies.TrekkingTSN.
boundOnFinalRegret
(T_RH, T_SH, T_TR, nbPlayers, nbArms)[source]¶ Use the upper-bound on regret when \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) and \(M\) are known.
The “constant” regret of course grows linearly with \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\), as:
\[\forall T \geq T_{RH} + T_{SH} + T_{TR}, \;\; R_T \leq M (T_{RH} + (1 - \frac{M}{K}) T_{SH} + T_{TR}).\]
Warning
this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_{RH}\), \(T_{SH}\) and \(T_{TR}\) is chosen as given by
special_times()
).- Cf. Theorem 1 of [Kumar et al., 2018](XXX).
- Examples:
>>> boundOnFinalRegret(197, 26949, -280, 2, 8) # doctest: +ELLIPSIS 40257.5 >>> boundOnFinalRegret(279, 34468, 616, 2, 8) # doctest: +ELLIPSIS 53492.0 >>> boundOnFinalRegret(362, 41987, 1512, 2, 8) # doctest: +ELLIPSIS 66728.5
- For \(M=5\):
>>> boundOnFinalRegret(197, 26949, -280, 5, 8) # doctest: +ELLIPSIS 50114.375 >>> boundOnFinalRegret(279, 34468, 616, 5, 8) # doctest: +ELLIPSIS 69102.5 >>> boundOnFinalRegret(362, 41987, 1512, 5, 8) # doctest: +ELLIPSIS 88095.625
- For \(M=K=8\):
>>> boundOnFinalRegret(197, 26949, -280, 8, 8) # doctest: +ELLIPSIS -664.0 # there is something wrong with T_TR ! >>> boundOnFinalRegret(279, 34468, 616, 8, 8) # doctest: +ELLIPSIS 7160.0 >>> boundOnFinalRegret(362, 41987, 1512, 8, 8) # doctest: +ELLIPSIS 14992.0
-
class
Policies.TrekkingTSN.
State
¶ Bases:
enum.Enum
Different states during the Musical Chair algorithm
-
ChannelCharacterization
= 2¶
-
NotStarted
= 1¶
-
TrekkingTSN
= 3¶
-
__module__
= 'Policies.TrekkingTSN'¶
-
-
class
Policies.TrekkingTSN.
TrekkingTSN
(nbArms, theta=0.01, epsilon=0.1, delta=0.05, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
TrekkingTSN: implementation of the single-player policy from [R.Kumar, A.Yadav, S.J.Darak, M.K.Hanawal, Trekking based Distributed Algorithm for Opportunistic Spectrum Access in Infrastructure-less Network, 2018](XXX).
-
__init__
(nbArms, theta=0.01, epsilon=0.1, delta=0.05, lower=0.0, amplitude=1.0)[source]¶ - nbArms: number of arms,
Example:
>>> nbArms = 8 >>> theta, epsilon, delta = 0.01, 0.1, 0.05 >>> player1 = TrekkingTSN(nbArms, theta=theta, epsilon=epsilon, delta=delta)
For multi-players use:
>>> configuration["players"] = Selfish(NB_PLAYERS, TrekkingTSN, nbArms, theta=theta, epsilon=epsilon, delta=delta).children
-
state
= None¶ Current state
-
theta
= None¶ Parameter \(\theta\).
-
epsilon
= None¶ Parameter \(\epsilon\).
-
delta
= None¶ Parameter \(\delta\).
-
T_RH
= None¶ Parameter \(T_{RH}\) computed from
special_times()
-
T_SH
= None¶ Parameter \(T_{SH}\) computed from
special_times()
-
T_CC
= None¶ Parameter \(T_{CC} = T_{RH} + T_{SH}\)
-
T_TR
= None¶ Parameter \(T_{TR}\) computed from
special_times()
-
last_was_successful
= None¶ That’s the l of the paper
-
last_choice
= None¶ Keep memory of the last choice for CC phase
-
cumulatedRewards
= None¶ That’s the V_n of the paper
-
nbObservations
= None¶ That’s the S_n of the paper
-
lock_channel
= None¶ That’s the L of the paper
-
t
= None¶ Internal times
-
startGame
()[source]¶ Just reinitialize all the internal memory, and decide how to start (state 1 or 2).
-
getReward
(arm, reward)[source]¶ Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.
- If not collision, receive a reward after pulling the arm.
-
handleCollision
(arm, reward=None)[source]¶ Handle a collision, on arm of index ‘arm’.
- Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
-
__module__
= 'Policies.TrekkingTSN'¶
-
Policies.TsallisInf module¶
The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.
- Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)
-
Policies.TsallisInf.
ALPHA
= 0.5¶ Default value for \(\alpha\) the parameter of the Tsallis entropy. We focus on the 1/2-Tsallis algorithm, ie, with \(\alpha=\frac{1}{2}\).
-
class
Policies.TsallisInf.
TsallisInf
(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Exp3.Exp3
The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.
- Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)
-
alpha
= None¶ Store the constant \(\alpha\) used by the Online-Mirror-Descent step using \(\alpha\) Tsallis entropy.
-
inverse_exponent
= None¶ Store \(\frac{1}{\alpha-1}\) to only compute it once.
-
cumulative_losses
= None¶ Keep in memory the vector \(\hat{L}_t\) of cumulative (unbiased estimates) of losses.
-
eta
¶ Decreasing learning rate, \(\eta_t = \frac{1}{\sqrt{t}}\).
-
trusts
¶ Trusts probabilities \(\mathrm{trusts}(t+1)\) are just the normalized weights \(w_k(t)\).
-
getReward
(arm, reward)[source]¶ Give a reward: accumulate rewards on that arm k, then recompute the trusts.
Compute the trusts probabilities \(w_k(t)\) with one step of Online-Mirror-Descent for bandit, using the \(\alpha\) Tsallis entropy for the \(\Psi_t\) functions.
\[\begin{split}\mathrm{trusts}'_k(t+1) &= \nabla (\Psi_t + \mathcal{I}_{\Delta^K})^* (- \hat{L}_{t-1}), \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]- If \(\Delta^K\) is the probability simplex of dimension \(K\),
- and \(\hat{L}_{t-1}\) is the cumulative loss vector, ie, the sum of the (unbiased estimate) \(\hat{\ell}_t\) for the previous time steps,
- where \(\hat{\ell}_{t,i} = 1(I_t = i) \frac{\ell_{t,i}}{\mathrm{trusts}_i(t)}\) is the unbiased estimate of the loss,
- With \(\Psi_t = \Psi_{t,\alpha}(w) := - \sum_{k=1}^{K} \frac{w_k^{\alpha}}{\alpha \eta_t}\),
- With learning rate \(\eta_t = \frac{1}{\sqrt{t}}\) the (decreasing) learning rate.
-
__module__
= 'Policies.TsallisInf'¶
Policies.UCB module¶
The UCB policy for bounded bandits.
- Reference: [Lai & Robbins, 1985].
-
class
Policies.UCB.
UCB
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB policy for bounded bandits.
- Reference: [Lai & Robbins, 1985].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \log(t)}{N_k(t)}}.\]
-
__module__
= 'Policies.UCB'¶
Policies.UCBH module¶
The UCB-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert et al. 09].
-
class
Policies.UCBH.
UCBH
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBalpha.UCBalpha
The UCB-H policy for bounded bandits, with knowing the horizon. Reference: [Audibert et al. 09].
-
__init__
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
alpha
= None¶ Parameter alpha
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(T)}{2 N_k(t)}}.\]
-
__module__
= 'Policies.UCBH'¶
-
Policies.UCBV module¶
The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].
-
class
Policies.UCBV.
UCBV
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].
-
__init__
(nbArms, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
rewardsSquared
= None¶ Keep track of squared of rewards, to compute an empirical variance
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and of rewards squared for that arm (normalized in [0, 1]).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{2 \log(t) V_k(t)}{N_k(t)}} + 3 (b - a) \frac{\log(t)}{N_k(t)}.\end{split}\]Where rewards are in \([a, b]\), and \(V_k(t)\) is an estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.
-
__module__
= 'Policies.UCBV'¶
-
Policies.UCBVtuned module¶
The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].
-
class
Policies.UCBVtuned.
UCBVtuned
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBV.UCBV
The UCBV-Tuned policy for bounded bandits, with a tuned variance correction term. Reference: [Auer et al. 02].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ V'_k(t) &= V_k(t) + \sqrt{\frac{2 \log(t)}{N_k(t)}}, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{\log(t) V'_k(t)}{N_k(t)}}.\end{split}\]Where \(V'_k(t)\) is an other estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.
-
__module__
= 'Policies.UCBVtuned'¶
-
Policies.UCBalpha module¶
The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Reference: [Auer et al. 02].
-
Policies.UCBalpha.
ALPHA
= 4¶ Default parameter for alpha
-
class
Policies.UCBalpha.
UCBalpha
(nbArms, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting). Reference: [Auer et al. 02].
-
__init__
(nbArms, alpha=4, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
alpha
= None¶ Parameter alpha
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{\alpha \log(t)}{2 N_k(t)}}.\]
-
__module__
= 'Policies.UCBalpha'¶
-
Policies.UCBdagger module¶
The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.
- Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://tor-lattimore.com/)
-
Policies.UCBdagger.
ALPHA
= 1¶ Default value for the parameter \(\alpha > 0\) for UCBdagger.
-
Policies.UCBdagger.
log_bar
(x)[source]¶ The function defined as \(\mathrm{l\overline{og}}\) by Lattimore:
\[\mathrm{l\overline{og}}(x) := \log\left((x+e)\sqrt{\log(x+e)}\right)\]Some values:
>>> for x in np.logspace(0, 7, 8): ... print("x = {:<5.3g} gives log_bar(x) = {:<5.3g}".format(x, log_bar(x))) x = 1 gives log_bar(x) = 1.45 x = 10 gives log_bar(x) = 3.01 x = 100 gives log_bar(x) = 5.4 x = 1e+03 gives log_bar(x) = 7.88 x = 1e+04 gives log_bar(x) = 10.3 x = 1e+05 gives log_bar(x) = 12.7 x = 1e+06 gives log_bar(x) = 15.1 x = 1e+07 gives log_bar(x) = 17.5
Illustration:
>>> import matplotlib.pyplot as plt >>> X = np.linspace(0, 1000, 2000) >>> Y = log_bar(X) >>> plt.plot(X, Y) >>> plt.title(r"The $\mathrm{l\overline{og}}$ function") >>> plt.show()
-
Policies.UCBdagger.
Ki_function
(pulls, i)[source]¶ Compute the \(K_i(t)\) index as defined in the article, for one arm i.
-
Policies.UCBdagger.
Ki_vectorized
(pulls)[source]¶ Compute the \(K_i(t)\) index as defined in the article, for all arms (in a vectorized manner).
Warning
I didn’t find a fast vectorized formula, so don’t use this one.
-
class
Policies.UCBdagger.
UCBdagger
(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB-dagger (\(\mathrm{UCB}{\dagger}\), UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.
- Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://downloads.tor-lattimore.com/papers/XXX)
-
__init__
(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
alpha
= None¶ Parameter \(\alpha > 0\).
-
horizon
= None¶ Parameter \(T > 0\).
-
getReward
(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \mathrm{l}\overline{\mathrm{og}}\left( \frac{T}{H_k(t)} \right)}, \\ \text{where}\;\; & H_k(t) := N_k(t) K_k(t) \\ \text{and}\;\; & K_k(t) := \sum_{j=1}^{K} \min(1, \sqrt{\frac{T_j(t)}{T_i(t)}}).\end{split}\]
-
__module__
= 'Policies.UCBdagger'¶
Policies.UCBimproved module¶
The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.
- Reference: [[Auer et al, 2010](https://link.springer.com/content/pdf/10.1007/s10998-010-3055-6.pdf)].
-
Policies.UCBimproved.
ALPHA
= 0.5¶ Default value for parameter \(\alpha\).
-
Policies.UCBimproved.
n_m
(horizon, delta_m)[source]¶ Function \(\lceil \frac{2 \log(T \Delta_m^2)}{\Delta_m^2} \rceil\).
-
class
Policies.UCBimproved.
UCBimproved
(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.SuccessiveElimination.SuccessiveElimination
The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.
- Reference: [[Auer et al, 2010](https://link.springer.com/content/pdf/10.1007/s10998-010-3055-6.pdf)].
-
__init__
(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
alpha
= None¶ Parameter alpha
-
activeArms
= None¶ Set of active arms
-
estimate_delta
= None¶ Current estimate of the gap \(\Delta_0\)
-
current_m
= None¶ Current round m
-
max_m
= None¶ Bound \(m = \lfloor \frac{1}{2} \log_2(\frac{T}{e}) \rfloor\)
-
when_did_it_leave
= None¶ Also keep in memory when the arm was kicked out of the
activeArms
sets, so fake index can be given, if we ask to order the arms for instance.
-
choice
(recursive=False)[source]¶ In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (
self.activeArms
) with methodchoiceFromSubSet
.
-
__module__
= 'Policies.UCBimproved'¶
Policies.UCBmin module¶
The UCB-min policy for bounded bandits, with a \(\min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right)\) term. Reference: [Anandkumar et al., 2010].
-
class
Policies.UCBmin.
UCBmin
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The UCB-min policy for bounded bandits, with a \(\min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right)\) term. Reference: [Anandkumar et al., 2010].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \min\left(1, \sqrt{\frac{\log(t)}{2 N_k(t)}}\right).\]
-
__module__
= 'Policies.UCBmin'¶
-
Policies.UCBoost module¶
The UCBoost policy for bounded bandits (on [0, 1]).
- Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
Warning
The whole goal of their paper is to provide a numerically efficient alternative to kl-UCB, so for my comparison to be fair, I should either use the Python versions of klUCB utility functions (using kullback
) or write C or Cython versions of this UCBoost module. My conclusion is that kl-UCB is always faster than UCBoost.
-
Policies.UCBoost.
c
= 0.0¶ Default value for better practical performance.
-
Policies.UCBoost.
tolerance_with_upperbound
= 1.0001¶ Tolerance when checking (with
assert
) that the solution(s) of any convex problem are correct.
-
Policies.UCBoost.
CHECK_SOLUTION
= False¶ Whether to check that the solution(s) of any convex problem are correct.
Warning
This is currently disabled, to try to optimize this module! WARNING bring it back when debugging!
-
Policies.UCBoost.
squadratic_distance
(p, q)[source]¶ The quadratic distance, \(d_{sq}(p, q) := 2 (p - q)^2\).
-
Policies.UCBoost.
solution_pb_sq
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{sq}\) the
biquadratic_distance()
function:\[\begin{split}P_1(d_{sq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{sq}(p, q) \leq \delta.\end{split}\]- The solution is:
\[q^* = p + \sqrt{\frac{\delta}{2}}.\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_sq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_sq) policy for bounded bandits (on [0, 1]).
- It uses
solution_pb_sq()
as a closed-form solution to compute the UCB indexes (using the quadratic distance). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{sq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
- It uses
-
Policies.UCBoost.
biquadratic_distance
(p, q)[source]¶ The biquadratic distance, \(d_{bq}(p, q) := 2 (p - q)^2 + (4/9) * (p - q)^4\).
-
Policies.UCBoost.
solution_pb_bq
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{bq}\) the
biquadratic_distance()
function:\[\begin{split}P_1(d_{bq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{bq}(p, q) \leq \delta.\end{split}\]- The solution is:
\[q^* = \min(1, p + \sqrt{-\frac{9}{4} + \sqrt{\frac{81}{16} + \frac{9}{4} \delta}}).\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_bq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_bq) policy for bounded bandits (on [0, 1]).
- It uses
solution_pb_bq()
as a closed-form solution to compute the UCB indexes (using the biquadratic distance). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{bq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
- It uses
-
Policies.UCBoost.
hellinger_distance
(p, q)[source]¶ The Hellinger distance, \(d_{h}(p, q) := (\sqrt{p} - \sqrt{q})^2 + (\sqrt{1 - p} - \sqrt{1 - q})^2\).
-
Policies.UCBoost.
solution_pb_hellinger
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{h}\) the
hellinger_distance()
function:\[\begin{split}P_1(d_h)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_h(p, q) \leq \delta.\end{split}\]- The solution is:
\[q^* = \left( (1 - \frac{\delta}{2}) \sqrt{p} + \sqrt{(1 - p) (\delta - \frac{\delta^2}{4})} \right)^{2 \times \boldsymbol{1}(\delta < 2 - 2 \sqrt{p})}.\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_h
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_h) policy for bounded bandits (on [0, 1]).
- It uses
solution_pb_hellinger()
as a closed-form solution to compute the UCB indexes (using the Hellinger distance). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_h)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
- It uses
-
Policies.UCBoost.
eps
= 1e-15¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Policies.UCBoost.
kullback_leibler_distance_on_mean
(p, q)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{kl}(p, q) = \mathrm{KL}(\mathcal{B}(p), \mathcal{B}(q)) = p \log\left(\frac{p}{q}\right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
-
Policies.UCBoost.
kullback_leibler_distance_lowerbound
(p, q)[source]¶ Lower-bound on the Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[d_{lb}(p, q) = p \log\left( p \right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
-
Policies.UCBoost.
solution_pb_kllb
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{lb}\) the proposed lower-bound on the Kullback-Leibler binary distance (
kullback_leibler_distance_lowerbound()
) function:\[\begin{split}P_1(d_{lb})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{lb}(p, q) \leq \delta.\end{split}\]- The solution is:
\[q^* = 1 - (1 - p) \exp\left(\frac{p \log(p) - \delta}{1 - p}\right).\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_lb
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_lb) policy for bounded bandits (on [0, 1]).
- It uses
solution_pb_kllb()
as a closed-form solution to compute the UCB indexes (using the lower-bound on the Kullback-Leibler distance). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{lb})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
- It uses
-
Policies.UCBoost.
distance_t
(p, q)[source]¶ A shifted tangent line function of
kullback_leibler_distance_on_mean()
.\[d_t(p, q) = \frac{2 q}{p + 1} + p \log\left(\frac{p}{p + 1}\right) + \log\left(\frac{2}{\mathrm{e}(p + 1)}\right).\]Warning
I think there might be a typo in the formula in the article, as this \(d_t\) does not seem to “depend enough on q” (just intuition).
-
Policies.UCBoost.
solution_pb_t
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_t\) a shifted tangent line function of
kullback_leibler_distance_on_mean()
(distance_t()
) function:\[\begin{split}P_1(d_t)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_t(p, q) \leq \delta.\end{split}\]- The solution is:
\[q^* = \min\left(1, \frac{p + 1}{2} \left( \delta - p \log\left(\frac{p}{p + 1}\right) - \log\left(\frac{2}{\mathrm{e} (p + 1)}\right) \right)\right).\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_t
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_t) policy for bounded bandits (on [0, 1]).
- It uses
solution_pb_t()
as a closed-form solution to compute the UCB indexes (using a shifted tangent line function ofkullback_leibler_distance_on_mean()
). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
Warning
It has bad performance, as expected (see the paper for their remark).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_t)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
- It uses
-
Policies.UCBoost.
is_a_true_number
(n)[source]¶ Check if n is a number or not (
int
,float
,complex
etc, any instance ofnumbers.Number
class.
-
class
Policies.UCBoost.
UCBoost
(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCBoost policy for bounded bandits (on [0, 1]).
- It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
should be either a set of strings (and NOT functions), or a number (3, 4 or 5). 3 indicate usingd_bq
,d_h
,d_lb
, 4 addsd_t
, and 5 addsd_sq
(see the article, Corollary 3, p5, for more details).- Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
set_D
= None¶ Set of strings that indicate which d functions are in the set of functions D. Warning: do not use real functions here, or the object won’t be hashable!
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
- It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
(see the article, Corollary 3, p5, for more details).- Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb_t
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
- It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
,d_t
(see the article, Corollary 3, p5, for more details).- Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb_t_sq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
- It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
,d_t
,d_sq
(see the article, Corollary 3, p5, for more details).- Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
min_solutions_pb_from_epsilon
(p, upperbound, epsilon=0.001, check_solution=False)[source]¶ List of closed-form solutions of the following optimisation problems, for \(d = d_s^k\) approximation of \(d_{kl}\) and any \(\tau_1(p) \leq k \leq \tau_2(p)\):
\[\begin{split}P_1(d_s^k)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_s^k(p, q) \leq \delta.\end{split}\]- The solution is:
\[\begin{split}q^* &= q_k^{\boldsymbol{1}(\delta < d_{kl}(p, q_k))},\\ d_s^k &: (p, q) \mapsto d_{kl}(p, q_k) \boldsymbol{1}(q > q_k),\\ q_k &:= 1 - \left( 1 - \frac{\varepsilon}{1 + \varepsilon} \right)^k.\end{split}\]- \(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCBoostEpsilon
(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCBoostEpsilon policy for bounded bandits (on [0, 1]).
- It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
- This variant uses
solutions_pb_from_epsilon()
to also compute the \(\varepsilon\) approximation of thekullback_leibler_distance_on_mean()
function (see the article for details, Th.3 p6). - Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
__module__
= 'Policies.UCBoost'¶
-
epsilon
= None¶ Parameter epsilon
Policies.UCBplus module¶
The UCB+ policy for bounded bandits, with a small trick on the index.
- Reference: [Auer et al. 2002], and [[Garivier et al. 2016](https://arxiv.org/pdf/1605.08988.pdf)] (it is noted \(\mathrm{UCB}^*\) in the second article).
-
class
Policies.UCBplus.
UCBplus
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The UCB+ policy for bounded bandits, with a small trick on the index.
- Reference: [Auer et al. 2002], and [[Garivier et al. 2016](https://arxiv.org/pdf/1605.08988.pdf)] (it is noted \(\mathrm{UCB}^*\) in the second article).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[I_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log(t / N_k(t))}{2 N_k(t)}\right)}.\]
-
__module__
= 'Policies.UCBplus'¶
Policies.UCBrandomInit module¶
The UCB index policy, modified to take a random permutation order for the initial exploration of each arm (could reduce collisions in the multi-players setting). Reference: [Lai & Robbins, 1985].
-
class
Policies.UCBrandomInit.
UCBrandomInit
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCB.UCB
The UCB index policy, modified to take a random permutation order for the initial exploration of each arm (could reduce collisions in the multi-players setting). Reference: [Lai & Robbins, 1985].
-
__init__
(nbArms, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
choice
()[source]¶ In an index policy, choose an arm with maximal index (uniformly at random):
\[A(t) \sim U(\arg\max_{1 \leq k \leq K} I_k(t)).\]Warning
In almost all cases, there is a unique arm with maximal index, so we loose a lot of time with this generic code, but I couldn’t find a way to be more efficient without loosing generality.
-
__module__
= 'Policies.UCBrandomInit'¶
-
Policies.Uniform module¶
Uniform: the fully uniform policy who selects randomly (uniformly) an arm at each step (stupid).
-
class
Policies.Uniform.
Uniform
(nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
Uniform: the fully uniform policy who selects randomly (uniformly) an arm at each step (stupid).
-
nbArms
= None¶ Number of arms
-
__module__
= 'Policies.Uniform'¶
-
Policies.UniformOnSome module¶
UniformOnSome: a fully uniform policy who selects randomly (uniformly) an arm among a fix set, at each step (stupid).
-
class
Policies.UniformOnSome.
UniformOnSome
(nbArms, armIndexes=None, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.Uniform.Uniform
UniformOnSome: a fully uniform policy who selects randomly (uniformly) an arm among a fix set, at each step (stupid).
-
nbArms
= None¶ Number of arms
-
armIndexes
= None¶ Arms from where to uniformly sample
-
__module__
= 'Policies.UniformOnSome'¶
-
Policies.WrapRange module¶
A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.
It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in \([0,1]\), and are easy to use if the rewards known to be in some interval \([a, b]\) (I did this from the very beginning here, with [lower, lower+amplitude]
).
But if the interval \([a,b]\) is unknown, what can we do?
The “Doubling Trick”, in this setting, refers to this algorithm:
- Start with \([a_0, b_0] = [0, 1]\),
- If a reward \(r_t\) is seen below \(a_i\), use \(a_{i+1} = r_t\),
- If a reward \(r_t\) is seen above \(b_i\), use \(b_{i+1} = r_t - a_i\).
Instead of just doubling the length of the interval (“doubling trick”), we use \([r_t, b_i]\) or \([a_i, r_t]\) as it is the smallest interval compatible with the past and the new observation \(r_t\)
- Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea.
See also
See for instance Obandit.WrapRange by @freuk.
-
class
Policies.WrapRange.
WrapRange
(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.
-
__init__
(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ New policy.
-
policy
= None¶ Underlying policy
-
getReward
(arm, reward)[source]¶ Maybe change the current range and rescale all the past history, and then pass the reward, and update t.
Let call \(r_s\) the reward at time \(s\), \(l_{t-1}\) and \(a_{t-1}\) the lower-bound and amplitude of rewards at previous time \(t-1\), and \(l_t\) and \(a_t\) the new lower-bound and amplitude for current time \(t\). The previous history is \(R_t := \sum_{s=1}^{t-1} r_s\).
The generic formula for rescaling the previous history is the following:
\[R_t := \frac{(a_{t-1} \times R_t + l_{t-1}) - l_t}{a_t}.\]So we have the following efficient algorithm:
- If \(r < l_{t-1}\), let \(l_t = r\) and \(R_t := R_t + \frac{l_{t-1} - l_t}{a_t}\),
- Else if \(r > l_{t-1} + a_{t-1}\), let \(a_t = r - l_{t-1}\) and \(R_t := R_t \times \frac{a_{t-1}}{a-t}\),
- Otherwise, nothing to do, the current reward is still correctly in \([l_{t-1}, l_{t-1} + a_{t-1}]\), so simply keep \(l_t = l_{t-1}\) and \(a_t = a_{t-1}\).
-
index
¶ Get attribute
index
from the underlying policy.
-
choiceFromSubSet
(availableArms='all')[source]¶ Pass the call to
choiceFromSubSet
of the underlying policy.
-
choiceIMP
(nb=1, startWithChoiceMultiple=True)[source]¶ Pass the call to
choiceIMP
of the underlying policy.
-
__module__
= 'Policies.WrapRange'¶
-
Policies.klUCB module¶
The generic KL-UCB policy for one-parameter exponential distributions.
- By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
Policies.klUCB.
c
= 1.0¶ default value, as it was in pymaBandits v1.0
-
Policies.klUCB.
TOLERANCE
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.klUCB.
klUCB
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The generic KL-UCB policy for one-parameter exponential distributions.
- By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
__init__
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
klucb
= None¶ kl function to use
-
klucb_vect
= None¶ kl function to use, in a vectorized way using
numpy.vectorize()
.
-
tolerance
= None¶ Numerical tolerance
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCB'¶
Policies.klUCBH module¶
The kl-UCB-H policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)
-
class
Policies.klUCBH.
klUCBH
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The kl-UCB-H policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)
-
__init__
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCBH'¶
-
Policies.klUCBHPlus module¶
The improved kl-UCB-H+ policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)
-
class
Policies.klUCBHPlus.
klUCBHPlus
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The improved kl-UCB-H+ policy, for one-parameter exponential distributions. Reference: [Lai 87](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350495)
-
__init__
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T / N_k(t))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCBHPlus'¶
-
Policies.klUCBPlus module¶
The improved kl-UCB policy, for one-parameter exponential distributions. Reference: [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf)
-
class
Policies.klUCBPlus.
klUCBPlus
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The improved kl-UCB policy, for one-parameter exponential distributions. Reference: [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf)
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t / N_k(t))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCBPlus'¶
-
Policies.klUCBPlusPlus module¶
The improved kl-UCB++ policy, for one-parameter exponential distributions. Reference: [Menard & Garivier, ALT 2017](https://hal.inria.fr/hal-01475078)
-
Policies.klUCBPlusPlus.
g
(t, T, K)[source]¶ The exploration function g(t) (for t current time, T horizon, K nb arms), as defined in page 3 of the reference paper.
\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
-
Policies.klUCBPlusPlus.
g_vect
(t, T, K)[source]¶ The exploration function g(t) (for t current time, T horizon, K nb arms), as defined in page 3 of the reference paper, for numpy vectorized inputs.
\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
-
class
Policies.klUCBPlusPlus.
klUCBPlusPlus
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The improved kl-UCB++ policy, for one-parameter exponential distributions. Reference: [Menard & Garivier, ALT 2017](https://hal.inria.fr/hal-01475078)
-
__init__
(nbArms, horizon=None, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c g(N_k(t), T, K)}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1), and where \(g(t, T, K)\) is this function:\[\begin{split}g(t, T, K) &:= \log^+(y (1 + \log^+(y)^2)),\\ y &:= \frac{T}{K t}.\end{split}\]
-
__module__
= 'Policies.klUCBPlusPlus'¶
-
Policies.klUCB_forGLR module¶
The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).
- It is designed to be used with the wrapper
GLR_UCB
. - By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
Policies.klUCB_forGLR.
c
= 3¶ Default value when using \(f(t) = \log(t) + c \log(\log(t))\), as
klUCB_forGLR
is inherited fromklUCBloglog
.
-
Policies.klUCB_forGLR.
TOLERANCE
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.klUCB_forGLR.
klUCB_forGLR
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=3, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCBloglog.klUCBloglog
The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).
- It is designed to be used with the wrapper
GLR_UCB
. - By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
- It is designed to be used with the wrapper
-
__init__
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=3, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
t_for_each_arm
= None¶ Keep in memory not only the global time step \(t\), but also let the possibility for
GLR_UCB
to use a different time steps \(t_k\) for each arm, in the exploration function \(f(t) = \log(t_k) + 3 \log(\log(t_k))\).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t_k) + c \log(\log(t_k))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).Warning
The only difference with
klUCB
is that a custom \(t_k\) is used for each arm k, instead of a common \(t\). This policy is designed to be used withGLR_UCB
.
-
__module__
= 'Policies.klUCB_forGLR'¶
Policies.klUCBloglog module¶
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t) Reference: [Garivier & Cappé - COLT, 2011].
-
Policies.klUCBloglog.
c
= 3¶ default value, as it was in pymaBandits v1.0
-
class
Policies.klUCBloglog.
klUCBloglog
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The generic kl-UCB policy for one-parameter exponential distributions. By default, it assumes Bernoulli arms. Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t) Reference: [Garivier & Cappé - COLT, 2011].
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t) + c \log(\max(1, \log(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCBloglog'¶
-
Policies.klUCBloglog_forGLR module¶
The generic kl-UCB policy for one-parameter exponential distributions with restarted round count t_k.
By default, it assumes Bernoulli arms.
Note: using log(t) + c log(log(t)) for the KL-UCB index of just log(t)
- It is designed to be used with the wrapper GLR_UCB
.
- By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
-
Policies.klUCBloglog_forGLR.
c
= 3¶ Default value when using \(f(t) = \log(t) + c \log(\log(t))\), as
klUCB_forGLR
is inherited fromklUCBloglog
.
-
Policies.klUCBloglog_forGLR.
TOLERANCE
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
class
Policies.klUCBloglog_forGLR.
klUCBloglog_forGLR
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=2, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB_forGLR.klUCB_forGLR
The generic KL-UCB policy for one-parameter exponential distributions, using a different exploration time step for each arm (\(\log(t_k) + c \log(\log(t_k))\) instead of \(\log(t) + c \log(\log(t))\)).
- It is designed to be used with the wrapper
GLR_UCB
. - By default, it assumes Bernoulli arms.
- Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).
- It is designed to be used with the wrapper
-
__init__
(nbArms, tolerance=0.0001, klucb=<function klucbBern>, c=2, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{\log(t) + c \log(\max(1, \log(t)))}{N_k(t)} \right\},\\ I_k(t) &= U_k(t).\end{split}\]If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see
Arms.kullback
), and c is the parameter (default to 1).
-
__module__
= 'Policies.klUCBloglog_forGLR'¶
Policies.klUCBswitch module¶
The kl-UCB-switch policy, for bounded distributions.
- Reference: [Garivier et al, 2018](https://arxiv.org/abs/1805.05071)
-
Policies.klUCBswitch.
TOLERANCE
= 0.0001¶ Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.
-
Policies.klUCBswitch.
threshold_switch_bestchoice
(T, K, gamma=0.2)[source]¶ The threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).
\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 1/5.\]
-
Policies.klUCBswitch.
threshold_switch_delayed
(T, K, gamma=0.8888888888888888)[source]¶ Another threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).
\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 8/9.\]
-
Policies.klUCBswitch.
threshold_switch_default
(T, K, gamma=0.2)¶ The threshold function \(f(T, K)\), to know when to switch from using \(I^{KL}_k(t)\) (kl-UCB index) to using \(I^{MOSS}_k(t)\) (MOSS index).
\[f(T, K) := \lfloor (T / K)^{\gamma} \rfloor, \gamma = 1/5.\]
-
Policies.klUCBswitch.
klucbplus_index
(reward, pull, horizon, nbArms, klucb=<function klucbBern>, c=1.0, tolerance=0.0001)[source]¶ One kl-UCB+ index, from [Cappé et al. 13](https://arxiv.org/pdf/1210.1136.pdf):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I^{KL+}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(T / (K * N_k(t)))}{N_k(t)} \right\}.\end{split}\]
-
Policies.klUCBswitch.
mossplus_index
(reward, pull, horizon, nbArms)[source]¶ One MOSS+ index, from [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf):
\[I^{MOSS+}_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{T}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
-
class
Policies.klUCBswitch.
klUCBswitch
(nbArms, horizon=None, threshold='best', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCB.klUCB
The kl-UCB-switch policy, for bounded distributions.
- Reference: [Garivier et al, 2018](https://arxiv.org/abs/1805.05071)
-
__init__
(nbArms, horizon=None, threshold='best', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
constant_threshold_switch
= None¶ For klUCBswitch (not the anytime variant), we can precompute the threshold as it is constant, \(= f(T, K)\).
-
use_MOSS_index
= None¶ Initialize internal memory: at first, every arm uses the kl-UCB index, then some will switch to MOSS. (Array of K bool).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}U_k(t) = \begin{cases} U^{KL+}_k(t) & \text{if } N_k(t) \leq f(T, K), \\ U^{MOSS+}_k(t) & \text{if } N_k(t) > f(T, K). \end{cases}.\end{split}\]- It starts by using
klucbplus_index()
, then it callsthreshold_switch()
to know when to stop and start usingmossplus_index()
.
- It starts by using
-
__module__
= 'Policies.klUCBswitch'¶
-
Policies.klUCBswitch.
logplus
(x)[source]¶ The \(\log_+\) function.
\[\log_+(x) := \max(0, \log(x)).\]
-
Policies.klUCBswitch.
phi
(x)[source]¶ The \(\phi(x)\) function defined in equation (6) in their paper.
\[\phi(x) := \log_+(x (1 + (\log_+(x))^2)).\]
-
Policies.klUCBswitch.
klucb_index
(reward, pull, t, nbArms, klucb=<function klucbBern>, c=1.0, tolerance=0.0001)[source]¶ One kl-UCB index, from [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf):
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I^{KL}_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu}_k(t), q) \leq \frac{c \log(t / N_k(t))}{N_k(t)} \right\}.\end{split}\]
-
Policies.klUCBswitch.
moss_index
(reward, pull, t, nbArms)[source]¶ One MOSS index, from [Audibert & Bubeck, 2010](http://www.jmlr.org/papers/volume11/audibert10a/audibert10a.pdf):
\[I^{MOSS}_k(t) = \frac{X_k(t)}{N_k(t)} + \sqrt{\max\left(0, \frac{\log\left(\frac{t}{K N_k(t)}\right)}{N_k(t)}\right)}.\]
-
class
Policies.klUCBswitch.
klUCBswitchAnytime
(nbArms, threshold='delayed', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.klUCBswitch.klUCBswitch
The anytime variant of the kl-UCB-switch policy, for bounded distributions.
- It does not use a doubling trick, but an augmented exploration function (replaces the \(\log_+\) by \(\phi\) in both
klucb_index()
andmoss_index()
fromklucbplus_index()
andmossplus_index()
). - Reference: [Garivier et al, 2018](https://arxiv.org/abs/1805.05071)
-
__init__
(nbArms, threshold='delayed', tolerance=0.0001, klucb=<function klucbBern>, c=1.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
- nbArms: the number of arms,
- lower, amplitude: lower value and known amplitude of the rewards.
-
__module__
= 'Policies.klUCBswitch'¶
-
threshold_switch
= None¶ A function, like
threshold_switch()
, of T and K, to decide when to switch from kl-UCB indexes to MOSS indexes (for each arm).
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}U_k(t) = \begin{cases} U^{KL}_k(t) & \text{if } N_k(t) \leq f(t, K), \\ U^{MOSS}_k(t) & \text{if } N_k(t) > f(t, K). \end{cases}.\end{split}\]- It starts by using
klucb_index()
, then it callsthreshold_switch()
to know when to stop and start usingmoss_index()
.
- It starts by using
- It does not use a doubling trick, but an augmented exploration function (replaces the \(\log_+\) by \(\phi\) in both
Policies.kullback module¶
Kullback-Leibler divergence functions and klUCB utilities.
- Faster implementation can be found in a C file, in
Policies/C
, and should be compiled to speedup computations. - However, the version here have examples, doctests, and are jit compiled on the fly (with numba, cf. http://numba.pydata.org/).
- Cf. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- Reference: [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf) and [Garivier & Cappé, 2011](https://arxiv.org/pdf/1102.2490.pdf)
Warning
All functions are not vectorized, and assume only one value for each argument.
If you want vectorized function, use the wrapper numpy.vectorize
:
>>> import numpy as np
>>> klBern_vect = np.vectorize(klBern)
>>> klBern_vect([0.1, 0.5, 0.9], 0.2) # doctest: +ELLIPSIS
array([0.036..., 0.223..., 1.145...])
>>> klBern_vect(0.4, [0.2, 0.3, 0.4]) # doctest: +ELLIPSIS
array([0.104..., 0.022..., 0...])
>>> klBern_vect([0.1, 0.5, 0.9], [0.2, 0.3, 0.4]) # doctest: +ELLIPSIS
array([0.036..., 0.087..., 0.550...])
For some functions, you would be better off writing a vectorized version manually, for instance if you want to fix a value of some optional parameters:
>>> # WARNING using np.vectorize gave weird result on klGauss
>>> # klGauss_vect = np.vectorize(klGauss, excluded="y")
>>> def klGauss_vect(xs, y, sig2x=0.25): # vectorized for first input only
... return np.array([klGauss(x, y, sig2x) for x in xs])
>>> klGauss_vect([-1, 0, 1], 0.1) # doctest: +ELLIPSIS
array([2.42, 0.02, 1.62])
-
Policies.kullback.
eps
= 1e-15¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Policies.kullback.
klBern
(x, y)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]>>> klBern(0.5, 0.5) 0.0 >>> klBern(0.1, 0.9) # doctest: +ELLIPSIS 1.757779... >>> klBern(0.9, 0.1) # And this KL is symmetric # doctest: +ELLIPSIS 1.757779... >>> klBern(0.4, 0.5) # doctest: +ELLIPSIS 0.020135... >>> klBern(0.01, 0.99) # doctest: +ELLIPSIS 4.503217...
- Special values:
>>> klBern(0, 1) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 34.539575...
-
Policies.kullback.
klBin
(x, y, n)[source]¶ Kullback-Leibler divergence for Binomial distributions. https://math.stackexchange.com/questions/320399/kullback-leibner-divergence-of-binomial-distributions
- It is simply the n times
klBern()
on x and y.
\[\mathrm{KL}(\mathrm{Bin}(x, n), \mathrm{Bin}(y, n)) = n \times \left(x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}) \right).\]Warning
The two distributions must have the same parameter n, and x, y are p, q in (0, 1).
>>> klBin(0.5, 0.5, 10) 0.0 >>> klBin(0.1, 0.9, 10) # doctest: +ELLIPSIS 17.57779... >>> klBin(0.9, 0.1, 10) # And this KL is symmetric # doctest: +ELLIPSIS 17.57779... >>> klBin(0.4, 0.5, 10) # doctest: +ELLIPSIS 0.20135... >>> klBin(0.01, 0.99, 10) # doctest: +ELLIPSIS 45.03217...
- Special values:
>>> klBin(0, 1, 10) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 345.39575...
- It is simply the n times
-
Policies.kullback.
klPoisson
(x, y)[source]¶ Kullback-Leibler divergence for Poison distributions. https://en.wikipedia.org/wiki/Poisson_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathrm{Poisson}(x), \mathrm{Poisson}(y)) = y - x + x \times \log(\frac{x}{y}).\]>>> klPoisson(3, 3) 0.0 >>> klPoisson(2, 1) # doctest: +ELLIPSIS 0.386294... >>> klPoisson(1, 2) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klPoisson(3, 6) # doctest: +ELLIPSIS 0.920558... >>> klPoisson(6, 8) # doctest: +ELLIPSIS 0.273907...
- Special values:
>>> klPoisson(1, 0) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 33.538776... >>> klPoisson(0, 0) 0.0
-
Policies.kullback.
klExp
(x, y)[source]¶ Kullback-Leibler divergence for exponential distributions. https://en.wikipedia.org/wiki/Exponential_distribution#Kullback.E2.80.93Leibler_divergence
\[\begin{split}\mathrm{KL}(\mathrm{Exp}(x), \mathrm{Exp}(y)) = \begin{cases} \frac{x}{y} - 1 - \log(\frac{x}{y}) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]>>> klExp(3, 3) 0.0 >>> klExp(3, 6) # doctest: +ELLIPSIS 0.193147... >>> klExp(1, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.193147... >>> klExp(2, 1) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klExp(4, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.306852... >>> klExp(6, 8) # doctest: +ELLIPSIS 0.037682...
- x, y have to be positive:
>>> klExp(-3, 2) inf >>> klExp(3, -2) inf >>> klExp(-3, -2) inf
-
Policies.kullback.
klGamma
(x, y, a=1)[source]¶ Kullback-Leibler divergence for gamma distributions. https://en.wikipedia.org/wiki/Gamma_distribution#Kullback.E2.80.93Leibler_divergence
- It is simply the a times
klExp()
on x and y.
\[\begin{split}\mathrm{KL}(\Gamma(x, a), \Gamma(y, a)) = \begin{cases} a \times \left( \frac{x}{y} - 1 - \log(\frac{x}{y}) \right) & \text{if} x > 0, y > 0\\ +\infty & \text{otherwise} \end{cases}\end{split}\]Warning
The two distributions must have the same parameter a.
>>> klGamma(3, 3) 0.0 >>> klGamma(3, 6) # doctest: +ELLIPSIS 0.193147... >>> klGamma(1, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.193147... >>> klGamma(2, 1) # And this KL is non-symmetric # doctest: +ELLIPSIS 0.306852... >>> klGamma(4, 2) # Only the proportion between x and y is used # doctest: +ELLIPSIS 0.306852... >>> klGamma(6, 8) # doctest: +ELLIPSIS 0.037682...
- x, y have to be positive:
>>> klGamma(-3, 2) inf >>> klGamma(3, -2) inf >>> klGamma(-3, -2) inf
- It is simply the a times
-
Policies.kullback.
klNegBin
(x, y, r=1)[source]¶ Kullback-Leibler divergence for negative binomial distributions. https://en.wikipedia.org/wiki/Negative_binomial_distribution
\[\mathrm{KL}(\mathrm{NegBin}(x, r), \mathrm{NegBin}(y, r)) = r \times \log((r + x) / (r + y)) - x \times \log(y \times (r + x) / (x \times (r + y))).\]Warning
The two distributions must have the same parameter r.
>>> klNegBin(0.5, 0.5) 0.0 >>> klNegBin(0.1, 0.9) # doctest: +ELLIPSIS -0.711611... >>> klNegBin(0.9, 0.1) # And this KL is non-symmetric # doctest: +ELLIPSIS 2.0321564... >>> klNegBin(0.4, 0.5) # doctest: +ELLIPSIS -0.130653... >>> klNegBin(0.01, 0.99) # doctest: +ELLIPSIS -0.717353...
- Special values:
>>> klBern(0, 1) # Should be +inf, but 0 --> eps, 1 --> 1 - eps # doctest: +ELLIPSIS 34.539575...
- With other values for r:
>>> klNegBin(0.5, 0.5, r=2) 0.0 >>> klNegBin(0.1, 0.9, r=2) # doctest: +ELLIPSIS -0.832991... >>> klNegBin(0.1, 0.9, r=4) # doctest: +ELLIPSIS -0.914890... >>> klNegBin(0.9, 0.1, r=2) # And this KL is non-symmetric # doctest: +ELLIPSIS 2.3325528... >>> klNegBin(0.4, 0.5, r=2) # doctest: +ELLIPSIS -0.154572... >>> klNegBin(0.01, 0.99, r=2) # doctest: +ELLIPSIS -0.836257...
-
Policies.kullback.
klGauss
(x, y, sig2x=0.25, sig2y=None)[source]¶ Kullback-Leibler divergence for Gaussian distributions of means
x
andy
and variancessig2x
andsig2y
, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties
- By default, sig2y is assumed to be sig2x (same variance).
Warning
The C version does not support different variances.
>>> klGauss(3, 3) 0.0 >>> klGauss(3, 6) 18.0 >>> klGauss(1, 2) 2.0 >>> klGauss(2, 1) # And this KL is symmetric 2.0 >>> klGauss(4, 2) 8.0 >>> klGauss(6, 8) 8.0
- x, y can be negative:
>>> klGauss(-3, 2) 50.0 >>> klGauss(3, -2) 50.0 >>> klGauss(-3, -2) 2.0 >>> klGauss(3, 2) 2.0
- With other values for sig2x:
>>> klGauss(3, 3, sig2x=10) 0.0 >>> klGauss(3, 6, sig2x=10) 0.45 >>> klGauss(1, 2, sig2x=10) 0.05 >>> klGauss(2, 1, sig2x=10) # And this KL is symmetric 0.05 >>> klGauss(4, 2, sig2x=10) 0.2 >>> klGauss(6, 8, sig2x=10) 0.2
- With different values for sig2x and sig2y:
>>> klGauss(0, 0, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS -0.0284... >>> klGauss(0, 0, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.2243... >>> klGauss(0, 0, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 1.1534...
>>> klGauss(0, 1, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS 0.9715... >>> klGauss(0, 1, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.7243... >>> klGauss(0, 1, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 3.1534...
>>> klGauss(1, 0, sig2x=0.25, sig2y=0.5) # doctest: +ELLIPSIS 0.9715... >>> klGauss(1, 0, sig2x=0.25, sig2y=1.0) # doctest: +ELLIPSIS 0.7243... >>> klGauss(1, 0, sig2x=0.5, sig2y=0.25) # not symmetric here! # doctest: +ELLIPSIS 3.1534...
Warning
Using
Policies.klUCB
(and variants) withklGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Policies.kullback.
klucb
(x, d, kl, upperbound, precision=1e-06, lowerbound=-inf, max_iterations=50)[source]¶ The generic KL-UCB index computation.
x
: value of the cum reward,d
: upper bound on the divergence,kl
: the KL divergence to be used (klBern()
,klGauss()
, etc),upperbound
,lowerbound=float('-inf')
: the known bound of the valuesx
,precision=1e-6
: the threshold from where to stop the research,max_iterations=50
: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{klucb}(x, d) \simeq \sup_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) < d \}.\]Note
It uses a bisection search, and one call to
kl
for each step of the bisection search.For example, for
klucbBern()
, the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:>>> x, d = 0.9, 0.2 # mean x, exploration term d >>> upperbound = min(1., klucbGauss(x, d, sig2x=0.25)) # variance 1/4 for [0,1] bounded distributions >>> upperbound # doctest: +ELLIPSIS 1.0 >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=10) # doctest: +ELLIPSIS 0.9941... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=10) # doctest: +ELLIPSIS 0.9944... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-3, max_iterations=50) # doctest: +ELLIPSIS 0.9941... >>> klucb(x, d, klBern, upperbound, lowerbound=0, precision=1e-6, max_iterations=100) # more and more precise! # doctest: +ELLIPSIS 0.994489...
Note
See below for more examples for different KL divergence functions.
-
Policies.kullback.
klucbBern
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Bernoulli distributions, using
klucb()
.- Influence of x:
>>> klucbBern(0.1, 0.2) # doctest: +ELLIPSIS 0.378391... >>> klucbBern(0.5, 0.2) # doctest: +ELLIPSIS 0.787088... >>> klucbBern(0.9, 0.2) # doctest: +ELLIPSIS 0.994489...
- Influence of d:
>>> klucbBern(0.1, 0.4) # doctest: +ELLIPSIS 0.519475... >>> klucbBern(0.1, 0.9) # doctest: +ELLIPSIS 0.734714...
>>> klucbBern(0.5, 0.4) # doctest: +ELLIPSIS 0.871035... >>> klucbBern(0.5, 0.9) # doctest: +ELLIPSIS 0.956809...
>>> klucbBern(0.9, 0.4) # doctest: +ELLIPSIS 0.999285... >>> klucbBern(0.9, 0.9) # doctest: +ELLIPSIS 0.999995...
-
Policies.kullback.
klucbGauss
(x, d, sig2x=0.25, precision=0.0)[source]¶ KL-UCB index computation for Gaussian distributions.
- Note that it does not require any search.
Warning
it works only if the good variance constant is given.
- Influence of x:
>>> klucbGauss(0.1, 0.2) # doctest: +ELLIPSIS 0.416227... >>> klucbGauss(0.5, 0.2) # doctest: +ELLIPSIS 0.816227... >>> klucbGauss(0.9, 0.2) # doctest: +ELLIPSIS 1.216227...
- Influence of d:
>>> klucbGauss(0.1, 0.4) # doctest: +ELLIPSIS 0.547213... >>> klucbGauss(0.1, 0.9) # doctest: +ELLIPSIS 0.770820...
>>> klucbGauss(0.5, 0.4) # doctest: +ELLIPSIS 0.947213... >>> klucbGauss(0.5, 0.9) # doctest: +ELLIPSIS 1.170820...
>>> klucbGauss(0.9, 0.4) # doctest: +ELLIPSIS 1.347213... >>> klucbGauss(0.9, 0.9) # doctest: +ELLIPSIS 1.570820...
Warning
Using
Policies.klUCB
(and variants) withklucbGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Policies.kullback.
klucbPoisson
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Poisson distributions, using
klucb()
.- Influence of x:
>>> klucbPoisson(0.1, 0.2) # doctest: +ELLIPSIS 0.450523... >>> klucbPoisson(0.5, 0.2) # doctest: +ELLIPSIS 1.089376... >>> klucbPoisson(0.9, 0.2) # doctest: +ELLIPSIS 1.640112...
- Influence of d:
>>> klucbPoisson(0.1, 0.4) # doctest: +ELLIPSIS 0.693684... >>> klucbPoisson(0.1, 0.9) # doctest: +ELLIPSIS 1.252796...
>>> klucbPoisson(0.5, 0.4) # doctest: +ELLIPSIS 1.422933... >>> klucbPoisson(0.5, 0.9) # doctest: +ELLIPSIS 2.122985...
>>> klucbPoisson(0.9, 0.4) # doctest: +ELLIPSIS 2.033691... >>> klucbPoisson(0.9, 0.9) # doctest: +ELLIPSIS 2.831573...
-
Policies.kullback.
klucbExp
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for exponential distributions, using
klucb()
.- Influence of x:
>>> klucbExp(0.1, 0.2) # doctest: +ELLIPSIS 0.202741... >>> klucbExp(0.5, 0.2) # doctest: +ELLIPSIS 1.013706... >>> klucbExp(0.9, 0.2) # doctest: +ELLIPSIS 1.824671...
- Influence of d:
>>> klucbExp(0.1, 0.4) # doctest: +ELLIPSIS 0.285792... >>> klucbExp(0.1, 0.9) # doctest: +ELLIPSIS 0.559088...
>>> klucbExp(0.5, 0.4) # doctest: +ELLIPSIS 1.428962... >>> klucbExp(0.5, 0.9) # doctest: +ELLIPSIS 2.795442...
>>> klucbExp(0.9, 0.4) # doctest: +ELLIPSIS 2.572132... >>> klucbExp(0.9, 0.9) # doctest: +ELLIPSIS 5.031795...
-
Policies.kullback.
klucbGamma
(x, d, precision=1e-06)[source]¶ KL-UCB index computation for Gamma distributions, using
klucb()
.- Influence of x:
>>> klucbGamma(0.1, 0.2) # doctest: +ELLIPSIS 0.202... >>> klucbGamma(0.5, 0.2) # doctest: +ELLIPSIS 1.013... >>> klucbGamma(0.9, 0.2) # doctest: +ELLIPSIS 1.824...
- Influence of d:
>>> klucbGamma(0.1, 0.4) # doctest: +ELLIPSIS 0.285... >>> klucbGamma(0.1, 0.9) # doctest: +ELLIPSIS 0.559...
>>> klucbGamma(0.5, 0.4) # doctest: +ELLIPSIS 1.428... >>> klucbGamma(0.5, 0.9) # doctest: +ELLIPSIS 2.795...
>>> klucbGamma(0.9, 0.4) # doctest: +ELLIPSIS 2.572... >>> klucbGamma(0.9, 0.9) # doctest: +ELLIPSIS 5.031...
-
Policies.kullback.
kllcb
(x, d, kl, lowerbound, precision=1e-06, upperbound=inf, max_iterations=50)[source]¶ The generic KL-LCB index computation.
x
: value of the cum reward,d
: lower bound on the divergence,kl
: the KL divergence to be used (klBern()
,klGauss()
, etc),lowerbound
,upperbound=float('-inf')
: the known bound of the valuesx
,precision=1e-6
: the threshold from where to stop the research,max_iterations=50
: max number of iterations of the loop (safer to bound it to reduce time complexity).
\[\mathrm{kllcb}(x, d) \simeq \inf_{\mathrm{lowerbound} \leq y \leq \mathrm{upperbound}} \{ y : \mathrm{kl}(x, y) > d \}.\]Note
It uses a bisection search, and one call to
kl
for each step of the bisection search.For example, for
kllcbBern()
, the two steps are to first compute an upperbound (as precise as possible) and the compute the kl-UCB index:>>> x, d = 0.9, 0.2 # mean x, exploration term d >>> lowerbound = max(0., kllcbGauss(x, d, sig2x=0.25)) # variance 1/4 for [0,1] bounded distributions >>> lowerbound # doctest: +ELLIPSIS 0.5837... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=10) # doctest: +ELLIPSIS 0.29... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=10) # doctest: +ELLIPSIS 0.29188... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-3, max_iterations=50) # doctest: +ELLIPSIS 0.291886... >>> kllcb(x, d, klBern, lowerbound, upperbound=0, precision=1e-6, max_iterations=100) # more and more precise! # doctest: +ELLIPSIS 0.29188611...
Note
See below for more examples for different KL divergence functions.
-
Policies.kullback.
kllcbBern
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for Bernoulli distributions, using
kllcb()
.- Influence of x:
>>> kllcbBern(0.1, 0.2) # doctest: +ELLIPSIS 0.09999... >>> kllcbBern(0.5, 0.2) # doctest: +ELLIPSIS 0.49999... >>> kllcbBern(0.9, 0.2) # doctest: +ELLIPSIS 0.89999...
- Influence of d:
>>> kllcbBern(0.1, 0.4) # doctest: +ELLIPSIS 0.09999... >>> kllcbBern(0.1, 0.9) # doctest: +ELLIPSIS 0.09999...
>>> kllcbBern(0.5, 0.4) # doctest: +ELLIPSIS 0.4999... >>> kllcbBern(0.5, 0.9) # doctest: +ELLIPSIS 0.4999...
>>> kllcbBern(0.9, 0.4) # doctest: +ELLIPSIS 0.8999... >>> kllcbBern(0.9, 0.9) # doctest: +ELLIPSIS 0.8999...
-
Policies.kullback.
kllcbGauss
(x, d, sig2x=0.25, precision=0.0)[source]¶ KL-LCB index computation for Gaussian distributions.
- Note that it does not require any search.
Warning
it works only if the good variance constant is given.
- Influence of x:
>>> kllcbGauss(0.1, 0.2) # doctest: +ELLIPSIS -0.21622... >>> kllcbGauss(0.5, 0.2) # doctest: +ELLIPSIS 0.18377... >>> kllcbGauss(0.9, 0.2) # doctest: +ELLIPSIS 0.58377...
- Influence of d:
>>> kllcbGauss(0.1, 0.4) # doctest: +ELLIPSIS -0.3472... >>> kllcbGauss(0.1, 0.9) # doctest: +ELLIPSIS -0.5708...
>>> kllcbGauss(0.5, 0.4) # doctest: +ELLIPSIS 0.0527... >>> kllcbGauss(0.5, 0.9) # doctest: +ELLIPSIS -0.1708...
>>> kllcbGauss(0.9, 0.4) # doctest: +ELLIPSIS 0.4527... >>> kllcbGauss(0.9, 0.9) # doctest: +ELLIPSIS 0.2291...
Warning
Using
Policies.kllCB
(and variants) withkllcbGauss()
is equivalent to usePolicies.UCB
, so prefer the simpler version.
-
Policies.kullback.
kllcbPoisson
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for Poisson distributions, using
kllcb()
.- Influence of x:
>>> kllcbPoisson(0.1, 0.2) # doctest: +ELLIPSIS 0.09999... >>> kllcbPoisson(0.5, 0.2) # doctest: +ELLIPSIS 0.49999... >>> kllcbPoisson(0.9, 0.2) # doctest: +ELLIPSIS 0.89999...
- Influence of d:
>>> kllcbPoisson(0.1, 0.4) # doctest: +ELLIPSIS 0.09999... >>> kllcbPoisson(0.1, 0.9) # doctest: +ELLIPSIS 0.09999...
>>> kllcbPoisson(0.5, 0.4) # doctest: +ELLIPSIS 0.49999... >>> kllcbPoisson(0.5, 0.9) # doctest: +ELLIPSIS 0.49999...
>>> kllcbPoisson(0.9, 0.4) # doctest: +ELLIPSIS 0.89999... >>> kllcbPoisson(0.9, 0.9) # doctest: +ELLIPSIS 0.89999...
-
Policies.kullback.
kllcbExp
(x, d, precision=1e-06)[source]¶ KL-LCB index computation for exponential distributions, using
kllcb()
.- Influence of x:
>>> kllcbExp(0.1, 0.2) # doctest: +ELLIPSIS 0.15267... >>> kllcbExp(0.5, 0.2) # doctest: +ELLIPSIS 0.7633... >>> kllcbExp(0.9, 0.2) # doctest: +ELLIPSIS 1.3740...
- Influence of d:
>>> kllcbExp(0.1, 0.4) # doctest: +ELLIPSIS 0.2000... >>> kllcbExp(0.1, 0.9) # doctest: +ELLIPSIS 0.3842...
>>> kllcbExp(0.5, 0.4) # doctest: +ELLIPSIS 1.0000... >>> kllcbExp(0.5, 0.9) # doctest: +ELLIPSIS 1.9214...
>>> kllcbExp(0.9, 0.4) # doctest: +ELLIPSIS 1.8000... >>> kllcbExp(0.9, 0.9) # doctest: +ELLIPSIS 3.4586...
-
Policies.kullback.
maxEV
(p, V, klMax)[source]¶ Maximize expectation of \(V\) with respect to \(q\) st. \(\mathrm{KL}(p, q) < \text{klMax}\).
- Input args.: p, V, klMax.
- Reference: Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf).
-
Policies.kullback.
reseqp
(p, V, klMax, max_iterations=50)[source]¶ Solve
f(reseqp(p, V, klMax)) = klMax
, using Newton method.Note
This is a subroutine of
maxEV()
.- Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011](https://arxiv.org/pdf/1004.5229.pdf).
Warning
np.dot is very slow!
-
Policies.kullback.
reseqp2
(p, V, klMax)[source]¶ Solve f(reseqp(p, V, klMax)) = klMax, using a blackbox minimizer, from scipy.optimize.
- FIXME it does not work well yet!
Note
This is a subroutine of
maxEV()
.- Reference: Eq. (4) in Section 3.2 of [Filippi, Cappé & Garivier - Allerton, 2011].
Warning
np.dot is very slow!
Policies.kullback_cython module¶
Policies.setup module¶
Policies.usenumba module¶
Import numba.jit or a dummy decorator.
-
Policies.usenumba.
USE_NUMBA
= False¶ Configure the use of numba
Policies.with_proba module¶
Simply defines a function with_proba()
that is used everywhere.
-
Policies.with_proba.
with_proba
(epsilon)[source]¶ Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.
Example:
>>> from random import seed; seed(0) # reproductible >>> with_proba(0.5) False >>> with_proba(0.9) True >>> with_proba(0.1) False >>> if with_proba(0.2): ... print("This happens 20% of the time.")
-
Policies.with_proba.
random
() → x in the interval [0, 1).¶
PoliciesMultiPlayers package¶
PoliciesMultiPlayers : contains various collision-avoidance protocol for the multi-players setting.
Selfish
: a multi-player policy where every player is selfish, they do not try to handle the collisions.CentralizedNotFair
: a multi-player policy which uses a centralize intelligence to affect users to a FIXED arm.CentralizedFair
: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.CentralizedMultiplePlay
andCentralizedIMP
: multi-player policies that use centralized but non-omniscient learning to select K = nbPlayers arms at each time step.OracleNotFair
: a multi-player policy with full knowledge and centralized intelligence to affect users to a FIXED arm, among the best arms.OracleFair
: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbBestArms, among the best arms.rhoRand
,ALOHA
: implementation of generic collision avoidance algorithms, relying on a single-player bandit policy (eg.UCB
,Thompson
etc). And variants,rhoRandRand
,rhoRandSticky
,rhoRandRotating
,rhoRandEst
,rhoLearn
,rhoLearnEst
,rhoLearnExp3
,rhoRandALOHA
,rhoCentralized
is a semi-centralized version where orthogonal ranks 1..M are given to the players, instead of just giving them the value of M, but a decentralized learning policy is still used to learn the best arms.RandTopM
is another approach, similar torhoRandSticky
andMusicalChair
, but we hope it will be better, and we succeed in analyzing more easily.
All policies have the same interface, as described in BaseMPPolicy
for decentralized policies,
and BaseCentralizedPolicy
for centralized policies,
in order to use them in any experiment with the following approach:
my_policy_MP = Policy_MP(nbPlayers, nbArms)
children = my_policy_MP.children # get a list of usable single-player policies
for one_policy in children:
one_policy.startGame() # start the game
for t in range(T):
for i in range(nbPlayers):
k_t[i] = children[i].choice() # chose one arm, for each player
for k in range(nbArms):
players_who_played_k = [ k_t[i] for i in range(nbPlayers) if k_t[i] == k ]
reward = reward_t[k] = sampled from the arm k # sample a reward
if len(players_who_played_k) > 1:
reward = 0
for i in players_who_played_k:
children[i].getReward(k, reward)
Submodules¶
PoliciesMultiPlayers.ALOHA module¶
ALOHA: generalized implementation of the single-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421), for a generic single-player policy.
This policy uses the collision avoidance mechanism that is inspired by the classical ALOHA protocol, and any single-player policy.
-
PoliciesMultiPlayers.ALOHA.
tnext_beta
(t, beta=0.5)[source]¶ Simple function, as used in MEGA:
upper_tnext(t)
= \(t^{\beta}\). Default to \(t^{0.5}\).>>> tnext_beta(100, beta=0.1) # doctest: +ELLIPSIS 1.584... >>> tnext_beta(100, beta=0.5) 10.0 >>> tnext_beta(100, beta=0.9) # doctest: +ELLIPSIS 63.095... >>> tnext_beta(1000) # doctest: +ELLIPSIS 31.622...
-
PoliciesMultiPlayers.ALOHA.
make_tnext_beta
(beta=0.5)[source]¶ Returns the function \(t \mapsto t^{\beta}\).
>>> tnext = make_tnext_beta(0.5) >>> tnext(100) 10.0 >>> tnext(1000) # doctest: +ELLIPSIS 31.622...
-
PoliciesMultiPlayers.ALOHA.
tnext_log
(t, scaling=1.0)[source]¶ Other function, not the one used in MEGA, but our proposal:
upper_tnext(t)
= \(\text{scaling} * \log(1 + t)\).>>> tnext_log(100, scaling=1) # doctest: +ELLIPSIS 4.615... >>> tnext_log(100, scaling=10) # doctest: +ELLIPSIS 46.151... >>> tnext_log(100, scaling=100) # doctest: +ELLIPSIS 461.512... >>> tnext_log(1000) # doctest: +ELLIPSIS 6.908...
-
PoliciesMultiPlayers.ALOHA.
make_tnext_log_scaling
(scaling=1.0)[source]¶ Returns the function \(t \mapsto \text{scaling} * \log(1 + t)\).
>>> tnext = make_tnext_log_scaling(1) >>> tnext(100) # doctest: +ELLIPSIS 4.615... >>> tnext(1000) # doctest: +ELLIPSIS 6.908...
-
class
PoliciesMultiPlayers.ALOHA.
oneALOHA
(nbPlayers, mother, playerId, nbArms, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: the ALOHA collision avoidance protocol is implemented here.
-
__init__
(nbPlayers, mother, playerId, nbArms, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
nbPlayers
= None¶ Number of players
-
p0
= None¶ Initial probability, should not be modified
-
p
= None¶ Current probability, can be modified
-
alpha_p0
= None¶ Parameter alpha for the recurrence equation for probability p(t)
-
beta
= None¶ Parameter beta
-
tnext
= None¶ Only store the delta time
-
t
= None¶ Internal time
-
chosenArm
= None¶ Last chosen arm
-
getReward
(arm, reward)[source]¶ Receive a reward on arm of index ‘arm’, as described by the ALOHA protocol.
- If not collision, receive a reward after pulling the arm.
-
handleCollision
(arm, reward=None)[source]¶ Handle a collision, on arm of index ‘arm’.
Warning
This method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.
Note
We do not care on which arm the collision occured.
-
choice
()[source]¶ Identify the available arms, and use the underlying single-player policy (UCB, Thompson etc) to choose an arm from this sub-set of arms.
-
__module__
= 'PoliciesMultiPlayers.ALOHA'¶
-
class
PoliciesMultiPlayers.ALOHA.
ALOHA
(nbPlayers, nbArms, playerAlgo, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
ALOHA: implementation of the multi-player policy from [Concurrent bandits and cognitive radio network, O.Avner & S.Mannor, 2014](https://arxiv.org/abs/1404.5421), for a generic single-player policy.
-
__init__
(nbPlayers, nbArms, playerAlgo, p0=0.5, alpha_p0=0.5, ftnext=<function tnext_beta>, beta=None, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- p0: initial probability p(0); p(t) is the probability of persistance on the chosenArm at time t
- alpha_p0: scaling in the update for p[t+1] <- alpha_p0 p[t] + (1 - alpha_p0)
- ftnext: general function, default to t -> t^beta, to know from where to sample a random time t_next(k), until when the chosenArm is unavailable. t -> log(1 + t) is also possible.
- (optional) beta: if present, overwrites ftnext, which will be t –> t^beta.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> p0, alpha_p0 = 0.6, 0.5 >>> s = ALOHA(nbPlayers, nbArms, Thompson, p0=p0, alpha_p0=alpha_p0, ftnext=tnext_log) >>> [ child.choice() for child in s.children ] [6, 11, 8, 4, 8, 8] >>> s = ALOHA(nbPlayers, nbArms, UCBalpha, p0=p0, alpha_p0=alpha_p0, beta=0.5, alpha=1) >>> [ child.choice() for child in s.children ] [1, 0, 5, 2, 15, 3]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
__module__
= 'PoliciesMultiPlayers.ALOHA'¶
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
-
PoliciesMultiPlayers.ALOHA.
random
() → x in the interval [0, 1).¶
PoliciesMultiPlayers.BaseCentralizedPolicy module¶
Base class for any centralized policy, for the multi-players setting.
-
class
PoliciesMultiPlayers.BaseCentralizedPolicy.
BaseCentralizedPolicy
(nbArms)[source]¶ Bases:
object
Base class for any centralized policy, for the multi-players setting.
-
__dict__
= mappingproxy({'__module__': 'PoliciesMultiPlayers.BaseCentralizedPolicy', '__doc__': ' Base class for any centralized policy, for the multi-players setting.', '__init__': <function BaseCentralizedPolicy.__init__>, '__str__': <function BaseCentralizedPolicy.__str__>, 'startGame': <function BaseCentralizedPolicy.startGame>, 'getReward': <function BaseCentralizedPolicy.getReward>, 'choice': <function BaseCentralizedPolicy.choice>, '__dict__': <attribute '__dict__' of 'BaseCentralizedPolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BaseCentralizedPolicy' objects>})¶
-
__module__
= 'PoliciesMultiPlayers.BaseCentralizedPolicy'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
PoliciesMultiPlayers.BaseMPPolicy module¶
Base class for any multi-players policy.
- If rewards are not in [0, 1], be sure to give the lower value and the amplitude. Eg, if rewards are in [-3, 3], lower = -3, amplitude = 6.
-
class
PoliciesMultiPlayers.BaseMPPolicy.
BaseMPPolicy
[source]¶ Bases:
object
Base class for any multi-players policy.
-
_choiceFromSubSet_one
(playerId, availableArms='all')[source]¶ Forward the call to self._players[playerId].
-
__dict__
= mappingproxy({'__module__': 'PoliciesMultiPlayers.BaseMPPolicy', '__doc__': ' Base class for any multi-players policy.', '__init__': <function BaseMPPolicy.__init__>, '__str__': <function BaseMPPolicy.__str__>, '_startGame_one': <function BaseMPPolicy._startGame_one>, '_getReward_one': <function BaseMPPolicy._getReward_one>, '_choice_one': <function BaseMPPolicy._choice_one>, '_choiceWithRank_one': <function BaseMPPolicy._choiceWithRank_one>, '_choiceFromSubSet_one': <function BaseMPPolicy._choiceFromSubSet_one>, '_choiceMultiple_one': <function BaseMPPolicy._choiceMultiple_one>, '_choiceIMP_one': <function BaseMPPolicy._choiceIMP_one>, '_estimatedOrder_one': <function BaseMPPolicy._estimatedOrder_one>, '_estimatedBestArms_one': <function BaseMPPolicy._estimatedBestArms_one>, '__dict__': <attribute '__dict__' of 'BaseMPPolicy' objects>, '__weakref__': <attribute '__weakref__' of 'BaseMPPolicy' objects>})¶
-
__module__
= 'PoliciesMultiPlayers.BaseMPPolicy'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
PoliciesMultiPlayers.CentralizedCycling module¶
CentralizedCycling: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.
- It allows to have absolutely no collision, if there is more channels than users (always assumed).
- And it is perfectly fair on every run: each chosen arm is played successively by each player.
- Note that it is NOT affecting players on the best arms: it has no knowledge of the means of the arms, only of the number of arms nbArms.
-
class
PoliciesMultiPlayers.CentralizedCycling.
Cycling
(nbArms, offset)[source]¶ Bases:
PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy
Cycling: select an arm as (offset + t) % nbArms, with offset being decided by the CentralizedCycling multi-player policy.
-
nbArms
= None¶ Number of arms
-
offset
= None¶ Offset
-
t
= None¶ Internal time
-
__module__
= 'PoliciesMultiPlayers.CentralizedCycling'¶
-
-
class
PoliciesMultiPlayers.CentralizedCycling.
CentralizedCycling
(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
CentralizedCycling: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.
-
__init__
(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]¶ - nbPlayers: number of players to create (in self._players).
- nbArms: number of arms.
Examples:
>>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> s = CentralizedCycling(2, 3) >>> [ child.choice() for child in s.children ] [2, 1] >>> [ child.choice() for child in s.children ] [0, 2] >>> [ child.choice() for child in s.children ] [1, 0] >>> [ child.choice() for child in s.children ] [2, 1]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
__module__
= 'PoliciesMultiPlayers.CentralizedCycling'¶
-
PoliciesMultiPlayers.CentralizedFixed module¶
CentralizedFixed: a multi-player policy which uses a centralized intelligence to affect users to a FIXED arm.
- It allows to have absolutely no collision, if there is more channels than users (always assumed).
- But it is NOT fair on ONE run: the best arm is played only by one player.
- Note that in average, it is fair (who plays the best arm is randomly decided).
- Note that it is NOT affecting players on the best arms: it has no knowledge of the means of the arms, only of the number of arms nbArms.
-
class
PoliciesMultiPlayers.CentralizedFixed.
Fixed
(nbArms, armIndex, lower=0.0, amplitude=1.0)[source]¶ Bases:
PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy
Fixed: always select a fixed arm, as decided by the CentralizedFixed multi-player policy.
-
nbArms
= None¶ Number of arms
-
armIndex
= None¶ Index of the fixed arm
-
__module__
= 'PoliciesMultiPlayers.CentralizedFixed'¶
-
-
class
PoliciesMultiPlayers.CentralizedFixed.
CentralizedFixed
(nbPlayers, nbArms)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
CentralizedFixed: a multi-player policy which uses a centralized intelligence to affect users to a FIXED arm.
-
__init__
(nbPlayers, nbArms)[source]¶ - nbPlayers: number of players to create (in self._players).
- nbArms: number of arms.
Examples:
>>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> s = CentralizedFixed(2, 3) >>> [ child.choice() for child in s.children ] [2, 1] >>> [ child.choice() for child in s.children ] [2, 1]
>>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> s = CentralizedFixed(4, 8) >>> [ child.choice() for child in s.children ] [7, 6, 1, 2] >>> [ child.choice() for child in s.children ] [7, 6, 1, 2]
>>> s = CentralizedFixed(10, 14)
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
__module__
= 'PoliciesMultiPlayers.CentralizedFixed'¶
-
PoliciesMultiPlayers.CentralizedIMP module¶
CentralizedIMP: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step, using an hybrid strategy: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].
-
class
PoliciesMultiPlayers.CentralizedIMP.
CentralizedIMP
(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.CentralizedMultiplePlay.CentralizedMultiplePlay
CentralizedIMP: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step, using an hybrid strategy: choose nb-1 arms with maximal empirical averages, then 1 arm with maximal index. Cf. algorithm IMP-TS [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779].
-
__module__
= 'PoliciesMultiPlayers.CentralizedIMP'¶
-
PoliciesMultiPlayers.CentralizedMultiplePlay module¶
CentralizedMultiplePlay: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step.
-
class
PoliciesMultiPlayers.CentralizedMultiplePlay.
CentralizedChildPointer
(mother, playerId)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Centralized version of the ChildPointer class.
-
__module__
= 'PoliciesMultiPlayers.CentralizedMultiplePlay'¶
-
-
class
PoliciesMultiPlayers.CentralizedMultiplePlay.
CentralizedMultiplePlay
(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
CentralizedMultiplePlay: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step.
-
__init__
(nbPlayers, nbArms, playerAlgo, uniformAllocation=False, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- uniformAllocation: Should the affectations of users always be uniform, or fixed when UCB indexes have converged? First choice is more fair, but linear nb of switches, second choice is not fair, but cst nb of switches.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Examples:
>>> from Policies import * >>> s = CentralizedMultiplePlay(2, 3, UCB) >>> [ child.choice() for child in s.children ] [2, 0]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
player
= None¶ Only one policy
-
children
= None¶ But nbPlayers children, fake algorithms
-
nbArms
= None¶ Number of arms
-
uniformAllocation
= None¶ Option: in case of multiplay plays, should the affectations of users always be uniform, or fixed when UCB indexes have converged? First choice is more fair, but linear nb of switches, second choice is not fair, but cst nb of switches
-
choices
= None¶ Choices, given by first call to internal algorithm
-
affectation_order
= None¶ Affectation of choices to players
-
_choice_one
(playerId)[source]¶ Use the player algorithm for the 1st decision, for each players, then use it.
-
_estimatedOrder_one
(playerId)[source]¶ Use the centralized algorithm to estimate ranking of the arms.
-
__module__
= 'PoliciesMultiPlayers.CentralizedMultiplePlay'¶
-
PoliciesMultiPlayers.ChildPointer module¶
ChildPointer: Class that acts as a child policy, but in fact it passes all its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).
-
class
PoliciesMultiPlayers.ChildPointer.
ChildPointer
(mother, playerId)[source]¶ Bases:
object
Class that acts as a child policy, but in fact it passes all its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).
-
mother
= None¶ Pointer to the mother class.
-
playerId
= None¶ ID of player in the mother class list of players
-
nbArms
= None¶ Number of arms (pretty print)
-
startGame
()[source]¶ Pass the call to self.mother._startGame_one(playerId) with the player’s ID number.
-
getReward
(arm, reward)[source]¶ Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.
-
handleCollision
(arm, reward=None)[source]¶ Pass the call to self.mother._handleCollision_one(playerId, arm, reward) with the player’s ID number.
-
choiceWithRank
(rank=1)[source]¶ Pass the call to self.mother._choiceWithRank_one(playerId) with the player’s ID number.
-
choiceFromSubSet
(availableArms='all')[source]¶ Pass the call to self.mother._choiceFromSubSet_one(playerId) with the player’s ID number.
-
choiceMultiple
(nb=1)[source]¶ Pass the call to self.mother._choiceMultiple_one(playerId) with the player’s ID number.
-
choiceIMP
(nb=1)[source]¶ Pass the call to self.mother._choiceIMP_one(playerId) with the player’s ID number.
-
estimatedOrder
()[source]¶ Pass the call to self.mother._estimatedOrder_one(playerId) with the player’s ID number.
-
estimatedBestArms
(M=1)[source]¶ Pass the call to self.mother._estimatedBestArms_one(playerId) with the player’s ID number.
-
__dict__
= mappingproxy({'__module__': 'PoliciesMultiPlayers.ChildPointer', '__doc__': ' Class that acts as a child policy, but in fact it passes *all* its method calls to the mother class (that can pass it to its internal i-th player, or use any centralized computation).\n ', '__init__': <function ChildPointer.__init__>, '__str__': <function ChildPointer.__str__>, '__repr__': <function ChildPointer.__repr__>, 'startGame': <function ChildPointer.startGame>, 'getReward': <function ChildPointer.getReward>, 'handleCollision': <function ChildPointer.handleCollision>, 'choice': <function ChildPointer.choice>, 'choiceWithRank': <function ChildPointer.choiceWithRank>, 'choiceFromSubSet': <function ChildPointer.choiceFromSubSet>, 'choiceMultiple': <function ChildPointer.choiceMultiple>, 'choiceIMP': <function ChildPointer.choiceIMP>, 'estimatedOrder': <function ChildPointer.estimatedOrder>, 'estimatedBestArms': <function ChildPointer.estimatedBestArms>, '__dict__': <attribute '__dict__' of 'ChildPointer' objects>, '__weakref__': <attribute '__weakref__' of 'ChildPointer' objects>})¶
-
__module__
= 'PoliciesMultiPlayers.ChildPointer'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
PoliciesMultiPlayers.DepRound module¶
DepRound()
: implementation of the dependent rounding procedure, from [[Dependent rounding and its applications to approximation algorithms, by R Gandhi, S Khuller, S Parthasarathy, Journal of the ACM, 2006](http://dl.acm.org/citation.cfm?id=1147956)].
It solves the problem of efficiently selecting a set of \(k\) distinct actions from \(\{1,\dots,K\}\), while satisfying the condition that each action \(i\) is selected with probability \(p_i\) exactly.
The distribution \((p_1, \dots, p_K)\) on \(\{1,\dots,K\}\) is assumed to be given.
Dependent rounding developed by [Gandhi et al.] is a kind of technique that randomly selects a set of edges from a bipartite graph under some cardinality constraints.
- It runs in \(\mathcal{O}(K)\) space complexity, and at most \(\mathcal{O}(K^2)\) time complexity (note that the article [Uchiya et al., 2010] wrongly claim it is in \(\mathcal{O}(K)\)).
- References: see also https://www.cs.umd.edu/~samir/grant/jacm06.pdf
-
PoliciesMultiPlayers.DepRound.
DepRound
(weights_p, k=1)[source]¶ [[Algorithms for adversarial bandit problems with multiple plays, by T.Uchiya, A.Nakamura and M.Kudo, 2010](http://hdl.handle.net/2115/47057)] Figure 5 (page 15) is a very clean presentation of the algorithm.
- Inputs: \(k < K\) and weights_p \(= (p_1, \dots, p_K)\) such that \(\sum_{i=1}^{K} p_i = k\) (or \(= 1\)).
- Output: A subset of \(\{1,\dots,K\}\) with exactly \(k\) elements. Each action \(i\) is selected with probability exactly \(p_i\).
Example:
>>> import numpy as np; import random >>> np.random.seed(0); random.seed(0) # for reproductibility! >>> K = 5 >>> k = 2
>>> weights_p = [ 2, 2, 2, 2, 2 ] # all equal weights >>> DepRound(weights_p, k) [3, 4] >>> DepRound(weights_p, k) [3, 4] >>> DepRound(weights_p, k) [0, 1]
>>> weights_p = [ 10, 8, 6, 4, 2 ] # decreasing weights >>> DepRound(weights_p, k) [0, 4] >>> DepRound(weights_p, k) [1, 2] >>> DepRound(weights_p, k) [3, 4]
>>> weights_p = [ 3, 3, 0, 0, 3 ] # decreasing weights >>> DepRound(weights_p, k) [0, 4] >>> DepRound(weights_p, k) [0, 4] >>> DepRound(weights_p, k) [0, 4] >>> DepRound(weights_p, k) [0, 1]
- See [[Gandhi et al, 2006](http://dl.acm.org/citation.cfm?id=1147956)] for the details.
-
PoliciesMultiPlayers.DepRound.
random
() → x in the interval [0, 1).¶
PoliciesMultiPlayers.EstimateM module¶
EstimateM: generic wrapper on a multi-player decentralized learning policy, to learn on the run the number of players, adapted from rhoEst from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
- My choice for the threshold function, see
threshold_on_t()
, does not need the horizon either, and uses \(t\) instead.
Note
This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).
Warning
This is still very experimental!
Note
For a less generic approach, see the policies defined in rhoEst.rhoEst
(generalizing rhoRand.rhoRand
) and RandTopMEst.RandTopMEst
(generalizing RandTopM.RandTopM
).
-
PoliciesMultiPlayers.EstimateM.
threshold_on_t_with_horizon
(t, nbPlayersEstimate, horizon=None)[source]¶ Function \(\xi(T, k)\) used as a threshold in
rhoEstPlus
.- 0 if nbPlayersEstimate is 0,
- 1 if nbPlayersEstimate is 1,
- any function such that: \(\xi(T, k) = \omega(\log T)\) for all k > 1. (cf. http://mathworld.wolfram.com/Little-OmegaNotation.html). I choose \(\log(1 + T)^2\) or \(\log(1 + T) \log(1 + \log(1 + T))\), as it seems to work just fine and satisfies the condition (25) from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
Warning
It requires the horizon \(T\), and does not use the current time \(t\).
Example:
>>> threshold_on_t_with_horizon(1000, 3) # doctest: +ELLIPSIS 14.287... >>> threshold_on_t_with_horizon(1000, 3, horizon=2000) # doctest: +ELLIPSIS 16.357...
-
PoliciesMultiPlayers.EstimateM.
threshold_on_t_doubling_trick
(t, nbPlayersEstimate, horizon=None, base=2, min_fake_horizon=1000, T0=1)[source]¶ A trick to have a threshold depending on a growing horizon (doubling-trick).
- Instead of using \(t\) or \(T\), a fake horizon \(T_t\) is used, corresponding to the horizon a doubling-trick algorithm would be using at time \(t\).
- \(T_t = T_0 b^{\lceil \log_b(t) \rceil}\) is the default choice, for \(b=2\) \(T_0 = 10\).
- If \(T_t\) is too small,
min_fake_horizon
is used instead.
Warning
This is ongoing research!
Example:
>>> threshold_on_t_doubling_trick(1000, 3) # doctest: +ELLIPSIS 14.356... >>> threshold_on_t_doubling_trick(1000, 3, horizon=2000) # doctest: +ELLIPSIS 14.356...
-
PoliciesMultiPlayers.EstimateM.
threshold_on_t
(t, nbPlayersEstimate, horizon=None)[source]¶ Function \(\xi(t, k)\) used as a threshold in
rhoEst
.- 0 if nbPlayersEstimate is 0,
- 1 if nbPlayersEstimate is 1,
- My heuristic to be any-time (ie, without needing to know the horizon) is to use a function of \(t\) (current time) and not \(T\) (horizon).
- The choice which seemed to perform the best in practice was \(\xi(t, k) = c t\) for a small constant \(c\) (like 5 or 10).
Example:
>>> threshold_on_t(1000, 3) # doctest: +ELLIPSIS 47.730... >>> threshold_on_t(1000, 3, horizon=2000) # doctest: +ELLIPSIS 47.730...
-
class
PoliciesMultiPlayers.EstimateM.
oneEstimateM
(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
-
__init__
(nbArms, playerAlgo, threshold, decentralizedPolicy, *args, lower=0.0, amplitude=1.0, horizon=None, args_decentralizedPolicy=None, kwargs_decentralizedPolicy=None, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
threshold
= None¶ Threshold function
-
nbPlayersEstimate
= None¶ Number of players. Optimistic: start by assuming it is alone!
-
collisionCount
= None¶ Count collisions on each arm, since last increase of nbPlayersEstimate
-
timeSinceLastCollision
= None¶ Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!
-
t
= None¶ Internal time
-
updateNbPlayers
(nbPlayers=None)[source]¶ Change the value of
nbPlayersEstimate
, and propagate the change to the underlying policy, for parameters calledmaxRank
ornbPlayers
.
-
choiceWithRank
(rank=1)[source]¶ Pass the call to self._policy.choiceWithRank() with the player’s ID number.
-
choiceFromSubSet
(availableArms='all')[source]¶ Pass the call to self._policy.choiceFromSubSet() with the player’s ID number.
-
choiceMultiple
(nb=1)[source]¶ Pass the call to self._policy.choiceMultiple() with the player’s ID number.
-
estimatedOrder
()[source]¶ Pass the call to self._policy.estimatedOrder() with the player’s ID number.
-
estimatedBestArms
(M=1)[source]¶ Pass the call to self._policy.estimatedBestArms() with the player’s ID number.
-
__module__
= 'PoliciesMultiPlayers.EstimateM'¶
-
class
PoliciesMultiPlayers.EstimateM.
EstimateM
(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
EstimateM: a generic wrapper for an efficient multi-players learning policy, with no prior knowledge of the number of player, and using any other MP policy.
-
__init__
(nbPlayers, nbArms, decentralizedPolicy, playerAlgo, policyArgs=None, horizon=None, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- nbArms: number of arms.
- decentralizedPolicy: base MP decentralized policy.
- threshold: the threshold function to use, see
threshold_on_t_with_horizon()
,threshold_on_t_doubling_trick()
orthreshold_on_t()
above. - policyArgs: named arguments (dictionnary), given to
decentralizedPolicy
. - *args, **kwargs: arguments, named arguments, given to
decentralizedPolicy
(will probably be given to the single-player decentralized policy under the hood, don’t care).
Example:
>>> from Policies import *; from PoliciesMultiPlayers import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 4 >>> nbPlayers = 2 >>> s = EstimateM(nbPlayers, nbArms, rhoRand, UCBalpha, alpha=0.5) >>> [ child.choice() for child in s.children ] [0, 3]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.EstimateM'¶
-
PoliciesMultiPlayers.OracleFair module¶
OracleFair: a multi-player policy which uses a centralized intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbBestArms, among the best arms.
- It allows to have absolutely no collision, if there is more channels than users (always assumed).
- And it is perfectly fair on every run: each chosen arm is played successively by each player.
- Note that it IS affecting players on the best arms: it requires full knowledge of the means of the arms, not simply the number of arms.
- Note that they need a perfect knowledge on the arms, even this is not physically plausible.
-
class
PoliciesMultiPlayers.OracleFair.
CyclingBest
(nbArms, offset, bestArms=None)[source]¶ Bases:
PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy
CyclingBest: select an arm in the best ones (bestArms) as (offset + t) % (len(bestArms)), with offset being decided by the OracleFair multi-player policy.
-
nbArms
= None¶ Number of arms
-
offset
= None¶ Offset
-
bestArms
= None¶ List of index of the best arms to play
-
nb_bestArms
= None¶ Number of best arms
-
t
= None¶ Internal time
-
__module__
= 'PoliciesMultiPlayers.OracleFair'¶
-
-
class
PoliciesMultiPlayers.OracleFair.
OracleFair
(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
OracleFair: a multi-player policy which uses a centralize intelligence to affect users an offset, each one take an orthogonal arm based on (offset + t) % nbArms.
-
__init__
(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]¶ - nbPlayers: number of players to create (in self._players).
- armsMAB: MAB object that represents the arms.
Examples:
>>> import sys; sys.path.insert(0, '..'); from Environment import MAB; from Arms import Bernoulli >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> problem = MAB({'arm_type': Bernoulli, 'params': [0.1, 0.5, 0.9]}) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE ... >>> s = OracleFair(2, problem) >>> [ child.choice() for child in s.children ] [1, 2] >>> [ child.choice() for child in s.children ] [2, 1]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
__module__
= 'PoliciesMultiPlayers.OracleFair'¶
-
PoliciesMultiPlayers.OracleNotFair module¶
OracleNotFair: a multi-player policy with full knowledge and centralized intelligence to affect users to a FIXED arm, among the best arms.
- It allows to have absolutely no collision, if there is more channels than users (always assumed).
- But it is NOT fair on ONE run: the best arm is played only by one player.
- Note that in average, it is fair (who plays the best arm is randomly decided).
- Note that it IS affecting players on the best arms: it requires full knowledge of the means of the arms, not simply the number of arms.
- Note that they need a perfect knowledge on the arms, even this is not physically plausible.
-
class
PoliciesMultiPlayers.OracleNotFair.
Fixed
(nbArms, armIndex)[source]¶ Bases:
PoliciesMultiPlayers.BaseCentralizedPolicy.BaseCentralizedPolicy
Fixed: always select a fixed arm, as decided by the OracleNotFair multi-player policy.
-
nbArms
= None¶ Number of arms
-
armIndex
= None¶ Index of fixed arm
-
__module__
= 'PoliciesMultiPlayers.OracleNotFair'¶
-
-
class
PoliciesMultiPlayers.OracleNotFair.
OracleNotFair
(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
OracleNotFair: a multi-player policy which uses a centralized intelligence to affect users to affect users to a FIXED arm, among the best arms.
-
__init__
(nbPlayers, armsMAB, lower=0.0, amplitude=1.0)[source]¶ - nbPlayers: number of players to create (in self._players).
- armsMAB: MAB object that represents the arms.
Examples:
>>> import sys; sys.path.insert(0, '..'); from Environment import MAB; from Arms import Bernoulli >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> problem = MAB({'arm_type': Bernoulli, 'params': [0.1, 0.5, 0.9]}) # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE ... >>> s = OracleNotFair(2, problem) >>> [ child.choice() for child in s.children ] [2, 1] >>> [ child.choice() for child in s.children ] [2, 1]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
__module__
= 'PoliciesMultiPlayers.OracleNotFair'¶
-
PoliciesMultiPlayers.RandTopM module¶
RandTopM: four proposals for an efficient multi-players learning policy. RandTopM
and MCTopM
are the two main algorithms, with variants (see below).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
- When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
- Reference: [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733)
-
PoliciesMultiPlayers.RandTopM.
WITH_CHAIR
= False¶ Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best.
RandTopM
(and variants) uses False andMCTopM
(and variants) uses True.
-
PoliciesMultiPlayers.RandTopM.
OPTIM_PICK_WORST_FIRST
= False¶ XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by
RandTopMCautious
andRandTopMExtraCautious
, and byMCTopMCautious
andMCTopMExtraCautious
.
-
PoliciesMultiPlayers.RandTopM.
OPTIM_EXIT_IF_WORST_WAS_PICKED
= False¶ XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by
RandTopMExtraCautious
andMCTopMExtraCautious
.
-
PoliciesMultiPlayers.RandTopM.
OPTIM_PICK_PREV_WORST_FIRST
= True¶ XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for
RandTopMOld
andMCTopMOld
.
-
class
PoliciesMultiPlayers.RandTopM.
oneRandTopM
(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random arm is sampled after observing a collision,
- And the player does not aim at the best arm, but at one of the best arm, based on her index policy.
- (See variants for more details.)
-
__init__
(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different.
-
chosen_arm
= None¶ Current chosen arm.
-
sitted
= None¶ Not yet sitted. After 1 step without collision, don’t react to collision (but still react when the chosen arm lies outside M-best).
-
prevWorst
= None¶ Keep track of the last arms worst than the chosen one (at previous time step).
-
t
= None¶ Internal time
-
worst_Mbest
()[source]¶ Index of the worst of the current estimate of the M-best arms. M is the maxRank given to the algorithm.
-
worst_previous__and__current_Mbest
(current_arm)[source]¶ Return the set from which to select a random arm for
MCTopM
(the optimization is now the default):\[\hat{M}^j(t) \cap \{ m : g^j_m(t-1) \leq g^j_k(t-1) \}.\]
-
handleCollision
(arm, reward=None)[source]¶ Get a new random arm from the current estimate of Mbest, and give reward to the algorithm if not None.
-
getReward
(arm, reward)[source]¶ Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.
-
choice
()[source]¶ Reconsider the choice of arm, and then use the chosen arm.
- For all variants, if the chosen arm is no longer in the current estimate of the Mbest set, a new one is selected,
- The basic RandTopM selects uniformly an arm in estimate Mbest,
- MCTopM starts by being “non sitted” on its new chosen arm,
- MCTopMCautious is forced to first try the arm with lowest UCB indexes (or whatever index policy is used).
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
class
PoliciesMultiPlayers.RandTopM.
RandTopM
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopM: a proposal for an efficient multi-players learning policy.
-
__init__
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- withChair: see
WITH_CHAIR
, - pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
, - exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
, - pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
, - maxRank: maximum rank allowed by the RandTopM child (default to nbPlayers, but for instance if there is 2 × RandTopM[UCB] + 2 × RandTopM[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopM(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
withChair
= None¶ Using a chair ?
-
pickWorstFirst
= None¶ Using first optimization ?
-
exitIfWorstWasPicked
= None¶ Using second optimization ?
-
pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
RandTopMCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the RandTopMCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMCautious[UCB] + 2 × RandTopMCautious[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
RandTopMExtraCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the RandTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMExtraCautious[UCB] + 2 × RandTopMExtraCautious[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMExtraCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
RandTopMOld
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the RandTopMOld child (default to nbPlayers, but for instance if there is 2 × RandTopMOld[UCB] + 2 × RandTopMOld[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMOld(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
MCTopM
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopM: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the MCTopM child (default to nbPlayers, but for instance if there is 2 × MCTopM[UCB] + 2 × MCTopM[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopM(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
MCTopMCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the MCTopMCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMCautious[UCB] + 2 × MCTopMCautious[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
MCTopMExtraCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMExtraCautious: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the MCTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMExtraCautious[UCB] + 2 × MCTopMExtraCautious[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMExtraCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
-
class
PoliciesMultiPlayers.RandTopM.
MCTopMOld
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMOld: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be one of the most efficient decentralized MP algorithm we have so far… The two other variants of MCTopM seem even better!
-
__module__
= 'PoliciesMultiPlayers.RandTopM'¶
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the MCTopMOld child (default to nbPlayers, but for instance if there is 2 × MCTopMOld[UCB] + 2 × MCTopMOld[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMOld(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
-
PoliciesMultiPlayers.RandTopMEst module¶
RandTopMEstEst: four proposals for an efficient multi-players learning policy. RandTopMEstEst
and MCTopMEstEst
are the two main algorithms, with variants (see below).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
- When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.
- The (fixed) number of players is learned on the run.
Note
This is fully decentralized: player do not need to know the (fixed) number of players!
- Reference: [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733)
Warning
This is still very experimental!
Note
For a more generic approach, see the wrapper defined in EstimateM.EstimateM
.
-
class
PoliciesMultiPlayers.RandTopMEst.
oneRandTopMEst
(threshold, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.oneRandTopM
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
-
__init__
(threshold, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
threshold
= None¶ Threshold function
-
nbPlayersEstimate
= None¶ Number of players. Optimistic: start by assuming it is alone!
-
collisionCount
= None¶ Count collisions on each arm, since last increase of nbPlayersEstimate
-
timeSinceLastCollision
= None¶ Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!
-
t
= None¶ Internal time
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
PoliciesMultiPlayers.RandTopMEst.
WITH_CHAIR
= False¶ Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best.
RandTopMEst
(and variants) uses False andMCTopMEst
(and variants) uses True.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_PICK_WORST_FIRST
= False¶ XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by
RandTopMEstCautious
andRandTopMEstExtraCautious
, and byMCTopMEstCautious
andMCTopMEstExtraCautious
.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_EXIT_IF_WORST_WAS_PICKED
= False¶ XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by
RandTopMEstExtraCautious
andMCTopMEstExtraCautious
.
-
PoliciesMultiPlayers.RandTopMEst.
OPTIM_PICK_PREV_WORST_FIRST
= True¶ XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for
RandTopMEstOld
andMCTopMEstOld
.
-
class
PoliciesMultiPlayers.RandTopMEst.
RandTopMEst
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopMEst: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.
-
__init__
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- withChair: see
WITH_CHAIR
, - pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
, - exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
, - pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
, - threshold: the threshold function to use, see
EstimateM.threshold_on_t_with_horizon()
,EstimateM.threshold_on_t_doubling_trick()
orEstimateM.threshold_on_t()
above. - *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMEst(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
withChair
= None¶ Using a chair ?
-
pickWorstFirst
= None¶ Using first optimization ?
-
exitIfWorstWasPicked
= None¶ Using second optimization ?
-
pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
RandTopMEstPlus
(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopMEstPlus: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.
-
__init__
(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- horizon: need to know the horizon \(T\).
- withChair: see
WITH_CHAIR
, - pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
, - exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
, - pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
, - threshold: the threshold function to use, see
threshold_on_t_with_horizon()
orthreshold_on_t()
above. - *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> horizon = 1000 >>> s = RandTopMEstPlus(nbPlayers, nbArms, UCB, horizon) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
withChair
= None¶ Using a chair ?
-
pickWorstFirst
= None¶ Using first optimization ?
-
exitIfWorstWasPicked
= None¶ Using second optimization ?
-
pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
MCTopMEst
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopMEst.RandTopMEst
MCTopMEst: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__init__
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
-
class
PoliciesMultiPlayers.RandTopMEst.
MCTopMEstPlus
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus
MCTopMEstPlus: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…
-
__module__
= 'PoliciesMultiPlayers.RandTopMEst'¶
-
__init__
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
-
PoliciesMultiPlayers.Scenario1 module¶
Scenario1: make a set of M experts with the following behavior, for K = 2 arms: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.
- Reference: Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. E. (2011, April). Contextual Bandit Algorithms with Supervised Learning Guarantees. In AISTATS (pp. 19-26).
-
class
PoliciesMultiPlayers.Scenario1.
OneScenario1
(mother, playerId)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
OneScenario1: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.
-
__module__
= 'PoliciesMultiPlayers.Scenario1'¶
-
-
class
PoliciesMultiPlayers.Scenario1.
Scenario1
(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
Scenario1: make a set of M experts with the following behavior, for K = 2 arms: at every round, one of them is chosen uniformly to predict arm 0, and the rest predict 1.
- Reference: Beygelzimer, A., Langford, J., Li, L., Reyzin, L., & Schapire, R. E. (2011, April). Contextual Bandit Algorithms with Supervised Learning Guarantees. In AISTATS (pp. 19-26).
-
__init__
(nbPlayers, nbArms, lower=0.0, amplitude=1.0)[source]¶ - nbPlayers: number of players to create (in self._players).
Examples:
>>> s = Scenario1(10)
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use
-
__module__
= 'PoliciesMultiPlayers.Scenario1'¶
PoliciesMultiPlayers.Selfish module¶
Selfish: a multi-player policy where every player is selfish, playing on their side.
- without knowing how many players there is,
- and not even knowing that they should try to avoid collisions. When a collision happens, the algorithm simply receive a 0 reward for the chosen arm.
-
class
PoliciesMultiPlayers.Selfish.
SelfishChildPointer
(mother, playerId)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Selfish version of the ChildPointer class (just pretty printed).
-
__module__
= 'PoliciesMultiPlayers.Selfish'¶
-
-
PoliciesMultiPlayers.Selfish.
PENALTY
= None¶ Customize here the value given to a user after a collision XXX If it is None, then player.lower (default to 0) is used instead
-
class
PoliciesMultiPlayers.Selfish.
Selfish
(nbPlayers, nbArms, playerAlgo, penalty=None, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
Selfish: a multi-player policy where every player is selfish, playing on their side.
- without nowing how many players there is, and
- not even knowing that they should try to avoid collisions. When a collision happens, the algorithm simply receives a 0 reward for the chosen arm (can be changed with penalty= argument).
-
__init__
(nbPlayers, nbArms, playerAlgo, penalty=None, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Examples:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = Selfish(nbPlayers, nbArms, Uniform) >>> [ child.choice() for child in s.children ] [12, 13, 1, 8, 16, 15] >>> [ child.choice() for child in s.children ] [12, 9, 15, 11, 6, 16]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
Warning
I want my code to stay compatible with Python 2, so I cannot use the new syntax of keyword-only argument. It would make more sense to have
*args, penalty=PENALTY, lower=0., amplitude=1., **kwargs
instead ofpenalty=PENALTY, *args, **kwargs
but I can’t.
-
nbPlayers
= None¶ Number of players
-
penalty
= None¶ Penalty = reward given in case of collision
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
_handleCollision_one
(playerId, arm, reward=None)[source]¶ Give a reward of 0, or player.lower, or self.penalty, in case of collision.
-
__module__
= 'PoliciesMultiPlayers.Selfish'¶
PoliciesMultiPlayers.rhoCentralized module¶
rhoCentralized: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- Every player has rank_i = i + 1, as given by the base station.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players, and an initial orthogonal configuration.
Warning
This policy is NOT efficient at ALL! Don’t use it! It seems a smart idea, but it’s not.
-
class
PoliciesMultiPlayers.rhoCentralized.
oneRhoCentralized
(maxRank, mother, playerId, rank=None, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- The player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, mother, playerId, rank=None, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
keep_the_same_rank
= None¶ If True, the rank is kept constant during the game, as if it was given by the Base Station
-
rank
= None¶ Current rank, starting to 1 by default, or ‘rank’ if given as an argument
-
handleCollision
(arm, reward=None)[source]¶ Get a new fully random rank, and give reward to the algorithm if not None.
-
__module__
= 'PoliciesMultiPlayers.rhoCentralized'¶
-
class
PoliciesMultiPlayers.rhoCentralized.
rhoCentralized
(nbPlayers, nbArms, playerAlgo, maxRank=None, orthogonalRanks=True, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
rhoCentralized: implementation of a variant of the multi-player rhoRand policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, orthogonalRanks=True, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the rhoCentralized child (default to nbPlayers, but for instance if there is 2 × rhoCentralized[UCB] + 2 × rhoCentralized[klUCB], maxRank should be 4 not 2).
- orthogonalRanks: if True, orthogonal ranks 1..M are directly affected to the players 1..M.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoCentralized(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
orthogonalRanks
= None¶ Using orthogonal ranks from starting
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoCentralized'¶
-
PoliciesMultiPlayers.rhoEst module¶
rhoEst: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on \([1, \dots, \hat{M}_i(t)]\) where \(\hat{M}_i(t)\) is the current estimated number of player by player i,
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
- My choice for the threshold function, see
threshold_on_t()
, does not need the horizon either, and uses \(t\) instead.
Note
This is fully decentralized: each child player does NOT need to know the number of players and does NOT require the horizon \(T\).
Note
For a more generic approach, see the wrapper defined in EstimateM.EstimateM
.
-
class
PoliciesMultiPlayers.rhoEst.
oneRhoEst
(threshold, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy,
- The rhoEst policy is used to keep an estimate on the total number of players, \(\hat{M}_i(t)\).
- The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).
-
__init__
(threshold, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
threshold
= None¶ Threshold function
-
nbPlayersEstimate
= None¶ Number of players. Optimistic: start by assuming it is alone!
-
rank
= None¶ Current rank, starting to 1
-
collisionCount
= None¶ Count collisions on each arm, since last increase of nbPlayersEstimate
-
timeSinceLastCollision
= None¶ Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!
-
t
= None¶ Internal time
-
__module__
= 'PoliciesMultiPlayers.rhoEst'¶
-
class
PoliciesMultiPlayers.rhoEst.
rhoEst
(nbPlayers, nbArms, playerAlgo, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoEst: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- threshold: the threshold function to use, see
EstimateM.threshold_on_t_with_horizon()
,EstimateM.threshold_on_t_doubling_trick()
orEstimateM.threshold_on_t()
above. - *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoEst(nbPlayers, nbArms, UCB, threshold=threshold_on_t) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoEst'¶
-
-
class
PoliciesMultiPlayers.rhoEst.
rhoEstPlus
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoEstPlus: implementation of the 2nd multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- horizon: need to know the horizon \(T\).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> horizon = 1000 >>> s = rhoEstPlus(nbPlayers, nbArms, UCB, horizon=horizon) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoEst'¶
-
PoliciesMultiPlayers.rhoLearn module¶
rhoLearn: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
- If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy, it might be better! Warning: no theoretical guarantees exist!
- Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
PoliciesMultiPlayers.rhoLearn.
CHANGE_RANK_EACH_STEP
= False¶ Should oneRhoLearn players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.
-
class
PoliciesMultiPlayers.rhoLearn.
oneRhoLearn
(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
- When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1
-
change_rank_each_step
= None¶ Change rank at each step?
-
getReward
(arm, reward)[source]¶ Give a 1 reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.
-
handleCollision
(arm, reward=None)[source]¶ Give a 0 reward to the rank selection algorithm, and select a (possibly new) rank.
-
__module__
= 'PoliciesMultiPlayers.rhoLearn'¶
-
class
PoliciesMultiPlayers.rhoLearn.
rhoLearn
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, maxRank=None, change_rank_each_step=False, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoLearn: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.
-
__init__
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, maxRank=None, change_rank_each_step=False, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- rankSelectionAlgo: algorithm to use for selecting the ranks.
- maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> stickyTime = 5 >>> s = rhoLearn(nbPlayers, nbArms, UCB, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
rankSelectionAlgo
= None¶ Policy to use to chose the ranks
-
nbArms
= None¶ Number of arms
-
change_rank_each_step
= None¶ Change rank at every steps?
-
__module__
= 'PoliciesMultiPlayers.rhoLearn'¶
-
PoliciesMultiPlayers.rhoLearnEst module¶
rhoLearnEst: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank, and without knowing the number of users.
- It generalizes
PoliciesMultiPlayers.rhoLearn.rhoLearn
simply by letting the ranks be \(\{1,\dots,K\}\) and not in \(\{1,\dots,M\}\), by hoping the learning algorithm will be “smart enough” and learn by itself that ranks should be \(\leq M\). - Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
- If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy, it might be better! Warning: no theoretical guarantees exist!
- Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)
Note
This is fully decentralized: each child player does not need to know the (fixed) number of players, it will learn to select ranks only in \(\{1,\dots,M\}\) instead of \(\{1,\dots,K\}\).
Warning
This policy does not work very well!
-
class
PoliciesMultiPlayers.rhoLearnEst.
oneRhoLearnEst
(maxRank, rankSelectionAlgo, change_rank_each_step, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoLearn.oneRhoLearn
-
__module__
= 'PoliciesMultiPlayers.rhoLearnEst'¶
-
-
class
PoliciesMultiPlayers.rhoLearnEst.
rhoLearnEst
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, change_rank_each_step=False, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoLearn.rhoLearn
rhoLearnEst: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank, and without knowing the number of users.
-
__init__
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Uniform.Uniform'>, lower=0.0, amplitude=1.0, change_rank_each_step=False, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- rankSelectionAlgo: algorithm to use for selecting the ranks.
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Difference with
PoliciesMultiPlayers.rhoLearn.rhoLearn
:- maxRank: maximum rank allowed by the rhoRand child, is not an argument, but it is always nbArms (= K).
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoLearnEst(nbPlayers, nbArms, UCB, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
rankSelectionAlgo
= None¶ Policy to use to chose the ranks
-
nbArms
= None¶ Number of arms
-
change_rank_each_step
= None¶ Change rank at every steps?
-
__module__
= 'PoliciesMultiPlayers.rhoLearnEst'¶
-
PoliciesMultiPlayers.rhoLearnExp3 module¶
rhoLearnExp3: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using the Exp3 learning algorithm instead of a random exploration for choosing the rank.
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is given by a second learning algorithm, playing on arms = ranks from [1, .., M], where M is the number of player.
- If rankSelection = Uniform, this is like rhoRand, but if it is a smarter policy (like Exp3 here), it might be better! Warning: no theoretical guarantees exist!
- Reference: [Proof-of-Concept System for Opportunistic Spectrum Access in Multi-user Decentralized Networks, S.J.Darak, C.Moy, J.Palicot, EAI 2016](https://doi.org/10.4108/eai.5-9-2016.151647), algorithm 2. (for BayesUCB only)
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
For the Exp3 algorithm:
- Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)
- See also [Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments, Y. Seldin & C. Szepasvari & P. Auer & Y. Abbasi-Adkori, 2012](http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf).
-
PoliciesMultiPlayers.rhoLearnExp3.
binary_feedback
(sensing, collision)[source]¶ Count 1 iff the sensing authorized to communicate and no collision was observed.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= r_{j,t} = F_{m,t} \times (1 - c_{m,t}), \\ \text{where}\;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
ternary_feedback
(sensing, collision)[source]¶ Count 1 iff the sensing authorized to communicate and no collision was observed, 0 if no communication, and -1 iff communication but a collision was observed.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
generic_ternary_feedback
(sensing, collision, bonus=1, malus=-1)[source]¶ Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, and 0 if no communication.
-
PoliciesMultiPlayers.rhoLearnExp3.
generic_continuous_feedback
(sensing, collision, bonus=1, malus=-1)[source]¶ Count ‘bonus’ iff the sensing authorized to communicate and no collision was observed, ‘malus’ iff communication but a collision was observed, but possibly does not count 0 if no communication.
\[\begin{split}\mathrm{reward}(\text{user}\;j, \text{time}\;t) &:= \mathrm{malus} + (\mathrm{bonus} - \mathrm{malus}) \times \frac{r'_{j,t} + 1}{2}, \\ \text{where}\;\; r'_{j,t} &:= F_{m,t} \times (2 r_{m,t} - 1), \\ \text{where}\;\; r_{j,t} &:= F_{m,t} \times (1 - c_{m,t}), \\ \text{and} \;\; F_{m,t} &\; \text{is the sensing feedback (1 iff channel is free)}, \\ \text{and} \;\; c_{m,t} &\; \text{is the collision feedback (1 iff user j experienced a collision)}.\end{split}\]
-
PoliciesMultiPlayers.rhoLearnExp3.
reward_from_decoupled_feedback
(sensing, collision)¶ Decide the default function to use. FIXME try all of them!
-
PoliciesMultiPlayers.rhoLearnExp3.
CHANGE_RANK_EACH_STEP
= False¶ Should oneRhoLearnExp3 players select a (possibly new) rank at each step ? The algorithm P2 from https://doi.org/10.4108/eai.5-9-2016.151647 suggests to do so. But I found it works better without this trick.
-
class
PoliciesMultiPlayers.rhoLearnExp3.
oneRhoLearnExp3
(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a (possibly new) rank is sampled after observing a collision, from the rankSelection algorithm.
- When no collision is observed on a arm, a small reward is given to the rank used for this play, in order to learn the best ranks with rankSelection.
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, rankSelectionAlgo, change_rank_each_step, feedback_function, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1
-
change_rank_each_step
= None¶ Change rank at each step?
-
feedback_function
= None¶ Feedback function: (sensing, collision) -> reward
-
getReward
(arm, reward)[source]¶ Give a “good” reward to the rank selection algorithm (no collision), give reward to the arm selection algorithm, and if self.change_rank_each_step, select a (possibly new) rank.
-
handleCollision
(arm, reward)[source]¶ Give a “bad” reward to the rank selection algorithm, and select a (possibly new) rank.
-
__module__
= 'PoliciesMultiPlayers.rhoLearnExp3'¶
-
class
PoliciesMultiPlayers.rhoLearnExp3.
rhoLearnExp3
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoLearnExp3: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/), using a learning algorithm instead of a random exploration for choosing the rank.
-
__init__
(nbPlayers, nbArms, playerAlgo, rankSelectionAlgo=<class 'Policies.Exp3.Exp3Decreasing'>, maxRank=None, change_rank_each_step=False, feedback_function=<function binary_feedback>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- rankSelectionAlgo: algorithm to use for selecting the ranks.
- maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoLearnExp3(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [0, 1, 9, 0, 10, 3] >>> [ child.choice() for child in s.children ] [11, 2, 0, 0, 4, 5]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
rankSelectionAlgo
= None¶ Policy to use to chose the ranks
-
nbArms
= None¶ Number of arms
-
change_rank_each_step
= None¶ Change rank at every steps?
-
__module__
= 'PoliciesMultiPlayers.rhoLearnExp3'¶
-
PoliciesMultiPlayers.rhoRand module¶
rhoRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
class
PoliciesMultiPlayers.rhoRand.
oneRhoRand
(maxRank, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1 by default
-
handleCollision
(arm, reward=None)[source]¶ Get a new fully random rank, and give reward to the algorithm if not None.
-
__module__
= 'PoliciesMultiPlayers.rhoRand'¶
-
class
PoliciesMultiPlayers.rhoRand.
rhoRand
(nbPlayers, nbArms, playerAlgo, maxRank=None, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
rhoRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoRand(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoRand'¶
-
PoliciesMultiPlayers.rhoRandALOHA module¶
rhoRandALOHA: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
- The only difference with rhoRand is that when colliding, users have a small chance of keeping the same rank, following a Bernoulli experiment: with probability = \(p(t)\), it keeps the same rank, with proba \(1 - p(t)\) it changes its rank (uniformly in \(\{1,\dots,M\}\), so there is a chance it finds the same again? FIXME).
- There is also a variant, like in MEGA (ALOHA-like protocol), the proba change after time: p(t+1) = alpha p(t) + (1-alpha)
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
PoliciesMultiPlayers.rhoRandALOHA.
new_rank
(rank, maxRank, forceChange=False)[source]¶ Return a new rank, from \(1, \dots, \mathrm{maxRank}\), different than rank, uniformly.
- Internally, it uses a simple rejection sampling : keep taking a new rank \(\sim U(\{1, \dots, \mathrm{maxRank}\})\), until it is different than rank (that’s not the most efficient way to do it but is simpler).
Example:
>>> from random import seed; seed(0) # reproducibility >>> [ new_rank(1, 8, False) for _ in range(10) ] [7, 7, 1, 5, 9, 8, 7, 5, 8, 6] >>> [ new_rank(8, 8, False) for _ in range(10) ] [4, 9, 3, 5, 3, 2, 5, 9, 3, 5]
Example with forceChange = True, when a new rank is picked different than the current one.
>>> [ new_rank(1, 8, True) for _ in range(10) ] [2, 2, 6, 8, 9, 2, 6, 7, 6, 4] >>> [ new_rank(5, 8, True) for _ in range(10) ] [9, 8, 8, 9, 1, 9, 1, 2, 7, 1]
-
class
PoliciesMultiPlayers.rhoRandALOHA.
oneRhoRandALOHA
(maxRank, p0, alpha_p0, forceChange, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, p0, alpha_p0, forceChange, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
p0
= None¶ Initial probability, should not be modified.
-
p
= None¶ Current probability of staying with the current rank after a collision. If 0, then it is like the initial rhoRand policy.
-
alpha_p0
= None¶ Parameter alpha for the recurrence equation for probability p(t)
-
rank
= None¶ Current rank, starting to 1 by default
-
forceChange
= None¶ Should a different rank be used when moving? Or not.
-
handleCollision
(arm, reward=None)[source]¶ Get a new fully random rank, and give reward to the algorithm if not None.
-
getReward
(arm, reward)[source]¶ Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.
- Additionally, if the current rank was good enough to not bring any collision during the last p0 time steps, the player “sits” on that rank.
-
__module__
= 'PoliciesMultiPlayers.rhoRandALOHA'¶
-
PoliciesMultiPlayers.rhoRandALOHA.
P0
= 0.0¶ Default value for P0, ideally, it should be 1/(K*M) the number of player
-
PoliciesMultiPlayers.rhoRandALOHA.
ALPHA_P0
= 0.9999¶ Default value for ALPHA_P0, FIXME I have no idea what the best possible choise ca be!
-
PoliciesMultiPlayers.rhoRandALOHA.
FORCE_CHANGE
= False¶ Default value for forceChange. Logically, it should be True.
-
class
PoliciesMultiPlayers.rhoRandALOHA.
rhoRandALOHA
(nbPlayers, nbArms, playerAlgo, p0=None, alpha_p0=0.9999, forceChange=False, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoRandALOHA: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, p0=None, alpha_p0=0.9999, forceChange=False, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- p0: given to the oneRhoRandALOHA objects (see above).
- alpha_p0: given to the oneRhoRandALOHA objects (see above).
- forceChange: given to the oneRhoRandALOHA objects (see above).
- maxRank: maximum rank allowed by the rhoRandALOHA child (default to nbPlayers, but for instance if there is 2 × rhoRandALOHA[UCB] + 2 × rhoRandALOHA[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> p0, alpha_p0, forceChange = 0.6, 0.5, True >>> s = rhoRandALOHA(nbPlayers, nbArms, UCB, p0, alpha_p0, forceChange) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
p0
= None¶ Initial value for p, current probability of staying with the current rank after a collision
-
alpha_p0
= None¶ Parameter alpha for the recurrence equation for probability p(t)
-
forceChange
= None¶ Should a different rank be used when moving? Or not.
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoRandALOHA'¶
-
-
PoliciesMultiPlayers.rhoRandALOHA.
random
() → x in the interval [0, 1).¶
PoliciesMultiPlayers.rhoRandRand module¶
rhoRandRand: implementation of a variant of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the k-th best arm, for k again uniformly drawn from [1, …, rank_i],
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, …, M] where M is the number of player.
Note
This algorithm is intended to be stupid! It does not work at all!!
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
class
PoliciesMultiPlayers.rhoRandRand.
oneRhoRandRand
(maxRank, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1
-
__module__
= 'PoliciesMultiPlayers.rhoRandRand'¶
-
class
PoliciesMultiPlayers.rhoRandRand.
rhoRandRand
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, maxRank=None, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
rhoRandRand: implementation of the multi-player policy from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, maxRank=None, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the rhoRand child (default to nbPlayers, but for instance if there is 2 × rhoRand[UCB] + 2 × rhoRand[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoRandRand(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
nbArms
= None¶ Number of arms
-
children
= None¶ List of children, fake algorithms
-
__module__
= 'PoliciesMultiPlayers.rhoRandRand'¶
-
PoliciesMultiPlayers.rhoRandRotating module¶
rhoRandRotating: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
- The only difference with rhoRand is that at every time step, the rank is updated by 1, and cycles in [1, .., M] iteratively.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
class
PoliciesMultiPlayers.rhoRandRotating.
oneRhoRandRotating
(maxRank, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
rank
= None¶ Current rank, starting to 1 by default
-
handleCollision
(arm, reward=None)[source]¶ Get a new fully random rank, and give reward to the algorithm if not None.
-
choice
()[source]¶ Chose with the new rank, then update the rank:
\[\mathrm{rank}_j(t+1) := \mathrm{rank}_j(t) + 1 \;\mathrm{mod}\; M.\]
-
__module__
= 'PoliciesMultiPlayers.rhoRandRotating'¶
-
class
PoliciesMultiPlayers.rhoRandRotating.
rhoRandRotating
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoRandRotating: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- maxRank: maximum rank allowed by the rhoRandRotating child (default to nbPlayers, but for instance if there is 2 × rhoRandRotating[UCB] + 2 × rhoRandRotating[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = rhoRandRotating(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
. - Warning:
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoRandRotating'¶
-
PoliciesMultiPlayers.rhoRandSticky module¶
rhoRandSticky: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
- Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
- But instead of aiming at the best (the 1-st best) arm, player i aims at the rank_i-th best arm,
- At first, every player has a random rank_i from 1 to M, and when a collision occurs, rank_i is sampled from a uniform distribution on [1, .., M] where M is the number of player.
- The only difference with rhoRand is that once a player selected a rank and did not encounter a collision for STICKY_TIME time steps, he will never change his rank. rhoRand has STICKY_TIME = +oo, MusicalChair is something like STICKY_TIME = 1, this variant rhoRandSticky has this as a parameter.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
-
PoliciesMultiPlayers.rhoRandSticky.
STICKY_TIME
= 10¶ Default value for STICKY_TIME
-
class
PoliciesMultiPlayers.rhoRandSticky.
oneRhoRandSticky
(maxRank, stickyTime, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.oneRhoRand
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.
- Except for the handleCollision method: a new random rank is sampled after observing a collision,
- And the player does not aim at the best arm, but at the rank-th best arm, based on her index policy.
-
__init__
(maxRank, stickyTime, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
stickyTime
= None¶ Number of time steps needed without collisions before sitting (never changing rank again)
-
rank
= None¶ Current rank, starting to 1 by default
-
sitted
= None¶ Not yet sitted. After stickyTime steps without collisions, sit and never change rank again.
-
stepsWithoutCollisions
= None¶ Number of steps since we chose that rank and did not see any collision. As soon as this gets greater than stickyTime, the player sit.
-
handleCollision
(arm, reward=None)[source]¶ Get a new fully random rank, and give reward to the algorithm if not None.
-
getReward
(arm, reward)[source]¶ Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.
- Additionally, if the current rank was good enough to not bring any collision during the last stickyTime time steps, the player “sits” on that rank.
-
__module__
= 'PoliciesMultiPlayers.rhoRandSticky'¶
-
class
PoliciesMultiPlayers.rhoRandSticky.
rhoRandSticky
(nbPlayers, nbArms, playerAlgo, stickyTime=10, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.rhoRand.rhoRand
rhoRandSticky: implementation of a variant of the multi-player policy rhoRand from [Distributed Algorithms for Learning…, Anandkumar et al., 2010](http://ieeexplore.ieee.org/document/5462144/).
-
__init__
(nbPlayers, nbArms, playerAlgo, stickyTime=10, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ - nbPlayers: number of players to create (in self._players).
- playerAlgo: class to use for every players.
- nbArms: number of arms, given as first argument to playerAlgo.
- stickyTime: given to the oneRhoRandSticky objects (see above).
- maxRank: maximum rank allowed by the rhoRandSticky child (default to nbPlayers, but for instance if there is 2 × rhoRandSticky[UCB] + 2 × rhoRandSticky[klUCB], maxRank should be 4 not 2).
- *args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> stickyTime = 5 >>> s = rhoRandSticky(nbPlayers, nbArms, UCB, stickyTime=stickyTime) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7] >>> [ child.choice() for child in s.children ] [9, 4, 6, 12, 1, 6]
- To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!
-
maxRank
= None¶ Max rank, usually nbPlayers but can be different
-
stickyTime
= None¶ Number of time steps needed without collisions before sitting (never changing rank again)
-
nbPlayers
= None¶ Number of players
-
children
= None¶ List of children, fake algorithms
-
nbArms
= None¶ Number of arms
-
__module__
= 'PoliciesMultiPlayers.rhoRandSticky'¶
-
PoliciesMultiPlayers.with_proba module¶
Simply defines a function with_proba()
that is used everywhere.
-
PoliciesMultiPlayers.with_proba.
with_proba
(epsilon)[source]¶ Bernoulli test, with probability \(\varepsilon\), return True, and with probability \(1 - \varepsilon\), return False.
Example:
>>> from random import seed; seed(0) # reproductible >>> with_proba(0.5) False >>> with_proba(0.9) True >>> with_proba(0.1) False >>> if with_proba(0.2): ... print("This happens 20% of the time.")
-
PoliciesMultiPlayers.with_proba.
random
() → x in the interval [0, 1).¶
complete_tree_exploration_for_MP_bandits module¶
Experimental code to perform complete tree exploration for Multi-Player bandits.
Algorithms:
- Support Selfish 0-greedy, UCB, and klUCB in 3 different variants.
- Support also RhoRand, RandTopM and MCTopM, even though they are not memory-less, by using another state representation (inlining the memory of each player, eg the ranks for RhoRand).
Features:
- For the means of each arm, \(\mu_1, \dots, \mu_K\), this script can use exact formal computations with sympy, or fractions with Fraction, or float number.
- The graph can contain all nodes from root to leafs, or only leafs (with summed probabilities), and possibly only the absorbing nodes are showed.
- Support export of the tree to a GraphViz dot graph, and can save it to SVG/PNG and LaTeX (with Tikz) and PDF etc.
- By default, the root is highlighted in green and the absorbing nodes are in red.
Warning
I still have to fix these issues:
- TODO : right now, it is not so efficient, could it be improved? I don’t think I can do anything in a smarter way, in pure Python.
Requirements:
- ‘sympy’ module to use formal means \(\mu_1, \dots, \mu_K\) instead of numbers,
- ‘numpy’ module for computations on indexes (e.g.,
np.where
), - ‘graphviz’ module to generate the graph and save it,
- ‘dot2tex’ module to generate nice LaTeX (with Tikz) graph and save it to PDF.
Note
To use the ‘dot2tex’ module, only Python2 is supported. However, I maintain an unpublished port of ‘dot2tex’ for Python3, see [here](https://github.com/Naereen/dot2tex), that you can download, and install manually (sudo python3 setup.py install) to have ‘dot2tex’ for Python3 also.
About:
- Date: 16/09/2017.
- Author: Lilian Besson, (C) 2017
- Licence: MIT Licence (http://lbesson.mit-license.org).
-
complete_tree_exploration_for_MP_bandits.
oo
= inf¶ Shortcut for float(‘+inf’).
-
complete_tree_exploration_for_MP_bandits.
PLOT_DIR
= 'plots/trees'¶ Directory for the plots
-
complete_tree_exploration_for_MP_bandits.
tupleit1
(anarray)[source]¶ Convert a non-hashable 1D numpy array to a hashable tuple.
-
complete_tree_exploration_for_MP_bandits.
tupleit2
(anarray)[source]¶ Convert a non-hashable 2D numpy array to a hashable tuple-of-tuples.
-
complete_tree_exploration_for_MP_bandits.
prod
(iterator)[source]¶ Product of the values in this iterator.
-
complete_tree_exploration_for_MP_bandits.
WIDTH
= 200¶ Default value for the
width
parameter forwraptext()
andwraplatex()
.
-
complete_tree_exploration_for_MP_bandits.
wraptext
(text, width=200)[source]¶ Wrap the text, using
textwrap
module, andwidth
.
-
complete_tree_exploration_for_MP_bandits.
ONLYLEAFS
= True¶ By default, aim at the most concise graph representation by only showing the leafs.
-
complete_tree_exploration_for_MP_bandits.
ONLYABSORBING
= False¶ By default, don’t aim at the most concise graph representation by only showing the absorbing leafs.
-
complete_tree_exploration_for_MP_bandits.
CONCISE
= True¶ By default, only show \(\tilde{S}\) and \(N\) in the graph representations, not all the 4 vectors.
-
complete_tree_exploration_for_MP_bandits.
FULLHASH
= False¶ Use only Stilde, N for hashing the states.
-
complete_tree_exploration_for_MP_bandits.
FORMAT
= 'svg'¶ Format used to save the graphs.
-
complete_tree_exploration_for_MP_bandits.
FixedArm
(j, state)[source]¶ Fake player j that always targets at arm j.
-
complete_tree_exploration_for_MP_bandits.
UniformExploration
(j, state)[source]¶ Fake player j that always targets all arms.
-
complete_tree_exploration_for_MP_bandits.
ConstantRank
(j, state, decision, collision)[source]¶ Constant rank no matter what.
-
complete_tree_exploration_for_MP_bandits.
choices_from_indexes
(indexes)[source]¶ For deterministic index policies, if more than one index is maximum, return the list of positions attaining this maximum (ties), or only one position.
-
complete_tree_exploration_for_MP_bandits.
Selfish_0Greedy_U
(j, state)[source]¶ Selfish policy + 0-Greedy index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_0Greedy_Utilde
(j, state)[source]¶ Selfish policy + 0-Greedy index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_0Greedy_Ubar
(j, state)[source]¶ Selfish policy + 0-Greedy index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_UCB_U
(j, state)[source]¶ Selfish policy + UCB_0.5 index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_UCB
(j, state)[source]¶ Selfish policy + UCB_0.5 index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_UCB_Utilde
(j, state)¶ Selfish policy + UCB_0.5 index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_UCB_Ubar
(j, state)[source]¶ Selfish policy + UCB_0.5 index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_KLUCB_U
(j, state)[source]¶ Selfish policy + Bernoulli KL-UCB index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_KLUCB
(j, state)[source]¶ Selfish policy + Bernoulli KL-UCB index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_KLUCB_Utilde
(j, state)¶ Selfish policy + Bernoulli KL-UCB index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
Selfish_KLUCB_Ubar
(j, state)[source]¶ Selfish policy + Bernoulli KL-UCB index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
choices_from_indexes_with_rank
(indexes, rank=1)[source]¶ For deterministic index policies, if more than one index is maximum, return the list of positions attaining the rank-th largest index (with more than one if ties, or only one position).
-
complete_tree_exploration_for_MP_bandits.
RhoRand_UCB_U
(j, state)[source]¶ RhoRand policy + UCB_0.5 index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
RhoRand_UCB_Utilde
(j, state)[source]¶ RhoRand policy + UCB_0.5 index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
RhoRand_UCB_Ubar
(j, state)[source]¶ RhoRand policy + UCB_0.5 index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
RhoRand_KLUCB_U
(j, state)[source]¶ RhoRand policy + Bernoulli KL-UCB index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
RhoRand_KLUCB_Utilde
(j, state)[source]¶ RhoRand policy + Bernoulli KL-UCB index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
RhoRand_KLUCB_Ubar
(j, state)[source]¶ RhoRand policy + Bernoulli KL-UCB index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
RandomNewRank
(j, state, decision, collision)[source]¶ RhoRand chooses a new uniform rank in {1,..,M} in case of collision, or keep the same.
-
complete_tree_exploration_for_MP_bandits.
default_policy
(j, state)¶ RhoRand policy + UCB_0.5 index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
default_update_memory
(j, state, decision, collision)¶ RhoRand chooses a new uniform rank in {1,..,M} in case of collision, or keep the same.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_UCB_U
(j, state, collision=False)[source]¶ RandTopM policy + UCB_0.5 index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_UCB_Utilde
(j, state, collision=False)[source]¶ RandTopM policy + UCB_0.5 index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_UCB_Ubar
(j, state, collision=False)[source]¶ RandTopM policy + UCB_0.5 index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_KLUCB_U
(j, state, collision=False)[source]¶ RandTopM policy + Bernoulli KL-UCB index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_KLUCB_Utilde
(j, state, collision=False)[source]¶ RandTopM policy + Bernoulli KL-UCB index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_KLUCB_Ubar
(j, state, collision=False)[source]¶ RandTopM policy + Bernoulli KL-UCB index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
RandTopM_RandomNewChosenArm
(j, state, decision, collision)[source]¶ RandTopM chooses a new arm after a collision or if the chosen arm lies outside of its estimatedBestArms set, uniformly from the set of estimated M best arms, or keep the same.
-
complete_tree_exploration_for_MP_bandits.
write_to_tuple
(this_tuple, index, value)[source]¶ Tuple cannot be written, this hack fixes that.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_UCB_U
(j, state, collision=False)[source]¶ MCTopM policy + UCB_0.5 index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_UCB_Utilde
(j, state, collision=False)[source]¶ MCTopM policy + UCB_0.5 index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_UCB_Ubar
(j, state, collision=False)[source]¶ MCTopM policy + UCB_0.5 index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_KLUCB_U
(j, state, collision=False)[source]¶ MCTopM policy + Bernoulli KL-UCB index + U feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_KLUCB_Utilde
(j, state, collision=False)[source]¶ MCTopM policy + Bernoulli KL-UCB index + Utilde feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_KLUCB_Ubar
(j, state, collision=False)[source]¶ MCTopM policy + Bernoulli KL-UCB index + Ubar feedback.
-
complete_tree_exploration_for_MP_bandits.
MCTopM_RandomNewChosenArm
(j, state, decision, collision)[source]¶ RandTopMC chooses a new arm after if the chosen arm lies outside of its estimatedBestArms set, uniformly from the set of estimated M best arms, or keep the same.
-
complete_tree_exploration_for_MP_bandits.
symbol_means
(K)[source]¶ Better to work directly with symbols and instantiate the results after.
-
complete_tree_exploration_for_MP_bandits.
random_uniform_means
(K)[source]¶ If needed, generate an array of K (numerical) uniform means in [0, 1].
-
complete_tree_exploration_for_MP_bandits.
uniform_means
(nbArms=3, delta=0.1, lower=0.0, amplitude=1.0)[source]¶ Return a list of means of arms, well spaced:
- in [lower, lower + amplitude],
- sorted in increasing order,
- starting from lower + amplitude * delta, up to lower + amplitude * (1 - delta),
- and there is nbArms arms.
>>> np.array(uniform_means(2, 0.1)) array([ 0.1, 0.9]) >>> np.array(uniform_means(3, 0.1)) array([ 0.1, 0.5, 0.9]) >>> np.array(uniform_means(9, 1 / (1. + 9))) array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
-
complete_tree_exploration_for_MP_bandits.
proba2float
(proba, values=None, K=None, names=None)[source]¶ Replace mu_k by a numerical value and evaluation the formula.
-
complete_tree_exploration_for_MP_bandits.
simplify
(proba)[source]¶ Try to simplify the expression of the probability.
-
complete_tree_exploration_for_MP_bandits.
proba2str
(proba, latex=False, html_in_var_names=False)[source]¶ Pretty print a proba, either a number, a Fraction, or a sympy expression.
-
complete_tree_exploration_for_MP_bandits.
tex2pdf
(filename)[source]¶ Naive call to command line pdflatex, twice.
-
class
complete_tree_exploration_for_MP_bandits.
State
(S, Stilde, N, Ntilde, mus, players, depth=0)[source]¶ Bases:
object
Not space-efficient representation of a state in the system we model.
- S, Stilde, N, Ntilde: are arrays of size (M, K),
- depth, t, M, K: integers, to avoid recomputing them,
- mus: the problem parameters (only for Bernoulli arms),
- players: is a list of algorithms,
- probas: list of transition probabilities,
- children: list of all possible next states (transitions).
-
__init__
(S, Stilde, N, Ntilde, mus, players, depth=0)[source]¶ Create a new state. Arrays S, Stilde, N, Ntilde are copied to avoid modify previous values!
-
S
= None¶ sensing feedback
-
Stilde
= None¶ number of sensing trials
-
N
= None¶ number of succesful transmissions
-
Ntilde
= None¶ number of trials without collisions
-
depth
= None¶ current depth of the exploration tree
-
t
= None¶ current time step. Simply = sum(N[0]) = sum(N[i]) for all player i, but easier to compute it once and store it
-
M
= None¶ number of players
-
K
= None¶ number of arms (channels)
-
children
= None¶ list of next state, representing all the possible transitions
-
probas
= None¶ probabilities of transitions
-
to_dot
(title='', name='', comment='', latex=False, html_in_var_names=False, ext='svg', onlyleafs=True, onlyabsorbing=False, concise=True)[source]¶ Convert the state to a .dot graph, using GraphViz. See http://graphviz.readthedocs.io/ for more details.
- onlyleafs: only print the root and the leafs, to see a concise representation of the tree.
- onlyabsorbing: only print the absorbing leafs, to see a really concise representation of the tree.
- concise: weather to use the short representation of states (using \(\tilde{S}\) and \(N\)) or the long one (using the 4 variables).
- html_in_var_names: experimental use of
<SUB>..</SUB>
and<SUP>..</SUP>
in the label for the tree. - latex: experimental use of
_{..}
and^{..}
in the label for the tree, to use with dot2tex.
-
saveto
(filename, view=True, title='', name='', comment='', latex=False, html_in_var_names=False, ext='svg', onlyleafs=True, onlyabsorbing=False, concise=True)[source]¶
-
copy
()[source]¶ Get a new copy of that state with same S, Stilde, N, Ntilde but no probas and no children (and depth=0).
-
is_absorbing
()[source]¶ Try to detect if this state is absorbing, ie only one transition is possible, and again infinitely for the only child.
Warning
Still very experimental!
-
has_absorbing_child_whole_subtree
()[source]¶ Try to detect if this state has an absorbing child in the whole subtree.
-
explore_from_node_to_depth
(depth=1)[source]¶ Compute recursively the one_depth children of the root and its children.
-
compute_one_depth
()[source]¶ Use all_deltas to store all the possible transitions and their probabilities. Increase depth by 1 at the end.
-
all_absorbing_states
(depth=1)[source]¶ Generator that yields all the absorbing nodes of the tree, one by one.
- It might not find any,
- It does so without merging common nodes, in order to find the first absorbing node as quick as possible.
-
absorbing_states_one_depth
()[source]¶ Use all_deltas to yield all the absorbing one-depth child and their probabilities.
-
find_N_absorbing_states
(N=1, maxdepth=8)[source]¶ Find at least N absorbing states, by considering a large depth.
-
all_deltas
()[source]¶ Generator that yields functions transforming state to another state.
- It is memory efficient as it is a generator.
- Do not convert that to a list or it might use all your system memory: each returned value is a function with code and variables inside!
-
get_all_leafs
()[source]¶ Recurse and get all the leafs. Many different state can be present in the list of leafs, with possibly different probabilities (each correspond to a trajectory).
-
get_unique_leafs
()[source]¶ Compute all the leafs (deepest children) and merge the common one to compute their full probabilities.
-
proba_reaching_absorbing_state
()[source]¶ Compute the probability of reaching a leaf that is an absorbing state.
-
__dict__
= mappingproxy({'__module__': 'complete_tree_exploration_for_MP_bandits', '__doc__': 'Not space-efficient representation of a state in the system we model.\n\n - S, Stilde, N, Ntilde: are arrays of size (M, K),\n - depth, t, M, K: integers, to avoid recomputing them,\n - mus: the problem parameters (only for Bernoulli arms),\n - players: is a list of algorithms,\n - probas: list of transition probabilities,\n - children: list of all possible next states (transitions).\n ', '__init__': <function State.__init__>, '__str__': <function State.__str__>, 'to_node': <function State.to_node>, 'to_dot': <function State.to_dot>, 'saveto': <function State.saveto>, 'copy': <function State.copy>, '__hash__': <function State.__hash__>, 'is_absorbing': <function State.is_absorbing>, 'has_absorbing_child_whole_subtree': <function State.has_absorbing_child_whole_subtree>, 'explore_from_node_to_depth': <function State.explore_from_node_to_depth>, 'compute_one_depth': <function State.compute_one_depth>, 'all_absorbing_states': <function State.all_absorbing_states>, 'absorbing_states_one_depth': <function State.absorbing_states_one_depth>, 'find_N_absorbing_states': <function State.find_N_absorbing_states>, 'all_deltas': <function State.all_deltas>, 'pretty_print_result_recursively': <function State.pretty_print_result_recursively>, 'get_all_leafs': <function State.get_all_leafs>, 'get_unique_leafs': <function State.get_unique_leafs>, 'proba_reaching_absorbing_state': <function State.proba_reaching_absorbing_state>, '__dict__': <attribute '__dict__' of 'State' objects>, '__weakref__': <attribute '__weakref__' of 'State' objects>})¶
-
__module__
= 'complete_tree_exploration_for_MP_bandits'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
class
complete_tree_exploration_for_MP_bandits.
StateWithMemory
(S, Stilde, N, Ntilde, mus, players, update_memories, memories=None, depth=0)[source]¶ Bases:
complete_tree_exploration_for_MP_bandits.State
State with a memory for each player, to represent and play with RhoRand etc.
-
__init__
(S, Stilde, N, Ntilde, mus, players, update_memories, memories=None, depth=0)[source]¶ Create a new state. Arrays S, Stilde, N, Ntilde are copied to avoid modify previous values!
-
memories
= None¶ Personal memory for all players, can be a rank in {1,..,M} for rhoRand, or anything else.
-
copy
()[source]¶ Get a new copy of that state with same S, Stilde, N, Ntilde but no probas and no children (and depth=0).
-
__hash__
(full=False)[source]¶ Hash the matrix Stilde and N of the state and memories of the players (ie. ranks for RhoRand).
-
is_absorbing
()[source]¶ Try to detect if this state is absorbing, ie only one transition is possible, and again infinitely for the only child.
Warning
Still very experimental!
-
all_deltas
()[source]¶ Generator that yields functions transforming state to another state.
- It is memory efficient as it is a generator.
- Do not convert that to a list or it might use all your system memory: each returned value is a function with code and variables inside!
-
__module__
= 'complete_tree_exploration_for_MP_bandits'¶
-
configuration module¶
Configuration for the simulations, for the single-player case.
-
configuration.
CPU_COUNT
= 2¶ Number of CPU on the local machine
-
configuration.
HORIZON
= 10000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration.
REPETITIONS
= 4¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration.
RANDOM_SHUFFLE
= False¶ The arms won’t be shuffled (
shuffle(arms)
).
-
configuration.
RANDOM_INVERT
= False¶ The arms won’t be inverted (
arms = arms[::-1]
).
-
configuration.
NB_BREAK_POINTS
= 0¶ Number of true breakpoints. They are uniformly spaced in time steps (and the first one at t=0 does not count).
-
configuration.
EPSILON
= 0.1¶ Parameters for the epsilon-greedy and epsilon-… policies.
-
configuration.
TEMPERATURE
= 0.05¶ Temperature for the Softmax policies.
-
configuration.
LEARNING_RATE
= 0.01¶ Learning rate for my aggregated bandit (it can be autotuned)
-
configuration.
TEST_WrapRange
= False¶ To know if my WrapRange policy is tested.
-
configuration.
CACHE_REWARDS
= True¶ Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.
-
configuration.
UPDATE_ALL_CHILDREN
= False¶ Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?
-
configuration.
UNBIASED
= False¶ Should the rewards for Aggregator policy use as biased estimator, ie just
r_t
, or unbiased estimators,r_t / p_t
-
configuration.
UPDATE_LIKE_EXP4
= False¶ Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal
-
configuration.
UNBOUNDED_VARIANCE
= 1¶ Variance of unbounded Gaussian arms
-
configuration.
NB_ARMS
= 9¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration.
LOWER
= 0.0¶ Default value for the lower value of means
-
configuration.
AMPLITUDE
= 1.0¶ Default value for the amplitude value of means
-
configuration.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration.
ARM_TYPE
¶ alias of
Arms.Bernoulli.Bernoulli
-
configuration.
ENVIRONMENT_BAYESIAN
= False¶ True to use bayesian problem
-
configuration.
MEANS
= [0.05, 0.16249999999999998, 0.27499999999999997, 0.38749999999999996, 0.49999999999999994, 0.6125, 0.725, 0.8374999999999999, 0.95]¶ Means of arms for non-hard-coded problems (non Bayesian)
-
configuration.
USE_FULL_RESTART
= True¶ True to use full-restart Doubling Trick
-
configuration.
configuration
= {'append_labels': {}, 'cache_rewards': True, 'change_labels': {0: 'Pure exploration', 1: 'Pure exploitation', 2: '$\\varepsilon$-greedy', 3: 'Explore-then-Exploit', 5: 'Bernoulli kl-UCB', 6: 'Thompson sampling'}, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}], 'environment_bayesian': False, 'horizon': 10000, 'n_jobs': -1, 'nb_break_points': 0, 'plot_lowerbound': True, 'policies': [{'archtype': <class 'Policies.Uniform.Uniform'>, 'params': {}, 'change_label': 'Pure exploration'}, {'archtype': <class 'Policies.EmpiricalMeans.EmpiricalMeans'>, 'params': {}, 'change_label': 'Pure exploitation'}, {'archtype': <class 'Policies.EpsilonGreedy.EpsilonDecreasing'>, 'params': {'epsilon': 479.99999999999983}, 'change_label': '$\\varepsilon$-greedy'}, {'archtype': <class 'Policies.ExploreThenCommit.ETC_KnownGap'>, 'params': {'horizon': 10000, 'gap': 0.11250000000000004}, 'change_label': 'Explore-then-Exploit'}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>}, 'change_label': 'Bernoulli kl-UCB'}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>}, 'change_label': 'Thompson sampling'}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 4, 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration.
nbArms
= 9¶ Number of arms in the first environment
configuration_comparing_aggregation_algorithms module¶
Configuration for the simulations, for the single-player case, for comparing Aggregation algorithms.
-
configuration_comparing_aggregation_algorithms.
HORIZON
= 10000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_comparing_aggregation_algorithms.
REPETITIONS
= 4¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_comparing_aggregation_algorithms.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_comparing_aggregation_algorithms.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_comparing_aggregation_algorithms.
NB_ARMS
= 9¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration_comparing_aggregation_algorithms.
RANDOM_SHUFFLE
= False¶ The arms are shuffled (
shuffle(arms)
).
-
configuration_comparing_aggregation_algorithms.
RANDOM_INVERT
= False¶ The arms are inverted (
arms = arms[::-1]
).
-
configuration_comparing_aggregation_algorithms.
NB_RANDOM_EVENTS
= 5¶ Number of random events. They are uniformly spaced in time steps.
-
configuration_comparing_aggregation_algorithms.
CACHE_REWARDS
= False¶ Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.
-
configuration_comparing_aggregation_algorithms.
UPDATE_ALL_CHILDREN
= False¶ Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?
-
configuration_comparing_aggregation_algorithms.
UNBIASED
= True¶ Should the rewards for Aggregator policy use as biased estimator, ie just
r_t
, or unbiased estimators,r_t / p_t
-
configuration_comparing_aggregation_algorithms.
UPDATE_LIKE_EXP4
= False¶ Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal
-
configuration_comparing_aggregation_algorithms.
TRUNC
= 1¶ Trunc parameter, ie amplitude, for Exponential arms
-
configuration_comparing_aggregation_algorithms.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration_comparing_aggregation_algorithms.
MINI
= 0¶ lower bound on rewards from Gaussian arms
-
configuration_comparing_aggregation_algorithms.
MAXI
= 1¶ upper bound on rewards from Gaussian arms, ie amplitude = 1
-
configuration_comparing_aggregation_algorithms.
SCALE
= 1¶ Scale of Gamma arms
-
configuration_comparing_aggregation_algorithms.
ARM_TYPE
¶ alias of
Arms.Bernoulli.Bernoulli
-
configuration_comparing_aggregation_algorithms.
configuration
= {'cache_rewards': False, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}], 'horizon': 10000, 'n_jobs': -1, 'nb_random_events': 5, 'policies': [{'archtype': <class 'Policies.Aggregator.Aggregator'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'update_all_children': False, 'decreaseRate': 'auto', 'update_like_exp4': False}}, {'archtype': <class 'Policies.Aggregator.Aggregator'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'update_all_children': False, 'decreaseRate': 'auto', 'update_like_exp4': True}}, {'archtype': <class 'Policies.LearnExp.LearnExp'>, 'params': {'children': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'unbiased': True, 'eta': 0.9}}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbExp>}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'lower': 0, 'amplitude': 1, 'klucb': <function klucbGauss>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'lower': 0, 'amplitude': 1}}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 4, 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration_comparing_aggregation_algorithms.
LOWER
= 0¶ And get LOWER, AMPLITUDE values
-
configuration_comparing_aggregation_algorithms.
AMPLITUDE
= 1¶ And get LOWER, AMPLITUDE values
configuration_comparing_doubling_algorithms module¶
Configuration for the simulations, for the single-player case, for comparing doubling-trick doubling schemes.
-
configuration_comparing_doubling_algorithms.
CPU_COUNT
= 2¶ Number of CPU on the local machine
-
configuration_comparing_doubling_algorithms.
HORIZON
= 45678¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_comparing_doubling_algorithms.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_comparing_doubling_algorithms.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_comparing_doubling_algorithms.
REPETITIONS
= 1000¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_comparing_doubling_algorithms.
UNBOUNDED_VARIANCE
= 1¶ Variance of unbounded Gaussian arms
-
configuration_comparing_doubling_algorithms.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration_comparing_doubling_algorithms.
NB_ARMS
= 9¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration_comparing_doubling_algorithms.
lower
= 0.0¶ Default value for the lower value of means
-
configuration_comparing_doubling_algorithms.
amplitude
= 1.0¶ Default value for the amplitude value of means
-
configuration_comparing_doubling_algorithms.
ARM_TYPE
¶ alias of
Arms.Bernoulli.Bernoulli
-
configuration_comparing_doubling_algorithms.
configuration
= {'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]}, {'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': {'newMeans': <function randomMeans>, 'args': {'nbArms': 9, 'mingap': None, 'lower': 0.0, 'amplitude': 1.0, 'isSorted': True}}}], 'horizon': 45678, 'n_jobs': -1, 'policies': [{'archtype': <class 'Policies.UCB.UCB'>, 'params': {}}, {'archtype': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>, 'params': {'horizon': 45678}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__arithmetic>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__geometric>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_fast>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_slow>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic>, 'full_restart': True, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__arithmetic>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__geometric>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_fast>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_slow>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}, {'archtype': <class 'Policies.DoublingTrickWrapper.DoublingTrickWrapper'>, 'params': {'next_horizon': <function next_horizon__exponential_generic>, 'full_restart': False, 'policy': <class 'Policies.klUCBPlusPlus.klUCBPlusPlus'>}}], 'repetitions': 1000, 'verbosity': 6}¶ This dictionary configures the experiments
configuration_markovian module¶
Configuration for the simulations, for the single-player case for Markovian problems.
-
configuration_markovian.
CPU_COUNT
= 2¶ Number of CPU on the local machine
-
configuration_markovian.
HORIZON
= 1000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_markovian.
REPETITIONS
= 100¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_markovian.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_markovian.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_markovian.
VARIANCE
= 10¶ Variance of Gaussian arms
-
configuration_markovian.
TEST_Aggregator
= True¶ To know if my Aggregator policy is tried.
-
configuration_markovian.
configuration
= {'environment': [{'arm_type': 'Markovian', 'params': {'rested': False, 'transitions': [{(0, 0): 0.7, (0, 1): 0.3, (1, 0): 0.5, (1, 1): 0.5}, [[0.2, 0.8], [0.6, 0.4]]], 'steadyArm': <class 'Arms.Bernoulli.Bernoulli'>}}], 'horizon': 1000, 'n_jobs': -1, 'policies': [{'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {}}], 'repetitions': 100, 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration_markovian.
nbArms
= 3¶ Number of arms in the first environment
configuration_multiplayers module¶
Configuration for the simulations, for the multi-players case.
-
configuration_multiplayers.
HORIZON
= 10000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_multiplayers.
REPETITIONS
= 200¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_multiplayers.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_multiplayers.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_multiplayers.
NB_PLAYERS
= 3¶ NB_PLAYERS : number of players for the game. Should be >= 2 and <= number of arms.
-
configuration_multiplayers.
collisionModel
(t, arms, players, choices, rewards, pulls, collisions)¶ The best collision model: none of the colliding users get any reward
-
configuration_multiplayers.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration_multiplayers.
CACHE_REWARDS
= False¶ Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.
-
configuration_multiplayers.
NB_ARMS
= 6¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration_multiplayers.
LOWER
= 0.0¶ Default value for the lower value of means
-
configuration_multiplayers.
AMPLITUDE
= 1.0¶ Default value for the amplitude value of means
-
configuration_multiplayers.
ARM_TYPE
¶ alias of
Arms.Bernoulli.Bernoulli
-
configuration_multiplayers.
ENVIRONMENT_BAYESIAN
= False¶ True to use bayesian problem
-
configuration_multiplayers.
MEANS
= [0.1, 0.26, 0.42000000000000004, 0.58, 0.74, 0.9]¶ Means of arms for non-hard-coded problems (non Bayesian)
-
configuration_multiplayers.
configuration
= {'averageOn': 0.001, 'collisionModel': <function onlyUniqUserGetsReward>, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.1, 0.26, 0.42000000000000004, 0.58, 0.74, 0.9]}], 'finalRanksOnAverage': True, 'horizon': 10000, 'n_jobs': -1, 'players': [<Policies.SIC_MMAB.SIC_MMAB object>, <Policies.SIC_MMAB.SIC_MMAB object>, <Policies.SIC_MMAB.SIC_MMAB object>], 'plot_lowerbounds': False, 'repetitions': 200, 'successive_players': [[CentralizedMultiplePlay(kl-UCB), CentralizedMultiplePlay(kl-UCB), CentralizedMultiplePlay(kl-UCB)], [Selfish(kl-UCB), Selfish(kl-UCB), Selfish(kl-UCB)], [rhoRand(kl-UCB), rhoRand(kl-UCB), rhoRand(kl-UCB)], [MCTopM(kl-UCB), MCTopM(kl-UCB), MCTopM(kl-UCB)]], 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration_multiplayers.
nbArms
= 6¶ Number of arms in the first environment
configuration_sparse module¶
Configuration for the simulations, for single-player sparse bandit.
-
configuration_sparse.
HORIZON
= 10000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_sparse.
REPETITIONS
= 100¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_sparse.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_sparse.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_sparse.
RANDOM_SHUFFLE
= False¶ The arms are shuffled (
shuffle(arms)
).
-
configuration_sparse.
RANDOM_INVERT
= False¶ The arms are inverted (
arms = arms[::-1]
).
-
configuration_sparse.
NB_RANDOM_EVENTS
= 5¶ Number of random events. They are uniformly spaced in time steps.
-
configuration_sparse.
UPDATE_ALL_CHILDREN
= False¶ Should the Aggregator policy update the trusts in each child or just the one trusted for last decision?
-
configuration_sparse.
LEARNING_RATE
= 0.01¶ Learning rate for my aggregated bandit (it can be autotuned)
-
configuration_sparse.
UNBIASED
= False¶ Should the rewards for Aggregator policy use as biased estimator, ie just
r_t
, or unbiased estimators,r_t / p_t
-
configuration_sparse.
UPDATE_LIKE_EXP4
= False¶ Should we update the trusts proba like in Exp4 or like in my initial Aggregator proposal
-
configuration_sparse.
TEST_Aggregator
= False¶ To know if my Aggregator policy is tried.
-
configuration_sparse.
CACHE_REWARDS
= False¶ Should we cache rewards? The random rewards will be the same for all the REPETITIONS simulations for each algorithms.
-
configuration_sparse.
TRUNC
= 1¶ Trunc parameter, ie amplitude, for Exponential arms
-
configuration_sparse.
MINI
= 0¶ lower bound on rewards from Gaussian arms
-
configuration_sparse.
MAXI
= 1¶ upper bound on rewards from Gaussian arms, ie amplitude = 1
-
configuration_sparse.
SCALE
= 1¶ Scale of Gamma arms
-
configuration_sparse.
NB_ARMS
= 15¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration_sparse.
SPARSITY
= 7¶ Sparsity for non-hard-coded problems (Bayesian problems)
-
configuration_sparse.
LOWERNONZERO
= 0.25¶ Default value for the lower value of non-zero means
-
configuration_sparse.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration_sparse.
ARM_TYPE
¶ alias of
Arms.Gaussian.Gaussian
-
configuration_sparse.
ENVIRONMENT_BAYESIAN
= False¶ True to use bayesian problem
-
configuration_sparse.
MEANS
= [0.00125, 0.03660714285714286, 0.07196428571428572, 0.10732142857142857, 0.14267857142857143, 0.1780357142857143, 0.21339285714285713, 0.24875, 0.25375, 0.3775, 0.50125, 0.625, 0.74875, 0.8725, 0.99625]¶ Means of arms for non-hard-coded problems (non Bayesian)
-
configuration_sparse.
ISSORTED
= True¶ Whether to sort the means of the problems or not.
-
configuration_sparse.
configuration
= {'environment': [{'arm_type': <class 'Arms.Gaussian.Gaussian'>, 'params': [(0.05, 0.05, 0.0, 1.0), (0.07142857142857144, 0.05, 0.0, 1.0), (0.09285714285714286, 0.05, 0.0, 1.0), (0.1142857142857143, 0.05, 0.0, 1.0), (0.13571428571428573, 0.05, 0.0, 1.0), (0.15714285714285717, 0.05, 0.0, 1.0), (0.1785714285714286, 0.05, 0.0, 1.0), (0.2, 0.05, 0.0, 1.0), (0.4, 0.05, 0.0, 1.0), (0.47500000000000003, 0.05, 0.0, 1.0), (0.55, 0.05, 0.0, 1.0), (0.625, 0.05, 0.0, 1.0), (0.7000000000000001, 0.05, 0.0, 1.0), (0.7750000000000001, 0.05, 0.0, 1.0), (0.8500000000000001, 0.05, 0.0, 1.0)], 'sparsity': 7}], 'horizon': 10000, 'n_jobs': -1, 'nb_random_events': 5, 'policies': [{'archtype': <class 'Policies.EmpiricalMeans.EmpiricalMeans'>, 'params': {'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.UCBalpha.UCBalpha'>, 'params': {'alpha': 1, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseUCB.SparseUCB'>, 'params': {'alpha': 1, 'sparsity': 7, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.klUCB.klUCB'>, 'params': {'klucb': <function klucbBern>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseklUCB.SparseklUCB'>, 'params': {'sparsity': 7, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.Thompson.Thompson'>, 'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.Thompson.Thompson'>, 'params': {'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.Thompson.Thompson'>, 'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'policy': <class 'Policies.BayesUCB.BayesUCB'>, 'posterior': <class 'Policies.Posterior.Beta.Beta'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.BayesUCB.BayesUCB'>, 'params': {'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.SparseWrapper.SparseWrapper'>, 'params': {'sparsity': 7, 'posterior': <class 'Policies.Posterior.Gauss.Gauss'>, 'policy': <class 'Policies.BayesUCB.BayesUCB'>, 'use_ucb_for_set_J': True, 'use_ucb_for_set_K': True, 'lower': 0, 'amplitude': 1}}, {'archtype': <class 'Policies.OSSB.OSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0}}, {'archtype': <class 'Policies.OSSB.GaussianOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0, 'variance': 0.05}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.0, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.001, 'gamma': 0.0, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.0, 'gamma': 0.01, 'sparsity': 7}}, {'archtype': <class 'Policies.OSSB.SparseOSSB'>, 'params': {'epsilon': 0.001, 'gamma': 0.01, 'sparsity': 7}}], 'random_invert': False, 'random_shuffle': False, 'repetitions': 100, 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration_sparse.
LOWER
= 0¶ And get LOWER, AMPLITUDE values
-
configuration_sparse.
AMPLITUDE
= 1¶ And get LOWER, AMPLITUDE values
configuration_sparse_multiplayers module¶
Configuration for the simulations, for the multi-players case with sparse activated players.
-
configuration_sparse_multiplayers.
HORIZON
= 10000¶ HORIZON : number of time steps of the experiments. Warning Should be >= 10000 to be interesting “asymptotically”.
-
configuration_sparse_multiplayers.
REPETITIONS
= 4¶ REPETITIONS : number of repetitions of the experiments. Warning: Should be >= 10 to be statistically trustworthy.
-
configuration_sparse_multiplayers.
DO_PARALLEL
= True¶ To profile the code, turn down parallel computing
-
configuration_sparse_multiplayers.
N_JOBS
= -1¶ Number of jobs to use for the parallel computations. -1 means all the CPU cores, 1 means no parallelization.
-
configuration_sparse_multiplayers.
NB_PLAYERS
= 2¶ NB_PLAYERS : number of players for the game. Should be >= 2 and <= number of arms.
-
configuration_sparse_multiplayers.
ACTIVATION
= 1.0¶ ACTIVATION : common probability of activation.
-
configuration_sparse_multiplayers.
ACTIVATIONS
= (1.0, 1.0)¶ ACTIVATIONS : probability of activation of each player.
-
configuration_sparse_multiplayers.
VARIANCE
= 0.05¶ Variance of Gaussian arms
-
configuration_sparse_multiplayers.
NB_ARMS
= 2¶ Number of arms for non-hard-coded problems (Bayesian problems)
-
configuration_sparse_multiplayers.
ARM_TYPE
¶ alias of
Arms.Bernoulli.Bernoulli
-
configuration_sparse_multiplayers.
MEANS
= [0.3333333333333333, 0.6666666666666667]¶ Means of the arms
-
configuration_sparse_multiplayers.
configuration
= {'activations': (1.0, 1.0), 'averageOn': 0.001, 'environment': [{'arm_type': <class 'Arms.Bernoulli.Bernoulli'>, 'params': [0.3333333333333333, 0.6666666666666667]}], 'finalRanksOnAverage': True, 'horizon': 10000, 'n_jobs': -1, 'players': [Selfish(UCB), Selfish(UCB)], 'repetitions': 4, 'successive_players': [[Selfish(U(1..2)), Selfish(U(1..2))], [Selfish(UCB), Selfish(UCB)], [Selfish(Thompson Sampling), Selfish(Thompson Sampling)], [Selfish(kl-UCB), Selfish(kl-UCB)], [Selfish(Exp3++), Selfish(Exp3++)]], 'verbosity': 6}¶ This dictionary configures the experiments
-
configuration_sparse_multiplayers.
nbArms
= 2¶ Number of arms in the first environment
env_client module¶
Client to play multi-armed bandits problem against. Many distribution of arms are supported, default to Bernoulli.
- Usage:
- env_client.py [–markovian | –dynamic] [–port=<PORT>] [–host=<HOST>] [–speed=<SPEED>] <json_configuration> env_client.py (-h|–help) env_client.py –version
- Options:
- -h –help Show this screen. –version Show version. –markovian Whether to use a Markovian MAB problem (default is simple MAB problems). –dynamic Whether to use a Dynamic MAB problem (default is simple MAB problems). –port=<PORT> Port to use for the TCP connection [default: 10000]. –host=<HOST> Address to use for the TCP connection [default: 0.0.0.0]. –speed=<SPEED> Speed of emission, in milliseconds [default: 1000].
-
env_client.
default_configuration
= {'arm_type': 'Bernoulli', 'params': {(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)}}¶ Example of configuration to pass from the command line.
'{"arm_type": "Bernoulli", "params": (0.1, 0.5, 0.9)}'
-
env_client.
read_configuration_env
(a_string)[source]¶ Return a valid configuration dictionary to initialize a MAB environment, from the input string.
-
env_client.
client
(env, host, port, speed)[source]¶ Launch a client that:
- uses sockets to listen to input and reply
- create a MAB environment from a JSON configuration (exactly like
main.py
when it readsconfiguration.py
) - then receives choice
arm
from the network, pass it to the MAB environment, listens to hisreward = draw(arm)
feedback, and sends this back to the network.
-
env_client.
transform_str
(params)[source]¶ Like a safe exec() on a dictionary that can contain special values:
- strings are interpreted as variables names (e.g., policy names) from the current
globals()
scope, - list are transformed to tuples to be constant and hashable,
- dictionary are recursively transformed.
- strings are interpreted as variables names (e.g., policy names) from the current
main module¶
main_multiplayers module¶
main_multiplayers_more module¶
main_sparse_multiplayers module¶
policy_server module¶
Server to play multi-armed bandits problem against.
- Usage:
- policy_server.py [–port=<PORT>] [–host=<HOST>] [–means=<MEANS>] <json_configuration> policy_server.py (-h|–help) policy_server.py –version
- Options:
- -h –help Show this screen. –version Show version. –port=<PORT> Port to use for the TCP connection [default: 10000]. –host=<HOST> Address to use for the TCP connection [default: 0.0.0.0]. –means=<MEANS> Means of arms used by the environment, to print regret [default: None].
-
policy_server.
default_configuration
= {'archtype': 'UCBalpha', 'nbArms': 10, 'params': {'alpha': 1}}¶ Example of configuration to pass from the command line.
'{"nbArms": 3, "archtype": "UCBalpha", "params": { "alpha": 0.5 }}'
-
policy_server.
read_configuration_policy
(a_string)[source]¶ Return a valid configuration dictionary to initialize a policy, from the input string.
-
policy_server.
server
(policy, host, port, means=None)[source]¶ Launch a server that:
- uses sockets to listen to input and reply
- create a learning algorithm from a JSON configuration (exactly like
main.py
when it readsconfiguration.py
) - then receives feedback
(arm, reward)
from the network, pass it to the algorithm, listens to hisarm = choice()
suggestion, and sends this back to the network.
-
policy_server.
transform_str
(params)[source]¶ Like a safe
exec()
on a dictionary that can contain special values:- strings are interpreted as variables names (e.g., policy names) from the current
globals()
scope, - list are transformed to tuples to be constant and hashable,
- dictionary are recursively transformed.
Warning
It is still as unsafe as
exec()
: only use it with trusted inputs!- strings are interpreted as variables names (e.g., policy names) from the current
How to run the code ?¶
This short page explains quickly how to install the requirements for this project, and then how to use the code to run simulations.
Required modules¶
Running some simulations¶
Then, it should be very straight forward to run some experiment.
This will run the simulation, average them (by repetitions
) and plot the results.
Single player¶
Single player¶
python main.py
# or
make main
Single player, aggregating algorithms¶
python main.py configuration_comparing_aggregation_algorithms
# or
make comparing_aggregation_algorithms
See these explainations: Aggregation.md
Single player, doubling-trick algorithms¶
python main.py configuration_comparing_doubling_algorithms
# or
make comparing_doubling_algorithms
See these explainations: DoublingTrick.md
Single player, with Sparse Stochastic Bandit¶
python main.py configuration_sparse
# or
make sparse
See these explainations: SparseBandits.md
Single player, with Markovian problem¶
python main.py configuration_markovian
# or
make markovian
Single player, with non-stationary problem¶
python main.py configuration_nonstationary
# or
make nonstationary
See these explainations: NonStationaryBandits.md
Multi-Player¶
Multi-Player, one algorithm¶
python main_multiplayers.py
# or
make multi
Multi-Player, comparing different algorithms¶
python main_multiplayers_more.py
# or
make moremulti
See these explainations: MultiPlayers.md
Using env
variables ?¶
For all simulations, I recently added the support for environment variable, to ease the customization of the main parameters of every simulations.
For instance, if the configuration_multiplayers_more.py
file is correct,
then you can customize to use N=4
repetitions, for horizon T=1000
and M=3
players, parallelized with N_JOBS=4
jobs (use the number of cores of your CPU for optimal performance):
N=4 T=1000 M=3 DEBUG=True SAVEALL=False N_JOBS=4 make moremulti
In a virtualenv
?¶
If you prefer to not install the requirements globally on your system-wide Python setup, you can (and should) use virtualenv
.
$ virtualenv .
Using base prefix '/usr'
New python executable in /your/path/to/SMPyBandits/bin/python3
Also creating executable in /your/path/to/SMPyBandits/bin/python
Installing setuptools, pip, wheel...done.
$ source bin/activate # in bash, use activate.csh or activate.fish if needed
$ type pip # just to check
pip is /your/path/to/SMPyBandits/bin/pip
$ pip install -r requirements.txt
Collecting numpy (from -r requirements.txt (line 5))
...
Installing collected packages: numpy, scipy, cycler, pytz, python-dateutil, matplotlib, joblib, pandas, seaborn, tqdm, sphinx-rtd-theme, commonmark, docutils, recommonmark
Successfully installed commonmark-0.5.4 cycler-0.10.0 docutils-0.13.1 joblib-0.11 matplotlib-2.0.0 numpy-1.12.1 pandas-0.19.2 python-dateutil-2.6.0 pytz-2016.10 recommonmark-0.4.0 scipy-0.19.0 seaborn-0.7.1 sphinx-rtd-theme-0.2.4 tqdm-4.11.2
And then be sure to use the virtualenv binary for Python, bin/python
, instead of the system-wide one, to launch the experiments (the Makefile should use it by default, if source bin/activate
was executed).
Or with a Makefile
?¶
You can also use the provided Makefile
file to do this simply:
make install # install the requirements
make multiplayers # run and log the main.py script
It can be used to check the quality of the code with pylint:
make lint lint3 # check the code with pylint
It is also used to clean the code, build the doc, send the doc, etc. (This should not be used by others)
Or within a Jupyter notebook ?¶
I am writing some Jupyter notebooks, in this folder (notebooks/
), so if you want to do the same for your small experiments, you can be inspired by the few notebooks already written.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
List of research publications using Lilian Besson’s SMPyBandits project¶
I (Lilian Besson) have started my PhD in October 2016, and this project is a part of my on going research since December 2016.
1st article, about policy aggregation algorithm (aka model selection)¶
I designed and added the Aggregator
policy, in order to test its validity and performance.
It is a “simple” voting algorithm to combine multiple bandit algorithms into one.
Basically, it behaves like a simple MAB bandit just based on empirical means (even simpler than UCB), where arms are the child algorithms A_1 .. A_N
, each running in “parallel”.
For more details, refer to this file: Aggregation.md and this research article.
PDF : BKM_IEEEWCNC_2018.pdf | HAL notice : BKM_IEEEWCNC_2018 | BibTeX : BKM_IEEEWCNC_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
2nd article, about Multi-players Multi-Armed Bandits¶
There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting.
The basic difference is about collisions : at each time t
, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.
For more details, refer to this file: MultiPlayers.md and this research article.
PDF : BK__ALT_2018.pdf | HAL notice : BK__ALT_2018 | BibTeX : BK__ALT_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
3rd article, using Doubling Trick for Multi-Armed Bandits¶
I studied what Doubling Trick can and can’t do to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.
For more details, refer to this file: DoublingTrick.md and this research article.
PDF : BK__DoublingTricks_2018.pdf | HAL notice : BK__DoublingTricks_2018 | BibTeX : BK__DoublingTricks_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
4th article, about Piece-Wise Stationary Multi-Armed Bandits¶
With Emilie Kaufmann, we studied the Generalized Likelihood Ratio Test (GLRT) for sub-Bernoulli distributions, and proposed the B-GLRT algorithm for change-point detection for piece-wise stationary one-armed bandit problems. We combined the B-GLRT with the kl-UCB multi-armed bandit algorithm and proposed the GLR-klUCB algorithm for piece-wise stationary multi-armed bandit problems. We prove finite-time guarantees for the B-GLRT and the GLR-klUCB algorithm, and we illustrate its performance with extensive numerical experiments.
For more details, refer to this file: NonStationaryBandits.md and this research article.
PDF : BK__COLT_2019.pdf | HAL notice : BK__COLT_2019 | BibTeX : BK__COLT_2019.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
Other interesting things¶
Single-player Policies¶
- More than 65 algorithms, including all known variants of the
UCB
, kl-UCB,MOSS
and Thompson Sampling algorithms, as well as other less known algorithms (https://smpybandits.github.io/docs/OCUCB
,BESA
,OSSB
etc). SparseWrapper
is a generalization of the SparseUCB from this article.- Implementation of very recent Multi-Armed Bandits algorithms, e.g.,
kl-UCB++
(from this article),UCB-dagger
(from this article), orMOSS-anytime
(from this article). - Experimental policies:
BlackBoxOpt
orUnsupervisedLearning
(using Gaussian processes to learn the arms distributions).
Arms and problems¶
- My framework mainly targets stochastic bandits, with arms following
Bernoulli
, bounded (truncated) or unboundedGaussian
,Exponential
,Gamma
orPoisson
distributions. - The default configuration is to use a fixed problem for N repetitions (e.g. 1000 repetitions, use
MAB.MAB
), but there is also a perfect support for “Bayesian” problems where the mean vector µ1,…,µK change at every repetition (seeMAB.DynamicMAB
). - There is also a good support for Markovian problems, see
MAB.MarkovianMAB
, even though I didn’t implement any policies tailored for Markovian problems. - I’m actively working on adding a very clean support for non-stationary MAB problems, and
MAB.PieceWiseStationaryMAB
is already working well. Use it with policies designed for piece-wise stationary problems, like Discounted-Thompson, CD-UCB, M-UCB, SW-UCB#.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Note: I have worked on other topics during my PhD, you can find my research articles on my website, or have a look to my Google Scholar profile or résumé on HAL.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Policy aggregation algorithms¶
- Remark: I wrote a small research article on that topic, it will be a better introduction as a small self-contained document to explain this idea and the algorithms. Reference: [Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017], presented at the IEEE WCNC 2018 conference.
PDF : BKM_IEEEWCNC_2018.pdf | HAL notice : BKM_IEEEWCNC_2018 | BibTeX : BKM_IEEEWCNC_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
Idea¶
The basic idea of a policy aggregation algorithm is to run in parallel some online learning algorithms, denoted $A_1,\ldots,A_N$
($A_i$
), and make them all vote at each step, and use some probabilistic scheme to select a decision from their votes.
Hopefully, if all the algorithms $A_i$
are not too bad and at least one of them is efficient for the problem at hand, the aggregation algorithm will learn to mainly trust the efficient one(s) and discard the votes from the others.
An efficient aggregation algorithm should have performances similar to the best child algorithm $A_i$
, in any problem.
The Exp4 algorithm by [Auer et al, 2002] is the first aggregation algorithm for online bandit algorithms, and recently other algorithms include LearnExp
([Singla et al, 2017]) and CORRAL
([Agarwal et al, 2017]).
Mathematical explanations¶
Initially, every child algorithms $A_i$
has the same “trust” probability $p_i$
, and at every step, the aggregated bandit first listen to the decision from all its children $A_i$
($a_{i,t}$
in $\{1,\ldots,K\}$
), and then decide which arm to select by a probabilistic vote: the probability of selecting arm $k$
is the sum of the trust probability of the children who voted for arm $k$
.
It could also be done the other way: the aggregated bandit could first decide which children to listen to, then trust him.
But we want to update the trust probability of all the children algorithms, not only one, when it was wised to trust them.
Mathematically, when the aggregated arm choose to pull the arm $k$
at step $t$
, if it yielded a positive reward $r_{k,t}$
, then the probability of all children algorithms $A_i$
who decided (independently) to chose $k$
(i.e., $a_{i,t} = k$
) are increased multiplicatively: $p_i \leftarrow p_i * \exp(+ \beta * r_{k,t})$
where $\beta$
is a positive learning rate, e.g., $\beta = 0.1$
.
It is also possible to decrease multiplicatively the trust of all the children algorithms who did not decided to chose the arm $k$
at every step $t$
: if $a_{i,t} \neq k$
then $p_i \leftarrow p_i * \exp(- \beta * r_{k,t})$
. I did not observe any difference of behavior between these two options (implemented with the Boolean parameter updateAllChildren
).
Ensemble voting for MAB algorithms¶
This algorithm can be seen as the Multi-Armed Bandits (i.e., sequential reinforcement learning) counterpart of an ensemble voting technique, as used for classifiers or regression algorithm in usual supervised machine learning (see, e.g., sklearn.ensemble.VotingClassifier
in scikit-learn).
Another approach could be to do some sort of grid search.
My algorithm: Aggregator¶
It is based on a modification of Exp4, and the details are given in its documentation, see Aggregator
.
All the mathematical details can be found in my paper, [Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017], presented at the IEEE WCNC 2018 conference.
Configuration:¶
A simple python file, configuration_comparing_aggregation_algorithms.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
For example, this will compare the classical MAB algorithms UCB
, Thompson
, BayesUCB
, klUCB
algorithms.
configuration = {
"horizon": 10000, # Finite horizon of the simulation
"repetitions": 100, # number of repetitions
"n_jobs": -1, # Maximum number of cores for parallelization: use ALL your CPU
"verbosity": 5, # Verbosity for the joblib calls
# Environment configuration, you can set up more than one.
"environment": [
{
"arm_type": Bernoulli, # Only Bernoulli is available as far as now
"params": [0.01, 0.01, 0.01, 0.02, 0.02, 0.02, 0.05, 0.05, 0.05, 0.1]
}
],
# Policies that should be simulated, and their parameters.
"policies": [
{"archtype": UCB, "params": {} },
{"archtype": Thompson, "params": {} },
{"archtype": klUCB, "params": {} },
{"archtype": BayesUCB, "params": {} },
]
}
To add an aggregated bandit algorithm (Aggregator
class), you can use this piece of code, to aggregate all the algorithms defined before and dynamically add it to configuration
:
current_policies = configuration["policies"]
configuration["policies"] = current_policies +
[{ # Add one Aggregator policy, from all the policies defined above
"archtype": Aggregator,
"params": {
"learningRate": 0.05, # Tweak this if needed
"updateAllChildren": True,
"children": current_policies,
},
}]
The learning rate can be tuned automatically, by using the heuristic proposed by [Bubeck and Cesa-Bianchi, Theorem 4.2], without knowledge of the horizon, a decreasing learning rate $\eta_t = \sqrt(\frac{\log(N)}{t * K})$
.
How to run the experiments ?¶
You should use the provided Makefile
file to do this simply:
# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install # install the requirements ONLY ONCE
make comparing_aggregation_algorithms # run and log the main.py script
Some illustrations¶
Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli
arms only):
On a “simple” Bernoulli problem (semi-log-y scale)¶
On a "simple" Bernoulli problem (semi-log-y scale).
Aggregator is the most efficient, and very similar to Exp4 here.
On a “harder” Bernoulli problem¶
On a "harder" Bernoulli problem, they all have similar performances, except LearnExp.
They all have similar performances, except LearnExp, which performs badly. We can check that the problem is indeed harder as the lower-bound (in black) is much larger.
On an “easy” Gaussian problem¶
On an "easy" Gaussian problem, only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling.
Only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling. CORRAL and LearnExp clearly appears sub-efficient.
On a harder problem, mixing Bernoulli, Gaussian, Exponential arms¶
On a harder problem, mixing Bernoulli, Gaussian, Exponential arms, with 3 arms of each types with the same mean.
This problem is much harder as it has 3 arms of each types with the same mean.
The semi-log-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard "mixed" problem.
The semi-log-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard “mixed” problem.
These illustrations come from my article, [Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017], presented at the IEEE WCNC 2018 conference.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Multi-players simulation environment¶
For more details, refer to this article. Reference: [Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017], presented at the Internation Conference on Algorithmic Learning Theorey 2018.
PDF : BK__ALT_2018.pdf | HAL notice : BK__ALT_2018 | BibTeX : BK__ALT_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
There is another point of view: instead of comparing different single-player policies on the same problem, we can make them play against each other, in a multi-player setting.
The basic difference is about collisions : at each time $t$
, if two or more user chose to sense the same channel, there is a collision. Collisions can be handled in different way from the base station point of view, and from each player point of view.
Collision models¶
For example, I implemented these different collision models, in CollisionModels.py
:
noCollision
is a limited model where all players can sample an arm with collision. It corresponds to the single-player simulation: each player is a policy, compared without collision. This is for testing only, not so interesting.onlyUniqUserGetsReward
is a simple collision model where only the players alone on one arm sample it and receive the reward. This is the default collision model in the literature, for instance cf. [Shamir et al., 2015] collision model 1 or cf [Liu & Zhao, 2009]. Our article also focusses on this model.rewardIsSharedUniformly
is similar: the players alone on one arm sample it and receive the reward, and in case of more than one player on one arm, only one player (uniform choice, chosen by the base station) can sample it and receive the reward.closerUserGetsReward
is similar but uses another approach to chose who can emit. Instead of randomly choosing the lucky player, it uses a given (or random) vector indicating the distance of each player to the base station (it can also indicate the quality of the communication), and when two (or more) players are colliding, only the one who is closer to the base station can transmit. It is the more physically plausible.
More details on the code¶
Have a look to:
main_multiplayers.py
andconfiguration_multiplayers.py
to run and configure the simulation,- the
EvaluatorMultiPlayers
class that performs the simulation, - the
ResultMultiPlayers
class to store the results, - and some naive policies are implemented in the
PoliciesMultiPlayers/
folder. As far as now, there is theSelfish
,CentralizedFixed
,CentralizedCycling
,OracleNotFair
,OracleFair
multi-players policy.
Policies designed to be used in the multi-players setting¶
- The first one I implemented is the “Musical Chair” policy, from [Shamir et al., 2015], in
MusicalChair
. - Then I implemented the “MEGA” policy from [Avner & Mannor, 2014], in
MEGA
. But it has too much parameter, the question is how to chose them. - The
rhoRand
and variants are from [Distributed Algorithms for Learning…, Anandkumar et al., 2010. - Our algorithms introduced in [Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017] are in
RandTopM
:RandTopM
andMCTopM
. - We also studied deeply the
Selfish
policy, without being able to prove that it is as efficient asrhoRand
,RandTopM
andMCTopM
.
Configuration:¶
A simple python file, configuration_multiplayers.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
See the explanations given for the simple-player case.
configuration["successive_players"] = [
CentralizedMultiplePlay(NB_PLAYERS, klUCB, nbArms).children,
RandTopM(NB_PLAYERS, klUCB, nbArms).children,
MCTopM(NB_PLAYERS, klUCB, nbArms).children,
Selfish(NB_PLAYERS, klUCB, nbArms).children,
rhoRand(NB_PLAYERS, klUCB, nbArms).children,
]
- The multi-players policies are added by giving a list of their children (eg
Selfish(*args).children
), who are instances of the proxy classChildPointer
. Each child methods is just passed back to the mother class (the multi-players policy, e.g.,Selfish
), who can then handle the calls as it wants (can be centralized or not).
How to run the experiments ?¶
You should use the provided Makefile
file to do this simply:
# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install # install the requirements ONLY ONCE
make multiplayers # run and log the main_multiplayers.py script
make moremultiplayers # run and log the main_more_multiplayers.py script
Some illustrations of multi-players simulations¶
plots/MP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.png
Figure 1 : Regret,$M=6$
players,$K=9$
arms, horizon$T=5000$
, against$500$
problems$\mu$
uniformly sampled in$[0,1]^K$
. rhoRand (top blue curve) is outperformed by the other algorithms (and the gain increases with$M$
). MCTopM (bottom yellow) outperforms all the other algorithms is most cases.
plots/MP__K9_M6_T10000_N1000__4_algos__all_RegretCentralized_loglog____env1-1_8200873569864822246.png
plots/MP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.png
Figure 2 : Regret (in loglog scale), for$M=6$
players for$K=9$
arms, horizon$T=5000$
, for$1000$
repetitions on problem$\mu=[0.1,\ldots,0.9]$
. RandTopM (yellow curve) outperforms Selfish (green), both clearly outperform rhoRand. The regret of MCTopM is logarithmic, empirically with the same slope as the lower bound. The$x$
axis on the regret histograms have different scale for each algorithm.
Figure 3 : Regret (in logy scale) for$M=3$
players for$K=9$
arms, horizon$T=123456$
, for$100$
repetitions on problem$\mu=[0.1,\ldots,0.9]$
. With the parameters from their respective article, MEGA and MusicalChair fail completely, even with knowing the horizon for MusicalChair.
These illustrations come from my article, [Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017], presented at the Internation Conference on Algorithmic Learning Theorey 2018.
Fairness vs. unfairness¶
For a multi-player policy, being fair means that on every simulation with $M$
players, each player access any of the $M$
best arms (about) the same amount of time.
It is important to highlight that it has to be verified on each run of the MP policy, having this property in average is NOT enough.
- For instance, the oracle policy
OracleNotFair
affects each of the$M$
players to one of the$M$
best arms, orthogonally, but once they are affected they always pull this arm. It’s unfair because one player will be lucky and affected to the best arm, the others are unlucky. The centralized regret is optimal (null, in average), but it is not fair. - And the other oracle policy
OracleFair
affects an offset to each of the$M$
players corresponding to one of the$M$
best arms, orthogonally, and once they are affected they will cycle among the best$M$
arms. It’s fair because every player will pull the$M$
best arms an equal number of time. And the centralized regret is also optimal (null, in average). - Usually, the
Selfish
policy is not fair: as each player is selfish and tries to maximize her personal regret, there is no reason for them to share the time on the$M$
best arms. - Conversely, the
MusicalChair
policy is not fair either, and cannot be: when each player has attained the last step, ie. they are all choosing the same arm, orthogonally, and they are not sharing the$M$
best arms. - The
MEGA
policy is designed to be fair: when players collide, they all have the same chance of leaving or staying on the arm, and they all sample from the$M$
best arms equally. - The
rhoRand
policy is not designed to be fair for every run, but it is fair in average. - Similarly for our algorithms
RandTopM
andMCTopM
, defined inRandTopM
.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Doubling Trick for Multi-Armed Bandits¶
I studied what Doubling Trick can and can’t do for multi-armed bandits, to obtain efficient anytime version of non-anytime optimal Multi-Armed Bandits algorithms.
The Doubling Trick algorithm, denoted $DT(A, (T_i))$
for a diverging increasing sequence $T_i$
, is the following algorithm:
Policies/DoublingTrick.py
Long story short, we proved the two following theorems.
For geometric sequences¶
It works for minimax regret bounds (in$R_T = \mathcal{O}(\sqrt{T}))$
, with a constant multiplicative loss$\leq 4$
, but not for logarithmic regret bounds (in$R_T = \mathcal{O}(\log T))$
.
https://hal.inria.fr/hal-01736357
For exponential sequences¶
It works for logarithmic regret bounds (in$R_T = \mathcal{O}(\log T))$
, but not for minimax regret bounds (in$R_T = \mathcal{O}(\sqrt{T}))$
.
https://hal.inria.fr/hal-01736357
Article¶
I wrote a research article on that topic, it is a better introduction as a self-contained document to explain this idea and the algorithms. Reference: [What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018].
PDF : BK__ALT_2018.pdf | HAL notice : BK__ALT_2018 | BibTeX : BK__ALT_2018.bib | Source code and documentationPublished
Maintenance
Ask Me Anything !
Configuration¶
A simple python file, configuration_comparing_doubling_algorithms.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
For example, we can compare the standard anytime klUCB
algorithm against the non-anytime klUCBPlusPlus
algorithm, as well as 3 versions of DoublingTrickWrapper
applied to klUCBPlusPlus
.
configuration = {
"horizon": 10000, # Finite horizon of the simulation
"repetitions": 100, # number of repetitions
"n_jobs": -1, # Maximum number of cores for parallelization: use ALL your CPU
"verbosity": 5, # Verbosity for the joblib calls
# Environment configuration, you can set up more than one.
"environment": [
{
"arm_type": Bernoulli,
"params": 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
}
],
# Policies that should be simulated, and their parameters.
"policies": [
{"archtype": UCB, "params": {} },
{"archtype": klUCB, "params": {} },
{"archtype": klUCBPlusPlus, "params": { "horizon": 10000 } },
]
}
Then add a Doubling-Trick bandit algorithm (DoublingTrickWrapper
class), you can use this piece of code:
configuration["policies"] += [
{
"archtype": DoublingTrickWrapper,
"params": {
"next_horizon": next_horizon,
"full_restart": full_restart,
"policy": BayesUCB,
}
}
for full_restart in [ True, False ]
for next_horizon in [
next_horizon__arithmetic,
next_horizon__geometric,
next_horizon__exponential_fast,
next_horizon__exponential_slow,
next_horizon__exponential_generic
]
]
How to run the experiments ?¶
You should use the provided Makefile
file to do this simply:
# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install # install the requirements ONLY ONCE
make comparing_doubling_algorithms # run and log the main.py script
Some illustrations¶
Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli
and UnboundedGaussian
arms only):
Doubling-Trick with restart, on a “simple” Bernoulli problem¶
Doubling-Trick with restart, on a "simple" Bernoulli problem
Regret for Doubling-Trick, for $K=9$
Bernoulli arms, horizon $T=45678$
, $n=1000$
repetitions and $\mu_1,\ldots,\mu_K$
taken uniformly in $[0,1]^K$
.
Geometric doubling ($b=2$
) and slow exponential doubling ($b=1.1$
) are too slow, and short first sequences make the regret blow up in the beginning of the experiment.
At $t=40000$
we see clearly the effect of a new sequence for the best doubling trick ($T_i = 200 \times 2^i$
).
As expected, kl-UCB++ outperforms kl-UCB, and if the doubling sequence is growing fast enough then Doubling-Trick(kl-UCB++) can perform as well as kl-UCB++ (see for $t < 40000$
).
Doubling-Trick with restart, on randomly taken Bernoulli problems¶
Doubling-Trick with restart, on randomly taken Bernoulli problems
Similarly but for $\mu_1,\ldots,\mu_K$
evenly spaced in $[0,1]^K$
(${0.1,\ldots,0.9}$
).
Both kl-UCB and kl-UCB++ are very efficient on “easy” problems like this one, and we can check visually that they match the lower bound from Lai & Robbins (1985).
As before we check that slow doubling are too slow to give reasonable performance.
Doubling-Trick with restart, on randomly taken Gaussian problems with variance $V=1$
¶
Doubling-Trick with restart, on randomly taken Gaussian problems with variance V=1
Regret for $K=9$
Gaussian arms $\mathcal{N}(\mu, 1)$
, horizon $T=45678$
, $n=1000$
repetitions and $\mu_1,\ldots,\mu_K$
taken uniformly in $[-5,5]^K$
and variance $V=1$
.
On “hard” problems like this one, both UCB and AFHG perform similarly and poorly w.r.t. to the lower bound from Lai & Robbins (1985).
As before we check that geometric doubling ($b=2$
) and slow exponential doubling ($b=1.1$
) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime AFHG obtained by Doubling-Trick.
Doubling-Trick with restart, on an easy Gaussian problems with variance $V=1$
¶
Doubling-Trick with restart, on an easy Gaussian problems with variance V=1
Regret for Doubling-Trick, for $K=9$
Gaussian arms $\mathcal{N}(\mu, 1)$
, horizon $T=45678$
, $n=1000$
repetitions and $\mu_1,\ldots,\mu_K$
uniformly spaced in $[-5,5]^K$
.
On “easy” problems like this one, both UCB and AFHG perform similarly and attain near constant regret (identifying the best Gaussian arm is very easy here as they are sufficiently distinct).
Each doubling trick also appear to attain near constant regret, but geometric doubling ($b=2$
) and slow exponential doubling ($b=1.1$
) are slower to converge and thus less efficient.
Doubling-Trick with no restart, on randomly taken Bernoulli problems¶
Doubling-Trick with no restart, on randomly taken Bernoulli problems
Regret for $K=9$
Bernoulli arms, horizon $T=45678$
, $n=1000$
repetitions and $\mu_1,\ldots,\mu_K$
taken uniformly in $[0,1]^K$
, for Doubling-Trick no-restart.
Geometric doubling (\eg, $b=2$
) and slow exponential doubling (\eg, $b=1.1$
) are too slow, and short first sequences make the regret blow up in the beginning of the experiment.
At $t=40000$
we see clearly the effect of a new sequence for the best doubling trick ($T_i = 200 \times 2^i$
).
As expected, kl-UCB++ outperforms kl-UCB, and if the doubling sequence is growing fast enough then Doubling-Trick no-restart for kl-UCB++ can perform as well as kl-UCB++.
Doubling-Trick with no restart, on an “simple” Bernoulli problems¶
Doubling-Trick with no restart, on an "simple" Bernoulli problems
$K=9$
Bernoulli arms with $\mu_1,\ldots,\mu_K$
evenly spaced in $[0,1]^K$
.
On easy problems like this one, both kl-UCB and kl-UCB++ are very efficient, and here the geometric allows the Doubling-Trick no-restart anytime version of kl-UCB++ to outperform both kl-UCB and kl-UCB++.
These illustrations come from my article, [What the Doubling Trick Can or Can’t Do for Multi-Armed Bandits, Lilian Besson and Emilie Kaufmann, 2018].
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Structure and Sparsity of Stochastic Multi-Armed Bandits¶
This page explains shortly what I studied on sparse stochastic multi-armed bandits.
Assume a MAB problem with $K$
arms, each parametrized by its mean $\mu_k\in\mathbb{R}$
.
If you know in advance that only a small subset (of size $s$
) of the arms have a positive arm, it sounds reasonable to hope to be more efficient in playing the bandit game compared to an approach which is non aware of the sparsity.
The SparseUCB
is an extension of the well-known UCB
, and requires to known exactly the value of $s$
.
It works by identifying as fast as possible (actually, in a sub-logarithmic number of samples) the arms with non-positive means.
Then it only plays in the “good” arms with positive means, with a regular UCB policy.
I studied extensions of this idea, first of all the SparseklUCB
policy as it was suggested in the original research paper, but mainly a generic “wrapper” black-box approach.
For more details, see SparseWrapper
.
- Reference: [“Sparse Stochastic Bandits”, by J. Kwon, V. Perchet & C. Vernade, COLT 2017]. Note that this algorithm only works for sparse Gaussian (or sub-Gaussian) stochastic bandits, and it includes Bernoulli arms.
Article¶
TODO finish! I am writing a small research article on that topic, it is a better introduction as a self-contained document to explain this idea and the algorithms. Reference: [Structure and Sparsity of Stochastic Multi-Arm Bandits, Lilian Besson and Emilie Kaufmann, 2018].
Example of simulation configuration¶
A simple python file, configuration_sparse.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
For example, we can compare the standard UCB
and BayesUCB
algorithms, non aware of the sparsity, against the sparsity-aware SparseUCB
algorithm, as well as 4 versions of SparseWrapper
applied to BayesUCB
.
configuration = {
"horizon": 10000, # Finite horizon of the simulation
"repetitions": 100, # number of repetitions
"n_jobs": -1, # Maximum number of cores for parallelization: use ALL your CPU
"verbosity": 5, # Verbosity for the joblib calls
# Environment configuration, you can set up more than one.
"environment": [
{ # sparsity = nb of >= 0 mean, = 3 here
"arm_type": Bernoulli,
"params": 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.2, 0.3
}
],
# Policies that should be simulated, and their parameters.
"policies": [
{"archtype": UCB, "params": {} },
{"archtype": SparseUCB, "params": { "sparsity": 3 } },
{"archtype": BayesUCB, "params": { } },
]
}
Then add a Sparse-Wrapper bandit algorithm (SparseWrapper
class), you can use this piece of code:
configuration["policies"] += [
{
"archtype": SparseWrapper,
"params": {
"policy": BayesUCB,
"use_ucb_for_set_J": use_ucb_for_set_J,
"use_ucb_for_set_K": use_ucb_for_set_K,
}
}
for use_ucb_for_set_J in [ True, False ]
for use_ucb_for_set_K in [ True, False ]
]
How to run the experiments ?¶
You should use the provided Makefile
file to do this simply:
make install # install the requirements ONLY ONCE
make sparse # run and log the main.py script
Some illustrations¶
Here are some plots illustrating the performances of the different policies implemented in this project, against various sparse problems (with Bernoulli
or UnboundedGaussian
arms only):
3 variants of Sparse-Wrapper for UCB, on a “simple” sparse Bernoulli problem¶
3 variants of Sparse-Wrapper for UCB, on a "simple" sparse Bernoulli problem
FIXME run some simulations and explain them!
These illustrations come from my (work in progress) article, [Structure and Sparsity of Stochastic Multi-Arm Bandits, Lilian Besson and Emilie Kaufmann, 2018].
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Non-Stationary Stochastic Multi-Armed Bandits¶
A well-known and well-studied variant of the stochastic Multi-Armed Bandits is the so-called Non-Stationary Stochastic Multi-Armed Bandits. I give here a short introduction, with references below. If you are in a hurry, please read the first two pages of this recent article instead (arXiv:1802.08380).
- The first studied variant considers piece-wise stationary problems, also referred to as abruptly changing, where the distributions of the
$K$
arms are stationary on some intervals$[T_i,\ldots,T_{i+1}]$
with some abrupt change points$(T_i)$
.- It is always assumed that the location of the change points are unknown to the user, otherwise the problem is not harder: just play your favorite algorithm, and restart it at each change point.
- The change points can be fixed or randomly generated, but it is assumed that they are generated with a random source being oblivious of the user’s actions, so we can always consider that they were already generated before the game starts.
- For instance,
Arms.geometricChangePoints()
generates some change point if we assume that at every time step$t=1,\ldots,T]$
, there is a (small) probability p to have a change point. - The number of change points is usually denoted
$L$
or$\Upsilon_T$
, and should not be a constant w.r.t.$T$
(otherwise when$T\to\infty$
only the last section counts and give a stationary problem so it is not harder). Some algorithms require to know the value of$\Upsilon_T$
, or at least an upper-bound, and some algorithms try to be efficient without knowing it (this is what we want!). - The goal is to have an efficient algorithm, but of course if
$\Upsilon_T = \mathcal{O}(T)$
the problem is too hard to hope to be efficient and any algorithm will suffer a linear regret (i.e., be as efficient as a naive random strategy).
- Another variant is the slowly varying problem, where the rewards
$r(t) = r_{A(t),t}$
is sampled at each time from a parametric distribution, and the parameter(s) change at each time (usually parametrized by its mean). If we focus on 1D exponential families, or any family of distributions parametrized by their mean$\mu$
, we denote this by having$r(t) \sim D(\mu_{A(t)}(t))$
where$\mu_k(t)$
can be varying with the time. The slowly varying hypothesis is that every time step can be a break point, and that the speed of change$|\mu_k(t+1) - \mu_k(t)|$
is bounded. - Other variants include harder settings.
- For instance, we can consider that an adversarial is deciding the change points, by being adaptative to the user’s actions. I consider it harder, as always with adversarial problems, and not very useful to model real-world problems.
- Another harder setting is a “pseudo-Markovian rested” point-of-view: the mean (or parameters) of an arm’s distribution can change only when it is sampled, either from time to time or at each time step. It makes sense for some applications, for instance Julien’s work (in SequeL Inria team), but for others it doesn’t really make sense (e.g., cognitive radio applications).
TODO fix notations more precisely, include definitions! TODO what are the lower-bounds given in the more recent articles?
Applications¶
TL;DR: the world is non stationary, so it makes sense to study this!
TODO write more justifications about applications, mainly for IoT networks (like when I studied multi-player bandits).
References¶
Here is a partial list of references on this topic. For more, a good starting point is to read the references given in the mentioned article, as always.
Main references¶
- It is not on non-stationary but on non-stochastic (i.e., adversary) bandits, but it can be a good reading for the curious reader. [“The Non-Stochastic Multi-Armed Bandit Problem”. P. Auer, N. Cesa-Bianchi, Y. Freund and R. Schapire. SIAM journal on computing, 32(1), 48-77, 2002].
- The Sliding-Window and Discounted UCB algorithms were given in [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”. Aurélien Garivier and Éric Moulines, ALT 2011].
- They are implemented in
Policies.SlidingWindowUCB.SWUCB
andPolicies.DiscountedUCB
. - Note that I also implemented the non-anytime heuristic given by the author,
Policies.SlidingWindowUCB.SWUCBPlus
which uses the knowledge of the horizon$T$
to try to guess a correct value for$\tau$
the sliding window size. - I implemented this sliding window idea in a generic way, and
Policies.SlidingWindowRestart
is a generic wrapper that can work with (almost) any algorithm: it is an experimental policy, using a sliding window (of for instance$\tau=100$
draws of each arm), and reset the underlying algorithm as soon as the small empirical average is too far away from the long history empirical average (or just restart for one arm, if possible).
- They are implemented in
- [“Thompson sampling for dynamic multi-armed bandits”. N Gupta,. OC Granmo, A. Agrawala, 10th International Conference on Machine Learning and Applications Workshops. IEEE, 2011]
- [“Stochastic multi-armed-bandit problem with non-stationary rewards”, O. Besbes, Y. Gur, A. Zeevi. Advances in Neural Information Processing Systems (pp. 199-207), 2014]
- [“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017] introduced the
CUSUM-UCB
andPHT-UCB
algorithms. - [“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018] introduced the
M-UCB
algorithm.
Recent references¶
More recent articles include the following:
- [“On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems”. L. Wei and V. Srivastav. arXiv preprint arXiv:1802.08380, 2018], introduced the first algorithms that can (try to) tackle the two problems simultaneously,
LM-DSEE
andSW-UCB#
.- They require to know the rate of change but not the number of changes. They either assume that the number of break points
$\Upsilon_T$
is$\mathcal{O}(T^\nu)$
for some$\nu\in(0,1)$
(for abruptly-changing), or that the rate of change is$\max_t |\mu_{t+1} - \mu_{t}| \leq \varepsilon_T = \mathcal{O}(T^{-\kappa})$
. In both cases, their model assumes to know$\nu$
or$\kappa$
, or an upper-bound on it. - One advantage of their algorithms is their simplicity and ability to tackle both cases!
- They require to know the rate of change but not the number of changes. They either assume that the number of break points
- [“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner. EWRL 2018, Lille], introduced the
AdSwitch
algorithm, which does not require to know the number$\Upsilon_T$
of change points.- Be sure how to adapt it to
$K\geq2$
arms and not just$K=2$
(it shouldn’t be hard). - TODO adapt it to unknown horizon (using doubling tricks?!
- Be sure how to adapt it to
- [“Memory Bandits: a Bayesian approach for the Switching Bandit Problem”. Réda Alami, Odalric Maillard, Raphaël Féraud. 31st Conference on Neural Information Processing Systems (NIPS 2017), hal-01811697], introduced the
MemoryBandit
algorithm, which does not require to know the number$\Upsilon_T$
of change points.- They use a generic idea of expert aggregation with an efficient tracking of a growing number of expert. The basic idea is the following: a new expert is started at every time, and at a breakpoint, the expert started just after the breakpoint will essentially be the most efficient one (and we need efficient tracking to know it).
- Their
MemoryBandit
algorithm is very efficient empirically, but not easy to implement and it requires a large memory (although some discussion is given in their article’s appendix, as they evoke an heuristic that reduces the storage requirement).
- 🇫🇷 [“Algorithme de bandit et obsolescence : un modèle pour la recommandation”. Jonhathan Louëdec, Laurent Rossi, Max Chevalier, Aurélien Garivier and Josiane Mothe. 18ème Conférence francophone sur l’Apprentissage Automatique, 2016 (Marseille, France)] (🇫🇷 in French), introduces and justifies the possible applications of slowly-varying to recommender systems. They studies and present a model with an exponential decrease of the means, and the
FadingUCB
that is efficient if a bound on the speed of the exponential decrease is known.
Other references¶
Other interesting references:
- [“The Non-Stationary Stochastic Multi Armed Bandit Problem”. R. Allesiardo, Raphaël Féraud and Odalric-Ambrym Maillard. International Journal of Data Science and Analytics, 3(4), 267-283. 2017] introduced the
Exp3R
algorithm. - [“Taming non-stationary bandits: A Bayesian approach”. V. Raj and S. Kalyani. arXiv preprint arXiv:1707.09727, 2017] introduced the
DiscountedThompson
algorithm.
Example of simulation configuration¶
A simple python file, configuration_nonstationary.py
, is used to import the arm classes, the policy classes and define the problems and the experiments.
The main.py
file is used to import the configuration and launch the simulations.
For example, we can compare the standard UCB
and Thompson
algorithms, non aware of the non-stationarity, against the non-stationarity aware DiscountedUCB
SWUCB
, and the efficient DiscountedThompson
algorithm.
We also included our algorithms Bernoulli-GLR-UCB
using kl-UCB
,
and compare it with CUSUM-UCB
and M-UCB
, the two other state-of-the-art actively adaptive algorithms.
horizon = 5000
change_points = [0, 1000, 2000, 3000, 4000]
nb_random_events = len(change_points) - 1 # t=0 is not a change-point
list_of_means = [
[0.4, 0.5, 0.9], # from 0 to 1000
[0.5, 0.4, 0.7], # from 1000 to 2000
[0.6, 0.3, 0.5], # from 2000 to 3000
[0.7, 0.2, 0.3], # from 3000 to 4000
[0.8, 0.1, 0.1], # from 4000 to 5000
]
configuration = {
"horizon": horizon, # Finite horizon of the simulation
"repetitions": 1000, # number of repetitions
"n_jobs": -1, # Maximum number of cores for parallelization: use ALL your CPU
"verbosity": 5, # Verbosity for the joblib calls
# Environment configuration, you can set up more than one.
"environment": [ # Bernoulli arms with non-stationarity
{ # A non stationary problem: every step of the same repetition use a different mean vector!
"arm_type": Bernoulli,
"params": {
"listOfMeans": list_of_means,
"changePoints": change_points,
}
},
]
],
# Policies that should be simulated, and their parameters.
"policies": [
{ "archtype": klUCB, "params": {} },
{ "archtype": Thompson, "params": {} },
{ "archtype": OracleSequentiallyRestartPolicy, "params": {
"policy": klUCB,
"changePoints": change_points,
"list_of_means": list_of_means,
"reset_for_all_change": True,
"reset_for_suboptimal_change": False,
}}
{ "archtype": SWklUCB, "params": { "tau": # formula from [GarivierMoulines2011]
2 * np.sqrt(horizon * np.log(horizon) / (1 + nb_random_events))
} },
{ "archtype": DiscountedklUCB, "params": { "gamma": 0.95 } },
{ "archtype": DiscountedThompson, "params": { "gamma": 0.95 } },
{ "archtype": Monitored_IndexPolicy, "params": {
"horizon": horizon, "policy": klUCB, "w": 150,
} },
{ "archtype": CUSUM_IndexPolicy, "params": {
"horizon": horizon, "policy": klUCB, "w": 150, "max_nb_random_events": nb_random_events, "lazy_detect_change_only_x_steps": 10, # Delta n to speed up
} } ] + [
{ "archtype": BernoulliGLR_IndexPolicy_WithDeterministicExploration,
"params": {
"horizon": horizon, "policy": klUCB_forGLR, "max_nb_random_events": nb_random_events,
"lazy_detect_change_only_x_steps": 10, # Delta n to speed up
"lazy_try_value_s_only_x_steps": 10, # Delta s
"per_arm_restart": per_arm_restart,
} }
for per_arm_restart in [True, False]
]
}
How to run the experiments ?¶
You should use the provided Makefile
file to do this simply:
# if not already installed, otherwise update with 'git pull'
git clone https://github.com/SMPyBandits/SMPyBandits/
cd SMPyBandits
make install # install the requirements ONLY ONCE
Then modify the configuration_nonstationary.py
file, to specify the algorithms you want to compare (use the snippet above for inspiration). And run with:
make nonstationary # run and log the main.py script
There is a couple of different piece-wise stationary problems, that we implemented for our article, and you can use environment variables to modify the experiment to run. For instance, to run problems 1 and 2, with horizon T=5000, N=1000 repetitions, using 4 cores, run:
PROBLEMS=1,2 T=5000 N=1000 N_JOBS=4 DEBUG=False SAVEALL=True make nonstationary
Some illustrations¶
Here are some plots illustrating the performances of the different policies implemented in this project, against various non-stationary problems (with Bernoulli
only).
History of means for this simple problem¶
We consider a simple piece-wise stationary problem, with $K=3$ arms, a time horizon $T=5000$ and $N=1000$ repetitions. Arm changes concern only one arm at a time, and there is $\Upsilon=4$ changes at times $1000,2000,3000,4000$ ($C_T=\Upsilon_T=4)$.
plots/NonStationary_example_HistoryOfMeans.png
Figure 1 : history of means $\mu_i(t)$ for the $K=3$ arms. There is only one change of the optimal arm.
The next figures were obtained with the following command (at the date of writing, 31st of January 2019):
PROBLEMS=1 T=5000 N=1000 N_JOBS=4 DEBUG=False SAVEALL=True make nonstationary
Comparison of different algorithms¶
By using the configuration snippet shown above, we compare 9 algorithms. The plots below show how to perform. Our proposal is the GLR-klUCB, with two options for Local or Global restarts (Generalized Likelihood Ratio test + klUCB), and it outperforms all the previous state-of-the-art approaches.
plots/NonStationary_example_Regret.png
Figure 2 : plot of the mean regret $R_t$ as a function of the current time step $t$, for the different algorithms.
plots/NonStationary_example_BoxPlotRegret.png
Figure 3 : box plot of the regret at $T=5000$, for the different algorithms.
plots/NonStationary_example_HistogramsRegret.png
Figure 4 : plot of the histograms of the regret at $T=5000$, for the different algorithms.
Comparison of time and memory consumptions¶
plots/NonStationary_example_RunningTimes.png
Figure 5 : comparison of the running times. Our approach, like other actively adaptive approach, is slower, but drastically more efficient!
plots/NonStationary_example_MemoryConsumption.png
Figure 6 : comparison of the memory consumption. Our approach, like other actively adaptive approach, is more costly, but drastically more efficient!
Article?¶
Not yet! We are working on this! TODO
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Short documentation of the API¶
This short document aim at documenting the API used in my SMPyBandits environment, and closing this issue #3.
Code organization¶
Layout of the code:¶
- Arms are defined in this folder (
Arms/
), see for exampleArms.Bernoulli
- MAB algorithms (also called policies) are defined in this folder (
Policies/
), see for examplePolicies.Dummy
for a fully random policy,Policies.EpsilonGreedy
for the epsilon-greedy random policy,Policies.UCB
for the “simple” UCB algorithm, or alsoPolicies.BayesUCB
,Policies.klUCB
for two UCB-like algorithms,Policies.AdBandits
for the AdBandits algorithm, andPolicies.Aggregator
for my aggregated bandits algorithms. - Environments to encapsulate date are defined in this folder (
Environment/
): MAB problem use the classEnvironment.MAB
, simulation results are stored in aEnvironment.Result
, and the class to evaluate multi-policy single-player multi-env isEnvironment.Evaluator
. - very_simple_configuration.py` imports all the classes, and define the simulation parameters as a dictionary (JSON-like).
main.py
runs the simulations, then display the final ranking of the different policies and plots the results (saved to this folder (plots/
)).
UML diagrams¶
For more details, see these UML diagrams.
Question: How to change the simulations?¶
To customize the plots¶
- Change the default settings defined in
Environment/plotsettings.py
.
To change the configuration of the simulations¶
- Change the config file, i.e.,
configuration.py
for single-player simulations, orconfiguration_multiplayers.py
for multi-players simulations. - A good example of a very simple configuration file is given in very_simple_configuration.py`
To change how to results are exploited¶
- Change the main script, i.e.,
main.py
for single-player simulations,main_multiplayers.py
for multi-players simulations. Some plots can be disabled or enabled by commenting a few lines, and some options are given as flags (constants in the beginning of the file). - If needed, change, improve or add some methods to the simulation environment class, i.e.,
Environment.Evaluator
for single-player simulations, andEnvironment.EvaluatorMultiPlayers
for multi-players simulations. They use a class to store their simulation result,Environment.Result
andEnvironment.ResultMultiPlayers
.
Question: How to add something to this project?¶
In other words, what’s the API of this project?
For a new arm¶
- Make a new file, e.g.,
MyArm.py
- Save it in
Arms/
- The file should contain a class of the same name, inheriting from
Arms/Arm
, e.g., like thisclass MyArm(Arm): ...
(no need for anysuper
call) - This class
MyArm
has to have at least an__init__(...)
method to create the arm object (with or without arguments - named or not); a__str__
method to print it as a string; adraw(t)
method to draw a reward from this arm (t
is the time, which can be used or not); and should have amean()
method that gives/computes the mean of the arm - Finally, add it to the
Arms/__init__.py
file:from .MyArm import MyArm
- For examples, see
Arms.Bernoulli
,Arms.Gaussian
,Arms.Exponential
,Arms.Poisson
.
- For example, use this template:
from .Arm import Arm
class MyArm(Arm):
def __init__(self, *args, **kwargs):
# TODO Finish this method that initialize the arm MyArm
def __str__(self):
return "MyArm(...)".format('...') # TODO
def draw(self, t=None):
# TODO Simulates a pull of this arm. t might be used, but not necessarily
def mean(self):
# TODO Returns the mean of this arm
For a new (single-user) policy¶
- Make a new file, e.g.,
MyPolicy.py
- Save it in
Policies/
- The file should contain a class of the same name, it can inherit from
Policies/IndexPolicy
if it is a simple index policy, e.g., like this,class MyPolicy(IndexPolicy): ...
(no need for anysuper
call), or simply likeclass MyPolicy(object): ...
- This class
MyPolicy
has to have at least an__init__(nbArms, ...)
method to create the policy object (with or without arguments - named or not), with at least the parameternbArms
(number of arms); a__str__
method to print it as a string; achoice()
method to choose an arm (index among0, ..., nbArms - 1
, e.g., at random, or based on a maximum index if it is an index policy); and agetReward(arm, reward)
method called when the armarm
gave the rewardreward
, and finally astartGame()
method (possibly empty) which is called when a new simulation is ran. - Optionally, a policy class can have a
handleCollision(arm)
method to handle a collision after choosing the armarm
(eg. update an internal index, change a fixed offset etc). - Finally, add it to the
Policies/__init__.py
file:from .MyPolicy import MyPolicy
- For examples, see
Arms.Uniform
for a fully randomized policy,Arms.EpsilonGreedy
for a simple exploratory policy,Arms.Softmax
for another simple approach,Arms.UCB
for the class Upper Confidence-Bounds policy based on indexes, so inheriting fromPolicies/IndexPolicy
). There is alsoArms.Thompson
andArms.BayesUCB
for Bayesian policies (using a posterior, e.g., likeArms.Beta
),Arms.klUCB
for a policy based on the Kullback-Leibler divergence.- For less classical
Arms.AdBandit
is an approach combining Bayesian and frequentist point of view, andArms.Aggregator
is my aggregating policy.
- For example, use this template:
class MyPolicy(object):
def __init__(self, nbArms, *args, **kwargs):
self.nbArms = nbArms
# TODO Finish this method that initialize the arm MyArm
def __str__(self):
return "MyArm(...)".format('...') # TODO
def startGame(self):
pass # Can be non-trivial, TODO if needed
def getReward(self, arm, reward):
# TODO After the arm 'arm' has been pulled, it gave the reward 'reward'
pass # Can be non-trivial, TODO if needed
def choice(self):
# TODO Do a smart choice of arm
return random.randint(self.nbArms)
def handleCollision(self, arm):
pass # Can be non-trivial, TODO if needed
Otherchoice...()
methods can be added, if this policyMyPolicy
has to be used for multiple play, ranked play, etc.
For a new multi-users policy¶
- Make a new file, e.g.,
MyPoliciesMultiPlayers.py
- Save it in
PoliciesMultiPlayers/
- The file should contain a class, of the same name, e.g., like this,
class MyPoliciesMultiPlayers(object):
- This class
MyPoliciesMultiPlayers
has to have at least an__init__
method to create the arm; a__str__
method to print it as a string; and achildren
attribute that gives a list of players (single-player policies). - Finally, add it to the
PoliciesMultiPlayers/__init__.py
file:from .MyPoliciesMultiPlayers import MyPoliciesMultiPlayers
For examples, seePoliciesMultiPlayers.OracleNotFair
andPoliciesMultiPlayers.OracleFair
for full-knowledge centralized policies (fair or not),PoliciesMultiPlayers.CentralizedFixed
andPoliciesMultiPlayers.CentralizedCycling
for non-full-knowledge centralized policies (fair or not). There is also thePoliciesMultiPlayers.Selfish
decentralized policy, where all players runs in without any knowledge on the number of players, and no communication (decentralized).
PoliciesMultiPlayers.Selfish
is the simplest possible example I could give as a template.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
About parallel computations¶
This short page explains quickly we used multi-core computations to speed-up the simulations in SMPyBandits.
Nowadays, parallelism is everywhere in the computational world, and any serious framework for numerical simulations must explore at least one of the three main approaches to (try to) gain performance from parallelism.
For all the different numerical simulations for which SMPyBandits is designed, the setting is the same: we consider a small set of p different problems, of time horizon T that we want to simulate for N independent runs (e.g., p=6, T=10000 and N=100). On the first hand, because of the fundamentally sequential nature of bandit games, each repetition of the simulation must be sequential regarding the time steps t=1,…,T, and so no parallelism can be done to speed up this axis. On the other hand, parallelism can help greatly for the two other axes: if we have a way to run in parallel 4 processes, and we have p=4 problems to simulate, then running a process for each problem directly brings a speed-up factor of 4. Similarly, if we want to run 100 repetitions of the same (random) problem, and we can run 4 processes in parallel, then running 100/4=25 repetitions on each process also bring a speed-up factor of 4.
In this page, we quickly review the chosen approach for SMPyBandits (multi-core on one machine), and we explain why the two other approaches were less appropriate for our study of multi-armed bandit problems.
What we did implement: Joblib
for multi-core simulations.¶
The first approach is to use multiple cores of the same machines, and because it is both the simplest and the less financially as well as ecologically costly, this is the approach implemented in SMPyBandits. The machines I had access to during my thesis, either my own laptop or a workstation hosted the SCEE team in CentraleSupélec campus, were equipped with i5 or i7 Intel CPU with 4 or 12 cores.
As explained in the page How_to_run_the_code.html
, we implemented in SMPyBandits an easy way to run any simulations on n cores of a machine, using the Joblib
library.
It is implemented in a completely transparent way, and if someone uses the command-line variable to configure experiments, using one core or all the cores of the machine one changes N_JOBS=1
to N_JOBS=-1
, like in this example.
BAYES=False ARM_TYPE=Bernoulli N=100 T=10000 K=9 N_JOBS=1 \
python3 main.py configuration.py
As long as the number of jobs (N_JOBS
here) is less then or equal to the number of physical cores in the CPU of the computer, the final speed-up in terms of total computation runtime is almost optimal.
But jobs are implemented as threads, so the speed-up cannot be more than the number of cores, and using for instance 20 jobs on 4-cores for the 20 repetitions is sub-optimal, as the CPU will essentially spend all its time (and memory) managing the different jobs, and not actually doing the simulations.
Using the above example, we illustrate the effect of using multi-jobs and multi-cores on the time efficiency of simulations using SMPyBandits. We consider three values of N_JOBS
, 1 to use only one core and one job, 4 to use all the 4 cores of my i5 Intel CPU, and 20 jobs.
We give in the Table below an example of running time of an experiment with T=1000, and different number of repetitions and number of jobs. It clearly illustrates that using more jobs than the number of CPU is sub-optimal, and that as soon as the number of repetitions is large enough, using one job by available CPU core (\ie, here 4 jobs) gives a significant speed-up time. Due to the cost of orchestrating the different jobs, and memory exchanges at the end of each repetition, the parallel version is \emph{not} 4 times faster, but empirically we always found it to be 2 to 3.5 times faster.
For a simulation with 9 different algorithms, for K=9 arms, a time horizon of T=10000,
we illustrate the effect on the running time of using N_JOBS
jobs in parallel.
For different number of repetitions and different number of jobs N_JOBS
, for 1, 4 (= nb cores), 20 (> nb cores) jobs:
- 1 repetition: 15 seconds, 26 seconds, 43 seconds
- 10 repetitions: 87 seconds, 51 seconds, 76 seconds
- 100 repetitions: 749 seconds, 272 seconds, 308 seconds
- 500 repetitions: 2944 seconds, 1530 seconds, 1846 seconds
The table above shows the effect on the running time of using
N_JOBS
jobs in parallel, for a simulation with 9 different algorithms, for K=9 arms, a time horizon of T=10000.
Approaches we did not try¶
The two other approaches we could have consider is parallel computations running on not multiple cores but multiple machines, in a computer cluster, or parallel computations running in a Graphical Processing Unit (GPU).
GPU¶
I did not try to add in SMPyBandits the possibility to run simulations using a GPU, or any general purpose computation libraries offering a GPU-backend. Initially designed for graphical simulations and mainly for video-games applications, the use of GPU for scientific computations have been gaining attention for numerical simulation in the research world since the last 15 years, and NVidia CUDA for GPGPU (General Purpose GPU) started to become popular in 2011. Since 2016, we saw a large press coverage as well as an extensive use in research of deep learning libraries that make general-purpose machine learning algorithms train on the GPU of a user’s laptop or a cluster of GPU. This success is mainly possible because of the heavy parallelism of such training algorithms, and the parallel nature of GPU. To the best of the author knowledge, nobody has tried to implement high performance MAB simulations by using the “parallelism power” of a GPU (at least, no code for such experiments were made public in 2019).
I worked on a GPU, implementing fluid dynamic simulations in an internship in 2012, and I have since then kept a curiosity on how to use GPU-powered libraries and code. I have contributed to and used famous deep-learning libraries, like Theano or Keras, and my limited knowledge on such libraries made me believe that it was not easy to use a GPU for bandit simulations, and most surely it would not have been worth the time.
I would be very curious to understand how a GPU could be used to implement highly efficient simulations for sequential learning problems, because it seemed hard whenever I thought about it.
Large scale cluster¶
I also did not try to use any large scale computer cluster, even if I was aware of the possibility offered by the Grid 5000 project, for instance. It is partly due to time constraint, as I would have been curious to try, but mainly because we found that it would not have helped us much to use a large scale cluster. The main reason is that in the multi-armed bandit and sequential learning literature, most research papers do not even include an experimental section, and for the papers who did take the time to implement and test their proposed algorithms, it is almost done on just a few problems and for short- or medium- duration experiments.
For instance, the papers we consider to be the best ones regarding their empirical sections are Liu & Lee & Shroff, 2017, arXiv:1711.03539 and Cao & Zhen & Kveton & Xie, 2018, arXiv:1802.03692, for piece-wise stationary bandits, and they mainly consider reasonable problems of horizon T=10000 and no more than 1000 independent repetitions. Each paper considers one harder problem, of horizon T=1000000 and less repetitions.
In each article written during my thesis, we included extensive numerical simulations, and even the longest ones (for Besson & Kaufmann, 2019, HAL-02006471) were short enough to run in less than 12 hours on a 12-core workstation, so we could run a few large-scale simulations over night. For such reasons, we prefer to not try to run simulations on a cluster.
Other ideas?¶
And you, dear reader, do you have any idea of a technology I should have tried? If so, please fill an issue on GitHub! Thanks!
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
:boom: TODO¶
For others things to do, and issues to solve, see the issue tracker on GitHub.
Publicly release it and document it - OK¶
Other aspects¶
- [x] publish on GitHub!
Presentation paper¶
- [x] A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience
- [x] A clear statement of need that illustrates the purpose of the software
- [x] A list of key references including a link to the software archive
- [x] Mentions (if applicable) of any ongoing research projects using the software or recent scholarly publications enabled by it
Clean up things - OK¶
Initial things to do! - OK¶
Improve and speed-up the code? - OK¶
More single-player MAB algorithms? - OK¶
Contextual bandits?¶
- [ ] I should try to add support for (basic) contextual bandit.
Better storing of the simulation results¶
Multi-players simulations - OK¶
Other Multi-Player algorithms¶
- [ ] “Dynamic Musical Chair” that regularly reinitialize “Musical Chair”…
- [ ] “TDFS” from [Liu & Zhao, 2009].
Dynamic settings¶
- [ ] add the possibility to have a varying number of dynamic users for multi-users simulations…
- [ ] implement the experiments from [Musical Chair], [rhoRand] articles, and Navik Modi’s experiments?
C++ library / bridge to C++¶
- [ ] Finish to write a perfectly clean CLI client to my Python server
- [ ] Write a small library that can be included in any other C++ program to do : 1. start the socket connexion to the server, 2. then play one step at a time,
- [ ] Check that the library can be used within a GNU Radio block !
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Some illustrations for this project¶
Here are some plots illustrating the performances of the different policies implemented in this project, against various problems (with Bernoulli
arms only):
Histogram of regrets at the end of some simulations¶
On a simple Bernoulli problem, we can compare 16 different algorithms (on a short horizon and a small number of repetitions, just as an example).
If we plot the distribution of the regret at the end of each experiment, R_T
, we can see this kind of plot:
Histogramme_regret_monoplayer_2.png
It helps a lot to see both the mean value (in solid black) of the regret, and its distribution of a few runs (100 here). It can be used to detect algorithms that perform well in average, but sometimes with really bad runs. Here, the Exp3++ seems to had one bad run.
Demonstration of different Aggregation policies¶
On a fixed Gaussian problem, aggregating some algorithms tuned for this exponential family (ie, they know the variance but not the means). Our algorithm, Aggregator, outperforms its ancestor Exp4 as well as the other state-of-the-art experts aggregation algorithms, CORRAL and LearnExp.
main____env3-4_932221613383548446.png
Demonstration of multi-player algorithms¶
Regret plot on a random Bernoulli problem, with M=6
players accessing independently and in a decentralized way K=9
arms.
Our algorithms (RandTopM and MCTopM, as well as Selfish) outperform the state-of-the-art rhoRand:
MP__K9_M6_T5000_N500__4_algos__all_RegretCentralized____env1-1_8318947830261751207.png
Histogram on the same random Bernoulli problems. We see that some all algorithms have a non-negligible variance on their regrets.
MP__K9_M6_T10000_N1000__4_algos__all_HistogramsRegret____env1-1_8200873569864822246.png
Comparison with two other “state-of-the-art” algorithms (MusicalChair and MEGA, in semilogy scale to really see the different scale of regret between efficient and sub-optimal algorithms):
MP__K9_M3_T123456_N100__8_algos__all_RegretCentralized_semilogy____env1-1_7803645526012310577.png
Other illustrations¶
Piece-wise stationary problems¶
Comparing Sliding-Window UCB and Discounted UCB and UCB, on a simple Bernoulli problem which regular random shuffling of the arm.
Demo_of_DiscountedUCB2.png
Sparse problem and Sparsity-aware algorithms¶
Comparing regular UCB, klUCB and Thompson sampling against “sparse-aware” versions, on a simple Gaussian problem with K=10
arms but only s=4
with non-zero mean.
Demo_of_SparseWrapper_regret.png
Demonstration of the Doubling Trick policy¶
- On a fixed problem with full restart:
main____env1-1_3633169128724378553.png
- On a fixed problem with no restart:
main____env1-1_5972568793654673752.png
- On random problems with full restart:
main____env1-1_1217677871459230631.png
- On random problems with no restart:
main____env1-1_5964629015089571121.png
Plots for the JMLR MLOSS paper¶
In the JMLR MLOSS paper I wrote to present SMPyBandits,
an example of a simulation is presented, where we compare the standard anytime klUCB
algorithm against the non-anytime variant klUCBPlusPlus
algorithm, and also UCB
(with (\alpha=1)) and Thompson
(with Beta posterior).
configuration["policies"] = [
{ "archtype": klUCB, "params": { "klucb": klucbBern } },
{ "archtype": klUCBPlusPlus, "params": { "horizon": HORIZON, "klucb": klucbBern } },
{ "archtype": UCBalpha, "params": { "alpha": 1 } },
{ "archtype": Thompson, "params": { "posterior": Beta } }
]
Running this simulation as shown below will save figures in a sub-folder, as well as save data (pulls, rewards and regret) in HDF5 files.
# 3. run a single-player simulation
$ BAYES=False ARM_TYPE=Bernoulli N=1000 T=10000 K=9 N_JOBS=4 \
MEANS=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] python3 main.py configuration.py
The two plots below shows the average regret for these 4 algorithms. The regret is the difference between the cumulated rewards of the best fixed-armed strategy (which is the oracle strategy for stationary bandits), and the cumulated rewards of the considered algorithms.
- Average regret:
paper/3.png
- Histogram of regrets:
paper/3_hist.png
Example of a single-player simulation showing the average regret and histogram of regrets of 4 algorithms. They all perform very well: each algorithm is known to be order-optimal (i.e., its regret is proved to match the lower-bound up-to a constant), and each but UCB is known to be optimal (i.e. with the constant matching the lower-bound). For instance, Thomson sampling is very efficient in average (in yellow), and UCB shows a larger variance (in red).
Saving simulation data to HDF5 file¶
This simulation produces this example HDF5 file,
which contains attributes (e.g., horizon=10000
, repetitions=1000
, nbPolicies=4
),
and a collection of different datasets for each environment.
Only one environment was tested, and for env_0
the HDF5 stores some attributes (e.g., nbArms=9
and means=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
)
and datasets (e.g., bestArmPulls
of shape (4, 10000)
, cumulatedRegret
of shape (4, 10000)
, lastRegrets
of shape (4, 1000)
, averageRewards
of shape (4, 10000)
).
See the example:
GitHub.com/SMPyBandits/SMPyBandits/blob/master/plots/paper/example.hdf5.
Note: HDFCompass is recommended to explore the file from a nice and easy to use GUI. Or use it from a Python script with h5py or a Julia script with HDF5.jl.Example of exploring this 'example.hdf5' file using HDFCompass
Graph of time and memory consumptions¶
Time consumption¶
Note that I had added a very clean support for time consumption measures, every simulation script will output (as the end) some lines looking like this:
Giving the mean and std running times ...
For policy #0 called 'UCB($\alpha=1$)' ...
84.3 ms ± 7.54 ms per loop (mean ± std. dev. of 10 runs)
For policy #1 called 'Thompson' ...
89.6 ms ± 17.7 ms per loop (mean ± std. dev. of 10 runs)
For policy #3 called 'kl-UCB$^{++}$($T=1000$)' ...
2.52 s ± 29.3 ms per loop (mean ± std. dev. of 10 runs)
For policy #2 called 'kl-UCB' ...
2.59 s ± 284 ms per loop (mean ± std. dev. of 10 runs)
Demo_of_automatic_time_consumption_measure_between_algorithms
Memory consumption¶
Note that I had added an experimental support for time consumption measures, every simulation script will output (as the end) some lines looking like this:
Giving the mean and std memory consumption ...
For players called '3 x RhoRand-kl-UCB, rank:1' ...
23.6 KiB ± 52 B (mean ± std. dev. of 10 runs)
For players called '3 x RandTopM-kl-UCB' ...
1.1 KiB ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x Selfish-kl-UCB' ...
12 B ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x MCTopM-kl-UCB' ...
4.9 KiB ± 86 B (mean ± std. dev. of 10 runs)
For players called '3 x MCNoSensing($M=3$, $T=1000$)' ...
12 B ± 0 B (mean ± std. dev. of 10 runs)
Demo_of_automatic_memory_consumption_measure_between_algorithms
It is still experimental!
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Jupyter Notebooks :notebook: by Naereen @ GitHub¶
This folder hosts some Jupyter Notebooks, to present in a nice format some numerical experiments for my SMPyBandits project.
The wonderful Jupyter tools is awesome to write interactive and nicely presented :snake: Python simulations!
1. List of experiments presented with notebooks¶
MAB problems¶
- Easily creating various Multi-Armed Bandit problems, explains the interface of the
Environment.MAB
module.
Single-Player simulations¶
- A simple example of Single-Player simulation, comparing
UCB1
(for two values of $\alpha$, 1 and 1/2),Thompson Sampling
,BayesUCB
andkl-UCB
. - Do we even need UCB? demonstrates the need for an algorithm smarter than the naive
EmpiricalMeans
. - Lai-Robbins lower-bound for doubling-tricks algorithms with full restart.
Active research on Single-Player MAB¶
Multi-Player simulations¶
- A simple example of Multi-Player simulation with 4 Centralized Algorithms, comparing
CentralizedMultiplePlay
andCentralizedIMP
withUCB
andThompson Sampling
. - A simple example of Multi-Player simulation with 2 Decentralized Algorithms, comparing
rhoRand
andSelfish
(for the “collision avoidance” part) combined withUCB
andThompson Sampling
for learning the arms. Spoiler:Selfish
beatsrhoRand
!
(Old) Experiments¶
2. Question: How to read these documents?¶
2.a. View the notebooks statically :memo:¶
- Either directly in GitHub: see the list of notebooks;
- Or on nbviewer.jupiter.org: list of notebooks.
2.b. Play with the notebooks dynamically (on MyBinder) :boom:¶
Anyone can use the mybinder.org website (by clicking on the icon above) to run the notebook in her/his web-browser. You can then play with it as long as you like, for instance by modifying the values or experimenting with the code.
- Do_we_even_need_UCB.ipynb
Binder
- Easily_creating_MAB_problems.ipynb
Binder
- Example_of_a_small_Single-Player_Simulation.ipynb
Binder
- Example_of_a_small_Multi-Player_Simulation__with_Centralized_Algorithms.ipynb
Binder
- Example_of_a_small_Multi-Player_Simulation__with_rhoRand_and_Selfish_Algorithms.ipynb
Binder
- Lai_Robbins_Lower_Bound_for_Doubling_Trick_with_Restarting_Algorithms.ipynb
Binder
- Exploring different doubling tricks for different kinds of regret bounds.ipynb
Binder
- Experiments of statistical tests for piecewise stationary bandits.ipynb
Binder
- Demonstrations of Single-Player Simulations for Non-Stationary-Bandits.ipynb
Binder
2.c. Play with the notebooks dynamically (on Google Colab) :boom:¶
Anyone can use the colab.research.google.com/notebook website (by clicking on the icon above) to run the notebook in her/his web-browser. You can then play with it as long as you like, for instance by modifying the values or experimenting with the code.
- Do_we_even_need_UCB.ipynb
Google Colab
- Easily_creating_MAB_problems.ipynb
Google Colab
- Example_of_a_small_Single-Player_Simulation.ipynb
Google Colab
- Example_of_a_small_Multi-Player_Simulation__with_Centralized_Algorithms.ipynb
Google Colab
- Example_of_a_small_Multi-Player_Simulation__with_rhoRand_and_Selfish_Algorithms.ipynb
Google Colab
- Lai_Robbins_Lower_Bound_for_Doubling_Trick_with_Restarting_Algorithms.ipynb
Google Colab
- Exploring different doubling tricks for different kinds of regret bounds.ipynb
Google Colab
- Experiments of statistical tests for piecewise stationary bandits.ipynb
Google Colab
- Demonstrations of Single-Player Simulations for Non-Stationary-Bandits.ipynb
Google Colab
3. Question: Requirements to run the notebooks locally?¶
All the requirements can be installed with pip
.
3.a. Jupyter Notebook and IPython¶
sudo pip install jupyter ipython
It will also install all the dependencies, afterward you should have a jupyter-notebook
command (or a jupyter
command, to be ran as jupyter notebook
) available in your PATH
:
$ whereis jupyter-notebook
jupyter-notebook: /usr/local/bin/jupyter-notebook
$ jupyter-notebook --version # version >= 4 is recommended
4.4.1
3.b. My numerical environment, SMPyBandits
¶
- First, install its dependencies (
pip install -r requirements
). - Then, either install it (not yet), or be sure to work in the main folder.
Note: it’s probably better to use virtualenv, if you like it. I never really understood how and why virtualenv are useful, but if you know why, you should know how to use it.
:information_desk_person: More information?¶
- More information about notebooks (on the documentation of IPython) or on the FAQ on Jupyter’s website.
- More information about mybinder.org: on this example repository.
:scroll: License ?
GitHub license¶
All the notebooks in this folder are published under the terms of the MIT License (file LICENSE.txt). © Lilian Besson, 2016-18.
Maintenance
Ask Me Anything !
Analytics
made-with-jupyter
made-with-python
ForTheBadge uses-badges
ForTheBadge uses-git
List of notebooks for SMPyBandits¶
—
Note
I wrote many other Jupyter notebooks covering various topics, see on my GitHub notebooks/ project.
A note on execution times, speed and profiling¶
- About (time) profiling with Python (2 or 3):
cProfile
orprofile
in Python 2 documentation (in Python 3 documentation), this StackOverflow thread, this blog post, and the documentation ofline_profiler
(to profile lines instead of functions) andpycallgraph
(to illustrate function calls) andyappi
(which seems to be thread aware). - See also
pyreverse
to get nice UML-like diagrams illustrating the relationships of packages and classes between each-other.
A better approach?¶
In January, I tried to use the PyCharm Python IDE, and it has an awesome profiler included! But it was too cumbersome to use…
An even better approach?¶
Well now… I know my codebase, and I know how costly or efficient every new piece of code should be, if I find empirically something odd, I explore with one of the above-mentionned module…
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
PyPI download
PyPI status
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
UML diagrams¶
These UML diagrams have been generated using pyreverse
.
Packages in SMPyBandits¶
UML Diagram - Packages of SMPyBandits.git
Packages in SMPyBandits.PoliciesMultiPlayers¶
Classes in SMPyBandits¶
UML Diagram - classes of SMPyBandits.git
Classes in SMPyBandits.PoliciesMultiPlayers¶
Classes in SMPyBandits.Policies.Experimentals¶
How to generate them?¶
See the rules generate_uml
and uml2others
in this Makefile
script.
logs
files¶
This folder keeps some examples of log files to show the output of the simulation scripts.
Multi players simulations¶
Example of output of the main_multiplayers.py
program¶
Example of output of the main_multiplayers_more.py
program¶
Linters¶
Pylint¶
- See
main_pylint_log.txt
for Python 2 (generic) linting report. - See
main_pylint3_log.txt
for Python 3 (specific) linting report.
Profilers¶
- See
main_py3_kernprof_log.txt
fromkernprof
profiling. - See
main_py3_profile_log.txt
for an example of a line-by-line time profiler. - See
main_py3_memory_profiler_log.txt
for an example of a line-by-line time profiler.
Graph of time and memory consumptions¶
Time consumption¶
Note that I had added a very clean support for time consumption measures, every simulation script will output (as the end) some lines looking like this:
Giving the mean and std running times ...
For policy #0 called 'UCB($\alpha=1$)' ...
84.3 ms ± 7.54 ms per loop (mean ± std. dev. of 10 runs)
For policy #1 called 'Thompson' ...
89.6 ms ± 17.7 ms per loop (mean ± std. dev. of 10 runs)
For policy #3 called 'kl-UCB$^{++}$($T=1000$)' ...
2.52 s ± 29.3 ms per loop (mean ± std. dev. of 10 runs)
For policy #2 called 'kl-UCB' ...
2.59 s ± 284 ms per loop (mean ± std. dev. of 10 runs)
Demo_of_automatic_time_consumption_measure_between_algorithms
Memory consumption¶
Note that I had added an experimental support for time consumption measures, every simulation script will output (as the end) some lines looking like this:
Giving the mean and std memory consumption ...
For players called '3 x RhoRand-kl-UCB, rank:1' ...
23.6 KiB ± 52 B (mean ± std. dev. of 10 runs)
For players called '3 x RandTopM-kl-UCB' ...
1.1 KiB ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x Selfish-kl-UCB' ...
12 B ± 0 B (mean ± std. dev. of 10 runs)
For players called '3 x MCTopM-kl-UCB' ...
4.9 KiB ± 86 B (mean ± std. dev. of 10 runs)
For players called '3 x MCNoSensing($M=3$, $T=1000$)' ...
12 B ± 0 B (mean ± std. dev. of 10 runs)
Demo_of_automatic_memory_consumption_measure_between_algorithms
It is still experimental!
Other examples¶
Example of output of a script¶
For the complete_tree_exploration_for_MP_bandits
script, see the file complete_tree_exploration_for_MP_bandits_py3_log.txt
.
:scroll: License ?
GitHub license¶
MIT Licensed (file LICENSE).
© 2016-2018 Lilian Besson.
Open Source? Yes!
Maintenance
Ask Me Anything !
Analytics
PyPI version
PyPI implementation
PyPI pyversions
Documentation Status
Build Status
Stars of https://github.com/SMPyBandits/SMPyBandits/
Releases of https://github.com/SMPyBandits/SMPyBandits/
Note
Both this documentation and the code are publicly available, under the open-source MIT License. The code is hosted on GitHub at github.com/SMPyBandits/SMPyBandits.
Indices and tables¶
- Index
- Module Index
- classindex,
- funcindex,
- methindex,
- staticmethindex,
- attrindex,
- Search Page.