Single-Player policies ¶

See here the documentation: docs/Policies

List of policies¶

Policies module : contains various (single-player) bandits algorithms:

“Stupid” algorithms: Uniform, UniformOnSome, TakeFixedArm, TakeRandomFixedArm,
Greedy algorithms: EpsilonGreedy, EpsilonFirst, EpsilonDecreasing,
And two variants of the Explore-Then-Commit policy: ExploreThenCommit.ETC_KnownGap, ExploreThenCommit.ETC_RandomStop,
Probabilistic weighting algorithms: Hedge, Softmax, Softmax.SoftmaxDecreasing, Softmax.SoftMix, Softmax.SoftmaxWithHorizon, Exp3, Exp3.Exp3Decreasing, Exp3.Exp3SoftMix, Exp3.Exp3WithHorizon, Exp3.Exp3ELM, ProbabilityPursuit, Exp3PlusPlus, and a smart variant BoltzmannGumbel,
Index based UCB algorithms: EmpiricalMeans, UCB, UCBlog10, UCBwrong, UCBlog10alpha, UCBalpha, UCBmin, UCBplus, UCBrandomInit, UCBV, UCBVtuned, UCBH, CPUCB,
Index based MOSS algorithms: MOSS, MOSSH, MOSSAnytime, MOSSExperimental,
Bayesian algorithms: Thompson, ThompsonRobust, BayesUCB,
Based on Kullback-Leibler divergence: klUCB, klUCBlog10, klUCBloglog, klUCBloglog10, klUCBPlus, klUCBH, klUCBHPlus, klUCBPlusPlus,
Empirical KL-UCB algorithm: KLempUCB (FIXME),
Other index algorithms: DMED, DMED.DMEDPlus, OCUCB, UCBdagger,
Hybrids algorithms, mixing Bayesian and UCB indexes: AdBandits,
Aggregation algorithms: Aggregator (mine, it’s awesome, go on try it!), and CORRAL, LearnExp,
Finite-Horizon Gittins index, approximated version: ApproximatedFHGittins,
An experimental policy, using Unsupervised Learning: UnsupervisedLearning,
An experimental policy, using Black-box optimization: BlackBoxOpt,
An experimental policy, using a sliding window of for instance 100 draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible), SlidingWindowRestart, and 3 versions for UCB, UCBalpha and klUCB: SlidingWindowRestart.SWR_UCB, SlidingWindowRestart.SWR_UCBalpha, SlidingWindowRestart.SWR_klUCB (my algorithm, unpublished yet),
An experimental policy, using just a sliding window of for instance 100 draws, SlidingWindowUCB.SWUCB, and SlidingWindowUCB.SWUCBPlus if the horizon is known.
Another experimental policy with a discount factor, DiscountedUCB and DiscountedUCB.DiscountedUCBPlus.
A policy designed to tackle sparse stochastic bandit problems, SparseUCB, SparseklUCB, and SparseWrapper that can be used with any index policy.
A policy that implements a “smart doubling trick” to turn any horizon-dependent policy into a horizon-independent policy without loosing in performances: DoublingTrickWrapper,
An experimental policy, implementing a another kind of doubling trick to turn any policy that needs to know the range [a,b] of rewards a policy that don’t need to know the range, and that adapt dynamically from the new observations, WrapRange,
The Optimal Sampling for Structured Bandits (OSSB) policy: OSSB (it is more generic and can be applied to almost any kind of bandit problem, it works fine for classical stationary bandits but it is not optimal),
New! The Best Empirical Sampled Average (BESA) policy: BESA (it works crazily well),
Some are designed only for (fully decentralized) multi-player games: MusicalChair, MEGA.

API¶

All policies have the same interface, as described in BasePolicy, in order to use them in any experiment with the following approach:

my_policy = Policy(nbArms)
my_policy.startGame()  # start the game
for t in range(T):
    chosen_arm_t = k_t = my_policy.choice()  # chose one arm
    reward_t     = sampled from an arm k_t   # sample a reward
    my_policy.getReward(k_t, reward_t)       # give it the the policy

Single-Player policies¶

List of policies¶

API¶

Single-Player policies ¶