Policies.ApproximatedFHGittins module¶

The approximated Finite-Horizon Gittins index policy for bounded bandits.

This is not the computationally costly Gittins index, but a simple approximation, using the knowledge of the horizon T.
Reference: [Lattimore - COLT, 2016](http://www.jmlr.org/proceedings/papers/v49/lattimore16.pdf), and [his COLT presentation](https://youtu.be/p8AwKiudhZ4?t=276)

Policies.ApproximatedFHGittins.ALPHA = 0.5¶: Default value for the parameter \(\alpha > 0\) for ApproximatedFHGittins.

Policies.ApproximatedFHGittins.DISTORTION_HORIZON = 1.01¶: Default value for the parameter \(\tau \geq 1\) that is used to artificially increase the horizon, from \(T\) to :math`tau T`.

class Policies.ApproximatedFHGittins.ApproximatedFHGittins(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.IndexPolicy.IndexPolicy

The approximated Finite-Horizon Gittins index policy for bounded bandits.

This is not the computationally costly Gittins index, but a simple approximation, using the knowledge of the horizon T.
Reference: [Lattimore - COLT, 2016](http://www.jmlr.org/proceedings/papers/v49/lattimore16.pdf), and [his COLT presentation](https://youtu.be/p8AwKiudhZ4?t=276)

__init__(nbArms, horizon=None, alpha=0.5, distortion_horizon=1.01, lower=0.0, amplitude=1.0)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

alpha = None¶: Parameter \(\alpha > 0\).

distortion_horizon = None¶: Parameter \(\tau > 0\).

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

__str__()[source]¶: -> str

m¶: \(m = T - t + 1\) is the number of steps to be played until end of the game.

Note

The article does not explain how to deal with unknown horizon, but eventually if \(T\) is wrong, this m becomes negative. Empirically, I force it to be \(\geq 1\), to not mess up with the \(\log(m)\) used below, by using \(\tau T\) instead of \(T\) (e.g., \(\tau = 1.01\) is enough to not ruin the performance in the last steps of the experiment).

computeIndex(arm)[source]¶: Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \log\left( \frac{m}{N_k(t) \log^{1/2}\left( \frac{m}{N_k(t)} \right)} \right)}, \\ \text{where}\;\; & m = T - t + 1.\end{split}\]

Note

This \(\log^{1/2}(\dots) = \sqrt(\log(\dots)))\) term can be undefined, as soon as \(m < N_k(t)\), so empirically, \(\sqrt(\max(0, \log(\dots))\) is used instead, or a larger horizon can be used to make \(m\) artificially larger (e.g., \(T' = 1.1 T\)).

__module__ = 'Policies.ApproximatedFHGittins'¶

computeAllIndex()[source]¶: Compute the current indexes for all arms, in a vectorized manner.