M10 Bandit Lab
The best lever is
not the one that paid last.
A live multi-armed bandit simulator. Tests four strategies side by side on the same hidden reward distribution. Shows why the slot machine that paid most recently is rarely the one that pays most overall, and how much that mistake costs.
Five arms. Hidden reward probabilities. Pull levers, watch algorithms learn, and compare cumulative regret in real time. Thompson Sampling, UCB1, Epsilon-Greedy, and Random play 500 rounds on identical ground truth. Nothing is preloaded. Every run is new.
Nothing is preloaded. Every run is new. Ground-truth probabilities documented in § VI.
§ I Why the winning variant is usually a trap
The slot machine that just paid out is not more likely to pay again.
This is the explore-exploit dilemma in its purest form. You have several options, you don't know which is best, and every pull of a suboptimal arm costs you the difference between the best possible reward and what you actually got. The temptation is to keep pulling whatever paid last. The math says that strategy is usually wrong.
A/B testing falls into the same trap. Run a test for two weeks, declare a winner, and deploy it forever. But early winners often regress. The variant that looked best at week two may have been lucky, not good. A bandit algorithm, by contrast, never stops exploring. It reduces the rate of exploration as evidence accumulates, but it never locks in a premature winner.
The cost of impatience is measured in regret: the cumulative difference between the reward you earned and the reward you would have earned had you known the best arm from the start. A good algorithm keeps regret low. A bad one lets it grow linearly with time.
§ II · M10.1 The casino · pull the levers
§ III How it works
The explore-exploit dilemma
Every pull is a choice between gathering information (explore) and using what you know (exploit). Explore too much and you waste pulls on bad arms. Exploit too early and you lock in a suboptimal choice before the data is clear. The optimal balance changes over time: explore heavily at first, then gradually shift to exploitation as confidence grows.
Optimism in the face of uncertainty
UCB1 adds an optimism bonus to every arm's empirical mean. Arms that have been pulled less often get a larger bonus, which forces exploration of the unknown. The bonus shrinks as the pull count grows, so no arm is neglected forever. The result is a principled upper confidence bound that automatically balances explore and exploit.
Probability matching
Thompson Sampling does not compute a single best guess for each arm. It draws a random sample from each arm's posterior distribution, then picks the arm with the highest draw. This means an arm with a wide posterior (high uncertainty) still has a chance of winning the sample, which drives exploration. As the posterior tightens, the best arm wins almost every time.
§ IV · M10.2 The cost of impatience
Not all bandit problems are equally hard. When one arm dominates, even a naive strategy finds it quickly. When two arms are close, the cost of confusion compounds. And when the best arm starts with a string of bad luck, only the patient algorithms recover.
| Scenario | Random | Epsilon-Greedy | UCB1 | Thompson |
|---|
§ V Receipts
§ VI Methodology & Colophon
Pure JavaScript. Each arm is a Bernoulli trial with a hidden probability sampled from Beta(2, 2). Four strategies update their beliefs and choose arms every frame. Regret is computed against the ground-truth best arm. All randomness is seeded by Math.random(); no server calls, no precomputed data.
K arms, T rounds, binary rewards. Random and Epsilon-Greedy are frequentist baselines. UCB1 uses optimism in the face of uncertainty. Thompson Sampling maintains a Beta posterior per arm and samples from it. Non-stationary mode adds a Gaussian random walk to each hidden probability after every pull.
Lattimore & Szepesvari, Bandit Algorithms (Cambridge, 2020)
Chapelle & Li, An Empirical Evaluation of Thompson Sampling (NIPS 2011)
Auer et al., Finite-time Analysis of the Multiarmed Bandit Problem (ML 2002)
Stationary environment by default; non-stationary toggle is a simple random walk. Binary rewards only. No contextual features, no batched updates, no delayed feedback. The Beta(2,2) prior for ground-truth generation is arbitrary; real-world reward distributions are rarely symmetric.