M10 Bandit Lab

The best lever is
not the one that paid last.

A live multi-armed bandit simulator. Tests four strategies side by side on the same hidden reward distribution. Shows why the slot machine that paid most recently is rarely the one that pays most overall, and how much that mistake costs.

Five arms. Hidden reward probabilities. Pull levers, watch algorithms learn, and compare cumulative regret in real time. Thompson Sampling, UCB1, Epsilon-Greedy, and Random play 500 rounds on identical ground truth. Nothing is preloaded. Every run is new.

Filed

Engine

Pure JS · Monte Carlo · Beta distributions

Arms

Rounds

500

Nothing is preloaded. Every run is new. Ground-truth probabilities documented in § VI.

§ I Why the winning variant is usually a trap

The slot machine that just paid out is not more likely to pay again.

This is the explore-exploit dilemma in its purest form. You have several options, you don't know which is best, and every pull of a suboptimal arm costs you the difference between the best possible reward and what you actually got. The temptation is to keep pulling whatever paid last. The math says that strategy is usually wrong.

A/B testing falls into the same trap. Run a test for two weeks, declare a winner, and deploy it forever. But early winners often regress. The variant that looked best at week two may have been lucky, not good. A bandit algorithm, by contrast, never stops exploring. It reduces the rate of exploration as evidence accumulates, but it never locks in a premature winner.

The cost of impatience is measured in regret: the cumulative difference between the reward you earned and the reward you would have earned had you known the best arm from the start. A good algorithm keeps regret low. A bad one lets it grow linearly with time.

§ II · M10.1 The casino · pull the levers

Number of arms

Number of rounds

500

Round 0 / 500

Random baseline

Regret: 0.00

Cumulative regret

Pulls per arm

Epsilon-Greedy ε = 0.1

Regret: 0.00

Cumulative regret

Pulls per arm

UCB1 optimism

Regret: 0.00

Cumulative regret

Pulls per arm

Thompson sampling

Regret: 0.00

Cumulative regret

Pulls per arm

Beta posteriors

§ III How it works

Fig. A

The explore-exploit dilemma

Every pull is a choice between gathering information (explore) and using what you know (exploit). Explore too much and you waste pulls on bad arms. Exploit too early and you lock in a suboptimal choice before the data is clear. The optimal balance changes over time: explore heavily at first, then gradually shift to exploitation as confidence grows.

Fig. B

Optimism in the face of uncertainty

UCB1 adds an optimism bonus to every arm's empirical mean. Arms that have been pulled less often get a larger bonus, which forces exploration of the unknown. The bonus shrinks as the pull count grows, so no arm is neglected forever. The result is a principled upper confidence bound that automatically balances explore and exploit.

Fig. C

Probability matching

Thompson Sampling does not compute a single best guess for each arm. It draws a random sample from each arm's posterior distribution, then picks the arm with the highest draw. This means an arm with a wide posterior (high uncertainty) still has a chance of winning the sample, which drives exploration. As the posterior tightens, the best arm wins almost every time.

§ IV · M10.2 The cost of impatience

Not all bandit problems are equally hard. When one arm dominates, even a naive strategy finds it quickly. When two arms are close, the cost of confusion compounds. And when the best arm starts with a string of bad luck, only the patient algorithms recover.

Preset

Non-stationary (reward drift)

idle

§ V Receipts

§ VI Methodology & Colophon

Engine

Pure JavaScript. Each arm is a Bernoulli trial with a hidden probability sampled from Beta(2, 2). Four strategies update their beliefs and choose arms every frame. Regret is computed against the ground-truth best arm. All randomness is seeded by Math.random(); no server calls, no precomputed data.

Model spec

K arms, T rounds, binary rewards. Random and Epsilon-Greedy are frequentist baselines. UCB1 uses optimism in the face of uncertainty. Thompson Sampling maintains a Beta posterior per arm and samples from it. Non-stationary mode adds a Gaussian random walk to each hidden probability after every pull.

Reading list

Lattimore & Szepesvari, Bandit Algorithms (Cambridge, 2020)
Chapelle & Li, An Empirical Evaluation of Thompson Sampling (NIPS 2011)
Auer et al., Finite-time Analysis of the Multiarmed Bandit Problem (ML 2002)

Limitations

Stationary environment by default; non-stationary toggle is a simple random walk. Binary rewards only. No contextual features, no batched updates, no delayed feedback. The Beta(2,2) prior for ground-truth generation is arbitrary; real-world reward distributions are rarely symmetric.

← Back to the portfolio View the notebook on GitHub ↗

M10 Bandit Lab

The best lever is not the one that paid last.

§ I Why the winning variant is usually a trap

§ II · M10.1 The casino · pull the levers

§ III How it works

The explore-exploit dilemma

Optimism in the face of uncertainty

Probability matching

§ IV · M10.2 The cost of impatience

§ V Receipts

§ VI Methodology & Colophon

The best lever is
not the one that paid last.