Bandits with Switching Costs: T2/3 Regret

Consider the adversarial two-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player’s T-round regret in this setting (i.e., his excess loss compared to the better of the two actions) is T2/3 (up to a log term). In the corresponding full-information problem, the minimax regret is known to grow at a slower rate of T1/2 . The difference between these two rates indicates that learning with bandit feedback (i.e. just knowing the loss from the player’s action, not the alternative) can be significantly harder than learning with full-information feedback. It also shows that without switching costs, any regret-minimizing algorithm for the bandit problem must sometimes switch actions very frequently. The proof is based on an information-theoretic analysis of a loss process arising from a multi-scale random walk.

(Joint work with Ofer Dekel, Jian Ding and Tomer Koren, to appear in STOC 2014 available at http://arxiv.org/abs/1310.2997)

Speaker Details

Yuval Peres is a Principal Researcher in the Theory group at Microsoft Research, Redmond. His research encompasses many areas of probability theory including random walks, Brownian motion, percolation, point processes and random graphs, as well as connections with Ergodic Theory, PDE, Combinatorics, Fractals and Theoretical Computer Science; cf. http://arxiv.org/find/math/1/au:+Peres_Y/0/1/0/all/0/1 He has advised 20 PhD students, see http://www.genealogy.math.ndsu.nodak.edu/id.php?id=22523&fChrono=1

Date:
Speakers:
Yuval Peres
Affiliation:
Microsoft Research Redmond
    • Portrait of Jeff Running

      Jeff Running

    • Portrait of Yuval Peres

      Yuval Peres

      Principal Researcher