Alekh Agarwal, Ofer Dekel, and Lin Xiao
Bandit convex optimization is a special case of online convex optimization with partial information. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at a single point. In some cases, the minimax regret of these problems is known to be strictly worse than the minimax regret in the corresponding full information setting. We introduce the multi-point bandit setting, in which the player can query each loss function at multiple points. When the player is allowed to query each function at two points, we prove regret bounds that closely resemble bounds for the full information case. This suggests that knowing the value of each loss function at two points is almost as useful as knowing the value of each function everywhere. When the player is allowed to query each function at d+1 points (d being the dimension of the space), we prove regret bounds that are exactly equivalent to full information bounds for smooth functions.
|Published in||Proceedings of the 23rd Annual Conference on Learning Theory (COLT)|