My research focuses on machine learning in interactive problems, where the objective is not merely to make accurate predictions, but to optimize a certain reward function by taking good actions. An example is online advertising, whose ultimate goal is to decide which ad to show to maximize revenue over time. These problems are known as reinforcement learning in general, and multi-armed bandits in certain simplified settings.
Multi-armed bandits are the most basic model for online learning with interaction: in each round, a learner takes an action, and in return receives a numerical reward. The goal of the learner is optimize her action-selection policy to maximize the total reward received. Very often in practice, the learner has access to contextual information in each round to infer which action leads to the highest reward. This kind of bandit problems, known as contextual bandits, turn out to be able to capture a number of important Web applications such as news recommendation, targeted advertising, ranking, etc. The key challenge here is the exploration/exploitation tradeoff, especially when contextual information is used. My work has resulted in principled and effective algorithms with parametric models such as linear or generalized linear models, as well as expert-style algorithms for adversarial scenarios.
Sample complexity of exploration in reinforcement learning studies the fundamental question of how fast a reinforcement learner can approach a near-optimal policy by interacting with an initially unknown environment. The key challenge again is the exploration/exploitation tradeoff, and is harder to solve than in the simpler bandit setting. Thanks to the Knows What It Knows framework for prediction-error-aware supervised learning, we developed a meta-algorithm that provably enjoys polynomial sample complexity in a wide range of reinforcement learning problems. More details are found in the (shorter) survey and the (longer) dissertation.
Offline evaluation of learning algorithms aims at providing performance estimates reliably based on log data, without deploying the algorithms in the real system (for cost/risk reasons). In interactive problems, offline evaluation (sometimes known as off-policy reinforcement learning) becomes much harder because of counterfactual effects. From real data analysis, we showed offline evaluation can be done reliably for multi-armed bandits. Extensions are possible with help of statistical techniques such as importance sampling, rejection sampling, and doubly robust estimation. I also helped release benchmark datasets for bandit algorithms while with Yahoo!, one of which was used in a PASCAL2 Challenge.
- Online learning provides a viable solution to large-scale machine learning. Truncated gradient is one such example that solves the celebrated Lasso approximately and efficiently in a stochastic-gradient-descent fashion. Another example is propensity score estimation, a critical step for answering what-if questions in counterfactual analysis. Online learning can also be combined with parallel computing to yield even greater speedup. I also helped develop the Vowpal Wabbit software that has found many uses in industry.
Online recommendation of contents (like ads, news, friends, music) is ubiquitous on the Web, and can be naturally modeled as a multi-armed bandit where the reward can be clicks, revenue, etc. I am more interested in personalized online recommendation which takes advantage of contextual information of users/queries to produce recommendations of greater quality. Algorithms that we developed for multi-armed bandits have shown great benefits in problems like news recommendation and targeted adverting.
Spoken dialog management is one application area of reinforcement learning, which can be used to optimize the policy of a conversational system so that it can better interact with, and assist, humans. I have worked on fast feature selection in the past, and am interested in other aspects such as imitation learning at present.
- Microsoft Research, Researcher, 2012 - Present
- Yahoo! Research, Research Scientist, 2010 - 2012
- Yahoo! Research, Postdoctoral Scientist, 2009 - 2010
- Rutgers University, PhD (Computer Science), 2009
- AT&T Shannon Labs, Research Intern, summer 2008
- Yahoo! Research, Research Intern, summer 2007
- Google, Engineer Intern, summer 2006
- University of Alberta, MS (Computing Science), 2004
- Tsinghua University, BE (Computer Science and Technology), 2002
To be added.
- Area Chair for ICML'2012 and ICML'2013.
- Senior Program Committee Member for IJCAI'2011.
- Reviewer for AAAI, AISTATS, COLT, ECML, KDD, ICML, IJCAI, NIPS, UAI, UbiComp, WSDM, WWW.
- Regular reviewing services for Journal of Machine Learning Research, Machine Learning, Journal of Artificial Intelligence Research, and Artificial Intelligence, among other journals and transactions.