Jianfeng Gao, Tianbing Xu, Lin Xiao, and Xiaodong He
We propose a new algorithm based on the dual averaging method for large-scale discriminative training in natural language processing (NLP), as an alternative to the perceptron algorithms or stochastic gradient descent (SGD). The new algorithm estimates parameters of linear models by minimizing L1 regularized objectives and are effective in obtaining sparse solutions, which is particularly desirable for large scale NLP tasks. We then give the mistake bound of the algorithm, and show how the bound is affected by the additional L1 regularization term. Evaluations on the tasks of parse reranking and statistical machine translation attest the success of the new algorithm.