Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Our research
Content type
+
Downloads (461)
+
Events (466)
 
Groups (151)
+
News (2815)
 
People (718)
 
Projects (1136)
+
Publications (12922)
+
Videos (5994)
Labs
Research areas
Algorithms and theory47205 (360)
Communication and collaboration47188 (236)
Computational linguistics47189 (258)
Computational sciences47190 (245)
Computer systems and networking47191 (809)
Computer vision208594 (933)
Data mining and data management208595 (140)
Economics and computation47192 (124)
Education47193 (89)
Gaming47194 (84)
Graphics and multimedia47195 (250)
Hardware and devices47196 (226)
Health and well-being47197 (102)
Human-computer interaction47198 (978)
Machine learning and intelligence47200 (984)
Mobile computing208596 (79)
Quantum computing208597 (41)
Search, information retrieval, and knowledge management47199 (737)
Security and privacy47202 (357)
Social media208598 (83)
Social sciences47203 (300)
Software development, programming principles, tools, and languages47204 (656)
Speech recognition, synthesis, and dialog systems208599 (159)
Technology for emerging markets208600 (58)
1–25 of 258
Sort
Show 25 | 50 | 100
1234567Next 
William D. Lewis, Christian Federmann, and Ying Xin

Cross Entropy Difference (CED) has proven to be a very effective method for selecting domain-specific data from large corpora of out-of-domain or general domain content. It is used in a number of different scenarios, and is particularly popular in bake-off competitions in which participants have a limited set of resources to draw from, and need to sub-sample the data in such a way as to ensure better results on domain-specific test sets. The underlying algorithm is handy since one can provide a set of...

Publication details
Date: 4 December 2015
Type: Inproceeding
Sauleh Eetemadi, William Lewis, Kristina Toutanova, and Hayder Radha

Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training...

Publication details
Date: 1 December 2015
Type: Article
Publisher: Springer
Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma

The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astrology documents, where both the queries and documents were either in Hindi written in Devanagari or in Roman...

Publication details
Date: 1 December 2015
Type: Inproceeding
Publisher: FIRE
Royal Sequiera, Monojit Choudhury, and Kalika Bali

We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do not help in POS tagging.

Publication details
Date: 1 December 2015
Type: Inproceeding
Publisher: NLPAI
William D. Lewis

In 1966, Star Trek introduced us to the notion of the Universal Translator. Such a device allowed Captain Kirk and his crew to communicate with alien species, such as the Gorn, who did not speak their language, or even converse with species who did not speak at all (e.g., the Companion from the episode Metamorphosis). In 1979, Douglas Adams introduced us to the “Babelfish” in the Hitchhiker's Guide to the Galaxy which, when inserted into the ear, allowed the main character to do...

Publication details
Date: 27 November 2015
Type: Inproceeding
Shyam Upadhyay and Ming-Wei Chang

We present DRAW, a dataset consisting of 1000 linear algebra word problems, semi-automatically annotated for the evaluation of automatic solvers. Details of the annotation process are described, which involves a novel template reconciliation procedure for reducing equivalent templates. DRAW also consists of richer annotations, including gold coefficient alignments and equation system templates, which were absent in existing benchmarks.

We present a quantitative comparison of DRAW to existing...

Publication details
Date: 1 October 2015
Type: Technical report
Number: MSR-TR-2015-78
Yi Yang, Wen-tau Yih, and Christopher Meek

We describe the WikiQA dataset, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. Most previous work on answer sentence selection focuses on a dataset created using the TREC-QA data, which includes editor-generated questions and candidate answer sentences selected by matching content words in the question. WikiQA is constructed using a more natural process and is more than an order of magnitude larger than the previous...

Publication details
Date: 21 September 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon

Models that learn to represent textual and knowledge base relations in the same continuous latent space are able to perform joint inferences among the two kinds of relations and obtain high accuracy on knowledge base completion (Riedel et al. 2013). In this paper we propose a model that captures the compositional structure of textual relations, and jointly optimizes entity, knowledge base, and textual relation representations. The proposed model significantly improves performance over a model that...

Publication details
Date: 17 September 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Nicholas Ruiz, Qin Gao, William Lewis, and Marcello Federico

In the spoken language translation pipeline, machine translation systems that are trained solely on written bitexts are often unable to recover from speech recognition errors due to the mismatch in training data. We propose a novel technique to simulate the errors generated by an ASR system, using the ASR system’s pronunciation dictionary and language model. Lexical entries in the pronunciation dictionary are converted into phoneme sequences using a text-to-speech (TTS) analyzer and stored in a...

Publication details
Date: 1 September 2015
Type: Inproceeding
Publisher: ISCA - International Speech Communication Association
Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li

This paper proposes a novel hierarchical recurrent neural network language model (HRNNLM) for document modeling. After establishing a RNN to capture the coherence between sentences in a document, HRNNLM integrates it as the sentence history information into the word level RNN to predict the word sequence with cross-sentence contextual information. A two-step training approach is designed, in which sentence-level and word-level language models are approximated for the convergence in a pipeline style....

Publication details
Date: 1 September 2015
Type: Inproceeding
Publisher: EMNLP
Dilek Hakkani-Tur, Yun-Cheng Ju, Geoffrey Zweig, and Gokhan Tur

Spoken language understanding (SLU) in today’s conversational systems focuses on recognizing a set of domains, intents, and associated arguments, that are determined by application developers. User requests that are not covered by these are usually directed to search engines, and may remain unhandled. We propose a method that aims to find common user intents amongst these uncovered, out-of-domain utterances, with the goal of supporting future phases of dialog system design. Our approach relies on...

Publication details
Date: 1 September 2015
Type: Inproceeding
Publisher: Interspeech 2015 Conference
Daniel Preotiuc-Pietro, Svitlana Volkova, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras

Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on...

Publication details
Date: 1 September 2015
Type: Article
Publisher: PLOS – Public Library of Science
Young-Bum Kim, Karl Stratos, Ruhi Sarikaya, and Minwoo Jeong

In natural language understanding (NLU), a user utterance can be labeled differently depending on the domain or application (e.g., weather vs. calendar). Standard domain adaptation techniques are not directly applicable to take advantage of the existing annotations because they assume that the label set is invariant. We propose a solution based on label embeddings induced from canonical correlation analysis (CCA) that reduces the problem to a standard domain adaptation task and allows use of a number of...

Publication details
Date: 29 August 2015
Type: Proceedings
Publisher: ACL – Association for Computational Linguistics
Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya

In this paper, we apply the concept of pre-training to hidden-unit conditional random
fields (HUCRFs) to enable learning on unlabeled data. We present a simple yet effective pre-training technique that learns to associate words with their clusters, which are obtained in an unsupervised manner. The learned parameters are then used to initialize the supervised learning process. We also propose a word clustering technique based on canonical correlation analysis (CCA) that is sensitive to multiple word...

Publication details
Date: 28 August 2015
Type: Proceedings
Publisher: ACL – Association for Computational Linguistics
Young-Bum Kim, Karl Stratos, Xiaohu Liu, and Ruhi Sarikaya

In this paper, we introduce the task of selecting compact lexicon from large, noisy gazetteers.
This scenario arises often in practice, in particular spoken language understanding (SLU).
We propose a simple and effective solution based on matrix decomposition techniques:
canonical correlation analysis (CCA) and rank-revealing QR (RRQR) factorization. CCA is first used to derive low-dimensional gazetteer embeddings from domain-specific search logs. Then RRQR is used to find a subset of...

Publication details
Date: 27 August 2015
Type: Proceedings
Publisher: ACL – Association for Computational Linguistics
Timothy Baldwin, Marie Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu

This paper presents the results of the two shared tasks associated with W-NUT 2015: (1) a text normalization task with 10 participants; and (2) a named entity tagging task with 8 participants. We outline the task, annotation process and dataset statistics, and provide a high-level overview of the participating systems for each shared task.

Publication details
Date: 1 August 2015
Type: Proceedings
Publisher: ACL – Association for Computational Linguistics
Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, and Jiawei Han

Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs...

Publication details
Date: 1 August 2015
Type: Inproceeding
Publisher: ACM – Association for Computing Machinery
Chi Wang, Xueqing Liu, Yanglei Song, and Jiawei Han

Automatic construction of user-desired topical hierarchies over large volumes of text data is a highly desirable but challenging task. This study proposes to give users freedom to construct topical hierarchies via interactive operations such as expanding a branch and merging several branches. Existing hierarchical topic modeling techniques are inadequate for this purpose because (1) they cannot consistently preserve the topics when the hierarchy structure is modified; and (2) the slow inference prevents...

Publication details
Date: 1 August 2015
Type: Inproceeding
Publisher: ACM – Association for Computing Machinery
Jian Tang, Meng Qu, and Qiaozhu Mei

Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without...

Publication details
Date: 1 August 2015
Type: Inproceeding
Publisher: ACM – Association for Computing Machinery
Kristina Toutanova, Waleed Ammar, Pallavi Chourdhury, and Hoifung Poon

Model selection (picking, for example, the feature set and the regularization strength) is crucial for building high-accuracy NLP models. In supervised learning, we can estimate the accuracy of a model on a subset of the labeled data and choose the model with the highest accuracy.
In contrast, here we focus on type-supervised learning, which uses constraints over the possible labels for word types for supervision, and labeled data is either not available or very small. For the setting where no...

Publication details
Date: 30 July 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Kristina Toutanova and Danqi Chen

In this paper we show the surprising effectiveness of a simple observed features model in comparison to latent feature models on two benchmark knowledge base completion datasets – FB15K and WN18. We also compare latent and observed feature models on a more challenging dataset derived from FB15K, and additionally coupled with textual mentions from a web-scale corpus. We show that the observed features model is most effective at capturing the information present for entity pairs with textual relations,...

Publication details
Date: 30 July 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao

We propose a novel semantic parsing framework for question answering using a knowledge base. We define a query graph that resembles subgraphs of the knowledge base and can be directly mapped to a logical form. Semantic parsing is reduced to query graph generation, formulated as a staged search problem. Unlike traditional approaches, our method leverages the knowledge base in an early stage to prune the search space and thus simplifies the semantic matching problem. By applying an advanced entity linking...

Publication details
Date: 28 July 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Rafael E. Banchs, Min Zhang, Xiangyu Duan, Haizhou Li, and A Kumaran

This report presents the results from the Ma-chine Transliteration Shared Task conducted as part of The Fifth Named Entities Workshop (NEWS 2015) held at ACL 2015 in Beijing, China. Similar to previous editions of NEWS Workshop, the Shared Task featured machine transliteration of proper names over 14 different language pairs, including 12 different languages and two different Japanese scripts. A total of 6 teams participated in the evaluation, submitting 194 standard and 12 non-standard runs, involving...

Publication details
Date: 1 July 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits...

Publication details
Date: 1 July 2015
Type: Inproceeding
Publisher: ACL – Association for Computational Linguistics
Igor Labutov, sumit basu, and lucy vanderwende

We develop an approach for generating deep (i.e, high-level) comprehension questions from novel text that bypasses the myriad challenges of creating a full semantic representation. We do this by decomposing the task into an ontology-crowd-relevance workflow, consisting of first representing the original text in a low-dimensional ontology, then crowd-sourcing candidate question templates aligned with that space, and finally ranking potentially relevant templates for a novel region of text. If ontological...

Publication details
Date: 1 July 2015
Type: Inproceeding
Publisher: to appear in: Proceedings of ACL 2015
1–25 of 258
Sort
Show 25 | 50 | 100
1234567Next 
> Our research