A large and growing number of web pages display contex- tual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the impor- tance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with “relevant” keywords. Based on this training, it can then extract new keywords from previ- ously unseen pages. Accuracy is substantially better than several baseline systems
Implicit query systems examine a document and automatically conduct searches for the most relevant information. In this paper, we offer three contributions to implicit query research. First, we show how to use query logs from a search engine: by constraining results to commonly issued queries, we can get dramatic improvements. Second, we describe a method for optimizing parameters for an implicit query system, by using logistic regression training. The method is designed to estimate the probability that any particular suggested query is a good one. Third, we show which features beyond standard TF-IDF features are most helpful in our logistic regression model: query frequency information, capitalization information, subject line information, and message length information. Using the optimization method and the additional features, we are able to produce a system with up to 6 times better results on top-1 score than a simple TF-IDF system.
A somewhat less technical introduction to the problem of spam, methods that spammers use, and the various ways to stop it. Contains descriptions of more speculative, future-looking technology. Also, there is an allegedly pornographic picture of my wife's hands.
In this paper, we describe a simple method for selling advertising, pay-per-percentage of impressions, that is immune to both click fraud and impression fraud. We describe assumptions required to guarantee the immunity, which impact the design of the system. In particular, ads must be shown in a truly random way, across the percentage of impressions purchased. We describe prefix-match: a system that is similar to broad-match, but more compatible with pay-per-percentage. We show how to auction pay-per-percentage matches, including prefix matches in a revenue maximizing way. Finally, we describe variations on the technique that may make it easier to sell to advertisers.
This document gives a tutorial overview of different technologies for stopping spam, including the pros and cons of each technique. Topics covered include machine learning systems, matching (hashing) systems, challenge response systems (including HIPs). The paper also discusses the techniques that spammers use.
IP addresses are an important tool for fighting spam, used for safe lists, blackhole lists, anti-spoofing and related purposes. While it is trivial to find the sender's IP address in most email server software, it turns out to be surprisingly difficult to do so in email client software: we explain why. This implies that either alternative approaches are needed (typically signature-based) or that new standards for communicating between clients and servers are needed. We suggest several forms such standards might take.
In this paper we analyze a very large junk e-mail corpus which was generated by a hundred thousand volunteer users of the Hotmail e-mail service. We describe how the corpus is being collected, and analyze: the geographic origins of the e-mail; who the e-mail is targeting; and what the e-mail is selling.
We look at how statistical techniques can be used to analyze and stop spam. First, we show how to create good filters, with statistical analyses. Next, we describe challenge response approaches. We describe computational puzzles, and how these can be used with challenge response, and improved through statistical variance reduction techniques. A combination of filtering, challenge response, and others is a promising approach to spam.
We analyze the problem of preventing outgoing spam. We show that some conventional techniques for limiting outgoing spam are likely to be ineffective. We show that while imposing per message costs would work, less annoying techniques also work. In particular, it is only necessary that the average cost to the spammer over the lifetime of an account exceed his profits, meaning that not every message need be challenged. We develop three techniques, one based on additional HIP challenges, one based on computational challenges, and one based on paid subscriptions. Each system is designed to impose minimal costs on legitimate users, while being too costly for spammers. We also show that maximizing complaint rates is a key factor, and suggest new standards to encourage high complaint rates.
This document describes a plan to stop spam combining machine learning, challenge response systems, computational puzzles, Turing Tests, and monetary payment systems. The pieces are already well known in the spam fighting community, but the unique combination makes the plan feasible. Using a machine learning system to identify likely spam, we drastically reduce the number of challenges. By offering a choice of proofs (computation, turing test, or money) we make sure that there is some method of payment for any sender.
Spam is a huge and growing problem. I'll first survey solutions to spam, including filtering approaches (machine learning, fuzzy hashing, and blackhole lists) and "postage" approaches, including Turing tests, computational puzzles, and monetary challenges. Our favorite technique is, of course, a machine learning/text classification approach. I'll talk about problems and solutions we've had in practice, especially how we have collected a huge corpus of labeled training data, both good and spam. I'll also talk briefly about my current research on personalizing spam filters, which turns out to be important, but harder than we thought.
Maximum entropy models (also known as logistic regression) are a common modeling technique, but extremely prone to overfitting. We show that using an Exponential distribution as a prior leads to bounded absolute discounting by a constant. We show that this prior is better motivated by the data than previous techniques such as a Gaussian prior, and often produces lower error rates. Exponential priors also lead to a simpler learning algorithm and easier to understand behavior. Furthermore, Exponential priors help explain the success of some previous smoothing techniques, and suggest simple variations that work better.
See AAAI tutorial below. This version customized for a machine translation audience.
This tutorial covers the state-of-the-art in language modeling. The tutorial should be accessible to anyone with an elementary knowledge of probability. The tutorial includes a quick review of probability, an introduction to language modeling, and then descriptions of various techniques, including smoothing, caching, skipping, clustering, sentence-mixture models. This version includes material on parsing-based models, and more time on applications.
We demonstrate a problem with the standard technique for learning probabilistic decision lists. We describe a simple, incremental algorithm that avoids this problem, and show how to implement it efficiently. We also show a variation that adds thresholding to the standard sorting algorithm for decision lists, leading to similar improvements. Experimental results show that the new algorithm produces substantially lower error rates and entropy, while simultaneously learning lists that are over an order of magnitude smaller than those produced by the standard algorithm.
We describe a speedup for training conditional maximum entropy models. The algorithm is a simple variation on Generalized Iterative Scaling, but converges roughly an order of magnitude faster, depending on the number of constraints, and the way speed is measured. Rather than attempting to train all model parameters simultaneously, the algorithm trains them sequentially. The algorithm is easy to implement, typically uses only slightly more memory, and will lead to improvements for most maximum entropy problems.
I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline, Naive Bayes. And finally, I review the literature, showing that the ideas of the Letter are not novel. I conclude by suggesting that Physical Review Letters should not publish Letters unrelated to physics.
This paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all using the maximum likelihood principle, which is consistent with the trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.
In this chapter, we repeat our previous results on speeding up data-oriented parsing (DOP). In addition, we clarify certain results, such as that our maximum constituents approach can be used wither independently, or in combination with our the fact efficient STSG approach. We also describe how recent improvements to DOP can benefit from the same speedup techniques.
Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus, and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achive up to more than 40% size reduction at the same perplexity.
Language Modeling for Soft Keyboards, with Gina Venolia, Keith Steury, and Chauncey Parker. In Proceedings of Intelligent User Interfaces 2002, San Francisco. Poster version. Extended technical report version.
Language models predict the probability of letter sequences. Soft keyboards are images of keyboards on a touch screen for input on Personal Digital Assistants. When a soft keyboard user hits a key near the boundary of a key position, the language model and key press model are combined to select the most probable key sequence. This leads to an overall error rate reduction by a factor of 1.67 to 1.87.
This is an extended version of a paper that appeared in Computer Speech and Language, October 2001, pages 403-434.
In the past several years, a number of different language modeling
improvements over simple trigram models have been found, including
caching, higher-order n-grams, skipping, interpolated Kneser-Ney
smoothing, and clustering. We present explorations of variations on,
or of the limits of, each of these techniques, including showing that
sentence mixture models may have more potential. While all of these
techniques have been studied separately, they have rarely been studied
in combination. We find some significant interactions, especially
with smoothing and clustering techniques. We compare a combination of
all techniques together to a Katz smoothed trigram model with no count
cutoffs. With a very small amount of training data, we get up to a
50% reduction in perplexity -- one bit of entropy -- although, as the
data size increases, our improvement decreases; on the largest data
sets, our reduction is at most 41% for data with punctuation, 38% for
data without. Our perplexity reductions are perhaps the highest
reported compared to a fair baseline. This is the extended version of
the paper; it contains additional details and proofs, and is designed
to be a good introduction to the state of the art in language
modeling.
We show that maximum entropy models can be modeled with certain kinds of Hidden Markov Models (HMMs). This allows us to easily construct maximum entropy-style models with hidden variables, hidden state sequences, or other characteristics. The resulting models can be easily trained using standard algorithms with guaranteed locally, and in some cases globally, optimal parameter settings. We also give experimental results showing that a maximum entropy model with a hidden variable outperforms conventional techniques on subject-verb agreement.
Maximum entropy models are considered by many to be one of the most promising avenues of language modeling research. Unfortunately, long training times make maximum entropy research difficult. We present a novel speedup technique: we change the form of the model to use classes. Our speedup works by creating two maximum entropy models, the first of which predicts the class of each word, and the second of which predicts the word itself. This factoring of the model leads to fewer non-zero indicator functions, and faster normalization, achieving speedups of up to a factor of 35 over one of the best previous techniques. It also results in typically slightly lower perplexities. The same trick can be used to speed training of other machine learning techniques, e.g. neural networks, applied to any problem with a large number of outputs, such as language modeling.
Several techniques are known for reducing the size of language models, including count cutoffs, Weighted Difference pruning, Stolcke pruning, and clustering. We compare all of these techniques and show some surprising results. For instance, at low pruning thresholds, Weighted Difference and Stolcke pruning underperform count cutoffs. We then show novel clustering techniques that can be combined with Stolcke pruning to produce the smallest models at a given perplexity. The resulting models can be a factor of three or more smaller than models pruned with Stolcke pruning, at the same perplexity. The technique creates clustered models that are often larger than the unclustered models, but which can be pruned to models that are smaller than unclustered models with the same perplexity.
We synthesize work on parsing algorithms, deductive parsing, and the theory of algebra applied to formal languages into a general system for describing parsers. Each parser performs abstract computations using the operations of a semiring. The system allows a single, simple representation to be used for describing parsers that compute recognition, derivation forests, Viterbi, n-best, inside values, and other values, simply by substituting the operations of different semirings. We also show how to use the same representation interpreted differently to compute outside values. The system can be used to describe a wide variety of parsers, including Earley's algorithm, Tree Adjoining Grammar parsing, Graham Harrison Ruzzo parsing, and prefix value computation.
This is a really neat paper. Don't let the technical stuff scare you -- I'm no mathematician. If you want to compute some sort of hairy outside values, or succintly describe a complex parser, or you want a deep understanding of the dynamic programming that underlies almost all parsing algorithms, read this paper. Otherwise, don't.
In the past several years, a number of different language modeling improvements over simple trigram models have been found, including caching, higher-order n-grams, skipping, modified Kneser-Ney smoothing, and clustering. While all of these techniques have been studied separately, they have rarely been studied in combination. We find some significant interactions, especially with smoothing techniques. The combination of all techniques leads to up to a 45% perplexity reduction over a Katz smoothed trigram model with no count cutoffs, the highest such perplexity reduction reported.
This tutorial covers the state-of-the-art in language modeling. The tutorial should be accessible to anyone with an elementary knowledge of probability. The tutorial includes a quick review of probability, an introduction to language modeling, and then descriptions of various techniques, including smoothing, caching, skipping, clustering, sentence-mixture models, and structured language models. Experimental results comparing most of these techniques are given. See the AAAI Tutorial for a more recent version.
We present a tutorial introduction to n-gram language modeling, and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of these smoothing techniques. We investigate how factors such as training data size, training corpus, count cutoffs and n-gram order affect relative performance (measured by perplexity.) Our results show that previous comparisons have not been complete enough. We introduce methodologies for examining smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.
The inside-outside probabilities are typically used for reestimating Probabilistic Context Free Grammars (PCFGs), just as the forward-backward probabilities are typically used for reestimating HMMs. In this thesis, I show several novel uses, including improving parser accuracy by matching parsing algorithms to evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster PCFG thresholding at a given accuracy level. I also give an elegant, state-of-the-art grammar formalism, which can be used to compute inside-outside probabilities; and a parser description formalism, which makes it easy to derive inside-outside formulas and many others.
Amusing thesis fact: Citeseer says there are 6 citations of my thesis, 5 of them from Rens Bod. Thanks Rens.
We present a new formalism, probabilistic feature grammar (PFG). PFGs combine most of the best properties of several other formalisms, including those of Collins, Magerman, and Charniak, and in experiments have comparable or better performance. PFGs generate features one at a time, probabilistically, conditioning the probabilities of each feature on other features in a local context. Because the conditioning is local, efficient polynomial time parsing algorithms exist for computing inside, outside, and Viterbi parses. PFGs can produce probabilities of strings, making them potentially useful for language modeling. Precision and recall results are comparable to the state of the art with words, and the best reported without words.
We present a variation on classic beam thresholding techniques that is up to an order of magnitude faster than the traditional method, at the same performance level. We also present a new thresholding technique, global thresholding, which, combined with the new beam thresholding, gives an additional factor of two improvement, and a novel technique, multiple pass parsing, that can be combined with the others to yield yet another 50% improvement. We use a new search algorithm to simultaneously optimize the thresholding parameters of the various algorithms.
Many different metrics exist for evaluating parsing results, including Viterbi, Crossing Brackets Rate, Zero Crossing Brackets Rate, and several others. However, most parsing algorithms, including the Viterbi algorithm, attempt to optimize the same metric, namely the probability of getting the correct labelled tree. By choosing a parsing algorithm appropriate for the evaluation metric, better performance can be achieved. We present two new algorithms: the ``Labelled Recall Algorithm,'' which maximizes the expected Labelled Recall Rate, and the ``Bracketed Recall Algorithm,'' which maximizes the Bracketed Recall Rate. Experimental results are given, showing that the two new algorithms have improved performance over the Viterbi algorithm on many criteria, especially the ones that they optimize.
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.
Excellent results have been reported for Data-Oriented Parsing (DOP) of natural language texts (Bod, 1993). Unfortunately, existing algorithms are both computationally intensive and difficult to implement. Previous algorithms are expensive due to two factors: the exponential number of rules that must be generated and the use of a Monte Carlo parsing algorithm. In this paper we solve the first problem by a novel reduction of the DOP model to a small, equivalent probabilistic context-free grammar. We solve the second problem by a novel deterministic parsing strategy that maximizes the expected number of correct constituents, rather than the probability of a correct parse tree. Using the optimizations, experiments yield a 97 crossing brackets rate and 88 zero crossing brackets rate. This differs significantly from the results reported by Bod, and is comparable to results from a duplication of Pereira and Schabes's (Pereira, 1992) experiment on the same data. We show that Bod's results are at least partially due to an extremely fortuitous choice of test data, and partially due to using cleaner data than other researchers.