Stanley Kok and Wen-tau Yih
17 July 2009
Email receipts (e-receipts) frequently record e-commerce transactions between users and online retailers, and contain a wealth of product information. Such information could be used in a variety of applications if it could be reliably extracted. However, extracting product information from e-receipts poses several challenges. For example, the high labor cost of annotating e-receipts makes traditional supervised approaches infeasible. E-receipts may also be generated from a variety of templates, and are usually encoded in plain text rather than HTML, making it difficult to discover the regularity of how product information is presented. In this paper, we present an approach that addresses all these challenges. Our approach is based on Markov logic, a language that combines probability and logic. From a corpus of unlabeled e-receipts, we identify all possible templates by jointly clustering the e-receipts and the lines in them. From the non-template portions of e-receipts, we learn patterns describing how product information is laid out, and use them to extract the product information. Experiments on a corpus of real-world e-receipts demonstrate that our approach performs well. Furthermore, the extracted information can be reliably used as labeled data to bootstrap a supervised statistical model, and our experiments show that such a model is able to extract even more product information.
|Published in||Sixth Conference on Email and Anti-Spam|
Copyright (c) 2009