Using Sketches to Estimate Two-way and Multi-way Associations

  • Ping Li ,
  • Kenneth Church ,
  • Ken Church

MSR-TR-2005-115 |

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. The proposed method has smaller errors and more flexibility than the original sketch method. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10^-4 may suffice.