Two Approaches to Handling Noisy Variation in Text Mining

  • Un Yong Nahm ,
  • Mikhail Bilenko ,
  • Raymond J. Mooney ,
  • Misha Bilenko

Proceedings of the ICML-2002 Workshop on Text Learning |

Variation and noise in textual database entries
can prevent text mining algorithms from discovering
important regularities. We present
two novel methods to cope with this problem:
(1) an adaptive approach to “hardening” noisy
databases by identifying duplicate records, and
(2) mining “soft” association rules. For identifying
approximately duplicate records, we present
a domain-independent two-level method for improving
duplicate detection accuracy based on
machine learning. For mining soft matching
rules, we introduce an algorithm that discovers
association rules by allowing partial matching of
items based on a textual similarity metric such as
edit distance or cosine similarity. Experimental
results on real and synthetic datasets show that
our methods outperform traditional techniques
for noisy textual databases.