Data Cleaning

Established: July 1, 2002

Poor data quality is a well-known problem in data warehouses that arises for a variety of reasons such as data entry errors and differences in data representation among data sources. For example, one source may use abbreviated state names while another source may use fully expanded state names. However, high quality data is essential for accurate data analysis. Data cleaning is the process of detecting and correcting errors and inconsistencies in data.

Goal

Typical data cleaning tasks include record matching, deduplication, and column segmentation which often need logic that go beyond using traditional relational queries. This has led to development of utilities for data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as “horizontal” platforms that are applicable across a variety of domains. These platforms provide a suite of additional data cleaning operators but also including relational operators such as select, project and equi-join.

In the Data Cleaning project, our goal is to define a repertoire of “built-in” operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions. We also investigate their efficient implementation on horizontal ETL engines (such as. SQL Server Integration Services). While for wide usage it is desirable to have generic, domain-neutral operators, it is also important that the operators allow domain specific customizations. Examples of such customizations include plugging in synonyms (e.g., Robert and Bob while dealing with people names), domain-specific tokenization, and token weighting.

The operators that we have identified and implemented so far include

  • Fuzzy Lookup (FL) that is used as one of the core operators to perform record matching. The matching can be customized by configuring the tokenization, token weighting and providing transformation rules. We address efficiency by designing a similarity index.
  • Fuzzy Grouping that is used for deduplication. Fuzzy Grouping clusters the pairwise matches produced by Fuzzy Lookup.
  • Column segmentation that segments input strings based on specification of regular expressions that can also reference tables in a database.

One of the contributions of the project is a Data Profiling tool that is used to analyze the data quality through a variety of statistics ranging from simple statistics such as the number of null values in a column to more complex statistics such as the strength of keys, foreign keys and functional dependencies.

Impact

If you have questions about this project, please contact the Data Cleaning research team (dcrt@microsoft.com).