Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Projects > Data Cleaning
Data Cleaning

Data warehouses, which are repositories of data collected from several data sources, form the backbone of most current CRM and decision support applications. Since data sources are independent, they may adopt independent and potentially inconsistent conventions. For example, one source may adopt the use of standard while another source adopts the use of fully expanded descriptions. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of CRM and decision support applications developed over data warehouses, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. Since the types of errors and inconsistencies can be domain-specific, it is important and challenging to develop generic dom

Goal

Data cleaning is an essential step in populating and maintaining data warehouses. Owing to likely differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse so that downstream data analysis is reliable and accurate. This is usually accomplished through an Extract-Transform-Load (ETL) process.

Typical data cleaning tasks include record matching, deduplication, and column segmentation which often go beyond traditional relational operators. This has led to development of utilities that support data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals such as Trillium that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as ``horizontal'' platforms that are applicable across a variety of domains. These platforms provide a suite of operators including relational operators such as select, project and equi-join. A common feature across these frameworks is extensibility--applications can plug in their own custom operators. A data transformation and cleaning solution is built by composing these (default and custom) operators to obtain an operator tree or a graph.

While the second category of software can in principle support arbitrarily complex logic by virtue of being extensible, it has the obvious limitation that most of the data cleaning logic needs to be incorporated as custom code since creating optimized custom code for data cleaning software is nontrivial. It would be desirable to extend its repertoire of "built-in" operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions.

In our Data Cleaning project, we seek to achieve the above goal. Thus, we aspire to identify key primitive data cleaning operators and then ensure their efficient implementation on horizontal ETL engines such as SSIS. Thus, we adopt the approach of developing a domain-neutral framework of generic data cleaning operators. We believe that decomposing a data cleaning solution into simpler well-defined operators makes it easier to compose data cleaning operators with each other and with other (relational and non-relational) operators.

If you have questions about this project, please contact the Data Cleaning research team (dcrt@microsoft.com).

Publications