*
Quick Links|Home|Worldwide
Microsoft*
Search for


Data Debugger Project

 
Goal

Data cleaning is an essential step in populating and maintaining data warehouses. Owing to likely differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse so that downstream data analysis is reliable and accurate. This is usually accomplished through an Extract-Transform-Load (ETL) process.

Typical data cleaning tasks include record matching, deduplication, and column segmentation which often go beyond traditional relational operators. This has led to development of utilities that support data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals such as Trillium that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as ``horizontal'' platforms that are applicable across a variety of domains. These platforms provide a suite of operators including relational operators such as select, project and equi-join. A common feature across these frameworks is extensibility--applications can plug in their own custom operators. A data transformation and cleaning solution is built by composing these (default and custom) operators to obtain an operator tree or a graph.

While the second category of software can in principle support arbitrarily complex logic by virtue of being extensible, it has the obvious limitation that most of the data cleaning logic needs to be incorporated as custom code since creating optimized custom code for data cleaning software is nontrivial. It would be desirable to extend its repertoire of "built-in" operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions.

In our Data Debugger project, we seek to achieve the above goal. Thus, we aspire to identify key primitive data cleaning operators and then ensure their efficient implementation on horizontal ETL engines such as SSIS. Thus, we adopt the approach of developing a domain-neutral framework of generic data cleaning operators. We believe that decomposing a data cleaning solution into simpler well-defined operators makes it easier to compose data cleaning operators with each other and with other (relational and non-relational) operators.

 
People

Arvind Arasu

Surajit Chaudhuri

Zhimin Chen

Kris Ganjam

Venkatesh Ganti

Gunjan Jha

Raghav Kaushik

Vivek Narasayya

 

 
Publications

The following papers are in pdf format. Click here to install Adobe Acrobat Reader.

Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik. Transformation based Framework for Record Matching. ICDE 2008.

Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, Raghav Kaushik. Example Driven Design of Efficient Record Matching Queries. VLDB 2007.

Leveraging Aggregate Constraints for Deduplication. SIGMOD 2007, Beijing, China. Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, Raghav Kaushik.

Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Databases (VLDB) 2006, Seoul, South Korea. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik.

A primitive operator for similarity joins in data cleaning. Proceedings of the international conference on data engineering (ICDE) 2006, Atlanta. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik.

Robust identification of fuzzy duplicates. Proceedings of the international conference on data engineering (ICDE) 2005, Tokyo, Japan. Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani.

Mining reference tables for automatic text segmentation. Proceedings of the international conference on knowledge discovery in databases (SIGKDD) 2004, Seattle, WA. Eugene Agichtein, Venkatesh Ganti.

Eliminating Fuzzy Duplicates in Data Warehouses. Proceedings of the 28th International Conference on Very Large Databases (VLDB) 2002, Hong Kong. Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti.

Robust and Efficient Fuzzy Match for Online Data Cleaning. Proceedings of the ACM Conference on management of data (SIGMOD) 2003, San Diego. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani.

If you have questions about this project, please contact Surajit Chaudhuri (surajitc@microsoft.com).


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement