Data warehouses, which are repositories of data collected from several data sources, form the backbone of most current CRM and decision support applications. Since data sources are independent, they may adopt independent and potentially inconsistent conventions. For example, one source may adopt the use of standard while another source adopts the use of fully expanded descriptions. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of CRM and decision support applications developed over data warehouses, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. Since the types of errors and inconsistencies can be domain-specific, it is important and challenging to develop generic dom
Goal
Data cleaning is an essential step in populating and maintaining data warehouses. Owing to likely differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse so that downstream data analysis is reliable and accurate. This is usually accomplished through an Extract-Transform-Load (ETL) process.
Typical data cleaning tasks include record matching, deduplication, and column segmentation which often go beyond traditional relational operators. This has led to development of utilities that support data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals such as Trillium that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as ``horizontal'' platforms that are applicable across a variety of domains. These platforms provide a suite of operators including relational operators such as select, project and equi-join. A common feature across these frameworks is extensibility--applications can plug in their own custom operators. A data transformation and cleaning solution is built by composing these (default and custom) operators to obtain an operator tree or a graph.
While the second category of software can in principle support arbitrarily complex logic by virtue of being extensible, it has the obvious limitation that most of the data cleaning logic needs to be incorporated as custom code since creating optimized custom code for data cleaning software is nontrivial. It would be desirable to extend its repertoire of "built-in" operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions.
In our Data Cleaning project, we seek to achieve the above goal. Thus, we aspire to identify key primitive data cleaning operators and then ensure their efficient implementation on horizontal ETL engines such as SSIS. Thus, we adopt the approach of developing a domain-neutral framework of generic data cleaning operators. We believe that decomposing a data cleaning solution into simpler well-defined operators makes it easier to compose data cleaning operators with each other and with other (relational and non-relational) operators.
If you have questions about this project, please contact the Data Cleaning research team (dcrt@microsoft.com).
- Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009
- Arvind Arasu and Raghav Kaushik, A Declarative Entity Representation Framework, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), June 2009
- Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD, Association for Computing Machinery, Inc., June 2009
- Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, IEEE Computer Society, 29 March 2009
- Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, IEEE Computer Society, June 2008
- Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD, Association for Computing Machinery, Inc., 2007
- Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB, Very Large Data Bases Endowment Inc., 2007
- Eugene Agichtein and Venkatesh Ganti, Mining reference tables for automatic text segmentation, in SIGKDD, Association for Computing Machinery, Inc., 2004
- Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani, Robust and efficient fuzzy match for online data cleaning, Association for Computing Machinery, Inc., 2003
- Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti, Eliminating Fuzzy Duplicates in Data Warehouses, in VLDB, 2002



