Poor data quality is a well-known problem in data warehouses that arises for a variety of reasons such as data entry errors and differences in data representation among data sources. For example, one source may use abbreviated state names while another source may use fully expanded state names. However, high quality data is essential for accurate data analysis. Data cleaning is the process of detecting and correcting errors and inconsistencies in data.
Typical data cleaning tasks include record matching, deduplication, and column segmentation which often need logic that go beyond using traditional relational queries. This has led to development of utilities for data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as “horizontal” platforms that are applicable across a variety of domains. These platforms provide a suite of additional data cleaning operators but also including relational operators such as select, project and equi-join.
In the Data Cleaning project, our goal is to define a repertoire of "built-in" operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions. We also investigate their efficient implementation on horizontal ETL engines (such as. SQL Server Integration Services). While for wide usage it is desirable to have generic, domain-neutral operators, it is also important that the operators allow domain specific customizations. Examples of such customizations include plugging in synonyms (e.g., Robert and Bob while dealing with people names), domain-specific tokenization, and token weighting.
The operators that we have identified and implemented so far include
- Fuzzy Lookup (FL) that is used as one of the core operators to perform record matching. The matching can be customized by configuring the tokenization, token weighting and providing transformation rules. We address efficiency by designing a similarity index.
- Fuzzy Grouping that is used for deduplication. Fuzzy Grouping clusters the pairwise matches produced by Fuzzy Lookup.
- Column segmentation that segments input strings based on specification of regular expressions that can also reference tables in a database.
One of the contributions of the project is a Data Profiling tool that is used to analyze the data quality through a variety of statistics ranging from simple statistics such as the number of null values in a column to more complex statistics such as the strength of keys, foreign keys and functional dependencies.
- Fuzzy Lookup and Fuzzy grouping are currently shipping as part of Microsoft SQL Server Integration Services (SSIS)
- Fuzzy Lookup is used for geocoding incoming queries in Bing Maps.
- Column Segmentation and Fuzzy Grouping are used for de-duplication of product names and descriptions at the back-end of Bing Shopping.
- Fuzzy Lookup is used in Microsoft’s internal master data management project that maintains information about Microsoft’s customers to match new customers with existing customers.
- The Data Profiling technology ships as part of Microsoft SQL Server Integration Services (SSIS). A part of the technology that discovers foreign keys in a database ships in Microsoft PowerPivot for Excel as part of the SQL Server 2008 R2 release.
- We have released a Fuzzy Lookup Add-in for Microsoft Excel 2010 via Microsoft Business Intelligence (BI) Labs that allows users to invoke our approximate matching technology on spreadsheet data in Excel.
If you have questions about this project, please contact the Data Cleaning research team (firstname.lastname@example.org).
- Arvind Arasu, Michaela Götz, and Raghav Kaushik, On Active Learning of Record Matching Packages, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), June 2010.
- Parag Agrawal, Arvind Arasu, and Raghav Kaushik, On Indexing Error-Tolerant Set Containment, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), June 2010.
- Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009.
- Arvind Arasu and Raghav Kaushik, A Declarative Entity Representation Framework, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), June 2009.
- Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD, Association for Computing Machinery, Inc., June 2009.
- Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, IEEE Computer Society, 29 March 2009.
- Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, IEEE Computer Society, June 2008.
- Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik, Leveraging Aggregate Constraints for Deduplication, in SIGMOD, Association for Computing Machinery, Inc., 2007.
- Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB, Very Large Data Bases Endowment Inc., 2007.
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik, Efficient Exact Set-Similarity Joins, in Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006, Very Large Data Bases Endowment Inc., August 2006.
- Eugene Agichtein and Venkatesh Ganti, Mining reference tables for automatic text segmentation, in SIGKDD, Association for Computing Machinery, Inc., 2004.
- Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani, Robust and efficient fuzzy match for online data cleaning, Association for Computing Machinery, Inc., 2003.
- Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti, Eliminating Fuzzy Duplicates in Data Warehouses, in VLDB, 2002.