|
Data Debugger Project
Goal
Data cleaning is an essential step in populating
and maintaining data warehouses. Owing to likely differences in
conventions between the external sources and the target data
warehouse as well as due to a variety of errors, data from
external sources may not conform to the standards and requirements
at the data warehouse. Therefore, data has to be transformed and
cleaned before it is loaded into the warehouse so that downstream
data analysis is reliable and accurate. This is usually
accomplished through an Extract-Transform-Load (ETL) process. People
Gunjan Jha
Publications
The following papers are in pdf format. Click here to install Adobe Acrobat Reader.
Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik. Transformation based Framework for Record Matching. ICDE 2008. Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, Raghav Kaushik. Example Driven Design of Efficient Record Matching Queries. VLDB 2007. Leveraging Aggregate Constraints for Deduplication. SIGMOD 2007, Beijing, China. Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, Raghav Kaushik. Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Databases (VLDB) 2006, Seoul, South Korea. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik. A primitive operator for similarity joins in data cleaning. Proceedings of the international conference on data engineering (ICDE) 2006, Atlanta. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. Robust identification of fuzzy duplicates. Proceedings of the international conference on data engineering (ICDE) 2005, Tokyo, Japan. Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. Mining reference tables for automatic text segmentation. Proceedings of the international conference on knowledge discovery in databases (SIGKDD) 2004, Seattle, WA. Eugene Agichtein, Venkatesh Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. Proceedings of the 28th International Conference on Very Large Databases (VLDB) 2002, Hong Kong. Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti. Robust and Efficient Fuzzy Match for Online Data Cleaning. Proceedings of the ACM Conference on management of data (SIGMOD) 2003, San Diego. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani.
If you have questions about this project, please contact Surajit Chaudhuri (surajitc@microsoft.com). |