Surajit Chaudhuri

Data Management, Exploration and Mining    Microsoft Research

Research Interests

ˇ         Self-Tuning Database Systems

ˇ         Monitoring Database Systems

ˇ         Data Cleaning

ˇ         Synergy of Information Retrieval and Databases

ˇ         Query Optimization

Projects

I lead the Data Management, Exploration and Mining group at Microsoft Research.

I am actively involved with the AutoAdmin project that we started in 1997. The goal of this project is to make databases self-tuning and self-administering by exploiting the knowledge of the workload. Our primary focus has been in automated physical database design (VLDB 1997, SIGMOD 1998, VLDB 2000) as well as on automated statistics management in relational systems. I work closely with the other members of this project and Microsoft SQL Server product group in doing this research. The Index Tuning Wizard in Microsoft SQL Server 7.0 and SQL Server 2000 are based on the technology that we developed as part of this project and represented the first workload-driven commercial physical design tools on relational systems to recommend indexes and indexes + materialized views respectively. We are further expanding the scope of the automated physical design technology in the Database Tuning Advisor feature of the upcoming release of SQL Server 2005. In 1998, we initiated work on exploiting execution feedback to define “self-tuning” histograms (SIGMOD 1999, SIGMOD 2002). More recently, I have become interested in the problem of monitoring database systems. Specifically, we worked on the problem of estimating progress of SQL queries (“What percentage of the query execution has been completed?”– SIGMOD 2004, SIGMOD 2005) as well as on a broader architecture for monitoring database servers (SQLCM – IEEE ICDE 2004).

Data Cleaning project develops tools and server infrastructure to effectively support data preparation, an essential step before effective data analysis, be it simple aggregation, OLAP or data mining, can be supported. Our work in this area strives to uncover fundamental generic building blocks to ensure flexible ways of defining data cleaning. In cooperation with SQL Server, we will be enabling fuzzy matching and fuzzy de-duplication operation for the first time in the upcoming SQL Server 2005 product (as part of Data Transformation Services).

Text documents as well as structured relational data are sources of our information. Integrated querying and browsing of structured relational databases and that of text are of vital importance for our ability to harness information effectively. I am investigating how relational querying can be enriched by borrowing ideas from the information retrieval. These include supporting keyword based search over databases as well as auto-ranking of answers in database queries. This technology is promising to solve the “empty answer” and “many answers” problem (you ask a query and get no hits) in databases. Our papers in IEEE ICDE 2002, CIDR 2003, VLDB 2004 and CIDR 2005 highlight our research directions.

Finally, I am interested in understanding database systems challenges to enable business intelligence and decision support more effectively on database platforms. In the past, I have worked on optimization of complex SQL queries, e.g., optimization of queries with group-by (VLDB 2004), user-defined predicates (VLDB 2006), exploiting factorization for index unions/intersection plans (SIGMOD 2003), data mining predicates (IEEE ICDE 2002). My more recent focus is revisiting the fundamental assumptions in query optimization. Brian Babcock and I have a recent paper on this topic in SIGMOD 2005.

Selected Professional Activities

  • ACM Transactions on Database Systems (TODS): Associate Editor
  • IEEE Transactions on Knowledge and Data Engineering (TKDE): Associate Editor, 2001-2005
  • IEEE Data Engineering Bulletin : Associate Editor, 1998-1999
  • ACM Digital Review: Member of the Editorial Board
  • 2005 ACM Conference on Management of Data (SIGMOD): Program Chair
  • 1999 ACM Conference on Knowledge Discovery and Data Mining (KDD): Program Co-chair
  • 2003 ACM SIGMOD Conference: Industrial Track Chair and Member of the Best Paper Awards Committee
  • 2001 ACM Conference on Knowledge Discovery and Data Mining: Industrial Track Co-chair
  • 1999 ACM SIGMOD Conference: Industrial Track Co-chair
  • 1998 IEEE Conference on Data Engineering (ICDE): Industrial Track Chair
  • 2002 IEEE Conference on Data Engineering (ICDE): Chair, OLAP and Data Warehousing Track
  • 2002 VLDB 10-year award committee, Member
  • NSF Panelist

Invited Talks and Tutorials

  • Surajit Chaudhuri, Gerhard Weikum: Foundations of automated database tuning, Tutorial presented at ACM SIGMOD 2005.
  • Surajit Chaudhuri, Benoît Dageville, Guy M. Lohman: Self-Managing Technology in Database Management Systems, Tutorial presented at VLDB 2004.
  • Databases and IR: Perspectives of a SQL Guy, NSF Information and Data Management PI Workshop, Seattle, 2003, pdf version of slides
  • Storage and Retrieval of XML Data Using Relational Databases. Tutorial presented at VLDB 2001 and IEEE ICDE 2002 Conferences.
  • An Overview of Data Warehousing and OLAP technology. Sigmod Record, March 1997 (with Umesh Dayal). Tutorials Presented at 1996 VLDB, 1997 SIGMOD, 1998 EDBT and 1998 IEEE ICDE Conferences pdf version
  • An Overview of Query Optimization in Relational Systems. Proceedings of 1998 ACM PODS. Invited Tutorial at ACM PODS Conference, 1998, pdf version of paper , pdf version of slides

Selected Recent Publications

For a complete list of my publications, please look up DBLP

  • Towards a Robust Query Optimizer: A Principled and Practical Approach, ACM SIGMOD 2005. (with Brian Babcock)
  • When Can We Trust Progress Estimators for SQL Queries? ACM SIGMOD 2005. (with Raghav Kaushik, Ravishankar Ramamurthy)
  • Automatic Physical Database Tuning: A Relaxation-based Approach. ACM SIGMOD 2005.  (with Nicolas Bruno)
  • Robust Identification of Fuzzy Duplicates. IEEE ICDE 2005. (with Venkatesh Ganti, Rajeev Motwani)
  • Effective Use of Block-Level Sampling in Statistics Estimation. ACM SIGMOD 2004. (with Gautam Das and Utkarsh Srivastava)
  • Probabilistic Ranking of Database Query Results. VLDB 2004. (with Gautam Das, Vagelis Hristidis, Gerhard Weikum)
  • Estimating Progress of Long Running SQL Queries. ACM SIGMOD 2004. (with Vivek Narasayya, Ravishankar Ramamurthy)
  • SQLCM: A Continuous Monitoring Framework for Relational Database Engines. IEEE ICDE 2004. (with Christian König, Vivek Narasayya)
  • Factorizing Complex Predicates in Queries to Exploit Indexes. SIGMOD 2003. (with Prasanna Ganesan, Sunita Sarawagi) 
  • Robust and efficient fuzzy match for online data cleaning, ACM SIGMOD 2003 (with Kris Ganjam, Venkatesh Ganti, Rajeev Motwani).
  • Automated Ranking of Database Query Results. CIDR 2003 (with Sanjay Agrawal, Gautam Das, and Aristides Gionis)
  • DBXplorer: A System For Keyword-Based Search Over Relational Databases. IEEE ICDE 2002. (with Sanjay Agrawal and Gautam Das).
  • Efficient Evaluation of Queries with Mining Predicates. Proceedings of IEEE International Conference on Data Engineering, 2002. (with Vivek Narasayya and Sunita Sarawagi). 
  • STHoles: A Multidimensional Workload-Aware Histogram. Proceedings of the ACM SIGMOD 2001.  (with Nicolas Bruno and Luis Gravano).
  • Integrating Data Mining with SQL Databases: OLE DB for Data Mining, Proceedings of 17th International Conference on Data Engineering, 2001 (with Amir Netz, Surajit Chaudhuri, Usama M. Fayyad, Jeff Bernhardt) 
  • Overcoming Limitations of Sampling for Aggregation Queries. Proceedings of 17th International Conference on Data Engineering, 2001 (with Gautam Das, Mayur Datar, Rajeev Motwani and Vivek Narasayya).
  • Rethinking Database System Architecture: Towards a Self-tuning, RISC-style Database System. Proceedings of the 26th International Conference on Very Large Databases (VLDB00) (with Gerhard Weikum). pdf version 
  • Automated Selection of Materialized Views and Indexes for SQL Databases. Proceedings of the 26th International Conference on Very Large Databases (VLDB00) (with Sanjay Agrawal and Vivek Narasayya). pdf version 
  • Towards Estimation Error Guarantees for Distinct Values. 19th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, Dallas, USA. 2000 (with Moses Charikar., Rajeev  Motwani, and Vivek Narasayya). pdf version 
  • Automating Statistics Management for Query Optimizers. Proceedings of 16th International Conference on Data Engineering, San Diego, USA 2000 (with Vivek Narasayya). pdf version 
  • Evaluating Top-k Selection Queries. Proceedings of 25th VLDB Conference, Edinburgh, Scotland , UK. 1999 (with Luis Gravano)
  • Self-Tuning Histograms: Building Histograms Without Looking at Data, Proceedings of ACM SIGMOD, Philadelphia, 1999 (with Ashraf Aboulnaga) pdf version
  • On Random Sampling over Joins, ACM SIGMOD 1999 (with Rajeev Motwani and Vivek Narasayya) pdf version
  • Random Sampling for Histogram Construction: How much is enough? Proceedings of ACM SIGMOD, Seattle, 1998 (with Vivek Narasayya and Rajeev Motwani) pdf version
  • AutoAdmin "What-If" Index Analysis Utility. Proceedings of ACM SIGMOD, Seattle, 1998 (with Vivek Narasayya). pdf version
  • An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. Proceedings of the 23rd International Conference on Very Large Databases (VLDB97), Athens, Greece, 1997, pp. 146-155, 1997 (with Vivek Narasayya). pdf version
  • Data Mining and Database Systems: Where is the Intersection?. IEEE Data Engineering Bulletin, March 1998

Selected Publications (Pre-MSR)

  • Optimizing Queries with User-Defined Predicates, VLDB Conference 1996 (with Kyuseok Shim)
  • Optimizing Queries over Multimedia Repositories, SIGMOD Conference 1996 (with Luis Gravano)
  • Optimizing Queries with Aggregate Views, EDBT 1996 (with Kyuseok Shim)
  • An Overview of Cost-based Optimization of Queries with Aggregates Data Engineering Bulletin 18(3): 3-9, 1995 (with Kyuseok Shim)
  • Join Queries with External Text Sources: Execution and Optimization Techniques SIGMOD Conference 1995: 410-422 (with Umeshwar Dayal and Tak W. Yan)
  • Optimizing Queries with Materialized Views ICDE 1995: 190-200 (with Ravi Krishnamurthy, Spyros Potamianos, Kyuseok Shim)
  • Including Group-By in Query Optimization VLDB 1994: 354-366 (with Kyuseok Shim)
  • Optimization of Real Conjunctive Queries PODS 1993: 59-70 (with Moshe Y. Vardi)

Microsoft Research
One Microsoft Way
Redmond, WA 98052 USA

Contact information (please, no soliciting):
Email surajitc@microsoft.com
Telephone 425-703-1938
Fax 425-936-7329