Surajit Chaudhuri

Surajit Chaudhuri
DEPUTY MG. DIR - DISTING. SCI.
.

    

Bio

I am a distinguished scientist and the managing director of XCG (part of Microsoft Research). My technical work is done in collaboration with members of the Data Management, Exploration and Mining group in XCG. Prior to joining Microsoft Research in Jan 1996, I worked at HP Labs, Palo Alto from 1992-1995. I received my Ph.D. from Stanford University and my B.Tech. from IIT, Kharagpur.

Research Interests

  • Self-Tuning Technology for Database Systems
  • Multi-Tenant Database Systems
  • Enterprise Data Analytics
  • Text Analytics, Structured Data and Search

Projects I worked on

I started the AutoAdmin project in 1996 soon after joining MSR. The goal of this project is to make databases self-tuning and self-administering by exploiting knowledge of the workload. Vivek Narasayya was my primary collaborator in early years and subsequently we were joined by other colleagues in this effort. Our primary focus was in automated physical database design as well as automated statistics management in relational systems. The Index Tuning Wizard in Microsoft SQL Server 7.0 and SQL Server 2000 are based on the technology that we developed as part of this project and represented the first workload-driven commercial physical design tools on relational systems to recommend indexes and indexes + materialized views respectively. The scope of the automated physical design technology has since been expanded and made available in the Database Tuning Advisor feature of the SQL Server 2005 and subsequent releases. The AutoAdmin  project page has a detailed description of the project and the publications.  Recently, I have gotten interested in the related problem of resource management for Multi-Tenant database systems.

I initiated the Data Cleaning project in 2000 with the goal of developing tools and server infrastructure to support data preparation, an essential step before effective data analysis. Venkatesh Ganti was our leading reseracher in this project in the early years. Our work led to fuzzy matching and fuzzy de-duplication transforms in the SQL Server 2005 product (and subsequent releases) in the SQL Server Integration Services component. In recent years, we have incorporated our Data Cleaning technology in Bing.

Text documents as well as structured relational data are sources of our information. Understanding the synergy between these two sources of information has been a longstanding interest of mine. I started looking at this problem in mid-nineties(SIGMOD 1995) when we studied the problem of "join" between Relational tables and Text repositories. Later, we investigated the problem of keyword search over structured databases (IEEE ICDE 2002) and the problem of auto-ranking of answers in database queries (CIDR 2003, VLDB 2004, CIDR 2005). More recently, we have been looking at the problem of entity search (WWW 2008). Ideas from this project have been incorporated in Bing.

Last but not the least, I am interested in the problem of supporting business intelligence and decision support queries more effectively on data platforms. In the past, I have worked on optimization of complex SQL queries, e.g., optimization of queries with group-by (VLDB 2004), user-defined predicates (VLDB 2006), exploiting factorization for index unions/intersection plans (SIGMOD 2003), and data mining predicates (IEEE ICDE 2002). One of the directions I have pursued is that of revisiting the fundamental assumptions in query optimization (SIGMOD 2005, SIGMOD 2009). Currently, I am exploring techniques and tools for "Big Data" enterprise analytic platforms.

Awards

  • 2012 ICDE Influential paper Award
  • 2011 ACM SIGMOD Edgar F. Codd Innovations Award
  • 2008 VLDB Best Paper Award (with Nico Bruno)
  • 2007 VLDB 10-Year Best Paper Award (with Vivek Narasayya)
  • 2005 ACM Fellow
  • 2004 ACM SIGMOD Contributions Award
  • 2000 IEEE ICDE Best Paper Award (with Vivek Narasayya)

Selected Professional Activities 

  • 2010 ACM Symposium on Cloud Computing (SOCC): Program Co-Chair
  • 2006 ACM Conference on Management of Data (SIGMOD): Program Chair
  • 1999 ACM Conference on Knowledge Discovery and Data Mining (KDD): Program Co-Chair
  • 2011 IEEE Data Engineering Conference: Industrial Track Co-Chair
  • 2003 ACM SIGMOD Conference: Industrial Track Chair
  • 2001 ACM Conference on Knowledge Discovery and Data Mining: Industrial Track Co-Chair
  • 1999 ACM SIGMOD Conference: Industrial Track Co-Chair
  • 1998 IEEE Conference on Data Engineering (ICDE):Industrial Track Chair
  • 2002 IEEE Conference on Data Engineering (ICDE): Chair, OLAP and Data Warehousing Track
  • 2008 VLDB 10-year award committee, Chair
  • 2002 VLDB 10-year award committee, Member
  • ACM Transactions on Database Systems (TODS): Associate Editor,  2001-2007
  • IEEE Transactions on Knowledge and Data Engineering (TKDE): Associate Editor, 2001-2005
  • IEEE Data Engineering Bulletin : Associate Editor, 1998-1999

Invited Talks, Tutorials, and Surveys

  • Experiences with Problem #9: Invited Talk, SIGMOD 2011, Athens.
  • A Programming Framework for Data Cleaning, Distinguished Lecture, University of British Columbia, 2009.
  • An Overview of Business Intelligence Technology, CACM 2011. (with Umeshwar Dayal, Vivek Narasayya)
  • Self-Tuning Database Systems: A Decade of Progress. VLDB 2007. (with Vivek Narasayya)
  • Foundations of automated database tuning, Tutorial presented at ACM SIGMOD 2005, VLDB 2006. (with Gerhard Weikum)
  • Self-Managing Technology in Database Management Systems, Tutorial presented at VLDB 2004. (with Benoît Dageville, Guy M. Lohman)
  • Databases and IR: Perspectives of a SQL Guy, NSF Information and Data Management PI Workshop, Seattle, 2003, pdf version of slides
  • An Overview of Data Warehousing and OLAP technology. Sigmod Record, March 1997 Tutorials Presented at 1996 VLDB, 1997 SIGMOD, 1998 EDBT and 1998 IEEE ICDE Conferences pdf version (with Umeshwar Dayal).
  • An Overview of Query Optimization in Relational Systems. Proceedings of 1998 ACM PODS. Invited Tutorial at ACM PODS Conference, 1998, pdf version of paper , pdf version of slides

Technology Transfer (in collaboration with project members)

  • SQL Server Index Tuning Wizard and Database Tuning Advisor (AutoAdmin project)

  • Fuzzy Lookup and Fuzzy Grouping Transforms in SQL Server Integration Services (Data Cleaning project)

  • Query Services and Catalog Data Quality for Bing Shopping (Data Cleaning and Entity Search projects)

Selected Publications

For a complete list of my publications, please look up DBLP 

  • Interval-based pruning for top-k processing over compressed lists. IEEE ICDE 2011. (with Kaushik Chakrabarti, Venky Ganti)
  • Query optimizers: time to rethink the contract? SIGMOD Conference, 2009.
  • Extending autocompletion to tolerate errors. SIGMOD Conference, 2009. (with Raghav Kaushik)
  • Exploiting web search engines to search structured databases. WWW 2009 (with Sanjay Agrawal, Kaushik Chakrabarti, Venkatesh Ganti, Arnd Christian König, Dong Xin)
  • Exploiting web search to generate synonyms for entities. WWW 2009. (with Venkatesh Ganti, Dong Xin)
  • Transformation-based Framework for Record Matching: IEEE ICDE 2008. (with Arvind Arasu, Raghav Kaushik)
  • Constrained physical design tuning: VLDB 2008 (with Nico Bruno)
  • Fine Grained Authorization Through Predicated Grants. IEEE ICDE 2007 (with Tanmoy Dutta, S. Sudarshan)
  • An Online Approach to Physical Design Tuning: IEEE ICDE 2007. (with Nico Bruno)
  • Leveraging aggregate constraints for deduplication. SIGMOD 2007 (with Venky Ganti, Raghav Kaushik, Anish Das Sarma)
  • A Primitive Operator for Similarity Joins in Data Cleaning. IEEE ICDE 2006 (with Venky Ganti, Raghav Kaushik)
  • Towards a Robust Query Optimizer: A Principled and Practical Approach, ACM SIGMOD 2005. (with Brian Babcock)
  • When Can We Trust Progress Estimators for SQL Queries? ACM SIGMOD 2005. (with Raghav Kaushik, Ravishankar Ramamurthy)
  • Effective Use of Block-Level Sampling in Statistics Estimation. ACM SIGMOD 2004. (with Gautam Das and Utkarsh Srivastava)
  • Estimating Progress of Long Running SQL Queries. ACM SIGMOD 2004. (with Vivek Narasayya, Ravishankar Ramamurthy)
  • Robust and efficient fuzzy match for online data cleaning, ACM SIGMOD 2003 (with Kris Ganjam, Venkatesh Ganti, Rajeev Motwani).
  • DBXplorer: A System For Keyword-Based Search Over Relational Databases. IEEE ICDE 2002. (with Sanjay Agrawal and Gautam Das).
  • Efficient Evaluation of Queries with Mining Predicates. Proceedings of IEEE International Conference on Data Engineering, 2002. (with Vivek Narasayya and Sunita Sarawagi).
  • STHoles: A Multidimensional Workload-Aware Histogram. Proceedings of the ACM SIGMOD 2001. (with Nicolas Bruno and Luis Gravano).
  • Integrating Data Mining with SQL Databases: OLE DB for Data Mining, Proceedings of 17th International Conference on Data Engineering, 2001 (with Amir Netz, Surajit Chaudhuri, Usama M. Fayyad, Jeff Bernhardt)
  • Rethinking Database System Architecture: Towards a Self-tuning, RISC-style Database System. Proceedings of the 26th International Conference on Very Large Databases (VLDB00)(with Gerhard Weikum).
  • Automated Selection of Materialized Views and Indexes for SQL Databases. Proceedings of the 26th International Conference on Very Large Databases (VLDB00)(with Sanjay Agrawal and Vivek Narasayya).
  • Towards Estimation Error Guarantees for Distinct Values. 19th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, Dallas, USA. 2000(with Moses Charikar., Rajeev Motwani, and Vivek Narasayya).
  • Automating Statistics Management for Query Optimizers. Proceedings of 16th International Conference on Data Engineering, San Diego, USA 2000 (with Vivek Narasayya).
  • Evaluating Top-k Selection Queries. Proceedings of 25th VLDB Conference, Edinburgh, Scotland , UK. 1999 (with Luis Gravano)
  • Self-Tuning Histograms: Building Histograms Without Looking at Data, Proceedings of ACM SIGMOD, Philadelphia, 1999 (with Ashraf Aboulnaga)
  • On Random Sampling over Joins, ACM SIGMOD 1999 (with Rajeev Motwani and Vivek Narasayya)
  • Random Sampling for Histogram Construction: How much is enough? Proceedings of ACM SIGMOD, Seattle, 1998 (with Vivek Narasayya and Rajeev Motwani)
  • AutoAdmin "What-If" Index Analysis Utility. Proceedings of ACM SIGMOD, Seattle, 1998 (with Vivek Narasayya).
  • An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. Proceedings of the 23rd International Conference on Very Large Databases (VLDB97), Athens, Greece, 1997, pp. 146-155, 1997 (with Vivek Narasayya).
  • Data Mining and Database Systems: Where is the Intersection?. IEEE Data Engineering Bulletin, March 1998

Selected Publications (Pre-MSR)

  • Optimizing Queries with User-Defined Predicates, VLDB Conference 1996 (with Kyuseok Shim)
  • Join Queries with External Text Sources: Execution and Optimization Techniques SIGMOD Conference 1995: 410-422 (with Umeshwar Dayal and Tak W. Yan)
  • Optimizing Queries with Materialized Views ICDE 1995: 190-200 (with Ravi Krishnamurthy, Spyros Potamianos, Kyuseok Shim)
  • Including Group-By in Query Optimization VLDB 1994: 354-366 (with Kyuseok Shim)
  • Optimization of Real Conjunctive Queries PODS 1993: 59-70 (with Moshe Y. Vardi)