*
Quick Links|Home|Worldwide
Microsoft*
Search for


Data Mining: Efficient Data Exploration and Modeling

 
Goal

The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined "knowledge" with the larger decision making process.  The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases.  Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure.  This enables database developers to easily access and successfully apply data mining technology in their applications.

 
Current Status

This is a long-term project.  In the short term, the focus will be on automating the data mining process over data warehouses.  This includes work in the following areas:

  • Integration of data mining with database systems: Success of data mining as an enterprise technology crucially depends on seamless integration of this technology with enterprise databases.  In this project, in collaboration with the SQL Server Product Group, we identify opportunities for new abstractions and interfaces that enable integration of data mining.  Our joint work resulted in defining OLE-DB DM, an extension of OLE-DB that exposes data mining functionality.  Our future work will focus on exploiting data mining for advanced data summarization and also enable tighter coupling between database querying and data mining.
  • Scalable Data Mining Algorithms: We are exploring scalable algorithms for modeling large databases.  Methods considered include those for predictive modeling (predicting products a customer is likely to purchase based on other products in their basket) and segmentation/clustering (grouping together customers that are "similar" to each other).  Specifically, we have focused on scalable decision tree algorithms for prediction, scalable probabilistic clustering algorithms, similarity detection algorithms between data objects, and mining sequence data.  We are particularly interested in efficiently building data mining models in linear or near-linear time.

 
People

Surajit Chaudhuri

Gautam Das

Venky Ganti

 
Publications

The following papers are in pdf format. Click here to install Adobe Acrobat Reader.

Chaudhuri S., Narasayya V. and Sarawagi S. Efficient Evaluation of Queries with Mining Predicates. Proceedings of 18th International Conference on Data Engineering, San Jose, USA, 2002. pdf version

Netz A., Bernhardt J., Chaudhuri S., and Fayyad U. Integrating Data Mining with SQL Databases: OLE DB for Data Mining. Proceedings of 17th International Conference on Data Engineering, Heidelberg, Germany, 2001. pdf version

Fayyad U. M., Chaudhuri S.,Bradley P. S. Data Mining and its Role in Database Systems. Tutorial, Proceedings of the 26th International Conference on Very Large Databases, Cairo, Egypt, 2000. 

Netz A., Chaudhuri S., Bernhardt J., Fayyad U. Integration of Data Mining and Relational Databases , Proceedings of the 26th International Conference on Very Large Databases, Cairo, Egypt, 2000.  pdf version

Bernhardt J., Chaudhuri S. and Fayyad U. , Scalable Classification over SQL Databases. Proceedings of 15th International Conference on Data Engineering, Sydney, Australia, 1999. pdf version

Chaudhuri S., Data Mining and Database Systems: Where is the Intersection? . Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, March 1998. pdf version 

Graefe G., Fayyad U. M., and Chaudhuri S., On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases.. Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , New York, USA 1998. pdf version 

If you have questions about this project, please contact Surajit Chaudhuri (surajitc@microsoft.com).

Read more about how data mining is integrated into SQL server.


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement