The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined "knowledge" with the larger decision making process. The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases. Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure. This enables database developers to easily access and successfully apply data mining technology in their applications.
Goal
The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined "knowledge" with the larger decision making process. The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases. Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure. This enables database developers to easily access and successfully apply data mining technology in their applications.
Current Status
This is a long-term project. In the short term, the focus will be on automating the data mining process over data warehouses. This includes work in the following areas:
- Integration of data mining with database systems: Success of data mining as an enterprise technology crucially depends on seamless integration of this technology with enterprise databases. In this project, in collaboration with the SQL Server Product Group, we identify opportunities for new abstractions and interfaces that enable integration of data mining. Our joint work resulted in defining OLE-DB DM, an extension of OLE-DB that exposes data mining functionality. Our future work will focus on exploiting data mining for advanced data summarization and also enable tighter coupling between database querying and data mining.
- Scalable Data Mining Algorithms: We are exploring scalable algorithms for modeling large databases. Methods considered include those for predictive modeling (predicting products a customer is likely to purchase based on other products in their basket) and segmentation/clustering (grouping together customers that are "similar" to each other). Specifically, we have focused on scalable decision tree algorithms for prediction, scalable probabilistic clustering algorithms, similarity detection algorithms between data objects, and mining sequence data. We are particularly interested in efficiently building data mining models in linear or near-linear time.
Read more about how data mining is integrated into SQL server.
- Surajit Chaudhuri, Vivek Narasayya, and Sunita Sarawagi, Efficient Evaluation of Queries with Mining Predicates, in Proceedings of 18th International Conference on Data Engineering, IEEE Computer Society, 2002
- Amir Netz, Surajit Chaudhuri, Usama Fayyad, and Jeff Bernhardt, Integrating Data Mining with SQL Databases: OLE DB for Data Mining, in Proceedings of 17th International Conference on Data Engineering, IEEE Computer Society, 2001
- Amir Netz, Surajit Chaudhuri, Jeff Bernhardt, and Usama Fayyad, Integration of Data Mining and Relational Databases , in Proceedings of the 26th International Conference on Very Large Databases, Very Large Data Bases Endowment Inc., 2000
- Usama Fayyad, Surajit Chaudhuri, and Paul Bradley, Data Mining and its Role in Database Systems, Very Large Data Bases Endowment Inc., 2000
- Surajit Chaudhuri, Usama Fayyad, and Jeff Bernhardt, Scalable Classification over SQL Databases, in Proceedings of 15th International Conference on Data Engineering, IEEE Computer Society, 1999
- Surajit Chaudhuri, Data Mining and Database Systems: Where is the Intersection?, in Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, IEEE Computer Society, 1998
- Goetz Graefe, Usama Fayyad, and Surajit Chaudhuri, On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases, in Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , American Association for Artificial Intelligence , 1998



