Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Data Mining

The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined "knowledge" with the larger decision making process. The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases. Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure. This enables database developers to easily access and successfully apply data mining technology in their applications.

Goal

The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined "knowledge" with the larger decision making process. The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases. Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure. This enables database developers to easily access and successfully apply data mining technology in their applications.

Current Status

This is a long-term project. In the short term, the focus will be on automating the data mining process over data warehouses. This includes work in the following areas:

  • Integration of data mining with database systems: Success of data mining as an enterprise technology crucially depends on seamless integration of this technology with enterprise databases. In this project, in collaboration with the SQL Server Product Group, we identify opportunities for new abstractions and interfaces that enable integration of data mining. Our joint work resulted in defining OLE-DB DM, an extension of OLE-DB that exposes data mining functionality. Our future work will focus on exploiting data mining for advanced data summarization and also enable tighter coupling between database querying and data mining.
  • Scalable Data Mining Algorithms: We are exploring scalable algorithms for modeling large databases. Methods considered include those for predictive modeling (predicting products a customer is likely to purchase based on other products in their basket) and segmentation/clustering (grouping together customers that are "similar" to each other). Specifically, we have focused on scalable decision tree algorithms for prediction, scalable probabilistic clustering algorithms, similarity detection algorithms between data objects, and mining sequence data. We are particularly interested in efficiently building data mining models in linear or near-linear time.

 

Read more about how data mining is integrated into SQL server.

Publications