The last decade has seen corporations and users own and store large quantities of digital data. This has led to the development of a wide array of data mining tools which provide ways to quickly extract useful information from such data. These tools also encourage users to pool their data in an attempt to recover qualitatively better information than would be possible from the data owned by individual users. But if the data contains sensitive/private information about users and/or their organizations, there are worries that these data mining algorithms may inadvertently reveal sensitive information along with the originally intended output. In this project, we are interested in developing practical algorithms for privacy-sensitive data mining.
Our current research is focused on developing privacy preserving Frequent Pattern Mining (FPM) algorithms. FPM is a fundamental problem in data mining and forms the basis of several tasks in data mining including association rule mining. An FPM algorithm finds the most frequent patterns in a dataset. A user can either specify the number of desired patterns in the output or specify a minimum threshold for the ‘support’ that each pattern enjoys in the dataset. For instance, FPM may help doctors find strong correlations, trends and rules from a pool of patient medical records. It is not hard to see that patients may be worried about their records being included in the pool unless they are assured that no “specific” information about them is revealed in the process.
The notion of privacy that we have considered for our current work is called Differential Privacy, a recently introduced definition which provides meaningful privacy guarantees in the presence of arbitrary external information. Differential privacy is a condition on the algorithm that releases some statistic about a dataset x. In rough terms, it states it should not be possible to detect a small change to x by merely observing the released statistics before and after the change. This implies that no matter what the adversary knows ahead of time, he learns the same thing about Alice whether or not her data is actually included in the dataset x.
We have developed two privacy preserving FPM algorithms which work in the following setting: The mechanism is designed to output the top-K patterns in the dataset. The user has complete control over the value of K (as long as it is computationally feasible to recover the top-K patterns). Our algorithms are differentially private and have been tested on several datasets from the FIMI repository. For most datasets, our algorithms have running times comparable to that of the non-private FPM algorithm.
Private FIM tool
(only for authenticated users on intranet)
- Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta, Discovering frequent patterns in sensitive data, in 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), Washington, DC, USA, Association for Computing Machinery, Inc., July 2010.