Wei Cheng, Xiaoming Jin, and Jian-Tao Sun
Retrieving similar data has drawn many research efforts in literature due to its importance in data mining, database, and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we proposed the novel similarity query problem on dimension incomplete data and adopted a probabilistic framework to model this problem. For this problem, users can specify a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging inherent relations between query and dimension incomplete data objects. When query is processed, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds in filtering data objects efficiently. Furthermore, a probabilistic triangle inequality is proposed to further speed up query processing. According to experiments on real data sets, our similarity query method is verified to be effective and effi cient on dimension incomplete data.
|Published in||The 2009 edition of the IEEE International Conference on Data Mining series (ICDM 2009)|
|Publisher||IEEE Communications Society|
Copyright © 2007 IEEE. Reprinted from IEEE Communications Society. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to firstname.lastname@example.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.