Ye Yuan, Guoren Wang, Haixun Wang, and Lei Chen
Given a set of model graphs D and a query graph q, con- tainment search aims to find all model graphs g ∈ D such that q contains g (q ⊇ g). Due to the wide adoption of graph models, fast containment search of graph data finds many applications in various domains. In comparison to tra- ditional graph search that retrieves all the graphs containing q (q ⊆ g), containment search has its own indexing characteristics that have not yet been examined. In this paper, we perform a systematic study on these characteristics and propose a contrast subgraph-based indexing model, called cIndex. Contrast subgraphs capture the structure differences between model graphs and query graphs, and are thus perfect for indexing due to their high selectivity. Using a redundancy-aware feature selection process, cIndex can sort out a set of significant and distinctive contrast subgraphs and maximize its indexing capability. We show that it is NP-complete to choose the best set of indexing features, and our greedy algorithm can approximate the one-level optimal index within a ratio of 1 − 1/e. Taking this solution as a base indexing model, we further extend it to accommodate hierarchical indexing methodologies and apply data space clustering and sampling techniques to reduce the index construction time. The proposed methodology provides a general solution to containment search and indexing, not only for graphs, but also for any data with transitive relations as well. Experimental results on real test data show that cIndex achieves near-optimal pruning power on various containment search workloads, and confirms its obvious advantage over indices built for traditional graph search in this new scenario.