Supporting Efficient and Effective Keyword Search: From Set Intersection to Taxonomies and Text Cubes

This talk is about index structures and data models to support efficient and effective keyword search, from a micro level (fundamental-operator level), e.g., data structures for fast set intersection, to a macro level (system level), e.g., leveraging taxonomies in keyword search.

In the first part of the talk, I will discuss algorithms to improve the performance of a key operation in keyword search for both structured and unstructured data: the efficient computation of set intersections. Here, I will introduce compact data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, I will show how to compute their intersection in expected time O(n/sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. Subsequently, I will discuss a much more practical version of this algorithm, which has weaker asymptotic guarantees but is much simpler and performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

In the second part, I will describe how to optimize (inverted) indexes to support processing a type of keyword query that allows word substitutions defined by a given taxonomy of terms. This problem is challenging because each term in a query can have a large number of substitutions and the original query can be rewritten into any of their combinations. We propose to build an additional index (besides inverted index) to efficiently process these types of queries. For a given query workload, we formulate an optimization problem which chooses the additional index structure, aiming at minimizing the query evaluation cost, under given index space constraints. We show the NP-hardness of the problem, and propose a pseudo-polynomial time algorithm using dynamic programming, as well as an (1−1/e)/4-approximation algorithm to solve the problem. Experimental results show that, with only 10% additional index space, our approach can greatly reduce the query evaluation cost.

Finally, I will briefly introduce our Text Cube-TopCell project which aims to support keyword queries and text analytics in structured data with both structural dimensions and text dimensions (or attributes). In the text cube model, text data is aggregated on different subsets of the structural dimensions, forming a hierarchy of entities called cube cells. Given a keyword query, our goal is to find the top-k most relevant cells in the text cube, which correspond to hidden entities at different granularities. Efficient algorithms to find relevant cube cells are proposed and a real system has been implemented (supported by NASA).

Speaker Details

Bolin Ding is a 5th year Ph.D. student of Computer Science at University of Illinois at Urbana-Champaign (UIUC), where he is working in the Data Mining Group under the supervision of Dr. Jiawei Han. Prior to joining UIUC, he received his M.Phil. Degree from The Chinese University of Hong Kong in 2007, and Bachelor’s degree from Renmin University of China in 2005. His research interests center around database systems and data mining. Specifically, he is interested in problems on the management and analysis of massive structured data, including efficient and effective searching, indexing, data mining algorithms, and data privacy. His research has led to over 30 research papers in high-profile international conferences and journals, and won the best student paper award in the International Conference on Data Engineering in 2007 (ICDE’07).

Homepage: https://netfiles.uiuc.edu/bding3/www/
Google Scholar: http://scholar.google.com/citations?user=AjYkTi8AAAAJ&hl=en

Date:
Speakers:
Bolin Ding
Affiliation:
University of Illinois
    • Portrait of Bolin Ding

      Bolin Ding

      Researcher

    • Portrait of Jeff Running

      Jeff Running