Large Multidimensional Datasets inside a Database System (PhD Thesis)

Many modern database applications deal with large amounts of multidimensional data. Examples include multimedia content-based retrieval (high dimensional multimedia feature data), time-series similarity retrieval, data mining/OLAP and spatial/spatio-temporal applications. To be able to handle multidimensional data efficiently, we need access methods (AMs) to selectively access some data items in a large collection associatively.

Traditional database AMs like B+-tree and hashing are not suitable for multidimensional data as they can handle only one dimensional data. Using multiple B+-trees (one per dimension) or space linearization followed by B+-tree indexing are not efficient solutions. We need multidimensional index structures: those that can index data based on multiple dimensions simultaneously. Most multidimensional index structures proposed so far do not scale beyond 10-15 dimensional spaces and are hence not suitable for high dimensional spaces that arise in modern database applications like multimedia retrieval (e.g., 64-d color histograms), data mining/OLAP (e.g., 52-d bank data in clustering) and time series/scientific/medical applications (e.g., 20-d Space Shuttle data, 64-d Electrocardiogram data). A simple sequential scan through the entire dataset to answer the query is often faster than using a multidimensional index structure.

To address the above need, we design and implement the hybrid tree, a multidimensional index structure that scales to high dimensional spaces. The hybrid tree combines the positive aspects of the two types of multidimensional index structures, namely data partitioning (e.g., R-tree and derivatives) and space partitioning (e.g., kdB-tree and derivatives), to achieve search performance more scalable to high dimensionalities than either of the above techniques. Our experiments show that the hybrid tree scales well to high dimensionalities for real-life datasets.

To achieve further scalability, we develop the local dimensionality reduction (LDR) technique to reduce the dimensionality of high dimensional data. The reduced space can be indexed more effectively using a multidimensional index structure. LDR exploits local, as opposed to global, correlations in the data and hence can reduce dimensionality with significantly lower loss of distance information compared to global dimensionality reduction techniques. This implies fewer false positives and hence significantly better search performance.

Another challenge in multidimensional indexing is handling time-series data which constitutes a major portion of all financial, medical and scientific information. We develop a new dimensionality reduction technique, called Adaptive Piecewise Constant Approximation (APCA), for time series data. APCA takes the idea of LDR one step further; it adapts locally to each time series object in the database and chooses the best reduced-representation for that object. We show how the APCA representation can be indexed using a multidimensional index structure. Our experiments show that APCA outperforms the other techniques by one to two orders of magnitude in terms of search performance.

Before multidimensional index structures can be supported as AMs in ”commercial-strength” database systems, efficient techniques to provide transactional access to data via the index structure must be developed. We develop concurrency control techniques for multidimensional index structures. Our solution, based on granular locking, offers a high degree of concurrency and has a low lock overhead.

An alternate technique to handle huge data volumes and fast search time requirements in multidimensional datasets is approximate query answering. This is especially true for decision support/OLAP applications where queries are usually exploratory in nature; fast approximate answers are often preferred to exact answers that take hours to compute. We develop a wavelet-based approximate query answering tool for DSS data. Our technique constructs compact synopses (comprising of wavelet coefficients) of the relevant database tables and subsequently answers any SQL query by working exclusively on the compact synopses. Our approach provides more accurate answers and faster response times compared to other approximate query answering techniques, namely random sampling and histograms, especially for high dimensional data.

Despite the increasing application need, commercial database management systems (DBMSs) lag far behind in their support for multidimensional data. One of the main reasons is the lack of scalable and effective techniques to manage large amounts of multidimensional data residing inside the DBMS. We believe that the techniques developed in this thesis address that problem. We hope that our solutions will encourage commercial database vendors to provide better support for multidimensional data in the future.