G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

MSR-TR-2011-138 |

Graphs are widely used for modeling complicated data such as social networks, bibliographical networks and knowledge bases. The growing sizes of graph databases motivate the crucial need for developing powerful and scalable graphbased query engines. We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language enables the expression of different types of graph queries that are of large interest in the databases that are modeled as large graph such as: pattern matching, reachability and shortest path queries. Each query can combine both of structural predicates and value-based predicates (on the attributes of the graph nodes/edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe an efficient hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph are stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database (using SQL) while the execution of other parts of the query plan are processed using memory-based algorithms, as necessary. Experimental results on real and synthetic datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors.