Web Search and Data Management
Web search and data management.
- TrinityTrinity is a general purpose distributed graph system over a memory cloud. Memory cloud is a globally addressable, in-memory key-value store over a cluster of machines. Through the distributed in-memory storage, Trinity provides fast random data access power over a large data set. With the power of fast graph exploration and distributed parallel computing, Trinity supports both low-latency online query processing and high-throughput offline analytics on billion-node scale large graphs.
- ProbaseKnowledge is indispensible to understanding. The goal of Probase is to model the concepts (common knowledge and commonsense knowledge) in our mental world, represent them in computable form in probablities, and enable machines to better understand natural language.
- WebSensor (InformationSensor)With the rapid growth of the web, there are grand challenges when making sense of web data: big volume, high velocity, high variety, and unknown veracity. In the physical world, a sensor is a converter that measures a physical quantity and converts it into a signal that can be read by an observer or by an instrument—today, mostly electronic. This project creates a virtual, WebSensor layer atop the web.
- Intent and Diversity (INDI)By submitting one query, users may have different intents. For an ambiguous query, users may seek for different interpretations. For a faceted topic, users may be interested in different subtopics. In this project, we investigate how many queries are ambiguous in real search logs; we propose methods to diversify search results; we experiment with new metrics to measure diversity; we also organize NTCIR INTENT and IMINE tasks to provide common data for IR community.
- Web Page Analysis (WEPA)A Web page is not atom but rich in structure. In this project, we take advantage of HTML DOM structure and associated visual features, such as font size, width and height of a DOM element, to understand the purpose of authors in creating a page. We model importance of blocks in the page; we extract structured data from pages across websites; we learn templates from a set of mixed pages from a website; we also identify article title, body and images from pages to improve reading experience.
- Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang, A Distributed Graph Engine for Web Scale RDF Data, in PVLDB, August 2013
- Bin Shao, Haixun Wang, and Yatao Li, Trinity: A Distributed Graph Engine on a Memory Cloud, in Proceedings of SIGMOD 2013, ACM SIGMOD, 26 June 2013
- Zhao Sun, Hongzhi Wang, Bin Shao, Haixun Wang, and Jianzhong Li, Efficient Subgraph Matching on Billion Node Graphs, in PVLDB, August 2012
- Lijun Chang, Jeffrey Yu, Lu Qin, Yuanyuan Zhu, and Haixun Wang, Finding Information Nebula over Large Networks, in ACM CIKM, October 2011
- Ruoming Jin, Lin Liu, Bolin Ding, and Haixun Wang, Reachability Computation in Uncertain Graphs, in VLDB, September 2011
- Ruoming Jin, Yang Xiang, Ruan Ning, and Haixun Wang, Path-Tree: An Efficient Reachability Indexing Scheme for Large Directed Graphs, in ACM Transactions on Database Systems (TODS), ACM Transactions on Database Systems (TODS), 2011