Web Search and Data Management
In recent years, we have seen dramatic improvements in machine learning, knowledge mining, graph database, and crowdsourcing that are providing search engines with new capabilities to perform deeper data and text processing and understanding. For example, deep learning offers “bottom-up” capability to learn representation from big data, and there has been exciting progress in using deep learning for the semantic embedding of words, queries, and documents into a vector space for advanced text processing. Knowledge mining, which refers to a variety of techniques that data-mine and extract knowledge such as entities and relationships from the Web, has made significant advances in the past few years, and these techniques are being used by search engines to create a comprehensive knowledge graph that offers “top-down” capability for reasoning and inference and understanding the data. Combining the bottom-up and top-down capabilities with crowdsourcing that cleverly includes human computation in the loop to receive implicit or explicit feedback and verification, we have seen many new ideas and systems developed to assist automatic algorithms for higher accuracy and performance. These new capabilities jointly advance the state-of-the-art in machine comprehension of text and artificial intelligence. Web Search and Data Management Group is performing cutting edge research in these related areas and developing new capabilities to empower next generation search engines and intelligent applications. We are also working closely with researchers in Natural Language Computing, Knowledge Mining, and Machine Learning Group.
- Intent and Diversity (INDI)
By submitting one query, users may have different intents. For an ambiguous query, users may seek for different interpretations. For a faceted topic, users may be interested in different subtopics. In this project, we investigate how many queries are ambiguous in real search logs; we propose methods to diversify search results; we experiment with new metrics to measure diversity; we also organize NTCIR INTENT and IMINE tasks to provide common data for IR community.
Knowledge is indispensible to understanding. The goal of Probase is to model the concepts (common knowledge and commonsense knowledge) in our mental world, represent them in computable form in probablities, and enable machines to better understand natural language.
Trinity is a general purpose distributed graph system over a memory cloud. Memory cloud is a globally addressable, in-memory key-value store over a cluster of machines. Through the distributed in-memory storage, Trinity provides fast random data access power over a large data set. With the power of fast graph exploration and distributed parallel computing, Trinity supports both low-latency online query processing and high-throughput offline analytics on billion-node scale large graphs.
- Web Page Analysis (WEPA)
A Web page is not atom but rich in structure. In this project, we take advantage of HTML DOM structure and associated visual features, such as font size, width and height of a DOM element, to understand the purpose of authors in creating a page. We model importance of blocks in the page; we extract structured data from pages across websites; we learn templates from a set of mixed pages from a website; we also identify article title, body and images from pages to improve reading experience.
- WebSensor (InformationSensor)
With the rapid growth of the web, there are grand challenges when making sense of web data: big volume, high velocity, high variety, and unknown veracity. In the physical world, a sensor is a converter that measures a physical quantity and converts it into a signal that can be read by an observer or by an instrument—today, mostly electronic. This project creates a virtual, WebSensor layer atop the web.
- Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang, A Distributed Graph Engine for Web Scale RDF Data, in PVLDB, August 2013
- Bin Shao, Haixun Wang, and Yatao Li, Trinity: A Distributed Graph Engine on a Memory Cloud, in Proceedings of SIGMOD 2013, ACM SIGMOD, 26 June 2013
- Zhao Sun, Hongzhi Wang, Bin Shao, Haixun Wang, and Jianzhong Li, Efficient Subgraph Matching on Billion Node Graphs, in PVLDB, August 2012
- Lijun Chang, Jeffrey Yu, Lu Qin, Yuanyuan Zhu, and Haixun Wang, Finding Information Nebula over Large Networks, in ACM CIKM, October 2011
- Ruoming Jin, Lin Liu, Bolin Ding, and Haixun Wang, Reachability Computation in Uncertain Graphs, in VLDB, September 2011
- Ruoming Jin, Yang Xiang, Ruan Ning, and Haixun Wang, Path-Tree: An Efficient Reachability Indexing Scheme for Large Directed Graphs, in ACM Transactions on Database Systems (TODS), ACM Transactions on Database Systems (TODS), 2011