Search Objective Gets a Refined Approach
By Rob Knies
June 28, 2006 12:00 AM PT

Search technology has become ubiquitous among Internet users. Who among us doesn’t find themselves using search on a daily—often hourly—basis? It’s become virtually a Web given: Find white rectangle, start typing. It’s what we expect; it’s what we do.

But while search has achieved deep penetration into the nooks and crannies of the Web environment, much remains to be achieved to plumb the depths of search’s potential. While the technology has come a long way in recent years, it’s still in its infancy. Much work remains to enable users to gain effortless access to the information they seek.

Ji-Rong Wen and Zaiqing Nie are at the forefront of that continuing effort.

Wen and Nie, researchers with the Web Search & Mining Group within Microsoft Research Asia, are pioneering research into Object-Level Vertical Search, a technique that has shown promising results in enhancing the search process to provide fine-grained results reaped from sifting through a variety of Web pages to deliver precise information to user queries.

“We want to develop a better search engine for some specific domains,” Wen says. “We want to develop a better search engine.”

Object-Level Vertical Search takes a refined approach that is a significant advance from traditional Web search. The latter paradigm is based on a page-level relevance ranking approach, in which pages that receive links from many other pages are adjudged to have more value by the very fact that they are popular. If more people link to a given page, it must have something to offer—that is the presumption.

In reality, we all know what happens. A search query returns a list of Web pages, some of which may have more relevance to what we are seeking, some with less. It’s up to us, then, to start clicking on likely candidates and scanning the pages for the information we want.

It works, to a degree. We’re in the neighborhood, but we’re still looking for the right house. Object-Level Vertical Search is designed to put us on the doorstep.

“In Object-Level Vertical Search,” Wen says, “we want to extract and integrate information from the Web about specific objects.

“For example, in academic search for a researcher, his information may be distributed on different Web sites. We need to collect, extract, and integrate all of this information. On one Web site, we may find the e-mail address of this person. On another Web site, we can find his telephone number and his publications.

“We collect all this information and integrate it. Then, after extraction and integration, the results will be a virtual page containing all the related information about this person.”

The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.

Wen and Nie began working on the concept a couple of years ago, on a project called Libra that eventually developed into Windows Live Academic Search.

“At that time,” Nie recalls, “we came up with this object-level vertical-search idea.”

Adds Wen: “We realized that traditional search technology could not meet our requirements.”

Those requirements were to improve on existing search.

“Object-level search is better,” Nie says, “in terms of enabling us to get more specific information about real-world objects.

“In the vertical domain, people are really interested in information about specific objects, not the pages themselves. For example, if you are a researcher, you always want to find information about other researchers and conferences and journals. If you want to find information about the best researchers in the world, and you use a basic search engine, it’s very difficult to find who the popular researchers are in a particular domain. But using our object-level search engines will specifically give you a list of researchers and extract and integrate the information together. The user can have a much better understanding as a result of a query.”

The approach taken by Wen and Nie is the first to develop the idea fully on a large scale from a Web-search perspective. In a 2005 paper entitled Object-Level Ranking: Bringing Order to Web Objects, authors Nie, Yuanzhi Zhang of Peking University, Wen, and Wei-Ying Ma, a principal researcher for Microsoft Research Asia, debuted the concept of PopRank, which computes the popularity of the objects within a specific domain, rather than the popularity of the Web pages.

“We found,” Nie says, “that, using our PopRank model, our ranking accuracy improved a lot.”

Such success doesn’t come easily, though.

“Our model is much more complicated,” Wen confirms. “In the academia-search domain, there are multiple types of objects: authors, papers, conferences, journals—all these things.

“In addition, the relationship is more complicated. There are paper-author relationships, author-author relationships, paper-paper relationships, paper-conference relationships. We need to differentiate different object types and different relationships at the algorithm level. We need to assign different weights to those different relationships and then change the algorithm.”

It’s the same, popularity-based ranking idea taken to whole new, more refined level.

“If you’re looking for papers in data mining, how do you select papers to read?” Nie asks. “You may want to read some papers because they are cited by many other papers. You may also want to read a paper if it is written by a famous researcher. You may also want to read a paper if it was published in conjunction with a high-quality conference. Because of all of these reasons, a paper may become popular. But our research was about the more important factor: What determines the popularity of a paper?”

There are a series of steps involved in the Object-Level Vertical Search process:

  • Web Crawling: to collect relevant information on the Web efficiently.
  • Classification: Does a page contain information on products, papers, people, or some other desired category?
  • Extraction: pulling specific information about the search query from the relevant Web pages. For a product, for instance, that could mean product name, brand, image, description, and price.

“After this stage,” Wen says, “you can see that we have transformed the Web-page information into structured data.”

  • Integration: Combining the gathered object information into a concise whole. This includes resolving Web-page idiosyncrasies and naming conventions and making sure that similarly named objects are integrated only if they relate to the actual object being sought.
  • Ranking: There are two types of ranking. One, static rank, is handled well by the PopRank algorithm. The second, relevance, is trickier, because an object might be popular, but irrelevant to the query at hand. Because the object description is integrated from multiple Web pages, developing a ranking mechanism is a challenge.

“The most important part,” Nie says, “is extraction and integration.”

Adds Wen: “Web pages are very diverse. The same information is encoded with different formats on different Web pages. Our extraction and integration algorithms have to be very robust to deal with the variety of Web pages. But the good news is that we have developed a very good algorithm to do this, based on some machine-learning techniques.”

Wen and Nie continue to refine their techniques to provide an even more effective search experience.

“We are currently working with the MSN® search team and shopping team to build better search technologies,” Nie confirms. “We need to focus on two things. The first is to continue to improve the extraction and the integration accuracy. Actually, they want us to do it perfectly! This is challenging, but we are approaching that goal.

“The second thing is that, after we extract and integrate the objects from Web pages, we want to provide better ranking. Object ranking is different from page ranking, so we need to investigate this problem more deeply.

“And we want to apply our framework to more domains. We want to use this for blog search, restaurant search, job search. You can imagine that there are many, many instances in which this could be used.”

For the foreseeable future, the page-level search technique will continue to play a significant role in delivering search results to Internet users. Object-Level Vertical Search, though, adds another dimension to the evolution of search technology.

“I think they will co-exist,” Nie says. “The traditional Web search engine is still viable in terms of coverage, easier to use. We still need something like page-level search.

“But in our Object-Level Vertical Search engine, we have a new ranking algorithm. We have a whole new architecture to do search. We just hope that, in the future, if we expand our technology to more domains, we hope more and more users will use our search engines to get more precise and accurate information.”

Wen agrees.

“As we build more search engines for important vertical domains that enable users to get more precise answers and knowledge about objects in that domain,” he says, “more people will make a switch from general domain page-level Web search engines. That’s an ambitious goal.”