By Rob Knies
May 23, 2007 2:00 PM PT
The breathtaking ascendance of Internet search over the past decade has tended to obscure the limitations of the underlying technology. So quickly has search been embraced by hundreds of millions worldwide that it is entirely natural for people to spend more time marveling over what they’ve gained rather than focusing on the potential for improvement.
Luckily, though, while users revel in their unprecedented access to information, Silviu-Petru Cucerzan has his sights trained on the horizons of search.
Cucerzan, a researcher for the Text Mining, Search and Navigation group within Microsoft Research Redmond, is working on an approach he calls Information-Centric Browsing and Search, and his work promises to make the search experience much more robust, productive, and user-friendly.
His project explores the space of contextual search and instant access to information, in which the content of a document is analyzed to provide hyperlinks to key concepts. In the process of doing so, it identifies the most appropriate matches for ambiguous terms by considering them in the context in which they are used.
“I believe this could change a lot of what we’re doing,” Cucerzan says, “a lot of how search is done in general.”
Let’s say you’re reading a Web story on college football that mentions a running back named Bush winning the Heisman Trophy. You want to know a bit more, so you begin a Web search for “Bush.” What happens? You get a multitude of search results, few of which have anything to do with the football player.
But what if you had a tool that could analyze the story you’re reading, understand that you’re seeking information about the Bush—Reggie Bush—who played football for the University of Southern California, and delivered only links about that player—and other pertinent information, such as the USC team itself, the conference it which it plays, or his 2005 Heisman award?
With such a tool at hand, the search process moves from looking for a needle in a haystack to looking for a needle in a pincushion. You get what you need.
“This would enable applications to communicate with each other in a space of concepts,” Cucerzan says. “Right now, applications do not communicate with each other in terms of information. If I browse a document in a window and then go to a search engine, the context is completely lost. There’s no communication between different applications, even between different instances of the same application. My search engine is completely unaware of the kind of document I’m reading or the document I’m editing or whatever I have on my computer.”
Part of the disconnect is that we use search in different ways at different times. The way we search during work can be entirely different from the way we search during our leisure time.
“We shift contexts a lot,” Cucerzan says. “The fact that I’m reading a lot of machine-learning documents doesn’t mean that all that stuff is relevant when I read news. It depends on which persona is using the system.
“The most important thing, to me, is the current context. If I’m reading a news story and I query something, that’s probably what the query is about.”
That’s the motivation for his work.
“I was trying,” he says, “to create some technology to bias search-engine results by making the engine aware of what I’m currently doing. What exactly is the additional information that one has to send to the search engine? Just by looking at a document I’m browsing, it’s pretty difficult to say. But if we’re in a space of concepts and we can predict what the most important concepts from the document are, in absolute terms or with respect to a query, then the search engine or any other application could ask for these concepts and use them to meet the user’s informational needs as captured by the current context.
“It gives really good results,” Cucerzan says of his technology, “especially for ambiguous queries—and a lot of queries are ambiguous. It takes you from generic results to results that look beautiful in a particular context.”
The project offers a novel user interface to analyze a document and provide contextually relevant search results. An enhanced browser view is divided into two panes and a few specialized buttons. The pane on the left displays the document being viewed. A button on the address bar enables the user to process the document for contextual analysis. The right-hand pane offers relevant information from authoritative collections and relevant Web news and image search results. Other buttons enable the user to toggle on or off the search-contextualization and query-disambiguation features, if desired.
Once the process button is pressed, the tool analyzes the document, identifies key concepts, and links those concepts to the appropriate Web pages. Having seen words such as “football” and “USC” in the same story as “Bush,” the tool retrieves links to pages about Reggie, not George W. And, after the analysis, if ambiguity remains about precisely which meaning a term has, a list of associations appears, from which the user can select the most appropriate association and receive the search results he or she is seeking.
The preferred results are retained for a particular document, giving the user a personalized Web resource for any analyzed document. This can be particularly effective for amassing a collection of concept-based bookmarks.
“They become active as the concept becomes active in context,” Cucerzan explains. “Now, I have about 300 or 400 bookmarks. I’m afraid to bookmark one more page, because I know I’m not going to be able to find it and it will make it harder to find anything else.
“With this browser,” he adds, “the only bookmarks that are active are those for which the concepts are active. That makes a huge difference.”
Instead of having to sift through a collection of irrelevant search results and a plethora of bookmarks not currently pertinent, Cucerzan’s technology delivers tailor-made information to the user when the user needs it.
“It’s really nice to have all that information at your fingertips,” he says. “This is very powerful. I can create my personalized view of the Web, based on concepts.”
That personalization is saved on the user’s computer, so the next time a similar document is analyzed, the same conceptual links can be invoked.
The possibilities for such a scenario are many and varied, but consider a couple. What if lots of users chose to share their preferences and their contextual searches?
“If we were to collect this information from a lot of users,” Cucerzan suggests, “the system could learn from them and get better and better. Of course, we’d need an agreement from the users, but if they knew that they could help improve the system to their own and other users’ benefit, I am sure most would agree to provide implicit feedback.”
Another possible use could have even more far-reaching effect.
“What if 70 percent of the people that have at least one bookmark for Reggie Bush have one particular page bookmarked?” Cucerzan asks rhetorically. “Then, when somebody searches for ‘Reggie Bush,’ what is the best page to show up at the top? At that point, I may not care about other algorithmic search results. I know that hundreds of thousands of people have bookmarked that page in their personalized view of the Web, and I will trust their judgment.
“It could change the paradigm of search if we had bookmarks on this growing space of concepts from a tremendous number of people. Basically, people would be voting with their bookmarking clicks on what’s important on the Web.”
Cucerzan has taken an interesting route to arrive at this point. His early interest in mathematics led to a fascination with computers, and while pursuing a bachelor’s degree in science at the University of Bucharest in his native Romania, he worked on optical character recognition, building a system that made it to the finals of the European Academic Software Award. That sparked an interest in natural-language processing, for which he received his Ph.D. from Johns Hopkins University. Upon joining Microsoft, he began working on query-log mining and information extraction, supplying a new spelling-correction technology for several Microsoft products and a tool for question answering for Encarta®.
His current project has been bolstered by the contributions of a couple of Microsoft Research colleagues, Mike Schultz and Robert Ragno. Schultz built the data infrastructure employed by the Information-Centric Browsing and Search tool. And Ragno built an application-programming interface to pull information from Windows Live™ Search.
The system uses pre-processed information from Wikipedia and Encarta and now features 1.6 million indexed concepts. It’s still growing, to the benefit of hundreds of people who are using it. And Cucerzan is looking for more.
“The experience wouldn’t be quite complete unless we also link it to the search box in Internet Explorer®,” he says. “It would be nice to get feedback from users who are using this on a daily basis.
“Everybody who sees it says, ‘Wow, I want this!’ That’s our feedback as of now: Everybody wants such a technology. But would people actually use it daily instead of their regular browser? We don’t know.”
What he does know is that the project is plowing new ground in the fields of search and text mining.
“What’s new here is the large scale of this concept recognition and disambiguation,” Cucerzan says. “There is no other system that can go to any document on the Web right now and extract all this stuff.
“The other exciting technology is the context-aware search. If we contextualize the search, we can get more relevant results, based on what the important concepts are in a document we are reading or editing.”
Blazing trails in the still nascent years of Web search is rewarding.
“What I’m really happy about,” Cucerzan says, “is that it started from an idea that we weren’t sure was doable, because nobody else had done it before on such a scale.
“To get a fully implemented system that works and people can download and use as their browser—that’s a really nice thing.”