By Rob Knies
July 21, 2008 5:59 AM PT
Mapping has become one of the killer apps of the Internet era. It has become practically second nature to do a quick Web search for directions before heading to an unfamiliar destination. It’s so easy: Just plug in your address and where you want to go, and voilà, you’re practically there.
Well, maybe you are, if you live in places, such as the United States, where addresses have a structured form—street number, street name, city, state, postal code. Geocoding applications can parse that data to produce useful results.
But what if you live in a part of the world where addresses do not conform to such regimented structures? What if the key detail in an unstructured address is a spatial relationship to a nearby landmark? What if the address information is incorrectly typed? Geocoding applications are ill-equipped to cope with such instances.
Most of them, anyway, but not Robust Location Search. A research project from Microsoft Research India, it works across multiple countries via an ingenious technique that quickly whittles down a set of potential results until the correct one becomes apparent. And it can even handle incorrectly typed addresses and transliteration across different alphabets.
“If a person is entering text that has some reference to locations, an address or something more informal, we want to find out where it is on a map and not get confused if there are extraneous terms or conflicting or wrong information,” says Joseph Joy, principal architect at the India lab. “That’s the problem we’re trying to solve, and we’ve been really quite successful.”
Joy, working with colleagues Vibhuti Sengar and Tanuja Joshi, has devised a unique approach to the challenge of defining and locating unstructured addresses within a geographical space. By adopting an approach that recasts the research question at hand, they have developed a technique that rewrites the rules for mapping search and directions—and could prove useful in other domains, as well.
The problem is simple: There is no single address format that applies to all nations of the world. No global schema can be applied, and there are many exceptions to any “typical” format. Consider a sample address not that far from the Microsoft Research India facility:
2nd main 10th cross, Indira Nagar 1st stage, near RTO office Bangalore 560038
Try typing that one into your favorite mapping service. It’s obvious you won’t get very far.
In addition, users have a vexing tendency to misspell words, to omit information, or to alter the order of items even in more structured addresses. Such well-intentioned efforts can wreak havoc with existing systems, which depend on rule-based parsing into a formal schema, either manually or learned by training, to issue a query to a custom database. Specialized rules are necessary for each distinct region, and even those are unable to cope with unstructured or ill-formed input.
Joy, Sengar, and Joshi avoid those obstacles by applying a combination of textual similarity, defined by analysis of the similarity of words in a query with those found on maps, and spatial coherence, the degree of spatial overlap between potential geographic matches and places exhibiting textual similarity to the query. The result of the search is a list of possible interpretations. All this is accomplished, by the way, in a matter of milliseconds.
“We treat queries as plain text,” Joy says, “and we make minimal assumptions about the structure of that text. The first step is to obtain a large number of possible names that subsequences of the text may be referring to by using approximate text search. We identify similar names in the geographic database. We now have to find which combination of names produce the best geographical localization.
“The challenge,” Joy adds, “is how do you find the subset efficiently? There could easily be over a billion different ways you could put these combinations together. Our key insight is to use spatial overlap operations to prune candidates in a depth-first search for viable combinations.
This insight is incorporated into an algorithm called TEXSPACE, the core of the Robust Location Search system of efficiently finding viable query interpretations from a large number or potential query solutions.
TEXSPACE and its evaluation are explained in detail in a paper called Robust Location Search from Text Queries, written by Joy along with colleagues Joshi, Sengar, Samarth Prakash, and Kentaro Toyama. The paper was presented in November during the Association for Computing Machinery’s (ACM’s) 15th annual International Symposium on Advances in Geographic Information Systems (GIS).
“Because we do some pre-computing of approximate representations in space of these entities,” Joy explains, “spatial intersection overhead is very low. The key innovation is that we are able to rapidly converge on promising solutions by leveraging information from the spatial domain.”
The system, designed to be accessed via a Web service built atop Live Search Maps, relies on Microsoft Fuzzy Lookup technology, which provides error-tolerant text matching based on work from the Data Management, Exploration and Mining (DMX) team within Microsoft Research Redmond.
“Vivek Narasayya, a senior researcher in Redmond, happened to be visiting our lab,” Joy recalls, “and describing some of the work done by the DMX group, and that included being able to match comparable addresses. We were able to use their approximate text-searching technology, along with our algorithm, to build a complete solution.”
The project began in the summer of 2006, largely influenced by the surroundings in which Joy found himself. Having worked for 14 years in product development, mostly in the Microsoft Windows Operating System division, he found Microsoft Research’s opening of a lab in his native India irresistible and moved to Bangalore in 2005.
“The initial motivation for this project,” he says, “was that you can’t do geocoding of addresses in India because of the complexity and messiness of addresses. We decided to tackle this in an unusual way.”
That unconventional approach was used to build a prototype to test the technique using addresses in three cities with widely varying address formats: Greater Seattle, London, and Bangalore. The elements for analysis were points, representing landmarks; polylines, representing roads; and polygons, representing locality boundaries. About 200,000 entities were included in the study.
The prototype was evaluated against commercial systems for addresses that included errors and unstructured elements, and performance was found to be significantly better against commercial geocoding systems when errors or ambiguities are introduced. The results were published in the ACM GIS paper.
“We have, as far as we know, the best technology available for a place like India,” Joy smiles. “We are much better able to tackle complex addresses. We are also distinctly better when it comes to handling errors.
“We’ve taken a problem which is fairly old, a problem that people have built industries around, and we’ve found a completely different way of solving it.”
The Robust Location Search results have not gone unnoticed within Microsoft. The Windows Live Local team has been evaluating the technology. And given the nature of the project, the benefits could prove wide-ranging.
“I’m pretty excited by the prospects,” Joy says. “One great benefit of this is that it doesn’t take us any time to customize it to any particular area, because we don’t rely on any training or handcrafted rules.”
That means the entire planet is within the project’s scope. And Joy and colleagues aren’t kidding.
In fact, Joshi, Sengar, and Joy have collaborated with A Kumaran and Tobias Kellner of Microsoft Research India’s Multilingual Systems group, as well as Udayan Khurana of the Microsoft India Development Center, on cross-lingual address search, where the query language can be different than the language of the underlying data. The team combined a machine-learning-based automatic transliterator and fed the results to TEXSPACE. A paper on their work, titled Crosslingual Location Search, will be presented during the ACM’s SIGIR 2008 conference on information retrieval from July 20-24.
Not only that, but the techniques invoked are not necessarily limited to the geospatial realm.
“Thinking longer term,” Joy says, “our core algorithms are not even specific to two-dimensional, map-area domains. We are on the lookout for other, completely different applications.”
Other collaborators on the Robust Location Search project included Venky Ganti and Kris Ganjam, also of the DMX team. Joy is delighted that the project demonstrates how Microsoft Research India, and the Advanced Development and Prototyping group in particular, are able to deliver value.
“This project is a great example,” he says, “of how our group is able to spot a business need and combine our own algorithms with technologies from across Microsoft Research and do the additional engineering we need to get to a compelling solution.”