WordBreaker4Web

Word breaker is a very established natural language processing (NLP) task. It is important for processing many European languages that use compound words (e.g. German, Dutch, Greek, etc.), and is critical for eastern Asian languages (e.g. Chinese, Japanese, Korean, etc.) where the writing systems do not use white spaces to mark word boundaries.

The Web has brought the need of word breaking to a new height. For example, because the URL format does not allow white spaces, we are all forced to concatente words together in specifying, say, file paths or domain names. The convention of the hash tags in tweets is another example. One the web, we regularly need to parse the domain name "247moms" as "twenty-four seven moms" (solute to all the moms in the world!), and carefully not to misunderstand what the web site called "penisland" is all about.

As a byproduct of developing statistical NLP technologies to understand search queries, we find the same techniques can be also applied for word breaking "web languages". Below is a demo we showed at WWW-2010 and NAACL/HLT-2010, and a detail description can be found in the paper in WWW-2011. Just type in your concatenated strings (i.e., no spaces and obvious word boundary markers) and click on the "Break" button. The app will show top few plausible results with their probabilities in log base 10.

The demo is powerd by Microsoft Web N-gram Services. How fast and smoothly the progress bar moves depends on your internet connection. The N-gram is trained with web documents indexed by Bing in the EN-US market but, as pointed out in the NAACL/HLT paper, it seems to understand langauges other than English to some extent (e.g. even Chinese pinyin using roman alphabets). Also, the N-gram follows Bing's tokenization in which the punctuation marks, dollar signs and even apostrophe are all treated as spaces. Please do not surprise the word breaking results show you something like "don t surprise".

Feedback and Questions?

Get Microsoft Silverlight