*
Quick Links|Home|Worldwide
Microsoft*
Search for


Document Processing and Understanding

Overview

Document manipulation is a major business for Microsoft. Our research focuses on developing new technologies to understand, recover, and generate documents in both printed form and electronic ink. Examples of printed document technologies include recovery of electronic documents from their bitmap version (reviving "dead bits"), document compression, and automatic generation of document layouts. For electronic ink, our focus is on handwriting recognition, shape recognition, understanding free-form handwritten notes, and parsing annotation.

People

Primary Contact: Patrice Simard



Jacobs,
Chuck
Photo Not Available
Luu,
Chau


Photo Not Available
Steinkraus,
Dave

 

Affiliate Members


   
 
Projects

Layout analysis: Layout analysis is used to discover the intended structure (words, lines, blocks, pictures, etc) of a document. Our approach is based on robust statistics, multiple passes, and does not depend on language. For printed text, layout analysis is used for OCR (Optical Character Recognition), compression, reflow and re-purposing.

Compression: SLIm (Segmented Layered Image) is a new document format used to analyze and compress color documents. SLIm separates an image document into 2 layers, a foreground and a background, and a mask which indicates for each pixel whether it belong to the foreground or the background. Text and graphic lines (or structure) are captured in the mask and compressed using the BLC (Binary Level Codec) format, while the continuous color information is encoded in the foreground and the background, using the PTC (Progressive Transform Codec) format. This separation not only yields compression factors up to 10X better than JPEG for images which contains mostly text, it also enables basic document understanding which is necessary for operations such as OCR, reflow, and re-purposing.

Handwriting Recognition: Recognition is used to infer the text content (assign Unicodes to squiggles). Our approach is OCR based and uses convolutional neural networks. We currently hold the world's best performance on the MNIST database (handwritten digits collected by the National Institute of Standard and Technology, or NIST). Our technology is used for the recognition of 1 and 2 stroke Japanese characters in Tablet PC. We are currently developing an image based cursive recognizer for roman languages. This work is done in close collaboration with the handwriting recognition product group.

Cursive Parsing: Free form note taking present a particularly difficult challenge when it comes to recognition, selection, and reflowing. We have a close working relationship with Tablet to develop robust cursive parsing algorithms which are tolerant to the inherently variable aspects and diversity of human note taking. These algorithms are similar to the printed layout analysis algorithms, except that they also take advantage of the timing information available in electronic ink.

Annotations: Annotation of printed documents is a particularly useful feature for many applications. In collaboration with Tablet PC, we have developed an annotation recognition engine, which recognizes basic annotation marks such as margin signs, selections, underlining, and textual insertions.

 
Product Contributions

Here are some of the Microsoft products in which our technologies are found:

  • Compression: Letter clustering, layout analysis, and noise cleaning for BLC (Binary Level Codec). Ships with Tablet PC and with Microsoft Office Document Imaging.
  • Cursive Layout Parsing: Close collaboration and IP transfer on cursive layout parsing for the Journal application. Ships with Tablet PC.
  • Handwriting recognition: Neural network based recognizer for Japanese characters with 1 and 2 stroke. Ships with Tablet PC.


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement