Speaker Taylor Berg-Kirkpatrick
Host Hoifung Poon
Date recorded 6 November 2013
Printing-press era documents are difficult for OCR systems to transcribe because these documents are extremely noisy. However, the noise originates from processes that are causally understood. For example, thickened glyphs are caused by over-inking, and vertical offset is caused by slop in a mechanical baseline. We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our approach gives state-of-the-art results on two datasets of historical document images.
©2013 Microsoft Corporation. All rights reserved.