Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Unsupervised Transcription of Historical Documents

Speaker  Taylor Berg-Kirkpatrick

Host  Hoifung Poon

Duration  00:47:07

Date recorded  6 November 2013

Printing-press era documents are difficult for OCR systems to transcribe because these documents are extremely noisy. However, the noise originates from processes that are causally understood. For example, thickened glyphs are caused by over-inking, and vertical offset is caused by slop in a mechanical baseline. We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our approach gives state-of-the-art results on two datasets of historical document images.

©2013 Microsoft Corporation. All rights reserved.
> Unsupervised Transcription of Historical Documents