Hand-drawing Analysis

Martin Szummer, Markus Svensén, Christopher Bishop, Tom Minka

Introduction

Diagrams and graphics are most naturally drawn with a pen.  It has now become easy to capture pen input directly on a computer, with the spread of pen computers such as TabletPCs, PDAs, mobile phones with styluses and drawing tablets.  However, the tools for editing, searching and managing such electronic ink are still very limited.  The computer is poor at "understanding" the structure or meaning of drawing and writing.  Without such understanding it is difficult for the computer to offer intelligent editing of the ink: for example, it cannot recognise whether the user has been writing text or drawing a figure.  The main challenge is that users want the freedom to scribble almost anything with the pen, without going through menus or other constraints on input.   The resulting input consists of a seemingly infinite variety of types (maps, networks, graphs), in many different styles.  Every user is different!

Much of pen input involves mixed drawings containing both graphics and text.  Our research is on interpreting diagrams and drawings and the general structure and layout of ink.  For example, before one can attempt to recognize hand-writing, we must first distinguish writing from drawing.  To make sense of a wide variety of diagrams, we recognize commonly occurring graphical elements, such as containers and connectors.

Machine learning approaches

In the past, people tried to craft ink recognizers by hand: for example, detecting boxes by looking for pen strokes forming corners or parallel lines.  Unfortunately, due to the wide variability between users, it is very difficult and tedious to make such recognizers accurate.  Machine learning is a more promising way to tackle the ink recognition problem.  It is relatively easy to collect real writing and drawing from users. We gather many hand-drawn diagrams and other sketches from users, label them with the desired classes we want to recognize, and then apply new machine learning techniques to build recognizers.  Based on the user examples, our techniques learn what aspects or features of the pen strokes are important for recognizing boxes.


Projects

Drawing Parsing


We develop learning algorithms for analysing sketches and diagrams, e.g., organization charts.  Such charts and other diagrams consist of containers and connectors, which we attempt to recognize. Our algorithms have the ability to learn from:

  • Context: A pen stroke has little meaning by itself; its meaning often depends on nearby strokes.  For example a straight pen stroke could form one side of a container, be part of a connector, or be the digit '1' or the letter 'l'.  Context is essential in disambiguating such cases. Our context model captures relations between spatially adjacent and temporally consecutive strokes.

  • Simultaneous grouping and classification: given a page of pen strokes, it is difficult to recognize the types of objects without knowing how the strokes are grouped into objects.  Conversely, it is difficult to group strokes before knowing what objects they form.  There is an analogous long-standing problem in computer vision, where it is challenging to recognize objects before segmenting the image.  We solve this "chicken and egg" problem by simultaneously grouping and classifying strokes.

  • Flexible features: our algorithms handle large numbers of correlated features.  We simultaneously include temporal information about the sequence of strokes, as well as spatial information such distances, angles, and histograms of neighbor properties.

  • User feedback: we incorporate constraints from user interaction with the system; for example, the user may have corrected the system output, and the system will then use the correction when recognizing subsequent input.

We base our model on a Conditional Random Field (CRF).  This is a graphical model with nodes corresponding to ink strokes, and edges between nodes that we want to model dependencies (context) between.  Thanks to the sparseness of drawings, we are able to perform exact inference in this model, using the junction-tree algorithm.  All this is detailed in our CRF paper, which includes noise-robust potentials. Also read about new developments, including joint labeling and partitioning, Bayesian model averaging, and user feedback.


Text vs. Graphics

We have used machine learning techniques to separate pen strokes into text and 'graphics' strokes (sketches, diagrams, etc.). We combined features of individual strokes with the temporal characteristics of stroke sequences, and we also considered using features of the 'gaps' between strokes, when the pen is not touching the writing surface. All these features were extracted from real user data, collected at Microsoft Research in Cambridge. We use the resulting model to make predictions for each stroke in new stroke sequence, saying how likely it is to be either text or graphics. Further information and results are available in our paper from IWFHR 2004.

Some of the models resulting from this work have been incorporated into the ink analysis SDK for the Windows XP Tablet PC Edition.


Shape matching

 

We can match and recognize arbitrary shapes from a single example, as long as they are rigid and only deform according to affine transformations (scaling, shearing and rotation).  We determine the best match among multiple templates (lines, squares, triangles, circles) by employing Bayesian model comparison.  Detailed paper about shape matching.


Relevant publications

All our publications relevant to drawing analysis.


Links


Machine Learning and PerceptionMachine Learning—Hand-drawing Analysis