Description: http://research.microsoft.com/~viola/picNew.jpgPaul Viola
Partner Development Banager

Bing

Brief Bio:

-       Ph.D., MIT, 1995

-       Compaq Research Cambridge, 2000

-       Mitsubishi Electric Research, 2002

-       Microsoft Research, 2006

-       Live Labs, 2007

 


Bio

 

 

Paul manages a team of over 120 engineers on 4 continents. We deliver the algorithms that interpret users' queries to create the Search result page for Bing. This involves matching the query to web results, as well as other types of “structured” data. Our team also measures the overall quality of Bing results across the entire page.

 

Paul came to Microsoft as a Researcher in 2002.  Before moving to Search, Paul and his team worked on numerous efforts to use machine learning in the analysis of documents, emails, and web pages.  Results of this work can be seen in products like Windows, Live Search, and Microsoft Dynamics.  In collaboration with the Live Toolbar team we built the technology behind “smart menus”.  The Tablet PC team uses our technology to extract the structure in handwritten ink notes.  East Asian Office is using his technology to extract contact information from incoming emails.  Dynamics/CRM is using his technology to automatically route and analyze incoming faxes.   Live Search uses similar technology to classify and extract information from documents and queries.

 

Paul has served on the program committees of conferences such as Neural Information Process Systems (NIPS), Computer Vision and Pattern Recognition (CVPR), and the International Conference on Computer Vision (ICCV).  He has received the Marr Prize for the best paper in computer vision (at ICCV 2003).  An earlier paper on medical image processing received an honorable mention for the Marr prize in 1995.  He received an honorable mention for best paper at AAAI 2004. While at MIT he received the NSF Career award as one of the top junior faculty members in Computer Science.

 

Paul’s interest in intelligent systems goes back quite a ways.  As an IBM intern in 1986, Paul Viola constructed a semi-autonomous robotic wheel chair called “Mr. Ed”.  This ride-on robot incorporated sensors and a microprocessor,  and required no more than coarse and infrequent feedback to navigate a hallway,  or go through a doorway.  Until 1990,  he studied robotics with Rod Brooks at MIT.  There he received a master’s degree for the construction of an autonomous visual robot patterned on the simple mammalian oculomotor processes.  After brief forays into neuroscience and structural molecular biology (at Arris Phamaceutical now Axys), Paul returned to MIT to complete his Ph.D. on biomedical image processing.

 

Paul’s thesis work on the registration of images from various medical sensors has been widely used and reimplemented (his thesis has been referenced more than 800 times).  It is now a standard technique that appears in many commercial products,  and is widely considered the best and most reliable registration technique for assistance in surgical planning. 

 

In 1995, Dr. Viola returned to MIT as an assistant professor and later an associate professor.  His work focused on statistical learning for image processing and computer vision.   In the area of computer graphics his work with Jeremy De Bonet is considered the first effective texture synthesis algorithm for complex textural patterns.  Other work included techniques for image database retrieval and 3D reconstruction.

 

While visiting Compaq Cambridge research labs in 2000,  he created the world’s first real-time face detection system.  This system has been widely adopted and reimplemented: Intel distributes this algorithm in a computer vision toolkit;  MSRA has adopted and improved the algorithm.  Though it was developed outside of MS, it is actually in wide use inside of MSR (as well as many other companies and universities).  After leaving MIT to join the Mitsubishi Electric Research Labs, he worked on face recognition, surveillance, and real-time computer vision.

 

For more or less complete lists of my references please try: DBLP, IEEE, SCHOLAR, or CITESEER.

 


Old Research Overview

My past work (with a wide range of collaborators both inside of MSR and in the product groups) is at the intersection of Machine Learning, Natural Language Processing, and Computer Vision.  We have constructed systems which understand documents which can be used to route them to the correct recipient, extract structured information, or to repurpose them for other tasks.  For example,  the Tablet PC makes it easy to jot down notes or to derive equations.  I am working with the Tablet team to understand these ink documents so that they can be reused and edited.

 

We have built a number of systems:

 

 

Fax Routing.  (DAS 2004 paper) We have created a system that can routed incoming fax images.  Optical character recognition finds the words, and they are then evaluated to determine which are relevant.  For example, words are relevant if they are near the word “TO”.  The relevant words are then compared to a database of recipients using a fuzzy matching algorithm.

 

 

Description: http://msrweb/users/viola/AssistedFormTechnology/ContactMgr/Documentation/Projec1.jpg

Contact Parsing (AAAI 2004 paper, SIGIR 2005 paper) Given an address block from the bottom of an email, web page, or scanned document, automatically extract the key fields and fill them into a form.  The system works along with a novel UI which makes correcting errors easy.

 

See also the internal web site.

 

Send mail if you would like to download a demo.

 

Ink Outline and List Analysis (IWFHR 2004 paper) Processes handwritten notes from the tablet PC to find list and outline structure.  Once found, the structure allows you to provide more powerful editing (like opening and closing sub-trees).  It is also easier to import the notes into Word and OneNote.

 

 

Recognition and Grouping of Ink .  (DAS 2004 paper) Given a page of ink strokes there are two related challenges.  First you must group the stokes on the page into valid sets (i.e. group the 3 strokes in an H).  Second you must recognize the groups.  This is difficult to do well unless you perform both tasks simultaneously.

 

 

 

Document Structure Extraction (ICDAR 2005 paper, ICCV 2005 paper)  From a document scan, or a PDF file,  the words and lines can be extracted accurately.  What is missing is higher level information about the document.  Is it one or two columns?  Where is the title?  Is this block a part of a footnote,  or a section of the main text?  If you had this information it is easy to import the text + structure back into Word to make editing easy.

 

 

 


Older Work


Robust Real-time Object Detection

We have created a new visual object detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions.  The first is the introduction of a new image representation called the ``Integral Image'' which allows the features used by our detector to be computed very quickly.  The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features and yields extremely efficient classifiers.  The third contribution is a method for combining classifiers in a  ``cascade'' which allows background  regions of the image to be quickly discarded while spending more computation on promising object-like regions.  A set of experiments in the domain of face detection are presented. The system yields face detection performace comparable to the best previous systems.   Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

The best overview of the approach is available in these papers: IJCVor CVPR 2001 (shorter) .

We also proposed a new learning algorithm called AsymBoost which improves performance of the cascade:  NIPS 14, Dec 2001 .

This work grew out of earlier research on image database retrieval CVPR 2000 (see below ).
 

Mutual Information Matching

In 1995 we developed a new approach for solving computer vision problems based on entropy. This approach can be used to derive algorithms for pose estimation, object recognition, shape from shading, and lightness compensation. Each of these algorithms is based on a simple non-parametric estimate for the entropy of a signal.

My thesis contains a good overview of these ideas. Other papers include: IJCV-97 and Medical Image Analysis-96.  

Complex Feature Recognition

In 1996 we developed a new Bayesian framework for visual object recognition which is based on the insight that images of objects can be modeled as a conjunction of local features. This framework can be used to both derive an object recognition algorithm and an algorithm for learning the features themselves. The overall approach, called complex feature recognition or CFR, is unique for several reasons: it is broadly applicable to a wide range of object types, it makes constructing object models easy, it is capable of identifying either the class or the identity of an object, and it is computationally efficient--requiring time proportional to the size of the image.

Instead of a single simple feature such as an edge, CFR uses a large set of complex features that are learned from experience with model objects. The response of a single complex feature contains much more class information than does a single edge. This significantly reduces the number of possible correspondences between the model and the image. In addition, CFR takes advantage of a type of image processing called {\em oriented energy}. Oriented energy is used to efficiently pre-process the image to eliminate some of the difficulties associated with changes in lighting and pose.

A paper describing CFR was published as AI-MEMO-1591. More recently we have extended these ideas using a generative multi-scale statistical model for images ICCV-99.
 

Non-parametric Multi-scale Model for Images

In 1997 we created a novel multi-scale statisitical model for images. One of the original motivations for this work was a flaw in the mutual information approach described above. In that framework the entropy of the image and model were estimated as if the pixels were independent. This multi-scale approach provided a much more powerful model for the dependencies in image.

While there have been many proposed approaches to the principled statistical modeling of images, each has been limited in either the complexity of the models or the complexity of the images. Our approach is much more general and can be used for recognition, image de-noising, and in a ``generative mode'' to synthesize high quality textures. Several papers describing this approach can be found here: NIPS-97, SIGGRAPH-97 and CVRP-98.
 

Image Database Retrieval (and Text too!)

Starting in 1997 we began to study the role of high dimensional representations in image database retrieval. Contrary to most work in the field, we created a very large set of features from each image. These features were designed to be very selective--each only responds to a very small percentage of images.

At first it might seem that the introduction of tens of thousands of features could only make the query learning process infeasible. How can a problem which is difficult given ten to twenty features become tractable with 10,000. Two recent results in machine learning argue that this is not necessarily a terrible mistake: ``support vector machines (SVM)'' and ``boosting''. Both approaches have been shown to generalize well in very high dimensional spaces because they maximize the margin between positive and negative examples. Boosting provides the close fit to our problem domain because it greedily selects a small number of features from a very large number of potential features. This small set of features can be used to rapidly scan very large databases.

The best paper in this area appeared in IJCV in 2003. A paper describing a early version of this approach was published in NIPS-97.

Satisfyingly very similar ideas have proven valuable in text retrieval: NIPS-98 (PDF) .
 

Handwritten Mathematical Expression Recognition

We have built a number of systems that can parse and interpret handwritten mathematical expressions. What makes this hard is that the semantics of a mathematical expression comes from the spatial arrangement of the symbols. In a sense this is computer vision problem.

A paper describing a early version of this approach was published in AAAI-98 More recently, Nick Matsakis has written a Master's thesis describing these ideas. Nick has also put together a demo and some other some other related information.
 

The Computer Vision Macroscope

At MIT my students and I constructed a a real-time 3D reconstruction and event recording suite.

Our first paper in this area describes a very fast algorithm for 3D reconstruction which uses prior information to improve the results of silhouette intersection. Silhouette intersection is one approach for reconstructing the 3-dimensional shape of an object from multiple views. Using this approach, the task is to produce a binary labeling of a set of voxels, that determines which voxels are filled and which are empty. In this paper, we give an energy minimization formulation of the silhouette intersection problem. The global minimum of this energy can be rapidly computed with a single graph cut, using a result due to Greig, Porteous and Seheult. CVPR-00 .