Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
The 3rd Microsoft Research India Computer Vision and Graphics Shindig

Microsoft Research India, Bangalore, India

December 16 and 17, 2010 


This shindig was a gathering of researchers working in a particular area to share ideas and discuss future research directions. Invited speakers spoke on contemporary research topics in computer vision, graphics, and image processing during the two days of the event. There were productive research discussions with speakers, invited faculty members, and students.


Contact Information

For any queries, please email Microsoft External Research – India with the subject line "Shindig on Computer Vision, Graphics and Image Processing".


Talk Abstracts

Compressive Sensing: Is It the Next Best Hope for Computer Vision?

Rama Chellappa (University of Maryland, College Park)


Since the early 1970s, computer vision researchers have relied on concepts from physics, mathematics, and statistics to develop new approaches for many computer vision problems. These include image formation models, regularization approaches, optimization techniques, Markov random fields, Bayesian inference, machine learning, manifold learning, and more recently, compressive sensing.


In this talk, I will explore the notion that the latest excitement about compressive sensing and sparse representations is justified in the context of generating novel algorithms for computer vision problems. Examples from 3-D modeling from sparse gradients, dictionary-based face recognition, image reconstruction from gradients, and estimation of BRDFs will be provided to support the discussions.





Are Categories Necessary for Recognition? 

Alexei "Alyosha" Efros (Carnegie Mellon University) 


The use of categories to represent concepts (e.g., visual objects) is so prevalent in computer vision and machine learning that most researchers don't give it a second thought. Faced with a new task, one simply carves up the solution space into classes (e.g., cars, people, buildings), assigns class labels to training examples, and applies one of the many popular classifiers to arrive at a solution.


In this talk, I will discuss a different way of thinking about object recognition—not as object naming, but rather as object association. Instead than asking "What is it?" a better question might be "What is it like?" [M. Bar]. The etymology of the very word "re-cognize" (to know again) supports the view that association plays a key role in recognition. Under this model, when faced with a novel object, the task is to associate it with the most similar objects in one's memory which can then be used directly for knowledge transfer, bypassing the categorization step all-together. I will present some very preliminary results on our new model, termed "The Visual Memex," which aims to use object associations (in terms of visual similarity and spatial context) to reason about and parse visual scenes. We show that our model offers better performance at certain tasks than standard category-driven approaches.



Modeling Deformable Surfaces from Single Videos

Pascal Fua (EPFL)


Without a strong model, 3-D shape recovery of non-rigid surfaces from monocular video sequences is a severely under-constrained problem. Prior models are required to resolve the inherent ambiguities. In our work, we have investigated several approaches to incorporating such priors without making unwarranted assumptions about the physical properties of the surfaces we are dealing with.


In this talk, I will present these approaches and discuss their relative strengths and weaknesses. I will also demonstrate that they can be incorporated into effective algorithms that can capture very complex deformations.





Non-Photorealistic Rendering and the Science of Art

Aaron Hertzmann (University of Toronto)


Non-Photorealistic Rendering describes algorithms for creating images and animations inspired by traditional media for art and illustration. I will present some of my work in this area. Moreover, I argue that Non-Photorealistic Rendering (NPR) research will play a key role in the scientific understanding of visual art and illustration. NPR can contribute to scientific understanding of two kinds of problems: how do artists create imagery, and how do observers respond to artistic imagery? I sketch out some of the open problems, how NPR can help, and what some possible theories might look like. Additionally, I discuss the thorny problem of how to evaluate NPR research and theories.





Self-Paced Learning for Specific-Class Semantic Segmentation 
M. Pawan Kumar (Stanford University) 

Algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that the algorithm should be presented with the training data in a meaningful order: easy samples first, difficult samples later. As we are often not provided with a readily computable measure of easiness, we design a novel self-paced learning algorithm that simultaneously selects easy samples and learns the parameters. We empirically demonstrate that self-paced learning outperforms the state of the art on several standard applications.


Next, we consider the task of learning a model that provides a complete segmentation of an image by assigning each of its pixels to a specific semantic class. The main problem we face is that lack of fully supervised data. To address this issue, we develop a principled framework for learning the parameters of a specific-class segmentation model using diverse data (with varying levels of supervision). More precisely, we formulate our problem as a latent structural support vector machine (LSVM), where the latent variables model any missing information in the human annotation. In order to deal with the noise inherent in weakly supervised annotations, we train the LSVM with self-paced learning. Using large, publicly available datasets we show that our approach is able to exploit the information offered by different annotations to improve the accuracy of specific-class segmentation. 



Activity Recognition by Structured Models Learned from a Small Set of 2-D Video Examples

Ram Nevatia (University of Southern California)


Activity recognition in videos is a key task in video content extraction; it is needed for applications such as monitoring and alerts, content-based indexing and human-computer interaction. There are several alternative approaches to this task. One approach is to compute local spatio-temporal features and then use their global distribution for classification. In this work, we take a more structural approach where an activity is defined by a sequence of states where each state characterizes the actor’s body pose and relations to objects of interest (and to other actors).


The structural approach leads to not only recognition of the activities but also to their detailed descriptions; however, there are several challenges in following this approach. One of these is to estimate the body poses in video frames; in earlier work, we have shown that simultaneous computation of pose tracks and activities can lead to robust and efficient solutions. Here, pose inference is guided by the activity models whereas the activity inferences, in turn, depend on evidence for different poses in the images.


To account for the variations in observed pose due to viewpoint changes, we use 3-D pose sequence models and render them in 2-D as needed. However, acquiring the 3-D models is a difficult task in itself and requires capturing action videos in a Mocap environment. In recent work, we have developed methods to infer the 3-D pose sequences by lifting from 2-D videos with a relatively small amount of manual intervention. This allows for rapid learning of new activity patterns from just a few (possibly only one) video training examples.


This talk will describe our recent work in activity recognition using the structural approach, including the task of learning the models.





Perils of Developing Quantitative Methods for Biomedical Imaging Applications  
Jens Rittscher (GE)  

By providing data sets like the PASCAL Object Recognition Database Collection and other similar data sets the computer vision community has provided an accepted benchmark that clearly defined a challenge problem for the research community at large. Although the lack of similar data sets for biomedical applications has been broadly acknowledged there are some inherent challenges that need to be addressed. Ground truth labeling of some of these very complex data sets appears to be extremely difficult. Often experts disagree and systematic approaches of finding a consensus interpretation need to be applied. The underlying variation of biological specimen is another factor that needs to be taken into account.


In this context, I will introduce the concept on edit based visualization and illustrate how it has been applied to larger scale data sets. The goal of the talk is to stimulate an exchange of ideas on this topic and discuss what lessons that we learnt from the computer vision challenge problems should be taken into account when designing a reference data set for biomedical imaging applications.  



Jens Rittscher joined the Visualization and Computer Vision Laboratory at GE Global Research in Niskayuna in 2001. He received a Diploma in Mathematics and Computer Science from the University in Bonn, Germany, in 1997, and completed is DPhil under the supervision of Andrew Blake at the University of Oxford in 2001. His research interests include the analysis of visual motion, automatic video annotation, and model based image segmentation techniques. More recently, he focused his research efforts in the area of biomedical imaging. In 2008, he published a volume with the title, Microscopic Image Analysis for Life Science Applications together with Raghu Machiraju and Stephen Wong. Currently, he has an adjunct professorship at the Rensselaer Polytechnic Institute, Troy, NY. 


Learning to Re-rank: Query-dependent Image Re-ranking using Click Data

Manik Varma (Microsoft Research India)


Our objective is to improve the performance of keyword based image search engines by re-ranking their baseline results. We address three limitations of existing search engines in this paper. First, there is no straight-forward, fully automated way of going from textual queries to visual features. Image search engines are therefore forced to rely on static and textual features alone for ranking. Visual features are used only for secondary tasks such as finding similar images. Second, image rankers are trained on query-image pairs labeled with relevance judgements determined by human experts. Such labels are well known to be noisy due to various factors including ambiguous queries, unknown user intent and subjectivity in human judgements. This leads to learning a sub-optimal ranker. Finally, a static ranker is typically built to handle disparate user queries. The ranker is therefore unable to adapt its parameters to suit the query at hand which again leads to sub-optimal results. All these problems can be mitigated by incorporating a second re-ranking stage leveraging user click data.


We hypothesise that images clicked in response to a query are mostly relevant to the query. We therefore aim to re-rank the original search results so as to promote images that are likely to be clicked to the top of the ranked list. This is achieved by using Gaussian Process regression to predict the normalised click count for each image. Re-ranking is then carried out based on the predicted click counts and the original ranking scores. It is demonstrated that the proposed algorithm can significantly boost the performance of a baseline search engine such as Bing image search.





Human Focussed Video Analysis

Andrew Zisserman (University of Oxford)


Determining the pose and actions of humans is one of the central problems of image and video analysis. The visual problem is challenging because humans are articulated animals, wear loose and varying clothing, self-occlude themselves, and stand against difficult and confusing backgrounds. Nevertheless, the area has seen great progress over the last decade due to advances in modelling, learning, and in the efficiency of algorithms.


We describe approaches for recognizing human actions and interactions, and for determining 2-D upper body pose. Results will be shown for various TV videos and feature films, and a live demonstration given for pose based video retrieval.

This is joint work with Vitto Ferrari, Nataraj Jammalamadaka, C. V. Jawahar, Alexander Klaeser, Marcin Marszalek, Alonso Patron-Perez, Ian Reid, and Cordelia Schmid.