Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Vision Seminar Series

The Vision Seminar Series encapsulates recent advances in a broad field of visual processing, including computer vision, image processing, computer graphics and human vision. Our mission is to bring together multi-disciplinary scholars to search for important and fun research topics, especially in deep image understanding and advanced image reconstruction. The talks are aimed at a broad audience of students, faculties and researchers.

Past Speakers

From Organizing Internet Image Collections To Recognizing Human Actions In Videos

Josef Sivic, INRIA / Ecole Normale Superieure, Paris, France
Tuesday, November 23, 2010
4:00 PM – 5:30 PM

Description
Automatic recognition of places, objects, and human activities in images and videos remains a challenging task. The imaged appearance of an object or a person can vary dramatically due to changes in camera viewpoint, illumination or partial occlusion. Furthermore, there is an additional variation when considering human actions. For example, people wear different clothing, and actions, such as 'drinking coffee' or 'opening a door', can be performed in a different manner by different individuals.
In the first part of the talk I will give an overview of our recent results in large-scale visual search for particular objects and places, with a focus on place recognition in structured, geotagged image databases. In the second part of the talk, I will show that human action detectors can be automatically learned from videos together with readily-available but imprecise and noisy text annotation in the form of movie scripts and subtitles.
Results will be shown on collections of Internet images from Flickr and street-view, as well as feature-length movies. Joint work with F. Bach, O. Duchenne, J. Knopp, I. Laptev, J. Ponce, T. Pajdla.

Biography
Josef Sivic received a degree from the Czech Technical University, Prague, in 2002 and the PhD degree from the University of Oxford in 2006. His thesis dealing with efficient visual search of images and videos was awarded the British Machine Vision Association 2007 Sullivan Thesis Prize and was short listed for the British Computer Society 2007 Distinguished Dissertation Award. His research interests include visual search and object recognition applied to large image and video collections. After spending six months as a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, he is currently an INRIA researcher at the Departement d’Informatique, Ecole Normale Superieure, Paris. http://www.di.ens.fr/~josef

Seeing Scenes in Scenes

James Hayes, Brown
Tuesday, August 24, 2010
4:00 PM – 5:30 PM

Description
Nearly all of scene recognition is based on the implicit assumption that one image depicts one scene category. But photos commonly contain diverse, distinct, _local_ scenes. We construct a novel database with tens of thousands of local subscene annotations and use it as a test set to evaluate a litany of object, scene, and texture recognition features for the novel task of localized scene detection. We show encouraging results using a multiple kernel SVM classifier to combine these features.

Biography
James Hays is an assistant professor of computer science at Brown University. Before joining Brown, James worked with Antonio Torralba as a post-doc at Massachusetts Institute of Technology. He received a Ph.D. in Computer Science from Carnegie Mellon University in 2009 while working with Alexei Efros, and a B.S. in Computer Science from Georgia Institute of Technology in 2003. His research interests are in computer vision and computer graphics, focusing on image understanding and manipulation leveraging massive amounts of data.

Scene and Object Recognition In Context

Antonio Torralba, MIT
Tuesday, August 17, 2010
4:00 PM – 5:30 PM

Description
Recognizing objects in images is an active area of research in computer vision. In the last two decades, there has been much progress and there are already object recognition systems operating in commercial products. Most of the algorithms for detecting objects perform an exhaustive search across all locations and scales in the image comparing local image regions with an object model. That approach ignores the semantic structure of scenes and tries to solve the recognition problem applying brute force. However, in the real world, objects tend to co-vary with other objects and scenes, providing a rich collection of contextual associations. Those contextual associations can be used to reduce the search space by looking only in places in which the object is expect to be and to increase performance, by rejecting image pattern that appear to look like the target object but that are in impossible places. In this talk I will present results on scene and object recognition obtained with a new database of more than 400 scene categories and more than 100 object classes. When hundreds of categories become available new challenges and opportunities emerge. One of the challenges is to devise efficient and accurate algorithms able to handle hundreds or thousands of categories. But there are also new opportunities. On scene recognition, we can test the performance of global features to classify scenes into a very large number of possible settings covering most of the places encountered by humans. On the object recognition side, we can develop context based models that will benefit from the interactions between hundreds of object categories.
The first part of the talk is work in collaboration with Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva http://groups.csail.mit.edu/vision/SUN/ 
The second part is based on work done in collaboration with Myung Jin Choi, Joseph Lim, and Alan Willsky http://web.mit.edu/~myungjin/www/HContext.html

Biography
Antonio Torralba is associate Professor of Electrical Engineering and Computer Science at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. Following his degree in telecommunications engineering, obtained at Telecom BCN, Spain, he was awarded a Ph.D. in Signal, Image, and Speech processing from the Institut National Polytechnique de Grenoble in 2000, France. Thereafter, he spent post-doctoral training at the Brain and Cognitive Science Department and the Computer Science and Artificial Intelligence Laboratory at MIT.

Semi-Supervised Learning in Gigantic Image Collections

Yair Weiss, Hebrew University
Thursday, August 12, 2010
4:00 PM – 5:30 PM

Description
With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet.

Biography
Yair Weiss is a Professor at the Hebrew University School of Computer Science and Engineering. He received his Ph.D. from MIT working with Ted Adelson on motion analysis and did postdoctoral work at UC Berkeley. His research interests include machine learning and human and machine vision. Since 2005 he has been a fellow of the Canadian Institute for Advanced Research. With his students and colleagues he has co-authored award winning papers in NIPS (2002),ECCV (2006), UAI (2008) and CVPR (2009).

A Probabilistic Image Jigsaw Puzzle Solver

Shai Avidan, Tel-Aviv University
Tuesday, August 3, 2010
4:00 PM – 5:30 PM

Description
I will present an ongoing effort to solve jigsaw puzzles. This is a good testbed for a variety of computer vision problems. Completing jigsaw puzzles is challenging and requires expertise even for humans, and is known to be NP-complete. We depart from previous methods that treat the problem as a constraint satisfaction problem and develop a graphical model to solve it. I will show a number of results and several potential applications. 

Biography
Shai Avidan received the PhD degree from the School of Computer Science at the Hebrew University, Jerusalem, Israel, in 1999. He is a professor at Tel-Aviv University, Israel. He was a postdoctoral researcher at Microsoft Research, a project leader at MobilEye, a research scientist at Mitsubishi Electric Research Labs (MERL), and a senior research scientist at Adobe Systems. He has published extensively in the fields of object tracking in video sequences and 3D object modeling from images. Recently, he has been working on the Internet vision applications such as privacy preserving image analysis, distributed algorithms for image analysis, and media retargeting, the problem of properly fitting images and video to displays of various sizes.

Adapting Object Models Using Regularized Cross-Domain Transforms

Kate Saenko, Harvard (SEAS) and UC Berkeley (EECS & ICSI)
Tuesday, July 27, 2010
4:00 PM – 5:30 PM

Description
I will describe a method for adapting object models acquired in a specific visual domain to new imaging conditions. This is done by learning a transformation which minimizes the effect of domain-induced changes in the feature distribution. In addition to being one of the first studies of domain adaptation for object recognition, this work develops a general theoretical framework for adaptation that could be applied to non-image data. The transformation is learned in a supervised manner, and can be applied to categories unseen at training time. The resulting model may be kernelized to learn non-linear transformations under a variety of regularizers. I will present a new image database for studying the effects of visual domain shift on object recognition, and demonstrate the ability of our method to improve recognition on categories with few or no target domain labels, moderate to large changes in the imaging conditions, and even changes in the feature representation.
 

Biography
Kate Saenko is a postdoctoral researcher with joint appointments at Harvard University (SEAS) and UC Berkeley (EECS and ICSI). She received her B.Sc. degree in Computer Science from the University of British Columbia in 2000 and completed her Ph.D. at MIT in 2009. Kate's research interests lie in modeling the joint semantics of images, speech and text to improve human-computer interaction. Her thesis work focused on unsupervised disambiguation of image senses using online sources and semantic ontologies.

How to Come up with New Ideas in Imaging: Using the Idea Hexagon

Ramesh Raskar, MIT Media Lab
Tuesday, July 20, 2010
4:00 PM – 5:30 PM

Description

If you hear a great idea ‘X’, how can you come up with what is neXt? You don’t need a Ph.D. to come up with new ideas. But the process of invention seems very unstructured and confusing. In this talk, I will share my thoughts about the research process based on my own experience (see projects at http://cameraculture.info). Come prepared to contribute your thoughts on how you solve problems and invent solutions. (See a really old version at http://www.slideshare.net/cameraculture/how-to-come-up-with-new-ideas-raskar-feb09)

Biography
Ramesh Raskar joined the Media Lab from Mitsubishi Electric Research Laboratories in 2008 as head of the Lab’s Camera Culture research group. His research interests span the fields of computational photography, inverse problems in imaging and human-computer interaction. Recent inventions include transient imaging to look around corners, next generation CAT-Scan machine, imperceptible markers for motion capture (Prakash), long distance barcodes (Bokode), touch+hover 3D interaction displays (BiDi screen), low-cost eye care devices (Netra) and new theoretical models to augment light fields (ALF) to represent wave phenomena.

He is a recipient of TR100 award from Technology Review, 2004, Global Indus Technovator Award, top 20 Indian technology innovators worldwide, 2003, Alfred P. Sloan Research Fellowship award, 2009 and Darpa Young Faculty award, 2010. He holds 40 US patents and has received four Mitsubishi Electric Invention Awards. He is currently co-authoring a book on Computational Photography. http://raskar.info

Visual Recognition with Humans in the Loop

Serge Belongie, University of California, San Diego
Tuesday, July 13, 2010
4:00 PM – 5:30 PM

Description
We present an interactive, hybrid human-computer method for object classification. The method applies to classes of problems that are difficult for most people, but are recognizable by people with the appropriate expertise (e.g., animal species or airplane model recognition). The classification method can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. Incorporating user input drives up recognition accuracy to levels that are good enough for practical applications; at the same time, computer vision reduces the amount of human interaction required. The resulting hybrid system is able to handle difficult, large multi-class problems with tightly-related categories. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate the accuracy and computational properties of different computer vision algorithms and the effects of noisy user responses on a dataset of 200 bird species and on the Animals With Attributes dataset. Our results demonstrate the effectiveness and practicality of the hybrid human-computer classification paradigm.
This work is part of the Visipedia project, in collaboration with Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder and Pietro Perona.

Biography
Serge Belongie received the B.S. degree (with honor) in Electrical Engineering from the California Institute of Technology in 1995 and the M.S. and Ph.D. degrees in Electrical Engineering and Computer Sciences (EECS) at U.C. Berkeley in 1997 and 2000, respectively. While at Berkeley, his research was supported by a National Science Foundation Graduate Research Fellowship. He is also a co-founder of Digital Persona, Inc., and the principal architect of the Digital Persona fingerprint recognition algorithm. He is currently an associate professor in the Computer Science and Engineering Department at U.C. San Diego. His research interests include computer vision and pattern recognition. He is a recipient of the NSF CAREER Award and the Alfred P. Sloan Research Fellowship. In 2004 MIT Technology Review named him to the list of the 100 top young technology innovators in the world (TR100).

Physics-Based Computer Vision, Again

Todd Zickler, Harvard
Tuesday, July 6, 2010
4:00 PM – 5:30 PM

Description
Computer vision systems are tasked with understanding an intricate visual world. A typical scene contains hundreds of surfaces that scatter light in distinct and complex ways, and these surfaces interact by occluding one another, casting shadows, and mutually reflecting light.
For computer vision systems to succeed, they must exploit structure within this complexity, and this requires appropriate models of the way in which images are formed. Vision techniques that rely on such models are often labeled "physics-based", and traditionally, physics-based methods have achieved tractability by making rather severe assumptions about the world. They possess elegant formulations but are often hard to apply in natural settings.
Over the past few years, our group has been working to relax these restrictions by developing models that provide a better balance between tractability and accuracy. We seek computational models that are complex enough to accurately describe the visual world, and at the same time, are simple enough to be "inverted" for inference purposes. The hope is to build a stronger foundation for the computer vision systems of the future.
In this talk I will summarize our recent work and describe two approaches for recovering shape and reflectance information in natural environments.

Biography
Todd Zickler received a B.Eng. degree in honors electrical engineering from McGill University in 1996 and a Ph.D. degree in electrical engineering from Yale University in 2004. He subsequently joined the Harvard School of Engineering and Applied Sciences, where he is currently an associate professor. Todd's interests span computer vision, image processing, computer graphics, and human perception; and much of his work is devoted to developing efficient representations for appearance. He received an NSF career award in 2006 and was named an Alfred P. Sloan research fellow in 2008. More information can be found on his web-site: http://www.eecs.harvard.edu/~zickler.

Applied Principles of Human Perception and Memory of Pictures

Aude Oliva, MIT
Tuesday, June 29, 2010
4:00 PM – 5:30 PM

Description
In this talk, I will describe our work on understanding how human observers perceive, organize, and memorize information from natural images and objects. I discuss how these principles relate to making aesthetic decisions about images, organizing and memorizing pictures, presenting images on small or large displays, among other visual tasks. Using examples of dual-purpose thumbnail images, which can be added to desktops and web pages, I show how the brain can extract information from displays in counter-intuitive and interesting ways. 

Biography
Aude Oliva is an Associate Professor of Cognitive Science, in the Department of Brain and Cognitive Sciences at MIT. After a French baccalaureate in Physics and Mathematics and two M. Sc. degrees, she was awarded a Ph.D in Cognitive Science from the Institut National Polytechnique of Grenoble, France. Beginning early in her professional career, she pursued multi-disciplinary work in human psychophysics, visual cognition and computer vision through postdoctoral research in the UK, Japan, France and US, giving her a distinctive background for working on the challenging topic of visual scene understanding. She is the recipient of an NSF CAREER Award in Computational Neuroscience, and she is a Fellow of the Association for Psychological Science.

Microgeometry Capture using GelSight

Kimo Johnson, MIT
Tuesday, June 22, 2010
4:00 PM – 5:30 PM

Description
GelSight is a novel sensor capable of measuring surface texture and shape at high resolutions. It is essentially a slab of clear elastomer covered with a reflective skin. When an object presses into the sensor, the skin distorts to conform to the shape of the object's surface. Viewed from behind, the reflective skin appears as a relief replica of the surface, and we can estimate the surface geometry of the object using photometric stereo techniques. We have built several different systems using GelSight sensors, including a real-time texture scanner, a multi-touch trackpad, a robotic fingertip, and recently a microgeometry capture system capable of resolving geometry below 5 microns. In this talk, I will give an overview of the sensor and then describe how the sensor materials, illumination design, and capture techniques can be configured for microgeometry capture. I will show that the measured microgeometry can produce high levels of detail in rendering and can create effects that are typically modeled with statistical abstractions such as the BRDF. 

Biography
Micah Kimo Johnson is a Postdoctoral Fellow at MIT. He is primarily interested in the concept of realism in images, including aspects of images that contribute to realism, or conversely, aspects that indicate that an image is fake. His Ph.D work focused on image forensics, i.e., detecting tampered images, but he has since worked on a variety of techniques for improving realism in computer graphics. He holds undergraduate degrees in Mathematics and Music Theory from the University of New Hampshire, an A.M. in Electro-Acoustic Music from Dartmouth College, and a Ph.D. in Computer Science from Dartmouth College.

High-Throughput Science

Hanspeter Pfister, Harvard
Tuesday, June 8, 2010
4:00 PM – 5:30 PM

Description
How is the brain wired? How did the universe start? How can we predict and prevent heart attacks? These are some of the great scientific challenges of our times, and answering them requires bigger scientific instruments, increasingly precise imaging equipment, and ever more complex computer simulations. The traditional model is to process the data on a remote supercomputer. However, low data transmission rates, high energy consumption, and the high price of large parallel machines are obstacles for many scientists. In this talk I will suggest that commodity high-throughput GPU computing is enabling is enabling high-throughput science, where we process massive data streams efficiently and analyze them rapidly, all the way from the instrument to the desktop. I will present an overview of three projects at Harvard that leverage GPUs for high-throughput science, ranging from neuroscience and radio astronomy to computational fluid-flow simulations. 

Biography
Hanspeter Pfister is Gordon McKay Professor of the Practice in the School of Engineering and Applied Sciences at Harvard University. His research lies at the intersection of visualization, computer graphics, and computer vision. Before joining Harvard he worked for 11 years at Mitsubishi Electric Research Laboratories where he was most recently Associate Director and Senior Research Scientist. Pfister has a Ph.D. in Computer Science from the State University of New York at Stony Brook and an M.S. in Electrical Engineering from the Swiss Federal Institute of Technology, Zurich, Switzerland. You can contact him at pfister@seas.harvard.edu.

Upcoming Speakers
  • TBD
Where

Microsoft Research New England
First Floor Conference Center
One Memorial Drive, Cambridge, MA

Arrival Guidance

Upon arrival, be prepared to show a picture ID and sign the Building Visitor Log when approaching the Lobby Floor Security Desk. Alert them to the name of the event you are attending and ask them to direct you to the appropriate floor. Typically the talks are located in the First Floor Conference Center, however sometimes the location may change.

Mailing List

To subscribe to the mailing list, send an email to this email address.