holoportation is a new type of 3D capture technology that allows high quality 3D models of people to be reconstructed, compressed, and transmitted anywhere in the world in real-time. When combined with mixed reality displays such as HoloLens, this technology allows users to see and interact with remote participants in 3D as if they are actually present in their physical space. Communicating and interacting with remote users becomes as simple as face to face communication.
We are developing a system for the acquisition, transmission and display of real-time 3D digital content. Our goal is to enable live, immersive 3D communications and entertainment experiences. Our strategy is to acquire the 3D signal within a cubical volume. We represent this signal using high resolution colored voxels, and have created algorithms for acquisition, encoding, decoding, streaming, and display of this voxel data designed specifically for modern massively data-parallel GPUs.
Room2Room is a life-size telepresence system that leverages projected augmented reality to enable co-present interaction between two remote participants. We enable a face-to-face conversation by performing 3D capture of the local user with color + depth cameras and projecting their virtual copy into the remote space at life-size scale. This creates an illusion of the remote person’s presence in the local space, as well as a shared understanding of verbal and non-verbal cues (e.g., gaze).
The capability of managing personal photos is becoming crucial. In this work, we have attempted to solve the following pain points for mobile users: 1) intelligent photo tagging, best photo selection, event segmentation and album naming, 2) speech recognition and user intent parsing of time, location, people attributes and objects, 3) search by arbitrary queries.
We propose a novel learning scheme called network morphism. It morphs a parent network into a child network, allowing fast knowledge transferring. The child network is able to achieve the performance of the parent network immediately, and its performance shall continue to improve as the training process goes on. The proposed scheme allows any network morphism in an expanding mode for arbitrary non-linear neurons, including depth, width, kernel size and subnet morphing operations.
We study the problem of image captioning, i.e., automatically describing an image by a sentence. This is a challenging problem, since different from other computer vision tasks such as image classiﬁcation and object detection, image captioning requires not only understanding the image, but also the knowledge of natural language. We formulate this problem as a multimodal translation task, and develop novel algorithms to solve this problem.
We study the problem of food image recognition via deep learning techniques. Our goal is to develop a robust service to recognize thousands of popular Asia and Western food. Several prototypes have been developed to support diverse applications. We are also developing a prototype called Im2Calories, to automatically calculate the calories and conduct nutrition analysis for a dish image.
Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. In this project, we present a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding.
Spatial Audio is a project about creating a 3D audio experience using headphones. Spatial audio is also known as 3D stereo sound, or simply 3D audio. The applications are augmented and virtual reality, but this technology also affects a trivial activities such as listening to music or watching a movie on the screen of your tablet.
NUIgraph is a prototype Windows 10 app for visually exploring data in order to discover and share insight.
The RoomAlive Toolkit is an open source SDK that enables developers to calibrate a network of multiple Kinect sensors and video projectors. The toolkit also provides a simple projection mapping sample that can be used as a basis to develop new immersive augmented reality experiences similar to those of the IllumiRoom and RoomAlive research projects.
Mano-a-Mano is a unique spatial augmented reality system that combines dynamic projection mapping, multiple perspective views and device-less interaction to support face-to-face, or dyadic, interaction with 3D virtual objects. Its main advantage over more traditional AR approaches is users are able to interact with 3D virtual objects and each other without cumbersome devices that obstruct face to face interaction.
RoomAlive is a proof-of-concept prototype that transforms any room into an immersive, augmented, magical entertainment experience. RoomAlive presents a unified, scalable approach for interactive projection mapping that dynamically adapts content to any room. Users can touch, shoot, stomp, dodge and steer projected content that seamlessly co-exists with their existing physical environment.
Animated computer graphics are projected onto the base of a fiber optic tree to create a sparse 3D display within the tree. This was done as an entry into Microsoft Research's MakeFest and demonstrated on 1/10/2014 to the MSRMakeFest community.
Using the Internet as an (noisy) knowledgebase to mine semantics for multimedia data.
This paper presents a method for acquiring dense nonrigid shape and deformation from a single monocular depth sensor. We focus on modeling the human hand, and assume that a single rough template model is available. We combine and extend existing work on model-based tracking, subdivision surface fitting, and mesh deformation to acquire detailed hand models from as few as 15 frames of depth data.
Online 3D reconstruction is gaining newfound interest due to the availability of real-time consumer depth cameras. The basic problem takes live overlapping depth maps as input and incrementally fuses these into a single 3D model. This is challenging particularly when real-time performance is desired without trading quality or scale. We contribute an online system for large and fine scale volumetric reconstruction based on a memory and speed efficient data structure.
We develop novel eye-gaze tracking technologies in order to make eye-gaze tracking technology ubiquitously available for improved natural user interaction (NUI).
We built the Sketch2Cartoon system, which is an automatic cartoon making system. It enables users to sketch major curves of characters and props in their mind, and real-time search results from millions of clipart images could be selected to compose the cartoon images. The selected com- ponents are vectorized and thus could be further edited. By enabling sketch-based input, even a child who is too young to read or write can draw whatever he/she imagines and get interesting cartoon images.
Microsoft Research is happy to continue hosting this series of Image Recognition (Retrieval) Grand Challenges. Do you have what it takes to build the best image recognition system? Enter these MSR Image Recognition Challenges in ACM Multimedia and/or IEEE ICME to develop your image recognition system based on real world large scale data.
We argue that the massive amount of click data from commercial search engines provides a data set that is unique in the bridging of the semantic and intent gap. Search engines generate millions of click data (a.k.a. image-query pairs), which provide almost "unlimited" yet strong connections between semantics and images, as well as connections between users' intents and queries. This site is to introduce such as dataset, Clickture.
Mobile video is quickly becoming a mass consumer phenomenon. More and more people are using their smartphones to search and browse video contents while on the move. This project is to develop an innovative instant mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching.
Gigapixel ArtZoom is an interactive panoramic image of Seattle that captures artists and performers in action throughout a 360-degree view of the city. You can zoom into the image to see dancers, acrobats, painters, performance artists, actors, jugglers, and sculptors—all appearing simultaneously within a single 20-gigapixel image. Visit the web site to explore the panorama and find out more.
We address the fundamental challenge of scalability for real-time volumetric surface reconstruction methods. We present a memory-efficient, streamable, hierarchical GPU data structure for 3D reconstruction of large-scale scenes with fine geometric details in real time. The system fuses live depth maps from a moving Kinect camera to generate high-quality 3D models of unbounded size.
The goal of the Spin project is to enable users to capture photorealistic 3D models of objects using just an ordinary camera -- with no special lighting, sensors, or other equipment. Our approach works equally well for a mobile phone, a point-and-shoot camera, or a digital SLR camera. The results can be shared and viewed on a phone, in a web browser, or in a desktop application.