Making Virtual Meetings Feel Real

Published

By Rob Knies, Managing Editor, Microsoft Research

Zhengyou Zhang has a vision: to bring people together.

He also has a strategy to achieve that vision: by utilizing multimedia technology. His latest tactic to reach the goal is the Personal Telepresence Station.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.

That’s the name of the combination of hardware and software that Zhang thinks, someday soon, will enable people half a world away to intermingle and exchange ideas and information as naturally as if they were in the same room.

“These days,” says Zhang, a principal researcher who manages the Multimodal Collaboration Research Team within the Communication and Collaborations Systems group at Microsoft Research Redmond, “with economic globalization and workforce mobilization, people are scattered around, and we need more and more collaboration between people.

Zhengyou Zhang

Zhengyou Zhang

“Of course, people always think about how to save money, which is important, especially in this current economic situation, but they also see a lot of other benefits in easy, instantaneous collaboration. If you have an easy way to collaborate with others, you can resolve an issue immediately without scheduling a meeting in a specific room or traveling to a city on a different continent. Saving time, increasing productivity, and improving work-life balance can be more significant than saving money. The best collaboration is a face-to-face meeting, so the Personal Telepresence Station idea is to replicate the experience of a face-to-face meeting as much as possible.”

His technological solution uses multiple monitors, microphones, cameras, and speakers to create the illusion that the people with whom you are interacting are sitting in the same room, not dispersed across the face of the globe. Call it long-distance face-to-face communication.

“What do we have currently?” Zhang asks. “We have the conference call. We have everyone connected to a single phone number or using instant messaging.”

Those communication conduits can enable the exchange of audio or textual information, but they hardly replicate the complex human interactions that an in-person meeting can convey. You can’t use visual or auditory information to determine who is speaking to whom. Facial expressions and body language can’t be detected. Eye contact is absent. Critical secondary cues that enhance understanding are unavailable.

Zhang’s work in computer vision goes back to 1987. He began to focus on multimedia meetings in 2000, and not long thereafter, he began to examine audio speech enhancement. While work on the Personal Telepresence Station goes back only a couple of years, it’s actually the latest manifestation of a lifetime of work.

“Trying to improve the meeting or collaboration experience,” Zhang says, “has been our goal for the last decade.”

With such a background, he is quite aware of the limitations of current technology.

In a conference call, all audio signals are mixed into a single, monaural channel. You have to expend effort not only to determine what is being said, but also who is saying it. It’s a significant cognitive load.

Zhang recounts a personal experience he had three years ago during a conference call with colleagues in Beijing. At one point, he got confused because one colleague presented arguments contradicting what she had said a few minutes earlier. Later, he figured out that these voices actually belonged to two people.

“In our daily life,” Zhang explains, “when we talk to people in a meeting, we experience what we call the cocktail-party effect. Our ears naturally have the ability to tune to the voice of a particular talker, even if there are many others speaking at the same time. When you talk, I know you are talking. I don’t need to expend any effort. When another person is talking and the sound is coming from a different direction, I know this is not you talking. That makes everything much easier.”

From a visual standpoint, the challenges are similar.

“When I look at another person,” Zhang says, “you know I’m looking at that person because you see my side view. The vision of the Personal Telepresence Station project is to replicate a person’s spatial audio effect and spatial vision.”

“When I talk to you, I really want to look at your eyes,” Zhang adds. “Other people know I’m talking to you, not them. When I turn my head, they know I’m talking to them, not you. Imagine talking to somebody with sunglasses on. For me, that’s very troubling. I cannot see their eyes. I don’t know if they’re looking at me, if they’re paying attention. You really want to see the reaction of others.”

Current video-conferencing systems offer some help but are far from perfected.

“You can display multiple videos of remote parties on a screen, but there’s only one camera,” he explains, “and that camera angle is sent to everyone in the remote location. It’s as if they share the same eye. Whether I’m looking at person A or person B or person C, it doesn’t matter, because they have no idea. It’s the same image.”

The first step in adding nuance to such sterile virtual environments is to use spatialization by means of an array of hardware in the sites where communication will be taking place. Augmented by software, the Personal Telepresence Station can bring entirely new dimensions to what has been an experience starved of sensory nuance.

Personal Telepresence Station

The Personal Telepresence Station can use multiple monitors to enhance the teleconferencing experience.

“We have either a big screen or multiple screens,” Zhang says. “If it’s multiple screens, we can dedicate each screen as the representation of a remote person. We attach a camera and a microphone to that screen and associate a loudspeaker with that screen. Each set of display, camera, speaker, and microphone is standing in for the remote person. When you talk to that set, you’re really talking to the virtual person from the remote site and looking into her eyes. That’s the idea.”

The screen of a large monitor, similarly, can be divided into portions so that each can be associated with a remote person. The concept of dedicated video and audio, shared across a network, remains the same.

There are practical constraints. The solution could scale to large numbers, but realistically, who contemplates having an office with dozens of monitors? In real-world settings, the solution seems likely to work best among small groups of two to five individuals.

“Theoretically, there is no limit,” Zhang says, “but we have to consider the cost.”

Also to be considered are studies on the size of workplace meetings, which indicate that those with just a handful of participants constitute a large percentage of all meetings that people attend.

Surprisingly, the use of the Personal Telepresence Station does not require equivalent hardware setups in both the local and remote locations. If each has a couple of loudspeakers and a couple of cameras, the virtual presence of meeting participants can be detected and spatialized. Attribute that to the magic of software.

“Through signal processing, we can virtually generate audio from any direction,” Zhang states. “Whether you come from the middle, 40 degrees to the left, or 40 degrees to the right, we can create virtual loudspeakers from two or more real ones. With this audio-spatialization technique, you don’t need the same number of loudspeakers as the number of remote participants.”

The use of loudspeakers, however, causes a well-known echo problem. When a remote party’s voice is played, it will be picked up by the microphone in the room and be transmitted back to the remote party, who will hear it as an echo of her own voice. To make this work, Zhang uses a technique called multichannel acoustic echo cancellation, enabling the microphones to transmit only the audio from the participants but not that from the loudspeakers.

To spatialize visual effects from a couple of cameras, he relies on a technique called Virtual Views.

“We are working on generating Virtual Views from a sparse set of cameras,” he explains. “If you have two cameras but there is a third remote person, this technology generates another view from the two real videos, a virtual video that makes it appear that you are looking at the third remote viewer.”

It’s hard to overstate the importance of non-verbal visual information. Albert Mehrabian, professor emeritus of psychology at UCLA, has stated the 7%-38%-55% Rule, often known as the “three V’s”—verbal, vocal, and visual. This often-quoted dictum holds that there are three elements in face-to-face communication of feelings and attitudes: Words account for 7 percent of the communicative value, tone of voice accounts for 38 percent, and visual cues account for 55 percent.

Furthermore, Zhang notes, such information takes on additional import in cross-cultural communication.

“People have found out that in cross-cultural collaboration, video is very important,” he says, “because when you have a language barrier, you try to use more gestures to convey the meaning of what you want to say. The reaction is not expressed in audio, and gaze and other things are very important.”

There are various, high-end technologies, costing hundreds of thousands of dollars, to create the illusion that remote participants are in the same room, but those are primarily hardware solutions for use by chief executive officers and corporate board rooms, not for information workers. The Personal Telepresence Station utilizes a variety of techniques to maximize the visuals of virtual communication economically and effectively.

One such technique is Virtual Lighting.

“In the high-end telepresence systems,” Zhang says, “people build a dedicated room with the right material on the wall and the right position of the lights, and that’s very expensive. We use image-processing technology to analyze the video and compare it with the best video we want to achieve, then try to adjust.

“But the question is: What makes the best video? We use [still-photography] celebrity images as targets and modify the video so that it looks like the celebrity images, because those images were taken by professionals, with appealing color schemes and nice lighting.”

Comparing to celebrity images also enables a second technique, Active Lighting. The Personal Telepresence Station uses an array of computer-controlled LEDs, mounted along the side or the top of a monitor, to enable the computer to determine the intensity of each light and thereby optimize the image. The celebrity shots help to fine-tune skin tone and facial contrast.

“The goal is the same thing: to reach the best video quality,” Zhang says. “With the lights, it’s better than just doing the image processing, especially in low-light rooms. By image processing alone, you can’t improve the signal-to-noise ratio much. But when you add the lights, you can really increase the signal.”

A third effort to make the virtual communication experience more lifelike is to address gaze awareness.

Eye-gaze correction is also very important for video conference,” he says. “When you look at the remote person, you look at the screen. The remote person is looking at you through the camera, which is mounted above the screen. There is a divergence of angles. A study says that those angles should diverge by less than five degrees, and if the angle is bigger than that, it’s not good. So what we do is to generate a virtual video such that the video looks like it’s taken from the back of the screen.”

While this research might seem quite applied and specific, Zhang assures that it is all part of a virtuous cycle that ultimately contributes to basic-research efforts to advance the state of the art.

“We want to solve practical problems that improve the collaboration,” he stipulates. “But the technologies behind it require very basic research. They touch the core of computer vision and multimedia: how to improve video and audio quality, how to generate novel views. To generate novel views, you have to know the camera direction, you have to know how the point in 3-D is related to a pixel in the image. You have to know how a pixel in one image is related to another pixel in the other camera.”

In fact, this is a fundamental problem in computer vision, and Zhang has developed a technology called camera calibration that is used by almost every computer-vision research group in the world. Another wildly popular result of his research is that of robust image matching through recovering epipolar geometry. And his team also helped develop the Microsoft RoundTable™ technology.

“We are reaching a point that we can solve practical problems,” he reports. “These practical problems also help us to define new research problems, basic research problems. From basic research to practical things and then many more basic research problems to solve—it’s a positive loop.”

Zhang’s list of collaborators is long and varied, including Philip A. Chou, also a principal researcher and manager of the Communication and Collaboration Systems group, and Multimodal Collaboration team members Zicheng Liu, Rajesh Hegde, Qin Cai, and Cha Zhang. The project is a joint effort with Xuedong Huang’s Communications Incubation Center, which includes Christian Huitema, a Microsoft distinguished engineer, Jayman Dalal, and Wanghong Yuan. Design guru Bill Buxton, another principal researcher, has provided invaluable insights. And Wei-ge Chen, software architect for the Advanced Development Team, contributed technological assistance.

Together, they have surmounted a number of challenges. Zhang cites as one the necessity of performing multidisciplinary research on both sociological and real-world customer needs. Another is the complexity of reproducing the richness of real environments.

There are more to address. Zhang and his team are exploring ways to enable three-dimensional conferencing, to provide a more immersive experience. They also are examining data collaboration, and Hegde has been working on collaborative Visual Studio®, in which software developers working in different locations can work on different parts of the same program at the same time and the software flags any conflicts automatically, in context, so the programmers can confer via audio and video to fix the problem. Figuring out how to reduce the bandwidth necessary to make these solutions work in real-world environments represents another opportunity for further study. But then, what would researchers do without further fields of inquiry?

“I am really excited about inventing new stuff,” Zhang says. “It’s a really exciting journey. Researchers are always proud when something we invented is used by others. I expect many more things to be proud of in the future.”

Related publications

Continue reading

See all blog posts