Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Video Investigations Enable Researcher Hua to Find Success
By Rob Knies
October 14, 2008 1:00 PM PT

You’re 34 years old. You’re 7,000 miles from home. You’re at the prestigious Massachusetts Institute of Technology, speaking during the Emerging Technologies Conference, facing an illustrious gathering of some of the brightest minds in the world. You’ve got 90 seconds to explain what your work is all about.

How do you respond?

If you’re Xian-Sheng Hua, you do as you have so many times before: You step forward, you meet the challenge head-on, and you conquer it by sheer force of intellect and will.

Xian-Sheng Hua
Xian-Sheng Hua

Hua, a lead researcher in the Internet Media Group within Microsoft Research Asia, specializes in the exploration of multimedia advertising and multimedia content analysis. His work has been so successful that he recently was honored by Technology Review, one of the most renowned technology magazines in the world, as one of the 2008 recipients of the TR35 award, which annually recognize 35 outstanding innovators under the age of 35.

“This is a recognition of Xian-Sheng’s amazing work in the content-based video-analysis area,” says Shipeng Li, principal researcher and research manager of the Internet Media Group. “More importantly, this is also a recognition of Microsoft Research Asia’s great achievement in the past 10 years as the hottest computer lab on the planet. This honor belongs to Xian-Sheng, and it also speaks volumes about everyone at Microsoft Research Asia.”

It is, indeed, a dizzying achievement for Hua, a native of Yingshan County, Hubei Province, in East Central China who received his bachelor’s degree and Ph.D. from Peking University. But he rose to the occasion. Asked to deliver a brief “elevator speech” to attendees of the Emerging Technologies Conference, here’s what he had to say:

“Existing video search engines are mainly based on indexing surrounding text or recognized speech, which frequently cannot reflect the real content of videos. A recent solution to this issue is to convert pixels in a video into text, using machine-learning techniques. That means learning visual patterns for certain keywords such as objects and events and then recognizing these keywords from video. However, due to the large gap between the pixels and the semantics, both the speed and the accuracy of the system will be out of control with the increase of the video data and the number of keywords.

“Recently, we are trying to solve this issue by proposing a new scheme which contains two key factors. One is that we’re not only using dedicated training data labelers, but also intelligently using a large amount of grassroots Internet users in the learning cycle. The other is a core approach called online multi-label active learning, which enables us to efficiently, intelligently, and endlessly leverage these labeling contributions to improve both the speed and the accuracy. We believe that this scheme is not only applicable for video indexing, but also for image search and even other, larger-scale classification problems.”

What Hua is suggesting is a novel technique to attain the Holy Grail of video search: indexing the visual content of the video itself, not merely the words being used on a Web site to describe the video.

Xian-Sheng Hua's TR35 citation
Xian-Sheng Hua's TR35 citation.

That’s the sort of innovative approach that gained Hua the TR35 recognition, one he shared this year with Microsoft Research colleague Meredith Ringel Morris, who has gained significant attention with her Microsoft Search Together project. In addition, two others from Microsoft made the list: Blaise Agüera y Arcas, a Live Labs architect who helped create Photosynth™, and Johnny Chung Lee, a researcher for the Applied Sciences team.

Hua, the third person from Microsoft Research Asia to win the Technology Review young-innovators award in the past five years, was delighted and surprised by the honor.

“I got an e-mail from Jason Pontin, editor in chief and publisher of Technology Review, that the editors of the magazine had named me one of this year’s TR35,” Hua recalls. “I was a little bit surprised, as this award seldom goes to China, especially to a person who has grown up totally in China without overseas experiences.”

Hua’s work, however, had made him difficult to overlook. He has authored or co-authored more than 130 academic papers, 60 of them related to multimedia search. He has more than 30 patents issued or pending. He has served as associate editor of IEEE Transactions on Multimedia and as an editorial-board member of the journal Multimedia Tools and Applications. And he is an adjunct professor of the University of Science and Technology of China.

Not bad for a 34-year-old, but then, Hua is passionate about his work.

“I’m interested,” he says, “in doing something really helpful for people who want to access, utilize, and share information.”

From that standpoint, Hua is having the impact he’s seeking, on both the academic and professional fronts.

  • During the Association for Computing Machinery (ACM) Multimedia 2007 conference, held in Augsburg, Germany, his paper Correlative Multi-Label Video Annotation—co-written with Guo-Jun Qi and Jinhui Tang of the University of Science and Technology of China, Microsoft Research Asia colleague Tao Mei, Yong Rui of the Microsoft China Research & Development Group, and Hong-Jiang Zhang of the Microsoft Advanced Technology Center—won Best Paper honors.
  • Another presentation, Video Collage—co-authored with Mei, Xueliang Liu and He-Qin Zhou of the University of Science and Technology of China, and Bo Yang of Tsinghua University—won the ACM Multimedia 2007 award for Best Demo. Video Collage assembles a synthesized collage consisting of representative image samples from a video, enabling efficient video browsing in a manner similar to what AutoCollage, a recent product from Microsoft Research Cambridge, does for a collection of still photographs.
  • The paper When Multimedia Advertising Meets the New Internet Era, authored by Hua, Microsoft Research Asia colleague Tao Mei, and Li, won the Best Poster Paper Award during the Institute of Electrical and Electronics Engineers’ 2008 International Workshop on Multimedia Signal Processing, held Oct. 8-10 in Cairns, Australia.
  • In the 2007 TREC Video Retrieval Evaluation, Hua finished first in full automatic video search, a competition designed to promote progress in content-based retrieval from digital video.
  • He and his Microsoft Research Asia colleagues continue to make significant contributions to international academic conferences. ACM Multimedia 2007 featured eight long papers from Microsoft Research Asia, five of them from the Internet Media Group. This year’s event, to be held in Vancouver, British Columbia, from Oct. 27-31, features seven long papers from the lab, five of them from the Internet Media Group. The group also placed five papers in this summer’s IEEE Conference on Computer Vision and Pattern Recognition.
  • Then there are his contributions to Microsoft products. No fewer than five feature advancements catalyzed by Hua: Windows Movie Maker, Windows XP Media Center Edition, Windows Vista, MSN Video, and Live Search Video.

“I’m proud,” he enthuses, “of what I have transferred into Microsoft products.”

These days, together with Mei, Hua is spending much of his time on the Make Sense multimedia-advertisement project, which incorporates three key components: VideoSense, ImageSense, and MakeSense 2.0.

Video Sense addresses issues regarding placing context-sensitive ads into a video stream while remaining sensitive to viewer sensitivities.

“There are two basic problems,” Hua explains. “The first is: Where do we insert it? One method is to insert ads in the video, like the ads in TV programs. We need to find appropriate insertion points according to the content of the video.”

To determine that, Hua uses two measures: content discontinuity, making ads less intrusive and less interruptive, and the attractiveness or importance of the video content.

“We will detect major discontinuities,” he says. “The higher those are defined, the better the probability that we can insert ads there. This is from the viewpoint of the user. They don’t want to be interrupted.”

Then, of course, there is the viewpoint of the advertiser.

“They want their ads to be viewed by as many people as possible,” Hua says. “We want to put ads at this point, for example, when something very exciting happens but the scene changes immediately.”

The second problem with placement of video ads is relevance: Which ads should be inserted?

“For text-based ads, many people are working on this using text matching, like search,” he says. “But for video, we only have speech or surrounding text, but not any information of the video around the insertion point. We still use the textual information, if we have it, but sometimes we can use some content-based features.”

To accomplish this, he analyzes the features in the video stream, such as the level of motion, the volume or abrupt changes in the music, and highly dominant colors.

“For a video with lots of motion, we can use an ad with lots of motion,” Hua says, “or we can do the opposite and use a very slow ad. This kind of insertion might get much more attention. Besides, semantic video annotation also provides extra ‘hidden’ textual information for us to do relevance matching. ”

In utilizing such techniques, Hua finds himself exploring the similarities and the differences between online and televised video.

“For online video, he explains, “we have much more opportunity to do relevance matching. Most TV ads are inserted manually or controlled by time interval. Seldom are they relevant to the content.”

In addition, short-form online videos need ad treatments that vary from those of TV shows and movies longer in duration.

“For short videos, we cannot insert too much into a frame,” Hua says. “Most likely we’ll be inserting ads after or before the video, or only inserting one clip into the stream. “It will work, but we need to control the duration of the ads we insert.”


Still, the key is to attract user attention, not to dissuade it. There are two ways to capture interest.

“The first principle is that we want to make ads less intrusive,” Hua says. “The duration should be short, the ad should be inserted at a point of high discontinuity, and it should be relevant to the content surrounding it.

“The second this is that the content needs to be free. For a movie or some TV programs, you might be charged, but if you want to get it for free, as on YouTube, you have to view some ads.”

There are also efforts to make the ads themselves more interesting and to entice the user to engage with the content. That’s where Make Sense 2.0 comes in. For example, a photo can be transformed into a game by dividing it into blocks, ordered randomly, and asking the viewer to rearrange the blocks to create a unified image.

“It’s a very simple game,” Hua says, “and we can find a way to show some ads while a user is playing the game. This turns video or image ads into games.”

There are other tactics that might prove valuable. For example, mousing over an image might change something, which could pique a user’s interest and prompt further mouse-overs.

Video advertising, as Hua foresees it, will be a disaggregated effort.

“These ads are inserted on the fly,” he says. “We don’t need to generate video with the ad in it. The video will play to a certain point, then the ad will be played, and then we’ll come back to another screen. It’s very flexible.”

Image Sense is the third part of the Make Sense project, focused on overlaying ad content on parts of an image. With progress being achieved in the computer analysis of video content, a relatively static or non-informative portion of an image can be used to display an ad.

“We can put ads into an area that is less interesting,” Hua says. “We’ll put ads in the area that will not affect the whole image, because we have technology that detects the salience are of the image, and then we can put ads in the non-salient area.”

He also stresses the importance of advertising to improving the multimedia experience for Internet users.

“Ads are a very promising way to monetize online multimedia data,” Hua says. “There is a lot of multimedia on the Web, but currently, it’s very difficult to monetize it. If we cannot get money from it, we have difficulty improving it.

“Most existing ad platforms are only using text. But we’re using content analysis. This is our core technology to improve ads on multimedia content. This can be regarded as a new multimedia ad platform. It’s a novel way of doing ads.”

Video search, multimedia ads, video annotation—each represents an effort to further Xian-Sheng Hua’s personal research mission.

“I want,” he says, “to enable effective, efficient, and natural access of visual content, videos and images. I feel really good when I see my research work in Microsoft products, getting used by so many people.”