Ralf Herbrich estimates that more than 250,000 English-language news articles are posted online every 24 hours, and that is a conservative figure.
Herbrich, director of Microsoft's Future Social Experiences Lab (FUSE Lab) in Cambridge, U.K., is well-acquainted with statistics on news feeds, tweets, and traffic volumes, thanks to his work on Project Emporia, an experimental web service that sifts through news streams in real time to bring stories of interest to users, categorized by topic area and ranked for relevance.
The difference between Project Emporia and other news services?
“What Project Emporia provides,” Herbrich explains, “Is very fine-grained user control of news selection. It serves up news based on your specific preferences, rather than leaving it up to the editor of a news site. And it’s easy to state your preferences: You just vote on the stories.”
Available for the web and the Windows 7 Phone, Project Emporia groups news articles under eight broad categories. It all looks fairly standard until the user scans or reads the stories and gives each a vote of “show me more like this” or “show me less like this.” Almost immediately, Project Emporia begins to recommend stories based on how the user has been voting. Each of the user’s votes improves the relevance of Project Emporia’s choices, for that individual and for every other user.
Launched in June 2010, Project Emporia has just passed the six-month mark. During this time, Herbrich and his colleagues have been discussing Emporia online with users of the service, an activity which has led to many refinements—as well as lively debates within the team. For a quick look under the hood, we asked Herbrich to reflect on his experience with the project.
Q: Emporia “discovers” news stories by ingesting tweets that contain URLs to articles. Why the decision to use Twitter?
Herbrich: The core algorithms behind Emporia have to work in large-scale web scenarios, so Twitter is a good candidate because it generates high volumes of content. But also, Twitter has been changing from a micro-blog to a news-distribution service. There are over 90 million tweets per day, and the percentage of tweets that are referential—containing links to rich objects such as web pages or videos—has gone up from 12 percent to nearly 25 percent. Twitter can be viewed as a distribution network for rich content, as well as a social network, and those statistics triggered our decision to use Twitter.
But Twitter is just one type of news source. Emporia, and the technology it’s built on, is not a Twitter reader. It's a news-story reader, and there is no reason why we can’t apply the same technology to other news sources.
Q:Scalability to handle such volumes was obviously a challenge, but what other challenges were involved in using tweets as news sources?
Herbrich: As you can imagine, a single news story could generate multiple tweets. We need to make sure we are never ingesting the same story twice, so we check the URLs. Then we fetch the actual web page and analyze the entities and keywords on the page. We also use clustering, to avoid serving up too many versions of the same news item. In the Technology category, for example, we begin by selecting the 400 most important stories. As a result, when you are in Emporia, you may see only one or two articles about Kinect at the Consumer Electronics Show. But in reality, with all the journalists and bloggers covering the show, there were more than 500 articles, and we have sifted through them in real time. This is pretty impressive, because whenever anyone tweets a news article, we usually discover it within a few seconds after it comes into Twitter. Then it takes five to 10 seconds to serve up that article. The process of analysis, storage, and pushing it into the memory index happens within minutes, and we process on the order of 15 million tweets and 1 million stories each day.
Q: You started building Emporia in late 2009 and launched it by June 2010. That seems like a quick turnaround.
Herbrich: First of all, while it’s true we began in September of 2009, many of the features of Emporia were based on research that people had been working on for the last three to five years. For example, the personalized-recommendation technology came from some research called Matchbox. We had already developed algorithms for click-through prediction. We also had an active-learning online-classification system for new keywords and other content indicators in tweets. And without Azure, we could not have built the entire Emporia system end to end in such a short time, with only a few people, all the while being certain it would scale.
Another reason the project got delivered in a timely manner was the wonderful collaboration between Microsoft Research Cambridge, the Search Technology Center Europe (STC-E), and FUSE Labs. We never differentiated whether someone worked at Microsoft Research Cambridge, STC-E, or FUSE Labs. We are all on the same team.
Q: You are intimately familiar with all the technology in Project Emporia. But once you started using it, did anything unexpected occur?
Herbrich: Yes. There’s the mathematics behind the solution, and then there is how the application of the mathematics manifests itself. I’ll tell you about something unexpected that happened. When we first came up with the idea of a front page that displayed a top story for each category, we saw this as a way to use clustering to achieve diversity in the news, to avoid showing just Technology stories, for instance.
Initially, the design goal of the user interface was to provide a discovery experience without using keywords, no textual way of adding news channels. But when we tested it with users, they said it took too long getting to news they wanted to see. For example, if you wanted news on the upcoming royal wedding, and this news is not among the first 10 items of the Entertainment category, you just can’t get there without working through a lot of other stories first. So we added the ability to create keyword-based channels.
When we added keywords on top of diversity, we had a real aha moment. After we entered “Kate Middleton” as a keyword, stories about Kate come up in many different categories. In Technology, we got a story about the BBC’s technology challenges for broadcasting and streaming wedding coverage. In Business, we found a story that said the economic impact of the wedding will be £650,000,000. In Lifestyle, there was a story on possible wedding-dress designers and one about Kate Middleton look-alike contests. That was so cool! You do the math and build the algorithms, but then you see something like this, and you think, “How could we ever have predicted?”
Q: What was your biggest challenge?
Herbrich: In a project such as Emporia, you have a personalization technology that you can’t really test along the way during the development phase. It's not like developing the kernel for an operating system, where you have parameters to work to, performance figures you want to reach, and ways to measure your effectiveness. We had to select the right feature sets for Emporia and do it in an objective and mathematical way—for example, to be able to say, “with this set of features, we only have keywords in the title, but it’s almost as good as having keywords from both the title and the body of the story.” But how can you make that determination without ever having gone live? We trusted in the mathematics, but the system needs to work on data from real people, because it's all about personal observation and what's relevant to you. Unless you build the system end to end and use it, you can't compare.
What we did to address this was to have daily “voting sessions” within the team. We’d take the unfinished product and spend an hour each day over the course of several weeks, looking at the news and voting. Someone got assigned to vote on Technology, someone on World Events and so on. It gave us a way to check whether things were working as expected. But what that meant was a lot of British news articles came up when Emporia first launched, because we are all based in the U.K.!
Q: In the meantime, you had to have faith in the mathematics.
Herbrich: We did. We totally had faith. But there is faith, and then there is guts!
Q: Do you think news reading will change with the adoption of technology such as Project Emporia?
Herbrich: What Emporia demonstrates is a way to put very fine-grained selection into your own hands instead of having news that is curated by an editor who provides articles of interest to the widest possible audience. With Emporia, your vote is what trains the system to your preferences. In the beginning, I would vote on 30 to 50 stories a day. After two or three weeks, Project Emporia pretty much had my taste nailed down. Every person on Emporia will see slightly different content after a few votes. A hundred thousand different users will get a hundred thousand different experiences. I think that's very powerful.