January 31, 2011 9:00 AM PT
Today’s world, says Sumit Basu, increasingly is driven by massive amounts of data.
“Take sociology,” says Basu, a researcher with the Knowledge Tools group at Microsoft Research Redmond. “In the past, most sociologists would use surveys and ethnographic studies as their primary research tools, but now, there are a growing number of sociologists working on topics such as social networks and gathering all their data from a computer. Same for fields like economics and finance—it’s all becoming very data-oriented. There is a need to process that data, mine that data, and run the sophisticated algorithms that make sense out of it all.”
At the same time, most experts in areas such as sociology, economics, or finance are not expert programmers and do not wish to use professional development tools such as Microsoft Visual Studio. They do, however, need to access libraries and data that were created for the professional development world. Helping to solve this problem is Sho, a programming tool that bridges the gap between interactive prototyping environments many scientists and engineers employ and statically compiled code developed in powerful tools such as Microsoft Visual Studio and Microsoft .NET. Sho is a programming tool developed by a small team: senior research software-development engineers Erin Renshaw and Chuck Jacobs, Basu, and John Platt, research area manager.
Sho became available via download on Jan. 26. The team says that Sho already is being widely used within Microsoft, both in Microsoft Research and in product groups, and that they believe scientists, researchers, and others with complex data-analysis needs will find it useful.
They describe Sho—a name that refers to, among other things, an obscure letter in the Greek alphabet—as a programming environment in which scientists and programmers using different tools can find a way to connect their code seamlessly. Currently, many scientists and others who are experts in particular fields of study tend to use scripted languages, such as Matlab, R, or Python, for their research and prototyping. These languages and their interactive environments enable a user to develop algorithms, analyze data, visualize data, and more—rapidly, and without mastering high-end programming tools or carefully structuring and designing code.
“Programming is not their core competency,” Basu says. “Their focus is on machine learning or biology, and they use programming as a tool to help them do the science. They don’t come up with complex software designs.”
In many cases, these experts need to call on the help of software developers to extend their analytical capabilities. For instance, they might need to connect to a web-data source, run their code on an HPC cluster, or have their code run every night on incoming data. In the world of software development, common tools are built around statically compiled languages such as C# or C++. In many cases, the researchers say, applications written for Matlab or Python have to be rewritten in one of these compiled languages to connect with other code libraries or to run in a production environment.
“There are lots of problems with writing code twice,” Basu explains. “A person rewriting the code from one system to another might not fully understand the code and could take shortcuts or make mistakes that cause subtle errors.”
Sho helps avoid those problems by giving researchers and scientists a programming “sandbox” that can be shared with and understood by both developers and domain experts. Sho adds .NET libraries useful to researchers, such as linear algebra, data visualization, and a user-friendly interactive console. It also makes it easier for researchers to leverage the complete Microsoft .NET stack, including Microsoft SQL Server, Microsoft SharePoint, Windows HPC Server, and Windows Azure.
“Sho gives greater reach to Python people,” Basu says, “and powerful math and visualization libraries to C# people.”
Or, as Platt puts it, “Now, people who want to do more technical computing using .NET can use our math and visualization libraries without having to incorporate their own.”
Platt and Basu say that Sho is a general tool that should find application in many computing fields. Platt, for instance, recently used Sho to help write a paper analyzing semantics across multiple languages. In this case, the goal was to find documents in different languages that have similar content and to use that as a tool to improve machine translation. Platt used Sho to help mine a large corpus of documents to learn a vector representation of the documents. Those with similar vectors have similar content, regardless of language.
In another case, Basu worked with Microsoft Office user-experience researcher Julie Guinn to help solve a problem common at Microsoft—and, no doubt, at other large corporations. The pair developed a prototype in Sho that helps give users working on group projects or across time zones a way to manage and organize the hundreds of sticky notes on which participants jot down ideas. At Microsoft, it’s not uncommon to walk into a conference room and find a whiteboard covered with sticky notes. Team members often use a process called “affinity diagramming” to organize the seemingly random bits of knowledge written on the notes into a more cohesive whole.
The end product of Basu’s and Guinn’s work with Sho, called Sticky Sorter, helps collaborators sort through and organize potentially hundreds of digital sticky notes to discover the connections between different ideas. Sho made it quick and easy for Basu and Guinn to prototype and iterate on a solution—complete with graphical user interface—they could deploy to actual users. The insights they learned from this prototype led to the final version of Sticky Sorter now available from Office Labs.
Basu sees Sho as having wide application in business.
“Let’s say that a company has tons of sales data and needs to crunch the numbers with sophisticated new algorithms for which ads are most useful for selling their products—really advanced data analytics,” he says. “Or you have tons of servers and want to do some predictions on which ones might have malware or are about to fail. Sho is a great environment for implementing the advanced algorithms to make these calculations, using easy-to-use .NET tools.”
Work on Sho began largely because of the researchers’ frustrations in trying to stitch together the worlds of script-based code, such as Matlab, and compiled code, such as C#.
“We really built Sho for ourselves,” Platt says. “We like .NET, and we work with C# and all the wonderful tools that .NET provides, but we also like the experience that we get with interactive scientific tools such as Matlab and R. Sho gave us the best of both worlds by tying the two together.”
Soon, coworkers in Microsoft Research heard about Sho and asked to use it, and through word of mouth, it spread across much of Microsoft.
“We now get emails out of the blue from people we’ve never met,” Basu says. “They say, ‘Oh, we heard about Sho from some other person and installed it and have been using it for months.’”
Among other examples, the Microsoft Biology Foundation uses Sho for high-performance statistics and math capabilities.
Sho’s creators say their work simply builds on what already has gone into tools such as IronPython, .NET, and Visual Studio.
“We’re really standing on the shoulders of giants,” Platt says, “Sho wraps the work already done on a lot of great Microsoft tools.”
The general release of Sho will connect to both Windows HPC Server 2008 R2 and Windows Azure. Sho can be used with server clusters or as part of a Windows Azure account, in which case Sho will “talk” directly to Azure.