Using Software to Enhance Healthcare
October 12, 2010 6:00 AM PT

Researchers at Johnson & Johnson Pharmaceutical Research and Development (J&J PRD) faced a challenge. Over the years, they have built a state-of-the-art platform to enable discovery of small-molecule drugs, but the expanding role of biologics in pharmaceutical research required a new set of tools to handle large-molecule compounds.

Developing such functionality from scratch was a daunting proposition. It would take time and resources while delaying development of novel treatments for debilitating diseases and disorders.

Researchers at Microsoft Research had a solution. Their new, open-source library of bioinformatics functions, the Microsoft Biology Foundation (MBF), part of the Microsoft Biology Initiative, was designed to address just such a challenge. When the J&J PRD researchers learned about this, they immediately became intrigued.

This confluence of need and opportunity occurred in late November 2009. Now, less than a year later, the benefit has become manifestly apparent. Instead of spending costly time building a foundation for the new biological infrastructure, J&J PRD was able to focus on delivering value-added functionality needed to facilitate development of innovative treatments that have the potential of improving the health and quality of life of patients around the world.

“By using MBF, we were able to provide our users with a greater level of functionality in less time to our users for our initial development phase in the large-molecule space.” says Jeremy Kolpak, J&J PRD senior analyst, who will be discussing his team’s MBF deployment during the 2010 eScience Workshop, being held in Berkeley, Calif., from Oct. 11-13, “It allowed us to focus on value-added functionality for our scientists and has helped us adapt to new requests quite easily.”

Such testimony brings a smile to the face of Simon Mercer, director of Health and Wellbeing for External Research, a division of Microsoft Research.

Simon Mercer
Simon Mercer

“The principal advantage of MBF,” Mercer says, “is that, because it’s free and open-source, as a programmer, you get a certain amount of prewritten functionality that you can just build on top of. It gives you more time to do the real science, because we’ve already supplied the basics.”

It didn’t take long for J&J PRD to grasp the implications of MBF.

“We were in the process of developing our own infrastructure to work with sequences,” Kolpak explains. “This was part of a larger move in our organization to improve how R&D with large molecules was performed and integrate that process with an existing and mature framework for working with small molecules.

“We have been using MBF from the day we heard of it.”

That is precisely the focus of the Health and Wellbeing effort within External Research: to collaborate openly with the bioinformatics community by applying advanced computing technologies to provide unprecedented insight into disease and human healthcare.

MBF, built on the Microsoft .NET Framework and aimed at making it easier to implement biological applications on the Windows platform, was launched in Boston on July 9 during the 11th annual Bioinformatics Open Source Conference. Since then, thousands of bioinformaticians have downloaded the tool kit.

Microsoft Research Biology Extension for Excel
The Microsoft Research Biology Extension for Excel, displaying the contents of a FASTA file containing an Influenza A virus sequence.

“There are a lot of biologists who start as post-docs but don’t end up going into biological research themselves,” Mercer says. “They end up managing the data and writing the scientific applications that the biologists need to do research. They can be anywhere on the continuum between full biologists with no computing background to full computer scientists with little or no biological background.

“They work alongside the biological scientists, but they won’t necessarily be those scientists. They’ll write scripts and write programs to help the lab run, and they’ll also probably do some data analysis.”

Companies and academics that pursue such work, naturally, are more concerned with the value they can derive from using software tools than with building the tools themselves.

“I’ve heard it over and over again from executives of different pharmaceutical companies,” Mercer says. “Possibly 90 percent of their software stack has been developed in house but offers them no competitive advantage. The real crown jewels in bioinformatics are relatively small compared with the huge bulk of software they have to maintain.

“They’re often in a situation where they want to exchange data with other pharmaceutical companies on a pre-compete level, and they find that hard, because their processing pipelines are uniquely their own. A lot of commercial companies are looking for things like MBF to adopt as a common platform, so they are using the same tools, analyzing the data in the same way, and they are able to share data sets and cut costs.”

In other words, MBF helps make bioinformaticians’ work a bit simpler. That certainly appears to be the case at J&J PRD.

“We have integrated it into our data-analysis and -visualization platform, Third Dimension Explorer, which has been developed in house,” Kolpak says. “This platform is used in a multitude of different contexts.”

With regard to J&J PRD’s large-molecule exploration, he lists the ability to achieve five distinct tasks:

  • View sequences with their associated assay data to see how variations across compounds impact targets.
  • Align multiple sequences.
  • View aligned sequences and their associated metadata, such as complementarity-determining regions.
  • Extract and translate regions of sequences.
  • Work with sequences of different formats to provide a generic platform for scientists to import and analyze them in one place.
Third Dimension Explorer with Johnson & Johnson Pharmaceutical Research and Development's sequence viewer extension
The Third Dimension Explorer sequence viewer extension enables users to view data in different forms and correlate it directly to the sequence where the data originated. The table contains the sequence data, and the top view shows the aligned sequences, color-coded by hydrophobicity. The views on the right are examples of visualizations of the assay data in the table.

“The goal,” Kolpak says, “is to capture operations that are performed routinely and make it extremely efficient to execute in one place. But at the same time, we are not trying to replace existing sequence-analysis tools for the more complex and less used operations.”

At Johnson & Johnson Pharmaceutical R&D, there are hundreds of users of the Third Dimension Explorer tool. The MBF-related development is still being completed and rolled out, but 40 people already are using the enhanced data-analysis platform—and deriving significant benefits.

“It’s hard to quantify the amount of time it has saved us,” Kolpak says, “due to the fact we work with an agile development methodology and, for each iteration, we are finding new functionality in MBF that we can utilize. I would say that, for our initial rollout, which required a large amount of framework implementation, it saved us around three months during a six-month initial development cycle.”

Biological work might not be the first thing that comes to mind when people think about Microsoft, but it supports such scientists nevertheless.

Basic Local Alignment Search Tool query results displayed in the Microsoft Research Sequence Assembler
The Microsoft Research Sequence Assembler presents the results of a Basic Local Alignment Search Tool query for the current sequence in Silver Map, a visual control developed by the Queensland University of Technology.

“Inside Microsoft Research, we’ve done lots of biology,” Mercer says. “It’s not what everybody would expect, but a lot of researchers apply their computer-science research in the biological domain for healthcare. How can you apply Microsoft technologies to scientific research? We often do that through collaborations with academics, where the academic brings the biology, in this case, and Microsoft brings the computer science. Together, hopefully, we advance further than either side would have done independently.

“Eventually, you have to ask yourself the question, ‘Why don’t we just build a platform so that all of the common elements are written once and don’t need to be written again for every single project?’ And once that platform exists, and it’s open-source and free, why not give it away to the community so it can benefit?”

There are specific ways in which MBF can assist in the biological domain, such as with modularity, extensibility, and code maintenance.

“Those sorts of things that professional programmers think of aren’t necessarily the first things in the minds of those who are writing scripts to support a lab,” Mercer continues. “MBF sits in the middle, with prewritten functionality in nice, digestible chunks, very standardized.”

There are quite a few other biological libraries akin to MBF already in use, some of them for a decade or more. But over time, they have grown unwieldy, making it hard to extend them. And they tend to be written in script-based languages that have no type checking. MBF, on the other hand, offers type checking and guarantees, and it’s built atop the common-language runtime, providing the flexibility to handle any of the more than 70 languages that work with .NET, thereby making easy for a heterogeneous community to use without having to conform to a single language.

“We’ve also wrapped the individual bits of MBF as workflow activities for our Trident workflow workbench,” Mercer adds, “which is also free and downloadable. “You don’t even have to be a programmer to use MBF. You can just drag and drop and connect the building blocks together to build workflow pipelines.”

External Research attempts to understand the precise scientific challenge encountered by its MBF partners, a methodology termed scenario-based development that identifies areas where MBF can be made more useful. That methodology will be a key component of the next wave of the tool’s enhancement.

DNA sequences displayed in the Microsoft Research Sequence Assembler
The Microsoft Research Sequence Assembler displays a series of short DNA sequences assembled by the Parallel De Novo Assembler algorithm into a contig.

“We’re approaching our partners in the academic community and the commercial world to define those scenarios,” Mercer says, “and that’s what’s driving the direction in MBF v2. We encourage the wider community—people who download the source code, understand it, and start developing their own extensions to support their own science—to participate, because the more of those we get, the more broadly we can develop MBF. It will grow by the actions of the community, to support the science that the community wants to support.”

That, in the example of J&J PRD, is exactly what is happening.

“A lot of what is on our wish list we have been developing in stride,” Kolpak says, “mainly a visualization tool for viewing sequences, in addition to some other sequence file-format supports that contain more than just sequence data. These are all things we plan to contribute back to the MBF development.”

And the community at which MBF is focused expects to use open-source code.

“If we want to run a project that would be recognizable and familiar in form to the academic community,” Mercer says, “then that would be a software-development project that is open-source, because open-source is a very common model there. We want to get contributions from as broad a set of people as possible.

“We want scientists to get a value out of using Windows,” he concludes. “We want scientists to pick up different tools that we have and understand that they can help them do their research more effectively and reach insights more quickly than they would otherwise manage to do. We’ve got a lot of value to offer in that area.”

The folks at Johnson & Johnson Pharmaceutical Research and Development couldn’t agree more.

“I am a software developer by trade,” Kolpak says, “and by using MBF, I have the confidence that what I am providing our users is not just solid code, but also that the science behind it is accurate.”