NCBI BLAST on Windows Azure

Making bioinformatics data more accessible to researchers worldwide

BLAST on Windows Azure enables cloud-based analysis of vast proteomics and genomic data.Built on Windows Azure, NCBI BLAST on Windows Azure enables researchers to take advantage of the scalability of the Windows Azure platform to perform analysis of vast proteomics and genomic data in the cloud.

BLAST on Windows Azure is a cloud-based implementation of the Basic Local Alignment Search Tool (BLAST) of the National Center for Biotechnology Information (NCBI). BLAST is a suite of programs that is designed to search all available sequence databases for similarities between a protein or DNA query and known sequences. BLAST allows quick matching of near and distant sequence relationships, providing scores that allow the user to distinguish real matches from background hits with a high degree of statistical accuracy. Scientists frequently use such searches to gain insight into the function and biological importance of gene products.

BLAST on Windows Azure extends the power of the BLAST suite of programs by allowing researchers to rent processing time on the Windows Azure cloud platform. The availability of these programs over the cloud allows laboratories, or even individuals, to have large-scale computational resources at their disposal at a very low cost per run. For researchers who don’t have access to large computer resources, this greatly increases the options to analyze their data. They can now undertake more complex analyses or try different approaches that were simply not feasible before.

BLAST on Windows Azure in Action

One of the major challenges for many bioinformatics laboratories has been obtaining and maintaining the very costly computational infrastructure required for the analysis of vast proteomics and genomics data. NCBI BLAST on Windows Azure addresses that need.

Seattle Children’s Hospital: Solving a Six-Year Problem in One Week

At Seattle Children’s Hospital, researchers interested in protein interactions wanted to know more about the interrelationships of known protein sequences. Due to the sheer number of known proteins—nearly 10 million—this would have been a very difficult problem for even the most state-of-the art computer to solve. When the researchers first approached the Microsoft Extreme Computing Group (XCG) to see if NCBI BLAST on Windows Azure could help solve this problem, initial estimates indicated that it would take a single computer more than six years to find the results. But by leveraging the power of the cloud, they could cut the computing time substantially.

Spreading the computational workload across multiple data centers allowed BLAST on Windows Azure to vastly reduce the time needed to complete the analysis.Spreading the computational workload across multiple data centers allowed BLAST on Windows Azure to vastly reduce the time needed to complete the analysis.

BLAST on Windows Azure enabled the researchers to split millions of protein sequences into groups and distribute them to data centers in multiple countries (spanning two continents) for analysis. By using the cloud, the researchers obtained results in about one week. This has been the largest research project to date run on Windows Azure.

Fueling Hydrogen Research

Rhodopseudomonas palustrisRhodopseudomonas palustrisScientists at the University of Washington’s Harwood Lab are working on a project to identify key drivers for producing hydrogen, a promising alternative fuel. The method they adopted characterizes a population of strains of the bacterium Rhodopseudomonas palustris and uses integrative genomics approaches to dissect the molecular networks of hydrogen production.

The process consists of a series of steps using BLAST to query 16 strains to sort out the genetic relationships among them, looking for homologs and orthologs.

Each step can be very computation intensive. Each of the 16 strains, for example, is computationally predicted to have approximately 5,000 proteins. A BLAST run can require three hours or more to analyze each strain. When Harwood Lab’s local resource was unable to handle the computation, the researchers submitted their request to a nationwide computer cluster, but the request was rejected after two days due to the long job-queuing time.

The researchers then contacted the XCG team to see if BLAST on Windows Azure could help them with this problem before their deadline—and it did. BLAST on Windows Azure significantly saved computing time:

  • The time for BLAST on Windows Azure to process 5,000 sequences of one strain was reduced from three hours to fewer than 30 minutes.
  • The entire analysis, coordinated by SEGA and XCG teams, was completed in three days.

The on-demand nature of BLAST on Windows Azure totally eliminates the job-queuing time, which sometimes is even longer than the computation time when running on the high-performance public computing resources that researchers often rely upon.

Scientists are using BLAST on Windows Azure to sort out genetic relationships among 16 strains of the bacterium Rhodopseudomonas palustris, looking for homologs and orthologs.Scientists are using BLAST on Windows Azure to sort out genetic relationships among 16 strains of the bacterium Rhodopseudomonas palustris, looking for homologs and orthologs.

Implementing BLAST on Windows Azure

The implementation of NCBI BLAST on Windows Azure consists of two distinct stages. The first stage is a preparation stage, in which the environment for the BLAST executable is staged and sent to each cloud “worker”—or compute node. In the second stage, the actual BLAST runs are carried out in response to input from the user.

Two crucial items need to be made available to each cloud worker that will run a portion of the BLAST job. The first is the BLAST application. BLAST on Windows Azure uses the latest version of BLAST+ executables (BlastP, BlastN, and BlastX) that are made available by NCBI. These applications can be used without any modification. In addition, the user needs to have access to one or more databases against which the BLAST application will search for its results. These are available from several sources on the web, including NCBI.

BLAST executables are bundled as a resource inside a cloud service package. Once the user deploys the package on their Windows Azure account, the BLAST+ executables get deployed on each worker for local execution. NCBI databases (such as nr, alu, and human_genome) are downloaded from the NCBI FTP site to Azure Blob storage by using a database download task that is executed by any available worker.

A web role provides the user with an interface where they can initiate, monitor, and manage their BLAST jobs. The user can enter the location of the input file on their local machine for upload, specify the number of partitions into which they want to break down their job, and specify BLAST-specific parameters for their job.

Once the user submits a job through the web role interface, a new split task entry gets created in the Azure Table. All available workers look for tasks in this table. The first available worker retrieves the task and splits the input file sequence into the number of partitions that were specified by the user. For each segment of the input file, a new task is created in the task table and the next available worker retrieves this task from task table.

Once all of the tasks in the queue have been completed, a worker takes the output of each task and merges them into a single file. That file is placed in a blob in Windows Azure storage and a URL to the result data is recorded in the job history for the user. This output can then be downloaded from the web role user interface.

Lessons Learned

The application of BLAST on Windows Azure to the large runs provided us by the University of Washington and Children’s Hospital groups taught Windows Azure researchers many important lessons about how to structure large-scale research projects in the cloud. Most of what we learned is applicable not just to the BLAST case but to any parallel job run at scale in the cloud.

Design for failure: Large-scale data-set computation of this sort will nearly always result in some sort of failure. In the week-long run of the Children’s Hospital project, we saw a number of failures: failures of individual machines and entire data centers taken down for regular updates. In each of these cases, the Windows Azure framework provided us with messages about the failure and had mechanisms in place to make sure jobs were not lost.

Structure for speed: Structuring the individual tasks in an optimal way can significantly reduce the total run time of the computation. Researchers conducted several test runs before embarking on runs of the whole dataset in order to make sure that the input data was partitioned in such a way as to get the most use out of each worker node. For example, Windows Azure expects individual tasks to complete in fewer than two hours. If a job takes more than two hours, Windows Azure assumes that the job failed and starts a new job doing the same work. If a job is too short, you don't get all the benefits of running the jobs in parallel.

Scale for cost savings: If a few long-running jobs are processing alongside many shorter jobs, it is important to not have idle worker nodes continuing to run up costs once their piece of the job is done. Researchers learned to detect which computers were idle and shut them down to avoid unnecessary costs.

People
Publications