Accelerating the pace of environmental research
MODISAzure is a pipeline for the download, processing, and reduction of diverse satellite imagery by using Windows Azure to deliver the results of massive cloud computational power to the desktops of researchers.
A mosaic of daily surface temperature data aggregated and reprojected from MODIS imagery using Windows AzureFor Dr. Youngryel Ryu, biogeoscience researcher at the University of California, Berkeley, the MODISAzure project dramatically changed the scale and the pace of the environmental research he can pursue.
“My initial intention was to study (evaporation and carbon fixing processes) for California. I am now using 10-year MODIS data for the U.S.,” Dr. Ryu said.
“The Azure system allowed me to ask more exciting and important questions and will help to answer them. It incredibly accelerated my work.”
Four-Stage Image Processing Pipeline
The MODISAzure project helped solve several obstacles to accessing the vast and varied remote sensing data from the MODIS satellites and other sources. The four-stage pipeline automates:
- Data upload from repositories such as the LPDAAC (Land Products Distributed Active Archive Center) and Goddard Space Center.
- Image reprojection and harmonization to common coordinates and resolutions.
- On-demand execution of two algorithms provided by the scientist. The first algorithm is commonly used to generate a science variable; the second algorithm produces graphs, tables, and maps used to analyze the science result.
- Delivery of final results to the scientist’s desktop.
The four stages of the MODISAzure pipeline
The MODISAzure Service is a Windows Azure web role that receives all user requests and sorts requests to appropriate job queues—for download, reprojection, or reduction.
The Service Monitor is a dedicated worker role that parses all job requests into tasks, which are recoverable units of work, and tracks the execution status of all jobs and tasks using Windows Azure tables for persistent storage.
MODISAzure architectural overview
Role of the Generic Worker
A specialized worker role, called the Generic Worker, performs all work in the pipeline. The Generic Worker dequeues tasks created by the Service Monitor from the appropriate Task Queue and creates a corresponding entry in the appropriate TaskStatus table. To execute the task, the worker loads any necessary libraries and the science executable and marshals storage between the Azure blob storage and local Azure worker instance files. Task status is tracked during execution in the TaskStatus table entry. The Generic Worker will attempt up to three retries of the task execution to masks transient errors on the Azure fabric, blob store, or VM configuration.
The Generic Worker tracks tasks in the pipeline
Storage for MODISAzure is separated by usage to simplify management policies. The four storage types and principles for managing them are:
- Original source image download: Can be deleted when all dependent reprojections are complete.
- Reduction results: A zip-file blob is created for each job to simplify the download. Older results are removed based on age.
- Reprojection results: May include the same target tile at different spatial resolution.
- Metadata, including geospatial lookup, known application library binaries: Necessary for service function, this data is never directly accessed by scientist code.
MODISAzure in Action
In the example illustrated below, the user submits a request to download needed source files. The user specifies the “Aqua” satellite, “atmospheric aerosol” science variable, the “h08v05” sinusoidal tile covering California, and the calendar year 2002. The Service Monitor does the appropriate lookups to convert “Aqua satellite” and “atmospheric aerosol” to the MODIS naming convention of MYD04_L2 and then uses the ScanTimeList table to do a geo-spatial lookup to determine each of the source MYDO4_L2 swath tiles that cover the h08v05 sinusoidal tile for each day in 2002. For example, MYD04.L2.A2002185.2005.005.2007068182447.hdf is one of the source files for July 4, 2002. The Service Monitor then checks the current contents of the Original source image download Azure blob store to determine which, if any, of the tiles are already resident. Lastly, the Service Monitor schedules tasks to download the remaining requested MYD04_L2 source swath tiles from the Goddard DAAC FTP site.
A request to download required source files for analysis of a tile of MODIS data
Once the source files are downloaded, the reprojection service processes the request, illustrated in the figure below, and stores the data.
MODISAzure reprojection service
The user initiates another request to process the results of the source-file download and the reprojection data. The results of the reduction are stored and the user can download them.
MODISAzure reduction service
What’s Next For MODISAzure
Richer Geo-spatial Reductions
Researchers are integrating a separate user and data product handling capability to enable a new kind of reduction. By adding a precursor technology that creates and manages a mask layer image, scientists will be able to extract portions of the imagery that correspond to key features such as watershed, biome, plant species extent, or soil chemical classification. In other words, a scientist can use that mask image to select the “right” pixels from the source imagery to study specific locations or phenomena.
The first planned use of this innovation will be to compare Ryu’s evapotranspiration computation with ground-based sensor data from key watersheds across the United States. Other scientists have already identified other areas of inquiry that the new capabilities—which also simplify user job submission and generic worker integration—will help them pursue.
Broadened Scientific Applications
Jie Li, the University of Virginia computer science researcher who built MODISAzure, said the pipeline may have many other scientific applications. Other earth imagery such as LIDAR or even medical imagery such as PET scans share similar data flows as the current application and may benefit from the MODISAzure framework.
It already is one of the largest applications running on Window Azure. For the 10-year U.S. continental scale water balance computation, MODISAzure manages:
- 5 TB data upload (600,000 files) from the NASA sites (six days for upload)
- 35,000 hours for reprojection
- 12,000 hours for derivation reduction
- 3,000 hours for analysis reduction
- 50 GB reduced results delivered to the desktop
And the researchers are not done yet. Li, Ryu, and others on the project are beginning to scale up the computation—to include data from the worldwide FLUXNET towers in Europe, Canada, Asia, South America, and beyond.