Appendix 3: Project Sequoia 2000 Scenarios

A3.1 Overview

This appendix presents scenarios that illustrate important aspects of the way in which EOSDIS is likely to be used by the research community. The scenarios demonstrate how scientific imperatives are translated into requirements and specifications for the data system.

These scenarios are not fanciful ìblue skyî conjectures; they are based on current research by Earth scientists participating in Project Sequoia 2000 at several University of California campuses. Therefore, they come directly from knowledgeable EOS customers who will be the end users of the data stream. Further, they depend only on current technology and on conservative estimates of technological evolution.

The scenarios have been chosen to be representative of several characteristic areas of research. We begin with the use of satellite measurements in the study of an important aspect of the global carbon cycle: the biomass burning source of atmospheric carbon dioxide. Remote sensing data (AVHRR today; MODIS in the future) can be employed to infer changes in the amount of carbon dioxide produced by fires in California. Together with fossil fuel combustion, biomass burning is thought to be a major source of the observed increase in atmospheric carbon dioxide, and its importance may be even greater in the future. This type of research is a key element to understanding the likely evolution of the chemical composition of the atmosphere and, hence, of climate change.

The second scenario demonstrates the use of satellite imagery in monitoring snow cover, an application that has many practical hydrological implications, including water supply estimation and flood forecasting, and which is also central to climate research. Changes in snow cover are key signals of short-term climate variability, closely related to phenomena such as the Indian monsoon and El Niño. They are also sensitive indicators of secular trends in climate and are expected to respond strongly to global warming. Snow cover is a critical element in several climate feedback processes involving changes in surface albedo, sea level, and other variables. Our scenario shows how the coverage of mountain snow packs can be estimated from satellite measurements (AVHRR and Landsat today; MODIS, ASTER, and Landsat in the future).

We conclude with a multi-part scenario illustrating a complex of research tasks involving atmospheric and oceanic general circulation models. These models are among the main tools used by theoreticians to understand and predict climate change.

Two recent trends in climate modeling have especially clear implications for EOSDIS system requirements. The first is a massive increase in the demand for data to validate the models. As the complexity and realism of climate simulations have increased, researchers have sought more and more observational evidence against which to compare the general circulation model (GCM) resultsóadditional variables, finer spatial and temporal resolution, longer time series, more derived products, greater accuracy, etc. There is no doubt that this trend will continue and intensify.

The second trend is the continuous growth in the number of research groups that use GCMs to support their work. Thirty years ago there were about 3 GCM projects in the world. Today the number is closer to 40, depending on where one draws the boundary between GCMs and less comprehensive models. It is clear that the GCM population is growing quickly as codes become public and computer power becomes accessible: Today research groups interested in using GCMs can use freeware versions of major models running on traditional supercomputers, or on individual or clusters of high-end workstations. Our GCM scenario illustrates how these trends will drive demands on EOSDIS.

A3.2 Monitoring Carbon Dioxide Changes Due to Dry Fuel Loading

Our first scenario originates in current research on the use of remote-sensing data to estimate changes in the quantity of carbon dioxide released by fires in California. An important goal of the research is to develop algorithms that are improvements over current methods of detecting changes in image pixels.

The data of the scenario are multispectral AVHRR images downloaded from the NOAA satellite 3 times a day (night, morning, and afternoon). For each image and each band, the data pertaining to California are extracted. They are mapped onto a standard coordinate system by tagging recognizable points (ground control points) in the image. A composite value, over a 2-week interval, of each pixel in each band is derived by algorithms under development. Then a single master-image for the interval is formed by applying a classification algorithm that acts on all the wavelength bands in which the pixel is viewed. Finally, another algorithm highlights pixel-level changes in the classified data.

A3.2.1 Interfaces with EOSDIS

Very similar algorithms will be applied to MODIS data, obtained from the EROS Data Center.

A3.2.2 Data Flow Model

The top-level of a data flow model of the scenario is shown in Figure A3-1. The principal components of the model are discussed below.

Data are captured in real-time from the AVHRR receiving system 3 times per day. The images are saved in proprietary format by the satellite receiver on its internal file system. The ultimate output of the scenario is a set of images highlighting land areas where change has occurred.

Figure A3-1: Data Flow Model of the Dry Fuel Loading Scenario

The value of a pixel in a single band of an image is denoted by P(x, y, l, t) or P(W, l, t), where x and y are image coordinates, W is the geographic position (latitude and longitude), l is the spectral wavelength, and t is time. In addition, let Wo be the observed location and Ws be the location on a standard grid. Then the following set of operations are performed on the data:

  • 1. Extract AVHRR: The satellite receiving system software extracts pertinent data and writes them to an NFS file in a standard format.
  • 2. Rectify image: The satellite cameras view the Earth at a slant angle, and standard cartographic methods are used to measure the distortion. Identifiable ground control points with known geographic position and elevation are tagged through visual examination of the image. The orbit/instrument model is thereby refined to define the x,y to Wo mapping.

  • 3. Resample: The spectral intensities P(Wo, l, ti) are resampled to give P(Ws, l, ti). This operation involves smoothing and interpolation.
  • 4. Smooth pixels over time: The P(Ws, l, ti) data are smoothed over a set of times (typically 1 day) using a nonlinear function based on the thermal band of the AVHRR imagery with constraints on the viewing angle off nadir. The purpose is to select the ìbestî or most informative P at each spot that is not obscured by clouds. The smoothing function may change from pixel to pixel. This operation produces a composite image, Pc(Ws, l, tj), where the time interval is typically 2 weeks.
  • 5. Combine all spectral bands: Each pixel of the composite image is classified using a function that depends on the spectral intensity in each band. This produces the classified image PC(Ws, tj).
  • 6. Detect changes: The set of classified images, made fortnightly, are intercompared to find pixels whose classification has changed.
  • A3.2.3 Data Sets

    The following data sets are produced in this scenario:

    A3.2.4 Systems Architecture

    The major systems components of the scenario are:

    A3.2.5 Data

    An Illustra DBMS controls the processing steps in this real-time scenario. The following table estimates the data volumes on hand after 1 year of continuous operation.

    Table A3-1: Data Volume Estimates

    Type
    Volume
    (MB)
    Assumptions
    AVHRR multispectral images 8,76012 MB/image, 1 image/day.

    2 x 365 x 12

    Classified and composite images 1042 MB/image, 1 image/14 days.

    2 x 26 x 2

    Warping functions/data Nil
    Digital Elevation Model 20 From USGS.
    Auxiliary data<50
    Total~10,000

    A3.2.6 Software Architecture

    Software of the Operating Environment

    The scenario runs under AIX (version 3.2.5) and OSF/1.

    COTS and Public-domain Software

    The satellite receiver software is Terascan. The Image Processing Workbench is available by ftp from crseo.ucsb.edu. Version 1.0 is used. Arc/Info is available from Environmental Systems Research Institute, and version 7.0, Beta 2 and 3, of this software is used.

    Scenario Software

    This scenario relies on the following major modules:

    A3.2.7 Hardware Architecture

    The scenario can be implemented on two different computer systems on a local-area network:

    An additional 1 GB of magnetic disk storage is available on the network. Old AVHRR data can be stored off-line on an NFS-mountable Metrum tape storage device and catalogued for recovery.

    A3.3 Monitoring Snow Cover Using Multispectral Satellite Imagery

    The snow-covered area of a drainage basin or a glacier has both hydrological and climatological significance. Snow measurements throughout the winter are needed as inputs to snowmelt runoff models, and snow measurements in the summer can help assess the glacierís mass balance. Long-term trends in snow coverage, such as an increase in the mean elevation of the snow line, may be related to climate change.

    This scenario illustrates a method for estimating the coverage of mountain snow packs from Landsat Thematic Mapper ô multispectral imagery. The method is applicable now to measurements from AVHRR and Landsat, and it will be generalized during the EOS era to MODIS, ASTER, and Landsat data.

    The scenario requires considerable human processing during the algorithm-development phase but little human intervention once the algorithms are developed and tested. Thus it is pertinent to automatic ìproduct generationî within EOSDIS.

    A3.3.1 Interfaces with EOSDIS

    The scenario creates a product that is being developed (and, hence, is not a ìstandardî) at an EOSDIS Science Computing Facility (SCF) at UC Santa Barbara. In the immediate future, data will be acquired over a network. We assume a circuit of adequate bandwidth (T3) will exist between UC Santa Barbara and the JPL DAAC by the end of 1994. Through this access point, Landsat TM, ASTER, AVHRR, and MODIS data will be acquired from either the Climate DAAC (GFSC) or the Land Processes DAAC (EDC). Version 0 data transfer protocols will be utilized.

    Some AVHRR data will also be acquired directly by a ground station at the SCF. After the launch of EOS-AM1, data from some MODIS spectral bands will be acquired through the same ground station. These will be useful for ìreal-timeî analyses.

    A3.3.2 Data Flow Model

    The top-level of a data flow model of this scenario is shown in Figure A3-2. The principal components of the model are discussed below. The model does not show the loops and iterations that will occur for a real analysis problem.

    Figure A3-2: Data Flow Model of the Snow Cover Scenario

    Remote-sensing data are acquired from an appropriate NASA DAAC. The end-product of the scenario is a set of maps delineating snow coverage at certain defined times over selected mountain rangesóSierra Nevada, Rocky Mountains, Andes, Tien Shan, Alps, etc.

    A3.3.3 Processes

  • 1. Download data: Landsat TM data will be transferred electronically and stored on tertiary storage, either at UCSB or elsewhere in the Project Sequoia 2000 network, i.e., at Berkeley or San Diego.
  • 2. Estimate surface reflectance: This process converts every pixel value into an Earth-surface reflectance. The atmospheric path radiance is estimated by an iterative ratioing technique. This involves visual inspection of the intermediate products of the algorithm until convergence is detected. The path radiance thus derived is then subtracted from the scene, and the remainder is multiplied by a transmittance factor and scaled by the incoming solar irradiance. This process will be coded primarily in IPW.
  • 3. Interactively catalog the spectral ìsignatureî of the reflectance from known types of surfaces: A catalog is made by interactively analyzing small parts of representative scenes. Based on field knowledge, land cover maps, and local aerial photos (called here ìtrue coverage mapsî), the operator selects regions on the image that represent each of the 3 broad spectral classes (snow, vegetation, and bare rock). The reflectance spectra for each coverage class are saved for later processing. This process will be coded primarily in IPW.
  • 4. Create fractional coverage images: A spectrally heterogeneous (with respect to the reference classes) sub-image of the scene is ìunmixedî by comparing its spectrum with the reference spectra. This analysis creates a new image for each of the surface-coverage classes being investigated. These are called ìfractional coverage images.î The value of each pixel in each of the images is proportional to the portion of the pixel covered by the pertinent surface material. This process will be coded primarily in IPW.
  • 5. Form decision tree: To avoid running the spectral unmixing model on the entire TM scene, we generate a decision tree based on the sub-images. The TM sub-scenes and their corresponding fractional coverage layers are fed to MATLAB, which generates an optimal decision tree mapping the sub-scene into the fractional coverage layers. This process will be coded primarily in MATLAB.
  • 6. Apply decision tree: The decision tree is applied to the full TM scene, yielding 2 data products: a mask of maximum extent of the snow coverage and an image of per-pixel fractional coverage for each coverage class (snow, rock, and vegetation). This process will be coded primarily using IPW.
  • 7. Create false-color image: A synthetic image is generated by assigning a primary color to each of the 3 coverage classes, then converting the fractional coverage to intensities of that color. This image is visually inspected for obvious artifacts (e.g., snow below a threshold elevation). This process will be coded primarily in IPW.
  • A3.3.4 Data Sets

    The following data sets are produced in this scenario:

    A3.3.5 Systems Architecture

    The architecture of the system executing this scenario is typical of NASA-supported SCFs at research universities. There is a local-area network of workstations using FDDI protocols, linked through gateways to the Internet and to the Project Sequoia 2000 private network. A link will be obtained to the JPL DAAC to support data transfers.

    There is a mixture of online, nearline, shelf, and remote storage systems. Both automatic and manual techniques are used to manage data storage.

    A variety of home-grown, commercial, and public-domain software is employed. There is no overall software framework that smoothly binds the software elements together. However, the main applications (IDL, MATLAB, and IPW) have primitive capabilities in interprocess communication. Only a few processes will be executing concurrently, so there is no need to automate process management.

    All processes in the scenario, in theory, could be accommodated on a single general-purpose workstation. However, in practice, the hardware architecture must accommodate multiple, heterogeneous hosts. For example, in the current implementation of the scenario, the IDL and MATLAB packages are bound to specific workstations by licensing constraints. The spectral unmixing step is constrained mainly by processor speed, while the decision tree evaluation requires online access to full TM scenes.

    Two steps in the scenario might benefit from parallel computing. The spectral unmixing algorithm, theoretically, could be evaluated in parallel for each pixel and, thus, is a good candidate for massive, fine-grained parallelism. The decision tree evaluation step could be accelerated similarly; it could be applied to multiple images simultaneously and, thus, could benefit from modest, coarse-grained parallelism.

    A3.3.6 Data

    Data are stored as large objects in Illustra.

    The primary data are the 6 reflective spectral bands of a full TM scene (6 bands, 6000 lines, 7000 samples = 250 MB). About 200 scenes worldwide (50 GB) will be acquired during a single snow season. A smaller amount of data will be generated by the scenario. Size estimates of the principle data stores are given in Table A3-2.

    Non-EOSDIS data requirements are limited to ancillary data, such as geologic and vegetation maps, that may be required to help a human interpreter identify homogeneous areas on a seasonal reference image.

    Table A3-2: Data Volume Estimates

    Type
    Volume
    (MB)
    Assumptions
    Landsat Thematic Mapper images 50,000250 MB/image, 2 images per month for 5 months in 20 areas.

    6 bands, 6,000 lines, 7,000 samples per image.

    True coverage maps 1,000Digitized aerial photos at finer than 1-m resolution.
    Reference reflectance spectra Nil6-vectors for a few thousand pixels.
    Fractional surf coverage 1,000Results with spectral unmixing model.
    Decision tree coefficients NilWeighting vectors for the few-thousand 6-vectors.
    Fractional surface coverage 10,000200 images at full Landsat resolution, 6,000 x 7,000.
    Snow cover mask100 Highly compressible.
    Total~62,000

    A3.3.7 Software Architecture

    Software of the Operating Environment

    The scenario will run under a UNIX O/S, currently OSF/1. The X Windows System will be used for graphical displays. Network File System (NFS) will be used to share data and programs across the network.

    Software development will be based on ANSI C compilers, with commercial or public-domain analysis packages. RCS will be the software configuration management system.

    COTS and Public-domain Software

    IDL and MATLAB will be used extensively. Most applications code will be written in the scripting language of these applications.

    Public-domain software for image analysis (IPW) and display (xv) will be used.

    Scenario Software

    Most of the software will consist of scripts written in IDL or MATLAB scripting languages. Some functions will be written in C and linked to IDL. IPW will be employed for spectral unmixing and decision tree evaluation. The Illustra database management system will control the processing and data flows.

    A3.3.8 Hardware Architecture

    The scenario is to be developed and run on both Sun Microsystems and DEC workstations. Machine-specific licenses are one reason that multiple hosts are required. A high-resolution color display is essential for the atmospheric correction step and to evaluate the final product.

    Approximately 100 GB of disk space will be required to process an entire snow season. This will be kept online, obviating the need for hierarchical data storage.

    A3.4 Validating Climate Models with EOSDIS Data

    The main element of this scenario illustrates how current data and future EOSDIS data can be used to validate a representative major climate model, the UCLA coupled GCM. This model consists of an atmospheric general circulation model (AGCM) coupled to an ocean general circulation model (OGCM). A secondary element of the scenario shows how the migration of GCMs to a workstation environment will magnify the demands on EOSDIS rapidly by multiplying the number of scientists and models in the research community.

    A typical global climate model follows the evolution of 10s of field variables defined on a space grid with a horizontal resolution of a few degrees of latitude and longitude and a vertical resolution of about 20 layers in the atmosphere and ocean. The state is saved about every 12 simulated hours, and a typical numerical experiment generates 100s of MB of stored data. Such models are frequently run for different combinations of boundary conditions and parameterizations of physical processes. Each such run, or numerical integration, which ordinarily might be for several simulated years, is called an ìexperiment.î

    Two principal methods are used to validate these experiments. In the more traditional and widely used method, observables are calculated from the model data. This may involve interpolating onto a different space-time grid or forming new combinations of model fields. In the second method, large-scale patterns are recognized in the model data and similar behaviors sought in measured data. This latter method, which has great potential but still requires considerable development, is called ìfeature extraction.î

    Observed data sets come from direct ship, aircraft, and spacecraft observations, as well as observations assimilated into GCMs of other organizations. Thus, GCMs today are an essential part of the observing system, as well as theoretical tools. Assimilated data of the National Meteorological Center (NMC) and the European Center for Medium Range Forecasting (ECMWF) are both used widely in current research. In the EOSDIS era, data from NASA&IACUTE;s Data Assimilation Office (DAO) will also be used in similar fashion.

    At UCLA, the method of validation by feature extraction has been enhanced recently by development of an entire analysis system, QUEST. This system provides content-based access to large data sets by finding and indexing features in these data sets. All data and features are held in a DBMS, and they can be extracted and displayed using standard query languages. Currently, the UCLA GCM group uses QUEST to study cyclones in model and observational data but intends extending the system to embrace other features of climate in the atmosphere and ocean.

    The validation of GCM simulations inevitably involves massive data sets. Increasingly, these data are installed in database management systems and retrieved with DBMS queries submitted from data-processing application programs. In the EOSDIS era, we expect to connect the offspring of our current software directly to EOSDIS database systems and to conduct this research interactively. Further, it is highly desirable that we be able to install some software in the DAAC to preprocess data before downloading to our local DBMS.

    The validation scenario described below illustrates the following points:

    ï Intimate connection between user software and EOSDIS software and data.

    ï Use of data provided by diverse sources.

    ï Importance of ad hoc queries.

    ï Use of EOSDIS as an interactive data server and data preprocessor.

    ï Importance of electronic connectivity for interactive analysis.

    A3.4.1 Interface with EOSDIS

    The coupling of an ìintelligentî feature extraction system to an advanced DBMS (Postgres) is not only feasible, but practical and desirable. In our scenario, the feature extraction system communicates with the DBMS via IP sockets using Postquel as the DBMS command language. The feature extraction system is a useful tool to test and improve EOSDIS capabilities for providing end user (pull) data flow because of the amount of data that can be required by the inferences, and the number and wide variety of potential patterns that can be inferred from both observation and model data.

    The scenario consists of interfacing the feature extraction system to EOSDIS via an ìintimateî virtual communications path (i.e., IP sockets) and using a DBMS command language (SQL) to extract in near real-time subsets or slices of multiple data sets to satisfy ad hoc queries generated by the feature extraction system inference engine. It is important that queries be run interactively so as to afford the investigator the opportunity to find, explore, and analyze interesting phenomena.

    The queries are run across multiple four-dimensional (latitude, longitude, vertical level, and time) data sets to extract spatial or temporal patterns whose indices can be used to infer related patterns and events. The rule set can be configured to analyze any type of data, allowing the system to process and test a heterogeneous fusion of different data sources, formats, and types. To satisfy the data required by the rule set, data can be drawn from multiple DAACs through the ìmiddlewareî to satisfy the query.

    A3.4.2 Data Flow Model

    Figure A3-3 shows a data flow model of the scenario. The scenario is performed through the workstation-interface of the feature extraction system. This interface presents appropriate menus and selection boxes to describe the nature of the science problem. The feature extraction system acts on the user-directives by commands and data exchanged with other processes of both the model and the external agent, the EOS DAAC.

    Systems External to the Scenario

    The scenario assumes that 1 or more DAACs are the primary repository of data to be used for validation, but that the data in the DAAC is more copious than needed. It is assumed that the DAAC proffers an SQL interface to Process 2, which is used to submit queries pertaining to the availability and lineage of data. To limit both WAN bandwidth and the size of the local storage system, data will be preprocessed by UCLA software running at the DAAC (Process 1). Initially, these will be private codes that interact with the DAAC&IACUTE;s DBMS through an SQL interface and a wide-band data channel. It is likely that the data preprocessing functions will be of wider applicability and that many of these functions will be embedded as DBMS extensions and made available for general use.

    Scenario Processes

    Process 1: This is the QUEST program. The scenario is orchestrated from the user interface of this process.

    Process 2: UCLA software at the DAAC. These will be modules that perform preprocessing of chosen data. The functions might include coordinate transformations, re-gridding, and changes in unit of measurement. Input data are provided by the DAAC in response to SQL queries. Output data (both field variables and allied metadata) are transmitted over-the-wire to the UCLA DBMS in an appropriate format. The execution of this process is performed over a direct process-to-process link with the feature extraction process (2).

    Figure A3-3: Model of the UCLA Scenario for Validating GCM Models

    Process 3: This represents the SQL Database Management System at UCLA. This DBMS will be linked directly to the DAAC DBMS, although that flow is not indicated in the figure. The DBMS accepts and manages data passed from Process 1.

    Process 4: Visualization. In the current architecture, IDL is used for showing validation data and GCM model data. Visualization may be an integral part of a future DBMS, in which case this process becomes superfluous.

    Data Flows and Data Stores of the Scenario

    The data flows are self-explanatory. SQL denotes queries submitted to a DBMS and DBMS responses. Data and data-over-the-wire stand for multi-MB binary objects. In the first case, the data pass from the DAAC DBMS to the preprocessor over the DAAC internal network. In the second case, the data are transported over the wide-area network formatted according to an DAAC protocol. Ftp is not an adequate protocol for this exchange.

    Data received at UCLA are saved in the local DBMS. We expect, at least in the early versions, that voluminous data will be saved as large objects on the file system (data stores D1 and D2), but that they will be accessed both through the DBMS SQL interface (e.g., the flow named fields) and directly by third-party applications (e.g., the un-named flows to the visualization process).

    A3.4.3 System Architecture

    The scenario is based on an end-to-end vertical system using distributed processes. The data extraction and transformation functions are envisioned to reside in, or as close as possible to, the data repository and ideally run in the DAAC or tightly coupled local compute engine. The feature extraction system and user interfaces will execute on workstations local to the user. The processes will communicate using Internet Protocol (IP) interprocess communications, and a portion of the code should be an integral part of the EOSDIS Information Management System and Database Management System.

    A3.4.4 Assumptions

    The researcher needs fast interactive access to the data with a response to the worst-case question posed to the feature extraction system within 1 week. The system must be reliable with average up-times of weeks, not days. Information navigation must be effective and not time-consuming. NCSA Mosaic or similar software needs to be available to locate and transfer the data of interest. The data extraction, transformation, communications, and DBMS command language should be as transparent to the user as is practicable.

    The DAAC must provide several critical services to support this scenario. The DAAC has to be highly interactive for efficient transaction request and response. Sufficient local DAAC cycle servers must be available to do the information navigation, system optimization, metadata extraction, data mining, and data transformations.

    Data staging should be optimized to reduce data downloading latency. This scenario also requires that data set chunks, once staged, remain accessible in secondary storage as long as is necessary for the efficient extraction of requested data.

    Network connectivity has to be reliable, fault-tolerant, and with high bandwidth. Data network transparency has to be ensured.

    A3.4.5 Data

    Data from different sources needs, in most cases, to be gridded or re-gridded to a common resolution. Vertical levels, units, and variables have to be made common among data sets. The processes to extract, transform, and re-grid the data should be as close to the data repository as possible. All data should reside in the DAAC, and metadata should be stored in the DAAC DBMS. Metadata should be extracted from the DBMS by direct SQL queries. Data should be extracted using internal DAAC processes to abstract external large objects. Data should be transferred across the network in a network-transparent format.

    Table A3-3, below, contains the sizes of representative observational data sets. The sizes of the data sets are what the National Center for Atmospheric Research currently stores on its Mass Storage Subsystem and, therefore, are usually in a compressed format. The field size is the number of bytes for a particular variable and level at the horizontal resolution of the data set and with a storage format of 8 bytes per value. Table A3-4 contains the sizes of selected data sets from UCLA AGCM runs.

    Table A3-3: Sizes of Representative Observational Data Sets

    Data set
    Resolution
    Data set size (MB/data_year)
    Field size (MB)
    Troposphere (ECMWF)
    144x72, 14 levels
    296
    0.083
    Stratosphere (NMC)
    65x65, 8 levels

    2 hemisphere
    252
    0.068
    Cloud (ISCCP)
    192x96
    2623
    0.147
    Visible/IR satellite images (GOES)
    6000x3000
    8846
    3.03
    Precipitation
    7500 stations
    238
    0.06
    Ocean temperature and currents (Levitus)
    180x70
    401
    0.1
    Sea-surface temperature (Oort)
    144x72
    3.93
    0.083

    Table A3-4: Sizes of Selected AGCM Data Sets
    Data set
    Resolution
    Data set size (MB/data_year)
    Field size (MB)
    Model 4x5, 15L
    72x44, 15 levels
    3821
    .025
    Model 2x2.5, 15L
    144x89, 15 levels
    15284
    .103

    The rules set can be added to the Feature Extraction System to accommodate any ad hoc set of inferences. Table A3-5, below, provides information for selected inferences that use 20-year data sets with twice-daily observational or AGCM data. The DAAC needs to process approximately 36.5 GB of data to satisfy one calendar yearís worth of twice-daily hyper-plane queries. This estimate assumes that the data will be broken into 100-MB tiles, that statistically ½ of a data set is read to find a requested hyper-plane, and that the amount of data required to do the indexing is small. The entries in the ìDAAC Data useî column are calculated as the amounts of data that must be processed by the DAAC to satisfy a 20-year inference for all required variables and are defined as:

    Data_use = 36.5 GB/year/variable * 20 years * Number_of_variables

    The ìFields (MB)î column in the following tables refers to the size of the data requested by feature extraction system at the horizontal resolution of the original data. ìFES data input (MB)î refers to the size of the data ingested into the feature extraction system after the data have been re-gridded to the AGCM&IACUTE;s horizontal grid (in this case 4-degree-latitude by 5-degree-longitude).

    Table A3-5: Data Processed for Selected Inferences

    Query 1: Search for cyclones and anticyclones.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    SLP, 700 mb u,v
    219.2
    3635
    1095
    Model (4x5,15L)
    SLP, 700 mb u,v
    219.1
    1095
    1095

    Query 2: Explore relationships between cyclone tracks and anticyclones.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    SLP, 700 mb u,v
    2192
    7270
    2190
    Model (4x5,15L)
    SLP, 700 mb u,v
    2191
    2190
    2190

    Query 3: Search for cyclones and anticyclones during El Niño.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    SLP, SST 700 mb u,v
    2922
    4847
    1460
    Model (4x5,15L)
    SLP, SST 700 mb u,v
    2921
    1460
    1460

    Query 4: Explore relationships between precipitation anomalies and El Niño events.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    PRECIP, SST
    1461
    2088
    730
    Model (4x5,15L)
    PRECIP., SST
    1460
    730
    730

    Query 5: Explore relationships between satellite-measured cloudiness anomalies and El Niño events.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    Cloudiness, SST
    1461
    2701
    730
    Model (4x5,15L)
    Cloudiness, SST
    1460
    730
    730

    Query 6: Explore the motions of isopycnal balloons in relation to the stratospheric polar vortex.
    Required variables
    DAAC data use (GB)
    Fields (MB)
    FES data input (MB)
    Observations
    Balloon positions, 3-D u,v, T
    2223
    65437
    19710
    Model (4x5,15L)
    Balloon positions, 3-D u, v, T
    2198
    16425
    16425

    The data set use numbers listed above represent a scenario where sequential access is used from the beginning of the data tile for each query (no indexing into or retaining file pointers in the 100-MB data tiles), and there is no caching of intermediate results. This number can be reduced significantly with intelligent database storage. There is between a 3 and 4 order of magnitude reduction in the amount of data transferred from one side to the other of the field extraction process. Gridding/re-gridding of data does not reduce the amount of data transferred significantly except when going from a pixel map to the coarse model grid.

    The length of time required to do an inference depends on the rule set used and the platform on which it is run. In the simple case of extracting cyclones, the feature extraction system can process 1 month of data in approximately 1 minute running on a Sun SPARC 10 workstation. This requires a sustained data flow of approximately .075 mb per second to be able keep up with the analysis.

    Once features such as cyclone and anticyclone tracks have been extracted, a user can run the following examples queries:

    The total data requirements are contained in the following Table A3-6, below.

    Table A3-6: Total Data Requirements

    Observation data (GB)
    Model data (4x5) (GB)
    Model data (2x2.5) (GB)
    Cyclone project
    20.5
    6.2
    24.8
    Stratospheric balloon project
    63.3
    16.4
    65.6
    Total
    83.8
    22.6
    90.4

    A3.4.6 Software and Hardware Architecture

    We envision that the end user for the most part would never interact directly with the DAACs. Information exploration and data mining would be done through graphical user-interface programs. These programs would send the appropriate command sequences to the DBMS for resolution. QUEST, the feature extraction system, is an example of such a program. The user selects the data sets, inference, search constraints, and other operational parameters, then the feature extraction system issues the ad hoc queries to the DBMS for the data required by the inference engine.

    QUEST comprises a graphical user interface, a query manager, a visualization manager, and a database API. The query manager maintains a unified schema of the data stored in the distributed, heterogeneous repositories with a common query language (API) to the underlying repositories and uses a Field Model Execution Engine as the query engine. The visualization manager supports static plotting (2D and 3D graphics) of data, analysis of data, and animation of data sets. The visualization manager runs on top of IDL. The graphical user interface (GUI) provides the ability to query scientific data interactively and extract features in the information repository, and is built upon Tool Command Language (Tcl) and X11 Toolkit (Tk).

    The data extracting, assimilation, gridding, and transformation routines will run as separate processes. Data queries to the DBMS will be via a database query language. Snippets of data required for current FES operation will be requested from the DBMS at any one time.

    This scenario runs under the UNIX operating system with the X Window System or PostScript for graphical display. Network File System (NFS) or IP sockets are used to share or transfer data across the network.

    UNIX scientific workstations, in conjunction with remotely mounted file systems, will be used for this scenario.

    A3.4.7 The Migration of GCMs to Workstations: Implications for EOSDIS

    Until very recently, all GCM research has been carried out on supercomputers. This requirement severely limited the number of GCM groups and the pace of GCM research. Now, however, the rapid pace of improvement in price/performance of workstations has altered the prospects for a large increase in the size of the GCM user community dramatically.

    In 1993, the National Center for Atmospheric Research (NCAR) made its current atmospheric GCM available in CRAY source code form via anonymous ftp. This GCM, the Community Climate Model, Version 2 (CCM2) consists of about 50,000 lines of Fortran. Unlike many GCMs, this model is professionally programmed and extensively documented. In many respects, it is a typical modern GCM.

    At about the same time, NCAR ported CCM2 to IBM and Sun Microsystems workstations and made that version of the code public as well. Under Project Sequoia 2000, scientists at Scripps Institution of Oceanography (SIO), UC San Diego, ported the workstation version of the model to the DEC Alpha platform. The timings for the various versions of CCM2 are revealing:

    On the 8-processor CRAY Y-MP at NCAR, the model in standard resolution (truncated triangularly at 42 waves in the horizontal, with 18 levels in the vertical) runs in about 1.1 CPU hours per simulated year. This spectral resolution is equivalent to a horizontal global grid of 128 x 64 grid points.

    This version of the model on the DEC Alpha runs in about 200 hours per simulated year. Thus, although it is not practical to make many decadal-scale integrations on the Alpha, a dedicated workstation is adequate for research with shorter integration times. Put differently, although the workstation is about 200 times slower than the supercomputer, many scientists have access to a dedicated workstation, but few have 0.5% of a CRAY C90.

    NCAR has also run CCM2 successfully on a cluster of IBM RS/6000 workstations. As such workstation farms become more common, and as software for distributing large jobs on workstations becomes more mature, it is highly likely that GCMs will be run in this mode.

    At present, many GCM groups routinely run coarse-resolution models. Truncation at 15 waves rhomboidally, for example, allows long integrations to be made, and for many applications the loss of smaller-scale detail may be tolerable. This spectral resolution is equivalent to a horizontal global grid of 48 x 40 gridpoints. CCM2 has been run extensively at this resolution, and intercomparisons with higher-resolution versions have been published. So the trade-off between economy and realism is well understood.

    The 18-level, 15-wave rhomboidal version of CCM2 runs on the DEC Alpha workstations at SIO in about 7 minutes per simulated day, or 42 hours per simulated year. This speed allows significant research progress in a reasonable time. The SIO group is using this model for tests of alternative parameterizations of cloud-radiation processes.

    The UCLA GCM has been run on a cluster of 8 DEC Alpha workstations. The 4x5, 9-layer version of the model requires about 24 hours to run 1 year, so it is possible to run decades-long simulations to study, for example, the atmospheric variability associated with El Niño. Such a simulation would generate approximately 100 GB of output.

    These examples of using workstations for climate modeling research are likely to become much more common in the near future, as additional GCMs are made public and as workstations proliferate and improve in performance and price. Already, e-mail networks of CCM2 workstation research groups are in place. Thus, the number and variety of potential users of EOSDIS products for GCM validation are likely to increase rapidly. The scenario given above for the validation of a major climate modeling system may be the product of a single GCM group today. In the EOS time frame, however, it is likely to be representative of the demands of a substantial research community.