Section 1: Our Vision for EOSDIS

Better data management is crucial to the success of scientific investigations of global change. Achieving the goals of NASA's Earth Observing System (EOS) will depend not only on the capabilities of its instruments, but on how well its information system helps scientists integrate reliable, small- and large-scale data sets into models of geophysical and biological phenomena, and how successfully the investigators are able to interact with each other.

Today, progress in the use of remote sensing for science is hampered by two problems:

The EOS Data and Information System (EOSDIS) addresses both problems. EOSDIS should enable new modes of research about the Earth, especially through synergistic interactions between observations and models. Further, it will be able to store, organize, and distribute massive amounts of diverse data and allow the Earth Science community to access, visualize, and analyze them. In general, it will allow a much wider population of scientists to study environmental change at regional and global scales via remote sensing.

To achieve these goals, EOSDIS must have a forward-looking, scalable architecture, and the purpose of this report is to present such an architecture. The remainder of this section presents the Earth Science community's requirements for EOSDIS and a summary of the software and hardware architecture of a system we have conceived that satisfies these requirements.

Section 2 contains a collection of user scenarios that support our contention in Section 1 that 80-90% of all accesses to EOSDIS will be ad hoc queries. Sections 3 and 4 contain, respectively, software and hardware details of our proposal. Lastly, Section 5 presents a cost model, which we use to forecast total EOSDIS software, hardware, and operations costs.

We begin in Section 1.1 by indicating what we expect EOSDIS to accomplish. Section 1.2 turns to the services required in an information system to meet the goals of Section 1.1. Section 1.3 asserts that the services must be provided with a focus on ad hoc usage rather than delivery of standard products. Section 1.4 addresses an important sizing question.

The second part of Section 1 describes our concept for an information system to satisfy the requirements presented in the first part. In particular, Section 1.5 indicates the assumptions that we have made and Section 1.6 discusses a conceptual model of EOSDIS that will prove useful in later sections. We follow, in sections 1.7 and 1.8, with summaries of our proposed hardware and software architecture. Section 1.9 discusses the major innovations that we have introduced into the design and acquisition of EOSDIS, and Section 1.10 identifies major risks to the proposal we have outlined.

1.1 The Purpose of EOS

Investigation of the causes and magnitudes of environmental change, especially on large regional and global scales, depends on excellence at tasks where the record of the scientific establishment is not encouraging &endash; the integration of data sets and geophysical and biological products of established quality and reliability. To monitor, understand, and ultimately predict global change &endash; both natural and human-induced &endash; requires a close interaction among the following types of models:

The most important advance the world expects from the EOS program is the improvement in the latter two types of models. There are several disparate roles for these models, as Table 1-1 shows. Each makes special demands on the data system.

Table 1-1: Roles of Models

Researchers' goals

Services needed

Encapsulate existing knowledge.

Understand relevant state variables. Process algorithms.

Analyze model sensitivity to uncertain parameters.

Provide plausible ranges for critical parameters. Compare comprehensive models with process models.

Assimilate observations.

Provide measurements of global state variables.

Validate model performance.

Provide records of natural variability, including past states and measurements of phenomena.

Monitor change.

Provide sustained global measurements of selected state variables.

Analyze cause and effect.

Provide tools for changing external forcing or internal processes.

Predict the future and assess uncertainty.

Provide tools to analyze model sensitivity.

1.2 Services Needed

The researchers' goals in Table 1-1 show the demand for an information system with diverse capabilities. EOS is a NASA flight project, and the instruments on the NASA platforms will produce a larger amount of data than scientists have ever seen. But these scientists must also access data streams provided by other agencies or colleagues. The services needed, then, represent the list of everything that is necessary to turn the bit stream from EOS instruments and other sources into useful information for studying global change. This is information processing on a grand scale.

Processing can be done on the following types of remote-sensing and model output data:

Scientists require an information system that will allow them to correlate these four types of information easily. Moreover, the user community has diverse capabilities and interests. Some EOSDIS users and knowledgeable data providers generate algorithms and code to compute scientific products from the EOS instruments. Some use raw data from multiple instruments to study geophysical phenomena that occur worldwide. Others use scientific products exclusively and may focus on small areas. Some approach the system knowing what data they want and how to find it; others use a less structured search-and-discover approach. Obviously, EOSDIS must accommodate a wide range of users.

We now turn to the manner in which the user community will interact with EOSDIS.

1.3 Ad Hoc Inquiry vs. Use of Standard Products

Figure 1-1, below, illustrates a simple, common data-processing scheme. The example pertains to data from the NOAA Advanced Very High-Resolution Radiometer (AVHRR), the satellite data most commonly used to study clouds, sea-surface temperature, vegetation, and snow. With the launch of the first EOS platform in 1998, MODIS data will replace AVHRR data for many applications, but similar processing methods are planned.

Studies of land-surface features almost always require cloud-free images but at weekly intervals or longer. A common method of eliminating cloud cover when mapping land-surface features, whose rate of change is slower than that of the atmosphere or ocean, is to composite images over time. In this procedure, the analyst co-registers images from the AVHRR afternoon overpasses for some interval, say 10 days. The composite is a single image for the interval, and each of its pixels is the best pixel from all the overpasses. The criterion for best can depend on many factors, including normalized-difference vegetation index (NDVI), temperature, and scan angle. Clouds move, and the day when the pixel is least obscured is usually the most cloud-free. Once the composite image is formed, the data can be used to map surface features, e.g., snow cover, vegetation index, land use, etc.

Figure 1-1: Processing Scheme for AVHRR Composites

Here are two examples that illustrate how an investigator requires control over processing of AVHRR data:

Both examples produce new scientific information and better algorithms for processing AVHRR data. When these are accepted by the community, a careful investigator will not be happy with a standard product that uses less than optimal compositing or navigation methods. So there will be a compelling impetus to change old algorithms to produce better results.

We believe that the examples in this section are typical of the expected usage of EOSDIS. While it is possible to set up standard processing steps that implement Figure 1-1 sequentially, the range of possible loops makes it unlikely that such standard products would satisfy most users. Moreover, it is likely that creating all possible science products everywhere composite data were available would fill the archive with data that no one would want in the period before the algorithm changes. Hence this eager processing model for EOSDIS &endash; pre-computing all products &endash; will probably waste storage andcomputing cycles.

Instead, we argue that lazy processing, whereby a vegetation map of a region for a specific time-window is not created until a user requests one, will better satisfy users and better optimize storage, processing, and network traffic. A common feature of any large archive is that most of the demand is for a small fraction of the data, but it is difficult to forecast which small fraction of data will be most in demand. The lazy processing model responds to the demand.

In summary, EOSDIS could be architected to:

We believe the standard product model is appropriate only for a few well understood products with widespread user communities and algorithms that infrequently change.

In contrast, the algorithms used to create most science products will change often, especially during the early years of the mission. Such products may well change before being requested by any user, leading to the desirability of a compute-on-demand strategy to generate the result of a user' ad hoc query.

We believe it is imperative that EOSDIS have flexibility to switch, on a product-by-product basis, between these two options. Moreover, we expect the vast majority of the products, probably 80-90%, will be in the second category. Section 2 discusses a collection of detailed user studies that support this assertion.

1.4 Sizing

The current model for EOSDIS is precise about the demands for creating standard scientific products and vague about access by users. We argue that forecasting usage of a new, innovative information system by examining historical usage of old, cumbersome systems is likely to lead to an underestimate of demand. When combined with good search tools and user interfaces and fast networks, electronic delivery of information becomes very appealing to a large population of scientists, educators, students, and commercial consumers. While the long-term exponential growth in the volume of Internet transfers and number of network nodes cannot continue forever, it shows no sign of slowing. Moreover, the rapid growth of Internet services and electronic provision of information about the Earth and space sciences seems to indicate a rapid increase in demand for services in the next half-decade. Specifically,

We believe that an innovative EOSDIS will generate similar demands. It is crucial that the architecture scale to meet them. In addition, in our hardware analysis, we propose to support substantial extra capacity to meet such unforeseen ad hoc queries.

1.5 Assumptions

In this section we indicate our major assumptions. Each is a significant driving force of our architecture, and we list them in order of their importance:

1.6 EOSDIS Conceptual Model

Figure 1-1 showed a specific scenario of Earth Science use of satellite imagery. Figure 1-2, below, conceptualizes and abstracts this processing model, and it will be used to present our system architecture. Here, we show a directed graph of processing nodes, interconnected by data flow edges. The left side of the figure shows the various EOSDIS data feeds as a single input data flow edge. This represents raw data input from the West Virginia processing site.

Instrument-specific processing is noted by the fan-out of the raw feed into several processing steps labeled P1, ..., Pk. Further processing takes place on most of the data streams; for example, P2 is followed by step Pn. P2 might remove noise, and Pn might align the data to some standard grid. In addition, some nodes may have multiple inputs. In this case, the processing step requires more than 1 data stream to perform the desired calculation. For example, Pm requires some non-EOS data set to perform its functions. Whenever a node has multiple inputs, synchronization is required, and processing cannot proceed until all data sets are available.

The right side of the figure shows an example user, who requires 2 data sets. The first one is the output of Pp, which might be a standard data product. The second one requires user-specific processing, Pq, and the availability of another external data set. This illustrates the user requirement to be able to obtain both standard data products and the results of ad hoc queries that might subset standard products, correlate standard products with other data sets, or perform further user-specific processing. As previously stated, we believe that ad hoc queries will constitute at least 80-90% of the load on EOSDIS.

In our opinion, the raw feed must be stored reliably by EOSDIS. If necessary, the output of each other processing step can be recreated from feed data.

Figure 1-2 represents several hundred users and perhaps one thousand or more processing steps. Hence, this figure is a conceptualization of a small part of EOSDIS processing. The results should only be stored &endash; i.e., eager evaluation &endash; if deemed more efficient than lazy evaluation.

Figure 1-2: A Conceptual View of Data Flows in EOSDIS

1.7 Hardware Architecture

Figure 1-3, below, illustrates our hardware architecture to implement the conceptual model. We envision exactly 2 superDAACs, each a professionally operated computing system under a single roof with O(1015) bytes of storage and appropriate processors. As discussed in Section 5, providing a number of superDAACs greater than 2 will drive up the total EOSDIS cost but provide no technical benefit whatsoever.

Figure 1-3: EOSDIS Hardware Architecture

Each superDAAC must store the entire raw data feed indicated in Figure 1-2. In this way, researchers can be assured of fail-safe data storage even in the event of a disaster (flood, earthquake, etc.). The 2 superDAACs have a high-speed network connection for rapid interchange of information. Besides storing the raw feed, each superDAAC is available to implement a portion of the processing in Figure 1-2. The scheduling of such work to the superDAACs is discussed in Section 1.8.

A superDAAC must be capable of reading the entire data archive in a reasonable time interval (1 month for the initial configuration, declining to 1 week by mid-way through the project).

The remainder of Figure 1-2 is executed on N peerDAACs. There are four reasons to distinguish peerDAACs from superDAACs:

Notice that we have not specified the value of N in our 2+N architecture. It is possible that N = 0, and all processing is at 2 very large superDAACs. It is plausible that N = 50, and some of the processing of Figure 1-2 occurs at the superDAACs and the remainder at the peerDAACs. It is also possible that N = 500, and the superDAACs are only storage sites, with virtually all processing occurring at the peerDAACs.

An essential element of our architecture is that it is easy to change the value of N over time. As technology develops over the next 20 years, it is likely that the desirable value of N will change; hence, it should not be hard coded.

In Section 4 of this report we indicate hardware designs for a superDAAC and a peerDAAC. These designs are used to drive our hardware cost models in Section 5 to produce a total cost for EOSDIS.

1.8 Software Architecture

Users will interact with EOSDIS through various client programs written in a variety of languages by themselves and others. These programs will run on the user's desktop or at one of the 2+N sites. Typically, users will interact with EOSDIS by specifying queries in SQL&endash;*, which will be the first-class communication protocol among elements of EOSDIS. Other protocols (e.g., HTTP and Z39.50) may be supported through gateways, which map foreign protocols into SQL-*.

Clients will deal with EOSDIS as if it were a single computer system, while middleware will be responsible for accepting location-independent SQL&endash;* queries from clients and executing them. The main tasks middleware will have to perform are job scheduling, optimizing the eager versus lazy processing strategy, deciding which of several copies of objects to use, and decomposing a location-independent query into a sequence of local queries to be processed by local servers. Middleware will also be responsible for obtaining parallelism in a distributed-processing environment.

Each of the 2+N local DBMS servers will be required to process queries in the SQL&endash;* dialect for data objects stored at its site and return the resulting answers. Interactions with data supplied by other agencies, such as NOAA, or with GCDIS or UserDIS, will take place through gateways, which will map SQL-* to whatever interface the foreign system accepts. Such gateways will also be useful to map between SQL-* and various vendors' specific implementations of enhanced SQLs.

Figure 1-4: EOSDIS Software Architecture

In the remainder of this section we discuss, in more detail, five points implicit in Figure 1-4.

EOSDIS should use COTS technology.

EOSDIS should be assembled primarily from Commercial-Off-The-Shelf (COTS) software and entirely from COTS hardware. The contractor should view itself as a system integrator of COTS products rather than as a code developer.

Specifically, the contractor should buy a COTS SQL-* DBMS. To achieve vendor independence, he should also buy at least 2 gateways between their DBMS and others. A COTS hierarchical storage manager (HSM) and an SQL-* middleware system are unlikely to be available that will meet EOSDIS needs. As such, the contractor should invest in 2 HSM and 2-3 middleware vendors to stimulate their COTS products to meet EOSDIS requirements. The contractor must build only class libraries and job-scheduling code. These decisions are factored into the cost model for EOSDIS, where we demonstrate that software development costs decline dramatically from current levels by employing this greater dependency on COTS solutions.

A high degree of automation is the best way to obtain maximum value from EOS data.

To make EOSDIS economical as well as functional, humans must not be in the loop except where absolutely necessary. For example, EOSDIS must deliver data to users without having humans handle tape. To make these data accessible, many (on the order of 100) tape transports must operate with a high degree of parallelism, and the physical data model must be tailored to this parallelism.

Investment in software tools for data management can reduce the number of staff needed to achieve a given standard in quality control, data administration, and EOSDIS help desk service. Automation will make the help desk much more effective and less costly. Modern computing companies now provide most user support through NCSA Mosaic pages and electronic mail.

Companies' experience with electronic mail shows that productivity per consultant increases 5-fold over voice contact. The key to automation (or at least facilitation) of these tasks (the first two of which are related to the processing steps described earlier) is an effective DBMS. We view a call to 800-HELP-EOS as a last resort.

 

Many of the processing steps in Figure 1-2 can be invoked by compute-on-demand control.

Processing at the nodes in Figure 1-2 should be made an optimization decision. The output of every node can be evaluated eagerly, i.e., it can be computed when its inputs become available. Alternately, it can be evaluated lazily, i.e., only when its output is needed. In this case, processing is deferred until a user requests something that requires executing this processing step.

EOSDIS could be designed as a push system, in which data are pushed from feeds into standard products, which are stored. Alternately, it could be viewed as a pull system in which user queries are processed by only performing as much processing as required. We believe EOSDIS should be both a push (eager) system and a pull (lazy) system, motivated by the circumstances.

The eager versus lazy optimization is complex. First, response time will be much better with eager evaluation because processing has been performed in advance. However, this processing will have been wasted if nobody asks for

the result. Also, if the definition of the processing step changes, eager evaluation will require re-computation. An essential element of our architecture is that this eager versus lazy optimization be changeable, as conditions warrant. The details of this dynamic optimization are presented in Section 3.

 

Current COTS database management systems are an appropriate foundation for the EOSDIS database management system.

EOSDIS must use a specific dialect of SQL, which we call SQL&endash;*. This dialect adds type extension, inheritance, complex objects, arrays, and user-defined functions to standard SQL-89. Section 3 discusses SQL&endash;* in more detail.

Vendors will write gateways from SQL&endash;* to their DBMS engines. EOSDIS should have gateways to at least 2 systems to achieve DBMS vendor independence.

Users should request data from the distributed EOSDIS system using SQL&endash;*.

There will be only one data language used in this system, SQL&endash;*, making it the lingua franca. This results in a simpler system with cleaner interfaces. A graphical user interface will save most users the need to express queries in SQL&endash;* directly. We do not believe there is merit in inventing a special Earth Science query language, in the sense of devising a grammar and syntax unique to the EOSDIS data. Rather, the special needs of EOSDIS should be met by extending SQL.

Moreover, a user will query the distributed EOSDIS system in a location-independent manner by generating SQL&endash;* queries. If the queries are local to a specific site, the local DBMS can respond. If they span more than

one site, middleware will decompose the queries into sub-pieces that can be executed at specific sites, interspersed with requests to move data from one site to another. One of the key features of this middleware is the ability to execute queries using parallelism among the many sites for faster query execution.

In addition, if there are multiple copies of data objects, middleware will need to decide which copy to use. One strategy for choosing among copies is to choose the one at the more lightly loaded site. This will facilitate load balancing among the sites. Section 3 discusses these points in detail.

EOSDIS must focus on database design.

Each site will store data using a local schema, which indicates what classes (tables) are present at that site, the columns in each table, and what data type each column is. Moreover, there will be a client schema that contains a location-independent view of the data in all the local schemas. It will include the class libraries for Earth Science data, including a data dictionary.

Schema design is considered difficult in relational environments. In EOSDIS, it will be harder than normal because of the much richer SQL&endash;* environment and because of the multiple sites.

It is crucial, therefore, that EOSDIS make database design a focus of considerable energy. Getting it right will make downstream management of EOSDIS much easier.

Many of the processing steps in Figure 1-2 should be written as user-defined functions of the DBMS.

 

The goal of the software system should be to write as many of the processing steps as possible in the form of user-defined functions of the DBMS. This will bring applications within the scope of the database, simplifying the interfaces and extending SQL-* to embrace scientific data processing. This approach will also allow the knowledge of whether a processing step is eager or lazy as well as the decision whether to store or discard intermediate analysis products to be contained inside the DBMS. Moreover, the DBMS mechanisms of views and triggers will make it possible to move between eager and lazy evaluation. The result will be considerable software simplicity, as is discussed in detail in Section 3.

DAACs should use an SQL-optimized protocol for communication.

All sites will communicate with each other by sending queries and receiving answers. As such, the inter-site protocol should be optimized for SQL. As discussed in Section 3, CORBA, Z39.50, and DCE are poor technical choices in an SQL environment and should be rejected. Rather, EOSDIS should use an SQL-oriented protocol.

1.9 Innovations

Our approach introduces a number of innovations to the design and acquisition of EOSDIS. These innovations -- either ideas we believe to be missing from the current design or existing design ideas we wish to reinforce -- include the following:

Sql-* is the principal development language. We are confident SQL will continue to evolve in the directions we have described and that such evolution, with some delay, will become codified into future standards.

A substantial amount of the algorithmic software will be attached to the database engine and invoked through queries written in SQL-*.

SQL-* middleware will support location transparency, multiple copies, and query optimization.

On-demand processing, which we call eager/lazy processing, is a cornerstone of the architecture, and we describe a method to implement it.

The superDAACs will have a built-in ability to re-process several petabytes of data within 1 month.

Science Computing Facilities (SCFs) have been promoted to DAAC status, and will be developed with the same hardware and software architecture.

Partnerships with industry can help accelerate the development of new hardware and software technologies. This view appears consistent with the goals of the Earth Observation Commercialization Application Program (EOCAP) program.

1.10 Mitigation of Risks

In many ways, EOSDIS represents the greatest challenge that designers of information systems have ever faced. Today, no one has an online database as large as 10 TB, yet EOSDIS will grow at more than one TB per day at the end of the century. Its end-to-end requirements for collecting, storing, managing, moving, and presenting and analyzing data push the existing technologies to their limits and beyond. For that reason, it provides an extraordinary opportunity for developing and deploying state-of-the-art computer technologies.

Yet, the goal of EOSDIS is not the development of technology per se, but the use of technology to serve the evolving needs of a large, diverse community of scientists. A successful EOSDIS will use a judicious mix of available technologies with a few carefully chosen inventions. NASA can depend on many advances in the computing industry to solve today's difficult problems. However, there are some areas where the EOSDIS requirements are uncommon, and NASA and Hughes must mitigate risks by investing in companies or research groups.

There are five risks that our architecture for EOSDIS faces, which we discuss in this section.

 

The ad hoc query load may overwhelm EOSDIS.

 

The current NASA model is one of ordering standard data products. As users see the capabilities of the system, they will move from the traditional paradigm to one of submitting ad hoc queries. Such queries may be computationally demanding and come in greater volumes than we have estimated. The solution is for EOSDIS to increase N or increase the processing capability at the existing 2+N sites. However, this may not be economically possible.

Storage technology may be late.

Costs may be driven up by the late appearance of future technology. In our opinion, this is unlikely as technology forecasts have always been beaten in the past. Also, we are quite conservative, as will be noted in later sections. Hence this risk is small.

NASA will need to take an aggressive approach towards fostering the development of hierarchical storage managers and their marriage to DBMSes. This is still a research issue; it is neither commercial off-the-shelf technology, nor is it a problem in systems integration. Data must flow up and down the hierarchy freely, and this flow will be hard to manage. Several tape transports will be unavailable at any given moment, but this cannot be allowed to compromise data accessibility.

DBMS technology may be late.

Our architecture assumes the existence of COTS technology supporting SQL&endash;*. This capability is available now from several start-up companies and is expected within the next 2-3 years from the larger vendors. EOSDIS will have a smaller or larger set of choices, depending on how fast the COTS vendors market enhanced systems. EOSDIS must also influence the direction of SQL-*.

In our opinion, there is low risk that fewer than 2 vendors will have workable solutions by late 1997.

COTS middleware may be late.

As stated earlier, NASA will have to foster the creation of appropriate COTS middleware. There is always the risk that the investment will not produce results. To reduce this risk, we suggested investing in 2-3 middleware efforts. The risk that all will fail is, in our opinion, much lower than the risk of the contractor failing to deliver working middleware if he undertakes the effort in-house.

Networks may be too slow.

Our architecture assumes that hardly any data are delivered via physical media. Everything moves over the network. If the state of networking in 1998-2000 is no better than today, costs will increase because data will have to be transferred to physical media and then shipped by mail or courier. Our view is that commercial applications with large markets (e.g., delivery of multimedia information) will ensure rapid improvement in network throughput and reliability. Hence this risk is small.

1.11 Conclusion

In general, NASA should engage in an aggressive research and prototyping program tied to EOSDIS objectives with a goal of reducing EOSDIS risks through technical innovation. The major areas where such innovation would be helpful are