Appendix 5: Large-object Management in EOSDIS

A5.1 Introduction and Summary

According to present projections, eosdis will hold more than 1 pb of data by the turn of the century, and the data store will be growing by more than 1 tb/day. Satellite measurements by scanning/sounding systems and scientific products derived from these measurements will comprise the greatest part of this volume. This appendix discusses issues involved in handling these data automatically under the supervision of a Database Management System (dbms). We call these data ìlarge objects,î following database nomenclature for nontraditional data types that are large in size or complex in structure or both. The usual nasa terminology for these data is ìgranules.î

We take the situation of EOSDIS on January 1, 2001, assuming that all products are generated constantly, and define a large-object catalog that is split over a few tables of a DBMS. The catalog holds pointers to data, some of which are assumed to be saved in a hierarchical storage manager (HSM), and others of which are stored on magnetic disk. The push system and the pull system concurrently access the catalog. The former adds rows (as new data come into being) and reads rows (to locate data for subsequent processing). The latter system, in response to ad hoc queries, reads the catalog and provides data to the user.

For the sake of analysis, we assume that the DBMS is centralized at one superDAAC. This simplifies the analysis and design, and provides an optimistic basis for sizing, since network traffic is ignored. This is an appropriate assumption at this stage of the work for we anticipate that the ad hoc queries of the pull system will stress predominantly the input/output channels and the DBMS transaction manager. Thus, issues of distributed data management and long-haul networks are excluded. CPU-related issues, such as creating browse images on-the-fly, or other ìlazyî evaluations are also beyond the scope of this appendix.

The approach to large-object management described here is consistent with the DBMScentric architecture of EOSDIS we advocate. This architecture accommodates, within a single framework, contemporaneous product generation, re-analysis of data with improved algorithms, and ad hoc user requests for data. The architecture assumes virtually all data are online, or nearly online (for instance, saved in a farm of tape robots controlled by a hierarchical storage manager), and that all applications access data by submitting SQL-* queries to the DBMS engine.

This study leads to the following conclusions:

ï To catalog the image-related products of routine analysis requires appending on order 30,000 rows/day to the governing table. This table will have about 20M rows and fill 20 GB of disk by the end of the year 2000. The table size and append rate are within the capacity of current COTS DBMSes.

ï The large-object store can be factored into 2 parts. The first part manages data associated with imaging instruments for data of levels 0 through 2. The second part manages imaging data of level 3 or more and all sounding data.

ï The subsystem for imaging data (level 2 or lower) will require 1,000 times the storage space and 100 times the table-space of the other subsystem.

A5.2 Methodology and Assumptions

The primary objective has been to provide quantitative estimates of how ad hoc queries of the Earth Science community will load EOSDIS. Despite the size and complexity of EOSDIS, the ECS scenarios show that a few representative queries occur repeatedly (see Appendix 4). Thus, a first approximation to the load on the DBMS can be found by analyzing a relatively small number of cases.

A5.2.1 Methodology

The work in this appendix complements and extends the work described in Appendix 4. There, a logical data model was constructed, and scenarios were phrased as queries against that logical model. Here we construct a physical model of the database and consider the effort required to satisfy queries of the logical model.

The rationale for this approach is our view that there is no essential difference between the push system and the pull system when the definition of the pull system is extended to embrace not just data selection but data analysis. Furthermore, our concept of eager/lazy evaluation of even the standard products extends the notion of pull processing arbitrarily close to the level 0 data at the head of the processing chain.

Definition of the Database

The database schema is assumed to consist of 2 abstract data types. Data of type ìrasterî hold the multi-band data recorded by imaging systems. Data of type ìsounderî hold the data recorded by sounding systems. For reasons of efficiency and expected usage, we split the tables holding (or pointing to) the large objects into 2 categories. The categories are

raster data, processing level ²2
raster data, processing level ³3
sounding data, processing level ²2
sounding data, processing level ³3

The table definitions are not given explicitly; we assume each instance (row) is 1,000 bytes in size. The size of the object itself is not included in this count. The table definitions are assumed to hold enough ìmetadataî to characterize the object uniquely, find it in the storage system, and supply enough information to read and write it properly. One realization of this model would be to store in each objectís table entry a pointer to a file on an (extended) NFS file system.

The size of each table and the cumulative size of the large objects to which they point are needed. We use the projections for the state of the system on January 1, 2001 and detailed them in Appendix 1.

Specification of the Physical Storage

A physical system for storing tables and large objects is conceptually defined. In this system, there are magnetic disks, magnetic tapes, and storage management software. There is also an indeterminate number of processors running 1 or more COTS DBMSes. To simplify the analysis, we assume the entire system is at a single location, but this does not vitiate the analysis for the reasons stated in the introduction.

Specification of the Load from Push Processing

Each element of the physical storage system is stressed by the continuous flux of incoming and processed data. This flux is also estimated from the expected situation on January 1, 2001.

A5.2.2 Assumptions

The following paragraphs summarize our assumptions. Some of our assumptions are based on the overall Project Sequoia 2000 architecture. Others have been made only to simplify this particular study. Our assumptions are

ï Push processing and pull processing occur simultaneously and involve concurrent access to the DBMS.

ï The database is centralized. Neither tables nor large objects are fragmented across the network.

ï The load from reprocessing is ignored.

ï Browse products are calculated on-the-fly.

ï The load from catalog searches alone, when these do not result in substantial data transfer, is ignored. These queries will constitute a minor part of the storage and processing load.

ï Large objects are stored in chunks no larger than 100 MB.

ï Large objects are retrieved in their entirety.

A5.3 Database Model

The database model consists of online large-object catalogs, and online and nearline large-object storage systems. The catalogs are controlled by a COTS DBMS. The storage systems are controlled either by the DBMS or the file manager (NFS) of the operating system.

We assume the availability of an HSM coupled to the DBMS. The storage manager consists of a quantity of tape robots fronted by a large (order 10-TB) disk cache. Conceptually, the DBMS always reads large objects from disk and writes large objects to disk. The HSM has software, working independently of the DBMS, that pushes stale data from disk to tape. In particular, standard products not actively in use are either moved onto tape automatically by the HSM or purged from the system by the eager/lazy component of the DBMS.

A schema for the large-object catalogs is not detailed explicitly but can be thought of as a catalog table for each of the principal abstract data types described in Appendix 4. These tables hold the minimum amount of information needed to find, retrieve, and read (or write) the large objects. A number of the attributes will be foreign keys to other tables of the schema that hold detailed information. It is reasonable to assume that these subsidiary tables are smaller in size than the catalog tables.

Our architecture assumes that users are provided no superfluous data and that the data are delivered over a wide-area network. On the basis of each queryís where clause, appropriate projections, hyper-slices, smoothings, etc., are performed by the DBMS. Ideally, the specification of the relevant subsets is built into the data retrieval system, and the minimum amount of data needed to satisfy the query are drawn from the deep archive onto disk cache and passed from disk into physical memory. One way to establish a tiling policy is to formulate it as an optimization problem.

A5.3.1 Bottom-up Description

The state of the large-object database on January 1, 2001 is described in Appendix A. The numbers in that appendix summarize material on satellite launch dates, instrument systems, and data products provided by HAIS in May. We have organized these data somewhat differently to get a better view of the database and the rate of change at the beginning of 2001.

The overall quantity of level 0 through 2 products is given at the head of Table A1-1. Those figures and the HAIS figures for the combined amount of data at the Goddard, Langley, and EROS DAACs are reproduced in Table A5-1, below. Our working numbers are consistent with the HAIS estimates for the two large DAACs (GSFC and Langley) but are about half the total HAIS count. We donít know the cause of the discrepancy, but it is not important in the present discussion.

It is important to note, however, that there are sensors that will provide data to EOSDIS, which are not in the table titled ìProducts&OACUTE;SDR Baseline.î A few are shown in Appendix 1, but we do not have size estimates for most of them. Also missing from the count are GCM modeling and assimilation data.

Table A5-1: Database Size and Rate of Change on January 1, 2001

Metric	Appendix A	HAIS, May 8
Stored data (TB)	1,500	2,750
Data production (GB/day)	2,300	~4,000
Large objects (millions)	16.7	not given
Large-object production (per day)	30,000	not given

Data granules smaller than 100 MB in size are saved as distinct large objects with exactly the size projected by the ECS designers. Granules larger than 100 MB will be divided into multiple 100-MB large objects. This splitting has no effect on the quantity of stored data, but it does influence the number of rows in the catalog tables. There are approximately twice as many large objects, as defined above, than there are granules in the SDR-baseline projection.

It is informative to look closer at the quantities of data according to product level and instrument type. The 2 pertinent types are imaging data and sounding data. The result of using this organization for the data described in Appendix 1 is shown in Tables A5-2 through A5-5.

Table A5-2 shows that there will be about 16M large objects from the imagining systems with product level lower than 3, and that the objects themselves occupy 1.5 PB of storage space. When instruments fly on multiple platforms (CERES and MODIS), the number of platforms is shown in parentheses, and the figures give the total quantity of data.

Table A5-3 shows the data flux on the fiducial date. The flux of large objects is exactly the rate that rows are added to the catalog table. The quantity of data with product level lower than 3, 2.7 TB/day, is equivalent to 31.3 MB/s.Table A5-2: Stored Data of Imaging Systems on January 1, 2001

Instrument Variables Data Levels

0 1A 1B 2 Â(0-2) 3

ASTER Mrows 0.70 0.70 1.47 1.47 4.34 x

TB 81 87 179 63 410 x

CERES Mrows 0.03 x 0.10 0.14 0.27 0.02

TB 0.30 x 0.45 12.60 13.35 0.50

MIMR Nil

Nil

MISR Mrows 0.41 0.54 1.27 0.16 2.38 0.03

TB 37 53 127 18 235 2.70

MODIS Mrows 0.68 1.13 4.76 3.01 9.58 0.03

TB 61 114 473 242 890 3.06

SeaWiFS Mrows 0.01 0.01 0.01 0.02 0.05 x

TB 0.42 0.05 0.32 0.01 0.80 x

Total Mrows 1.83 2.38 7.61 4.80 16.62 0.08

Total TB 179.72 254 780 336 1549 6.26

Table A5-3: Write Activity for Image Data on January 1, 2001

Instrument Variables Data Levels

0 1A 1B 2 Â(0-2) 3

ASTER Krows/day 0.78 0.78 1.60 1.60 4.76 x

GB/day 90 97 200 70 457 x

CERES (3) Krows/day 0.03 x 0.01 0.27 0.31 0.05

GB/day 0.40 x 1.00 28.00 29.40 1.00

MIMR Krows/day 0.02 x 0.06 0.17 0.24 0.01

GB/day 0.72 x 5.10 0.10 5.92 0.02

MISR Krows/day 0.40 0.60 1.41 0.18 2.59 0.03

GB/day 41 59 141 20 261 3.00

MODIS (2) Krows/day 1.50 2.50 10.60 6.60 21.20 0.07

GB/day 134 250 1040 532 1956 6.80

SeaWiFS Krows/day 0.02 0.02 0.02 0.03 0.07 x

GB/day 0.60 0.07 0.46 0.01 1.14 x

Total Krows/day 2.74 3.90 13.70 8.84 29.17 0.16

Total GB/day 267 406 1388 650 2710 10.82

Equivalent summaries for data of the sounding systems is shown in the 2 tables below. These data are less copious because several of the systems will only just have been launched. It is also noteworthy that relatively few level 3 products have been defined for this type of data.Table A5-4: Stored Data of Sounder Systems on January 1, 2001

Instrument Variables Data Levels

0 1A 1B 2 Â(0-2) 3

AIRS Nil

Nil

AMSU Nil

Nil

LIS Krows 20.00 x 20.00 20.00 60.00 1.00

GB 80 x 80 80 240 180.00

MHS Nil

Nil

MOPITT Krows 10.00 x ?? 3.00 13.00 2.70

GB 60 x ?? 90 150 20.00

SAGE Krows 1.50 x 1.50 14.00 17.00 x

GB 0.30 x 0.45 12.60 13.35 x

SWS Krows 10.00 x 10.00 20.00 40.00 x

GB 30 x 30 250 310 x

Total Krows 41.50 31.50 57.00 130 3.70

Total GB 170 110 433 713 200

Table A5-5: Write Activity for Sounder Data on January 1, 2001

Instrument Variables Data Levels

0 1A 1B 2 Â(0-2) 3

AIRS Krows/day 0.15 0.15 0.15 0.12 0.57 x

GB/day 15.3 16 15 2 48 x

AMSU Krows/day 0.015 0.015 0.015 x 0.05 x

GB/day 0.04 0.03 0.03 x 0.10 x

LIS Krows/day 0.02 x 0.02 0.02 0.05 x

GB/day 0.07 x 0.07 0.07 0.21 x

MHS Krows/day 0.02 0.02 0.02 x 0.05 x

GB/day 0.05 0.04 0.04 x 0.13 x

MOPITT Krows/day 0.02 x 0.001 0.003 0.02 0.003

GB/day 0.07 x 0.10 0.10 0.27 0.02

SAGE Krows/day 0.02 x 0.02 0.14 0.17 x

GB/day 0.26 x 0.02 0.01 0.29 x

SWS Krows/day 0.02 x 0.02 0.03 0.06 x

GB/day 0.06 x 0.05 0.50 0.61 x

Total Krows/day 0.24 0.18 0.23 0.30 0.95 0.00

Total GB/day 15.85 16.07 15.31 2.68 49.91 0.02

A5.3.2 Summary

Imager data, levels 0-2, dominate all other data by 2 or more orders of magnitude. Of the 1.5 PB in the working model for this data category, level 0 amounts to about 10% of each metric (total data, total rows, data flux, and row flux). This, of course, is based on the current assumption that all the intermediate products are saved permanently (except for the 12 on-demand products already identified in the ASTER stream). It is this heavy weighting of the storage system by derived products that motivates the eager/lazy model of data processing discussed in the body of this report.

For imager data, the level 3 data products and data fluxes are typically 1% of the cumulative products and fluxes at the lower levels. However, this generalization does not hold with sounder data. First, the AIRS instrument (a 2,000-channel spectrometer that will just have been launched) will be sending data at an enormous rate. Secondly, only a few level 3 products have been identified for the sounder instruments. The combination of these 2 effects is that the stored level 3 data for the sounders is almost entirely LIS data, but the bulk of the product generation system will be devoted to the AIRS data.

A summary of the data model is presented in Table A5-6. The top half of the table has the 4 pertinent metrics for the image-type data. The second half of the table has the metrics for the sounder-type data. These results are discussed from the viewpoint of the push processing in the next section.

Table A5-6: Database Size and Rate of Change on January 1, 2001 for Imager Data and Sounder Data

Class Amount Derivative

Imager data

Stored data, level 0-2 1,549 TB 2.7 TB/day

Database rows, level 0-2 16.6 million .030 million/day

Stored data, level ³3 6 TB .011 TB/day

Database rows, level ³3 80,000 160/day

Sounder data

Stored data, level 0-2 713 GB 50 GB/day

Database rows, level 0-2 130,000 1,000/day

Stored data, level ³3 200 GB .02 GB/day

Database rows, level ³3 4,000 negligible

A5.4 Large-object Management for Push Processing

Figure A5-1 shows our model (somewhat like an entity-relationship diagram) for capturing the essence of the year 2001 ECS database from the perspective of large-object management by the push system. The model is organized around 4 key tables, and these are shown as boxes in the central part of the figure. Each table has a row for a stored large object, and one attribute is a pointer to its location. This is represented by the dashed arrow. Raster_2 is the table for data of type ìrasterî (produced by the imaging systems) and processing level lower than or equal to 2. Raster_3 applies to raster data of processing level higher than or equal to 3. There is a similar pair of tables for sounder data. The number of rows in each table is taken from Table A5-6, and the total table size (in bytes) assumes each row is 1,000 bytes long.

Large objects are stored on magnetic disk when the cumulative data in the class is less than 10 TB. Otherwise data are saved in a hierarchical storage manager. In simple terms, the cost of a 10-TB disk farm (discussed in detail in sections 4 and 5) is $10M now, but less than $200,000 in 6 years if the deflator is 0.5/year. Furthermore, a flux of 10 GB/day can be kept online perpetually by spending $200,000 per year at the year 2000 prices.

This figure shows quantities of stored raster data and sounder data, the sizes of the tables that point to the data, and the growth rates of the 8 data types from push processing.

On the top and bottom of the figure, we show the data fluxes into each part of the model that are associated with push processing. These fluxes are taken from column 3 of Table A5-2.

Previously, we noted 2 anomalous data channels, AIRS and LIS. AIRS is a sounder but has such a high data rate that the level 0-2 products will swamp an affordable disk system. These data are shown flowing onto the hierarchical storage manager. Currently, there is no level 3 product defined for AIRS, so this low-level flood has no effect on the high-level store for sounder data. The stored data for that table, however, are dominated by LIS data, a lightning mission that will have terminated. The fluxes for sounder data level ³3 are paltry, and this follows from the fact that only MOPITT has a defined level 3 data product.

Figure A5-1: Snapshot of EOSDIS Data System

This simplified model illustrates a number of architectural issues:

ï Online magnetic disk will be an effective and economical way to save all sounder data and level 3 and higher imager data. Data growth can be handled by adding disks to the system. We project that the price of disks will decrease faster than the growth of data, indicating the desirability of just-in-time purchasing (see Appendix 2).

ï Level 3 data will be extremely popular with non-specialists. Having that data on disk will make it easy to service Internet requests (e.g., those via NCSA Mosaic clients). Separating the data catalog from product generation provides a kind of fire wall with distinct advantages. Different security mechanisms can be instituted, and heavy access to level 3 data will not impede routine analysis of the lower-level products.

ï To maintain distinct but linked DBMS systems and automate data migration among them is within the current state of the art.

ï None of the large-object catalogs is too big for current DBMS systems, either with respect to the row count or the total table size. Current DBMS systems, given tables as large as the largest in the model, can append rows orders of magnitude faster that the maximum model rate, (1 row)/(3 seconds). They can maintain all indices in real-time as well.

ï The database model neglects the DBMS tables that would be associated with optimized eager/lazy evaluation. An optimizer needs to monitor usage of large objects, and the simplest way to do this is to log usage in additional tables. These logs would need at least 1 additional row for each large-object, which might double the table space.

ï The model is drawn as if it were centralized, but we do not preclude distribution of large objects and affiliated tables or both across network links, either uniquely or with duplication, as is discussed in the body of this report.

ï The large bulk of processed raster data and the relatively small number of large objects indicates that eager/lazy storage optimization, as discussed in the body of the report, will not be hard to manage.

The fluxes of the model are based on the assumption that each large object is written once to the storage system and its pointer appended once to a DBMS table. In the transaction-oriented system we advocate, in which the DBMS is an integral part of the processing, most objects would be read as well, many more than once. This would entail ìselectsî from the catalog table and ìreadsî from the storage manager. To make the latter efficient, the disk fronting the hierarchical store should have capacity for several dayís worth of data. These added loads are not included in the numbers displayed in the figure.

A5.5 Large-object Management for Pull Processing

Operations involving large objects will be the predominant load ìpullî users will exert on an EOSDIS built to our architectural specifications. These operations will involve 4 principal activities:

ï Executing SQL-* queries involving complex where clauses, including joins between different data types. This will exercise the DBMS engine.

ï Fetching data from the storage system. This will exercise the storage manager and the input/output system.

ï Processing data by EOSDIS functions invoked from the database. These include both algorithmic processing and data display. A pertinent example is the calculation and visualization of a browse image. These will exercise the central processors of the DBMS engine or other compute servers enslaved to it.

ï Delivery of data to a remote user. Whenever possible, this is to be accomplished by direct network links. This will exercise routers and network bandwidth.

The ad hoc nature of these user activities makes it fruitless to collect the kinds of data and construct the system simulator that is appropriate for the push processing. However, the written scenarios collected by the User Model Team and our work with these described in Appendix 4 provide a good basis for making quantitative estimates how representative users will stress the system.

We assume ad hoc users submit queries phrased in SQL-*, which has been enhanced with EOSDIS-specific extensions as described in Section 3. Appendix 4 contains many examples of HAIS scenarios re-phrased in this language. To estimate the loads of these users, we adopt a logical and physical model of the database to receive the queries, such as that described above. Finally, the response of the model to the queries is assessed by estimating a number of pertinent factors, including:

The count of the number of rows touched.
The total elapsed time to execute the query.
The computational effort to execute special processing.
The number of bytes of large-object data handled by the I/O system.
The number of bytes of large-object data given back to the DBMS, if any.

Some examples of this methodology are given in Section 2.

Instrument	Variables			Data Levels
		0	1A	1B	2	Â(0-2)	3
ASTER	Mrows	0.70	0.70	1.47	1.47	4.34	x
	TB	81	87	179	63	410	x
CERES	Mrows	0.03	x	0.10	0.14	0.27	0.02
	TB	0.30	x	0.45	12.60	13.35	0.50
MIMR	Nil
	Nil
MISR	Mrows	0.41	0.54	1.27	0.16	2.38	0.03
	TB	37	53	127	18	235	2.70
MODIS	Mrows	0.68	1.13	4.76	3.01	9.58	0.03
	TB	61	114	473	242	890	3.06
SeaWiFS	Mrows	0.01	0.01	0.01	0.02	0.05	x
	TB	0.42	0.05	0.32	0.01	0.80	x
Total Mrows		1.83	2.38	7.61	4.80	16.62	0.08
Total TB		179.72	254	780	336	1549	6.26

Class	Amount	Derivative
Imager data
Stored data, level 0-2	1,549 TB	2.7 TB/day
Database rows, level 0-2	16.6 million	.030 million/day
Stored data, level ³3	6 TB	.011 TB/day
Database rows, level ³3	80,000	160/day
Sounder data
Stored data, level 0-2	713 GB	50 GB/day
Database rows, level 0-2	130,000	1,000/day
Stored data, level ³3	200 GB	.02 GB/day
Database rows, level ³3	4,000	negligible