CyberAll: A Personal Store for Everything
CyberAll is a project to archive all my personal and professional information content including that which has been computer generated (since the mid 70s), scanned and recognized, and recorded on VHS tapes. The archive includes books, correspondence (i.e. letters, memos, and email), transactions, papers, photos and photo albums, and video taped lectures. In 2000, only 10 gigabytes, costing $100 incrementally, are required, and the accumulation rate is projected to be 1-2 gigabytes per year. Encoding, indexing, and data-management costs swamp storage costs – by 1000:1 or more. The clear challenge is to automate the capture, search, and retrieval so that it comes close to the storage cost. It is inconceivable to think of manually managing or purging this electronic file since the storage costs are only $100. Indeed, copies are stored in 2 or 3 locations for redundancy.
Michael Lesk (Lesk, 1997) provides a comprehensive view of the problem of storing everything at a national or international scale, including the problems of encoding existing and evolving libraries of all types. In March 2000, Brewster Kahle’s non-profit organization, www.archive.org, archived the 1 billion web pages in 14 Terabytes. It is beginning to archive the output of 20 television channels.
In contrast, CyberAll is aimed at the personal scale. It is my sole store for all personal documents, photos, music, and videos as described by Bush over 50 years ago (Bush, 1947) and more recently by Gates (Gates, 1997).
CyberAll holds personal reference articles e.g. Amdahl’s Law, special computer manuals e.g. Digital PDP-1, CDC 6600[1], and magazines and clipped news articles e.g. Economist graphs that heretofore would be stored in files or on shelves. At present, only books remain in “atomic” form; but it will include all books as soon as they become e-books. Already, three books I authored have been scanned and are on my website (http://research.microsoft.com/~gbell).
Within the next decade personal computers will store a terabyte. In 2000, 40 gigabyte drives costing $400 are more than adequate to hold the content for most of a professional’s lifetime reading, presentations, and audio recordings. A CD encoded at 128 kilobits per second can be stored at a cost of $0.60. A typical user’s CD collection requires about the same space as the scanned and OCR’d versions of all his paper-based files.
The next phase of CyberAll will deal with voice capture of conversations, interviews, meetings, and presentations. Recording all the audio conversations in one’s personal and professional lives would require over a terabyte when encoded at 8 kilobits per second. Since a terabyte costs about $10,000 now and should be $1,000 in 5 years, recording conversations seems like a reasonable near-term goal. Clearly a ubiquitous, high-quality 360 degree camera/microphone that would attach to a personal computer would be a useful and welcome device.
Video is more challenging. For home use, a terabyte holds only 500 hours of DVD quality videos and 1500 CDs, but more compression increases the content by a factor of at least 10. Recording a lifetime of everything seen via video requires 100 terabytes. Doing this economically is still a decade or more away – now it would cost more than $10,000 per year. But in two decades, it should cost only $100 per year.
This paper presents the decisions, logistics, time, and costs to CyberAll my documents. Nearly all of the basic technologies for cyberization are improving at a rate approximating Moore’s Law: getting two times better every 18 months. There is extraordinary progress in all areas, ranging from processor speed, storage capacity, scanner speed and accuracy, camera resolution and software, OCR accuracy and capability (e.g. scan to HTML), audio and video encoding, printing and display, and standards. Thus, one can always rationalize waiting for a better system or standard – things will be SO much better in 18 months. However, the cost of content capture is increasing also[JG1] – so it is important to start now, especially with the compelling economics.
CyberAll raises questions about:
Longevity and Long-Term Retrievability – Paper and film can have centuries of lifetimes (although most of our 50+ year old film and photos show fading), while current digitized formats are almost certain to be un-readable in 10, 20, or 50 years based on media, platform/file, and applications obsolescence. So, digital content requires frequent conversion to new media and often to new formats (because the old formats are no longer supported). Historically, these format conversions have been lossy. ASCII is the only format that has stood the test of time, but it carries no semantics or application behavior. Automatic and failsafe backup is critical. CyberAll requires that digital documents never be lost and are forever preserved.
Access and Access Control – Access to personal information must be easily controlled by the owner. Privacy suggests that, by default, others should not have access to the content. However, those of us with public web sites need to be able to more simply map information in our CyberAll into a variety of increasingly public sites versus having to maintain an array of separate sites. In essence, more public sites are cached, slaves of CyberAll.
Databases And Retrieval Tools For Non-Textual Information – Handling photos, photo albums, conversations, audio, and video is a fertile, new product area. Current products have a long way to go to satisfy the very wide range of CyberAll users.
Usability – Building and using CyberAll today is tedious and requires technical skill. Just setting up CyberAll is a major problem. New products, standards, and services are needed to make using it a painless process so that everyone in a family could easily store items that would be forever retained. Storing items need to be as easy as discarding them… in fact, storage is just one step away from the recycling bin.
The motivation for CyberAll ranges from the technical challenges (i.e., “because we can” or will soon be able to) to a desire to provide an archive for our progeny. High on the list is simply coping with the exponential increase in the amount of information (e.g. web pages, pictures, audio, and video) that is becoming part of our personal and professional lives. Given the tools to easily en masse- produce documents, we are well on our way to converting ourselves into a world of filing clerks! This cycle has to stop.
CyberAll is consistent with or parallel to Nathan’s Laws of Software: (1.) Software is a gas that expands to fill the container it is in. (2.) Software grows until it is limited by Moore’s Law. (3.) Software makes Moore’s Law possible. (4.) Software is only limited by human ambition and expectation. One could replace the word “software” with the word “data” and get Nathan’s four Laws of Data.
Many share my “pack rat” mentality that wants to store everything in case we need it to remind us, or in case we need to remind others. This is a strong motivation that creates an infinite storage appetite. In essence, CyberAll is an almost infinite attic that can store anything that could conceivably be used to answer some future question or to help explain to others (e.g. our progeny) what it was like when. It is both a memory aid and a device to help tell stories. For some, this might mean storing everything from second grade spelling tests and grade cards to home videos.
The notion of the paperless office has been out of fashion for several decades. Rather, we have built ever more productive tools to generate paper. Surprisingly, the amount of paper and file folders only grows with inflation, while printer capacity continues to grow at a 20% annual rate! File storage capacity and area devoted to paper storage grow slowly with population as people seem to retain a constant amount of paper.
CyberAll aims to eliminate paper that is used for storage and transmission, but not for certain viewing applications where paper’s advantages are well known. CyberAll’s near-term goal is to reduce the need for paper document filing while appropriately handling the transactions that would have required paper for transmission, reading, and permanent storage. The two-year goal is to eliminate all paper except documents that represent money i.e. plain old money, notes, stock, and unfortunately, cancelled checks. Tragically, the financial community – hiding behind “user resistance” – is decades behind in their thinking or ability to electronically deal with all of these items, except money!
In order to replace paper for reading, screens may need a resolution of 200 dpi and higher contrast ratios. Paper is also lighter and more portable for small documents. Still, there is extraordinary progress in display resolution, size, price, and weight.
The advent of a standard image format will have the most impact on document archiving and use because it will provide a single and universal format for storing documents, including images and recognized text for searching. In this way, it will no longer be necessary to store or transmit paper documents[JG2]. The next generation TIF standard that can hold images and recognized text could eliminate the need to store or transmit paper. PDF, MIME, MHT, and DjVu are also candidates for such a standard.
At last, electronic filing cabinets such as Ricoh’s eCabinet (Ricoh, 1999) are being introduced that can accept both computer generated and scanned documents and know all of the words in the documents they hold! Of course, existing filing systems (e.g. Windows 2000, Office) include the ability to index their documents. However, scanned documents first need a recognized form.
Table 1 shows the various kinds of content that occur in an individual’s personal and professional lives for archival (mainly reference) and daily (working) use, e.g. cancelled checks, email, and music. It also shows some of the use of the content that arises in these contexts. This includes encoded legacy content e.g. papers, photos, audio and video tapes to computer generated papers, presentations, JPEG images, “ripped” CDs, and video tapes. .
|
Table 1. Data-types and use for timeliness and user
context. |
||
|
User Context / Timeliness |
Personal |
Professional (job related) |
|
Archival (historical reference) |
Documents, photos, music,
video memory-aid, entertainment, medical history, progeny |
Books, papers, reference
documents |
|
Working |
Documents, email, photos,
audio (CDs), video communication, entertainment, finance, record |
Documents, email, |
Tables 2 gives the storage requirements and costs for holding various kinds of data items of potential interest. It is clear that all written information and photographs cost nearly zero to store and these will reside in everyone’s cyberspace within the next decade. Also, the risk of deleting a potentially useful file is much higher then the space savings; hence, storing everything costs substantially less than any alternative.
It should also be noted that the
cross-over for storing encoded CDs is about 1/20th the cost of the
original CD, not counting the time to attend to the encoding. Unless the
encoding can be done in parallel with some other task, the encoding times and
cost swamp the cost of the CD. Emerging
music storage appliances and personal computers will likely change the entire
music distribution system. MP3.com
sells recorded music via the web and also offers a service that transmits
content to an owner of a CD, thereby reducing the users’ encoding cost.
|
Table
2. Storage requirements and cost for common data items |
||||
|
Item |
Size (Bytes) |
Encoded size |
Items/GByte |
Cost ($)/item* |
|
page (b/w) fax |
100 K |
4K |
10 - 250 K |
0.00004 - .001 |
|
page (color) |
6 M |
0.3(jpeg) |
160 –3 ,500 |
0.003 - 0.06 |
|
business card |
5 K |
500 |
200 K |
0.00005 |
|
Photograph |
3 M |
25-400 K |
10,000 |
0.001 |
|
book
350 pp |
25 M |
1-2 M |
40-750 |
0.01 - 0.25 |
|
|
|
|
|
|
|
CD
(1 hr) |
640 M |
60 M |
1.5 -16 |
$0.60 |
|
|
|
|
|
|
|
LowQ
video/hr |
50-300 K/bs |
20-300 M |
3.3 - 50 |
0.002 – 3.30 |
|
Mpeg
video/hr |
1.5 Mb/s |
670 M |
1.5 |
6.70 |
|
HiQ
video/hr |
DVD 4 Mb/s |
1.8 G |
0.6 |
18 |
*2000 system prices of $10,000 per terabyte or $10 per Gigabyte
Table 3 estimates the storage requirements for storing various types of content arising in an individual’s life. It is clear that an individual will be able to record all of the information accumulated in one’s entire personal and professional life in a few terabytes, including everything spoken, but not including anything captured via video recording. Certainly this archive would include all home videos for most families, hopefully with editing. The table shows the various jumps in storage required going from recording lifetime text, transcribed or encoded speech, and video. The need to recognize and only handle transcribed speech is clear based on storage and on the ability to search.
|
Table 3. Size for
storing everything read/written, heard/spoken, photographed and seen (via
video) |
|||
|
Data-types |
Rate |
Per day / |
Lifetime amount |
|
read text, few pictures |
200 K |
2 –10 M/G |
60-300 G |
|
Email, papers, written text |
|
0.5 M/G |
15 G |
|
photos w/voice @100KB |
200 K |
2 M/G |
60 G |
|
photos @200 KB |
Ten photos/day |
2M/2G |
150 G |
|
|
|
|
|
|
spoken text @120wpm |
43 K |
0.5 M/G |
15 G |
|
Spoken text @8Kbps |
3.6M |
40M/40G |
1.2T |
|
|
|
|
|
|
video-lite 50Kb/s POTS |
22 M |
0.25 G/T |
25 T |
|
video 200Kb/s VHS-lite |
90 M |
1 G/T |
100 T |
|
DVD video 4.3Mb/s |
1.8 G |
20 G/T |
1 P |
The actual amount of storage used (Table 4) is considerably less than the lifetime estimate, because until recently the author purged files to stay within file cabinet constraints. Only a few documents were preserved.
The author has a number of albums that archive family and trips, some of which have been posted on a website (Bell, 2000). Typical albums occupy 3-5 Mbytes, consisting of 30 pages of JPEG photos encoded at 150 KB/per page.
|
Table 4. Author’s document, photograph, videotape; and 150 CD archive |
||||
|
What |
Files |
Size(MB) |
MB/file |
GB/Yr |
|
Archive of scanned TIF
& PDF |
2,897* |
4,665 |
1.6 |
|
|
Computer files 10 yr
archive (3K) & working |
5,927 |
712 |
0.2 |
|
|
GB books (4 encoded) |
2,027 |
494 |
|
|
|
Photos: digital |
997 |
158 |
0.2 |
|
|
Photos: scanned albums,
pictures, slides |
1,730 |
480 |
0.3 |
|
|
Mail (last 2 years only) |
4 |
236 |
|
200 |
|
GB Videos (lectures, 8mm
family movies) |
20 |
4,000 |
200.0 |
|
|
Total personal/prof.
archive & working |
10,705 |
10,745 |
|
|
|
150 CDs MS WMA multimedia
encoding @16 KBps |
1200 |
8,640 |
57.6 |
1000 |
|
Grand Total |
11,905 |
19,385 |
|
|
Table 5 lists the items that one might want to cyberize and the potential formats to use. Legacy data-types, e.g. paper, photos, and videotapes, have stood the test of time. There are various kinds of “players” that allow them to be converted to computer readable form to exist in Cyberspace. For computer created data, the application program that created the data is often no longer available – so the document is essentially lost. Over the long term, complex programs like databases, word processors, and computer games can no longer run on new systems. This means the information about the various documents, i.e. meta-data, might appear within the files. In the future, I would anticipate that systems should be able to deduce much of the meta-data about a document (e.g type, title, author, keywords, creation date). The document creation data is probably the second most useful meta-data, and often missing. Information must be held in as few, golden primitive forms as possible.
This golden data format problem will be discussed in the following section.
|
Table 5.
Taxonomy of legacy and computer data item types and storage formats |
|
|
Information |
Encoding |
|
Legacy (non-computer generated to encode) |
|
|
Paper: b/w, color, mixed |
B/W
TIF, PDF, DOC/RTF, HTML |
|
Voice including phone |
MP3 |
|
Photos, slides, overhead transparencies |
JPEG
(future TIF standard will encode n-photos) |
|
Photo albums, slide shows, slide talks |
JPEG
folder, PDF, DOC/RTF, PPT, HTML thicket, MHT (html thicket as a single file) |
|
Music: CDs, tapes, and records |
MP3 |
|
Videotapes and film |
MPEG-j,
|
|
|
|
|
Computer
generated |
|
|
“golden” formats: TXT, TIF,
JPEG, MP3, MPEG-j |
|
|
Files & containers (DOC, RTF, PPT, HTML, PDF, XLS) |
|
|
Databases (e.g. Access, dbase II, etc.) Email databases (e.g. Eudora, Outlook) |
Questionable
long-term access! Unreadable indexes! Eudora:
TXT database! |
|
Applications (e.g. Money, Quicken) |
Annual
versions that may have to be upgraded. Reports convert to TXT! |
Encoded documents are stored in two formats to increase the likelihood of reading the document in the distant future. Black and white documents are retained in their primitive scanned TIF formats and, in addition, converted to either PDF, DOC/RTF, or HTML to enable the document to be searched, viewed on a screen, printed at a high quality level to allow the recipient to recreate the same feel as the original, and quite possibly recreated so it can be edited. For photographs, retrieval by content is an unsolved problem, although systems such as the Altavista search exist to attempt to find images using various attributes, e.g. color, people, or buildings and then to find similar photos with those attributes.
Some documents (mixed -- black/white text, color figures, and photos) hold the color images, original, and recognized text in one or more files. For example, a scanned copy of the 1889, 13-page Hollerith patent TIF file requires 700 Kbytes and 79 Mbytes for black and white and color, respectively. Storing the color scan as JPEG images in containers such as Word, PowerPoint, or PDF, requires about 2 Mbytes. This file produces a near likeness of the original, aged document. Depending on how the document was scanned, it can be OCR’d, but “on-screen” viewing is difficult. The black and white image stored in a PDF file occupies 950 KBytes and contains the original image for limited on-screen viewing, printing, and the OCR’d text for searching. DjVu stored color documents appear to encode compound color and text documents in half the size of other formats.
One of the most difficult tasks is to cut a relatively rare bound book, paper, or report apart for scanning and then to discard it (Bell, 2000). Some content (e.g. engineering notebooks and handwritten notes) are not being captured at this time due to the inability to recognize the material and the difficulty of reading low contrast material.
We used the HP Digital Sender (a scan server connected to Ethernet) to scan to either black and white or color TIF or PDF. Adobe Circulate converts among the various data types (e.g. PDF, TIF, and JPEG). Several other programs, e.g. Caere's PageKeeper, ScanSoft's Pagis and PaperPort scan to alternative, proprietary TIF format dialects. They also recognize text and build a search indices for retrieval. The author uses PaperPort for holding temporary working, professional documents – if a document is likely to be preserved, it is converted to TIF or PDF.
TIF format is the basis for virtually all OCR and page input programs:
Document ® Scan ® TIF future versions of TIF include OCR’d text
|® Acrobat ® PDF(with OCR’d text)
|® e.g. Omnipage +manual effort ® DOC | HTML & Simages
TIF is a golden format because it is a non-proprietary and evolving standard that has a huge installed base. Future, proposed versions of TIF contain the OCR’d text. Currently, a TIF is converted via Acrobat to PDF[2] with the image recognized so that the document can be searched and easily distributed.
Alternatively, an OCR program converts the image into a word document in a near likeness of the original that provides for repurposing. In this way, a new “original” document is created for subsequent use. Alternatively, a program converts the document into an HTML page, including the file of images (also known as the HTML thicket) for on-screen viewing. A new format, MHT derived from the MIME encoding of a mailable HTML page, holds the thicket in a single file and is possibly a future standards competitor.
Future TIF standards include the image plus the OCR’d text to enable searching as well as meta-data (e.g. creation date, scan date, author, document type, and keywords that further describe the document). As previously discussed, future retrieval systems should be able to deduce many of the attributes. With time, scanners will evolve to include the encoding software. Scanners that directly connect to a personal computer usually just provide bitmap images to the computer and, depending on the interface software, images can be stored in a variety of formats. In the future, scanners will continue to have more capability for scanning images and converting these into a variety of other forms, e.g. TIF with encoding, and JPEG.
Finally, the evolution of TIF and HTML-XML to be able to hold different image encodings, including the recognized text, will make scanning more economical by allowing all scanned documents to be recognized and indexed within a single archive. These two capabilities are critical to eliminating the need to store or transmit paper.
We encoded non-digital or legacy photos in two ways. The photos are scanned separately and put into a folder that is containerized by PowerPoint into a .PPT document that holds the set of photos as an album. PowerPoint has a special plug-in to collect a set of images to build the album. Depending on the expected use, either folders or PowerPoint holds the photos.
SPhotos ® Scan ® SJPEG ® PowerPoint ® .PPT
The second method is to directly scan an album (i.e. just a collection of pages of photos) into a single PDF document that holds the various JPEG images of each page. The PDF document can also be unstacked and each page converted into a folder of JPEG images. The process can continue on to create a single PowerPoint document or Word file (.doc) for storage or display. For web hosting an album, an HTML document is an alternative storage container.
Album ® Scan ® PDF(JPEG) ®Circulate ® SJPEG ® PowerPoint ® .PPT | .DOC
Note that a TIF is not used as the intermediate images format because of size. Virtually all images for personal use are JPEG encoded. Alternatively, I send 35mm slides, negatives, and photos of virtually any size to Kodak for conversion at a cost of $1/image. The Kodak Photo CD holds 100, 5 Mbyte images in multiple JPEG resolutions in its .PCD format.
Table 6 gives the advantages and disadvantages of holding the documents in various formats. It is important to note that none are ideal, but PDF comes relatively close because it can act as a container for virtually any data-type (e.g. TIF, GIF, JPEG), although extracting the data in the correct format using the labyrinth of Adobe tools is a great challenge. In addition, the Acrobat 4 OCR facility, Capture, provides recognition so that documents can be searched while retaining the original document as an image.
Table 6 also shows the need of future standards that can retain printed images as well as the recognized text for searching, together with the ability to display the images on a variety of screens.
|
Table 6. Characteristics of various scanned
document formats |
||
|
Format |
Advantages |
Limitations |
|
TIF |
A “golden format” from
scanning. Evolving standard will hold
images and recognized TXT. |
Must be OCR’d to search. |
|
PDF |
Defined to carry both
image and recognized text. Holds all
data-types. |
Sole sourced tools, not
editable, poor on screen viewing |
|
DOC/ |
Many editors and viewers,
well-defined, container for all data-types |
Separate software for OCR |
|
HTML->XML |
Open standard, editing
tools, on screen via browsers, most universal |
Compound documents create
an “HTML Thicket” of files; this is solved with MHT |
Table 7 gives times and/or costs to scan and encode various legacy documents. As a rule, each item (page, photo, or slide) costs about $1 from commercial services. For legacy documents, using Acrobat to create PDF for it’s indexing capability saves an incredible amount of time versus having to recognize and recreate a perfect copy of the document. In certain cases, one may want to recognize the document and convert it to a word document (i.e. DOC/RTF) or an HTML document for web viewing. This requires “perfect” recognition together with the need to format the document exactly like the original. In essence, a document is being re-published. To scan, recognize, and edit a page can easily require 10 minutes to create a formatted document that is suitable for repurposed use.
TIF is constantly being evolved by Caere, Kodak, Scansoft,
Xerox, etc. to hold compound documents text, tables, and TIF and JPEG
images. However, like PDF, all of the
company formats are unique, proprietary, and constantly being evolved. Provided a relatively high-resolution image
can be regenerated from the TIF components, character recognition can be done
on the document to create searchable text.
|
Table
7. Time or cost to encode legacy documents, photos, and CDs |
||
|
Task |
Time (min) |
Cost($) |
|
Page
scan |
1 |
0.10-1.00 |
|
10
page paper scan (HP Sender) |
||