National Historical Publications and Records Commission (NHPRC)

Go to the NHPRC Main Page
  • Print
  • Bookmark/Share
Annotation, NHPRC Newsletter
Vol. 26:1  ISSN 0160-8460  March 1998

The Universal Preservation Format: Background and Fundamentals
by Thom Shepard, Project Coordinator

Background

Sponsored by the WGBH Educational Foundation and funded in part by NHPRC Grant No. 97-029, the Universal Preservation Format initiative advocates a format for the long-term storage of electronically generated media. Dave MacCarn, Chief Technologist at WGBH, is the architect of UPF. He and Mary Ide, Director of the Media Archives and Preservation Center at WGBH, are the Project Co-Directors. I am the Project Coordinator, with my feet planted not-always-so-firmly in both engineering and archival camps.

Working with representatives from standards organizations, hardware and software companies, museums, academic institutions, archives, and libraries, this project will produce and publish a document called a Recommended Practice. This document will be submitted to the Society of Motion Picture and Engineers (SMPTE), and will suggest guidelines for engineers to follow when designing computer applications that involve or interact with digital storage. We expect to make the process of preserving and accessing electronic records (both original and migrated) more efficient, more cost-effective, and simpler.

Once upon a time, you could access most media through sheer cleverness. With analog media, such as a record or a film slide, there is an "analogy" between process and form. In practical terms, even without playback equipment, you could simulate the media experience. For example, when I was around Cub Scout age, I built a phonograph player, using rolled-up cardboard to amplify the sound and a sewing needle for a stylus. I can tell you, it was not very popular with my parents, whose records I sometimes borrowed for my prototype, but it worked. I could "get at" the sound.

Getting at digital media is not so easy. You need some form of decoder. Too often, you must have the exact decoder. Our project hopes to change all that. The UPF standard would serve as a universal decoder, co-existing and interchanging with proprietary formats in the same way that RTF ("rich text format") co-exists with Word or WordPerfect formats in your word processor.

I don't need to remind you about the value of standards. Just think about them the next time you replace a light bulb in your living room lamp. One standard that has made the professional lives of archivists easier is acid-free paper. Established in 1984 by the National Information Standards Organization, ANSI Z39.48-1984 set the requirements for the durability and longevity of paper. Paper that complies with this standard will last several hundred years. What made this standard a reality, particularly the 1992 revision, were joint efforts among paper makers, publishers, printers, and the preservation community. The UPF is sounding a similar call for cooperation and communication between engineers and archivists.

Technical Specifics of the UPF

Digital information consists of binary code (zeros and ones). When these zeros and ones are arranged in a particular way, you build digital objects. These objects can be data types, such as video or music, or they can be information about the data types, which is called "metadata." When talking about metadata in terms of its function, there are four basic categories: format, description, association, and composition.

The wrapper (or container) is a file format for storing both the media content or "essence" along with the information that describes it. Think of it as the equivalent of a digital burrito, with the basic ingredients as the "essence" and the optional hot sauce as its metadata. When Dave MacCarn first proposed the UPF in 1996, his model for the wrapper was Apple's Bento Container. Since that time, Apple has dropped its development of Bento. However, the UPF project is currently exploring several next-generation wrapper technologies. Most promising are:

  • JavaBeans, a portable, platform-independent component model written in Java;
  • IronDoc, developed by David McCusker, former Apple engineer in charge of OpenDoc storage and Bento; and
  • QuickTime 3.0, Apple's own follow-up to Bento.

The wrapper is a file format that has a framework structure. Anyone familiar with the Dublin Core metadata initiative, specifically the Warwick Framework Architecture, may have some understanding of frameworks as a method for managing data. Warwick posits a metadata structure in which material describing certain objects may either be embedded in the source or be referenced to files or storage areas external to the source. This information may include domain-specific descriptions, terms and conditions for document use, pointers to all manifestations of documents, and archival responsibility.

A practical example of this referencing may be illustrated by a typical web page, in which there is information embedded in the homepage, but there are also links to information contained within other pages. In terms of digital storage, the UPF will explore with archivists a Recommended Practice that will delineate what kinds of information should be embedded, or "carved in stone," and what kinds might be referenced and editable through time.

While there are several initiatives dealing with subject access and descriptors for faster access, what perhaps has received less attention are projects, like the Association of American Publishers' Digital Object Identifier (DOI) System, that are working toward the standardization of codes to represent digital objects. Identifying digital objects as unique entities is essential to establishing archival integrity, especially when it is so easy to misplace, corrupt, or delete digital information. As files are modified, you need to distinguish the offspring from the parent, but also map the "blood lines," so to speak.

The UPF is looking at initiatives dealing with unique identifiers, and expects to include such a system or systems in our Recommended Practice. Basically, each object carries an identifier that is unique within its container. As this object undergoes changes, often called "versioning," each new generation is assigned its own identifier, which always references its parent.

The UPF uses a digital Rosetta Stone to get at the range of data types held in a digital storage bank. The original Rosetta Stone was a stone tablet, dating back to 200 B.C., which contained the same message written in three languages (hieroglyphics, demotic characters, and Greek). Discovered in 1799 near the Rosetta mouth of the Nile River, it was used in the early 19th century to decipher the Egyptian hieroglyphics.

The digital Rosetta Stone would serve as a key, defining data types and encapsulating algorithms for deciphering those files. This is not a new idea. Jeff Rothenberg, in an article published in Scientific American, has suggested encapsulating software with the stored digital media as a way to get at the media through time. Dave MacCarn proposes the use of platform-independent algorithms to decipher file types.

For example, it might state in effect, "This system uses MARC, which is defined as such-as-such," or, "This system was originally recorded on 422 Video, which is defined as so-and-so." In addition, the Rosetta Stone might include some form of mapping among multimedia file formats or even classification or cataloging systems. The Rosetta Stone would also serve as a registry for unique identifiers.

The actual moving of data would be performed by a media compiler. It would remove the baggage of the acquisition format as it imported the data into the archive. It would optionally export whatever metadata you needed from the archive. Specifically, you could pre-select which set of relationships or media formats you wished to transport for a given need, such as Internet access. And because the relationships among your data objects would be built-in, you could very easily "package" information. For example, you could extract certain media objects, along with their associative text files, based on a scholar's search patterns. These materials could then be burned into a CD-ROM or transferred onto some other portable storage vehicle, and then loaned to the scholar for a fee, or sold to him outright.

Recent Steps for the UPF

Let me now turn to what we've been doing lately. On September 22, 1997, SMPTE assigned the UPF an official Study Group (ST13.14). Entitled "Requirements for a Universal Preservation Format," and chaired by Dave MacCarn, the group first met to establish an agenda and to hash out a statement of objectives, which includes gathering input from the archival community.

On December 9, 1997, Dave MacCarn and I attended the first SMPTE work study forum. Robin Dale of the Research Library Group joined us as we met with about 20 SMPTE engineers at the Sony headquarters in San Jose, California, to discuss the components of the UPF in respect to the stated needs and concerns of archivists, as expressed in our User Survey.

What are these needs and concerns? Though many archivists said that they realized they would have to "migrate" at some point, most could not justify the costs of either migrating to digital or of investing in new digital equipment that will only become obsolete in a few years. Running throughout these commentaries was the frustration that archivists had no control over new technologies. And while digital has qualities that are enormously appealing to archivists - searchability, mobility, longevity - computer technologies seem disposable, like snakes shedding their skins. Some archivists also reported that they were feeling pressure from administrators to go digital for all the wrong reasons: consolidating their collections, for example.

Related to these issues are the changing hiring practices within archival institutions. Commentators mentioned the need to hire people with computer skills at the expense of adding much-needed personnel with library or archival backgrounds and education. Managing these people is also a challenge. Our survey commentaries say it over and over: digital is not a replacement for existing analog collections. Digital must co-exist with analog.

Some of our questions bordered on "blue sky" issues. For example, we proposed a scenario in which embedded information would describe media through what is called "metadata streaming." This embedded information could be applied to video, to an image collection, to a piece of music, or even to a collection of records. Although the idea in itself was appealing, the unanswered question was: who would input all this information, who could afford it? And if a single picture is worth a thousand words, how many of those words do you include in your metadata? Answers may not be available here and now, but we believe that a UPF would help establish a foundation upon which these questions might be realistically explored.

For those already involved in some form of digital conversion, the strategy has generally been to convert from analog to digital in an ad hoc manner. No one has developed strategies for replacing analog collections with digital formats. Always the plan is to hold onto analog while experimenting with digital for purposes of access. Robin Dale said it best: "...[I]n the best of all worlds, institutions prefer to aim for an analog copy for long-term preservation and a digital copy for easy and readily available access."

We recently published the results of our survey on our web site. You can read what archivists have written verbatim, as well as our summaries. In addition, we have posted follow-up questions that we invite you to comment upon. We will include these commentaries in future site revisions. Even if you are not interested in the exact details of this project, we urge you to read these often-inspiring commentaries from some very respected people in this field.

Conclusion

A worthy standard for long-term digital storage will carry forth the traditional practices of analog collections. Specifically, a Recommended Practice must respect provenance and original order. Its framework must be robust, allowing for certain types of metadata to be embedded with the media, with other types to be referenced externally. By concentrating on elemental concepts of how data and information about that data might be stored over time, the Universal Preservation Format initiative is attempting to construct a bridge between engineers and information scientists, between those who make and market technical specifications and those who must learn to use the tools of technology to preserve the rapidly decaying fruits of our cultural heritage.

URLs for UPF-Related Web Sites

Return to Index

The U.S. National Archives and Records Administration
8601 Adelphi Road, College Park, MD 20740-6001
Telephone: 1-86-NARA-NARA or 1-866-272-6272