Building the Archives of the Future
Address to the IEEE Mass Storage Technologies Symposium on Systems
San Diego, California
April 19, 2001
Good Morning. It's a pleasure to join you.
Unlike most of the other speakers at this conference, I am not an expert on mass storage technologies. I'm probably the most "low-tech" of anyone here. I'm here without animated PowerPoint slides and colored charts with sound effects, and I'm going to give a speech that's printed out on a stack of plain white paper. In fact, I'm only carrying a cell phone because my staff insists on it.
However, as fate would have it, as the Archivist of the United States, I have one of the biggest mass storage problems of anyone in the country. And here is why:
In the past few decades, information technology has made many things a lot easier, faster, and more accurate. Unfortunately, records preservation is not one of them.
Think about it - at the National Archives in Washington DC, we preserve and display the most valuable documents of our country - the Declaration of Independence, the Constitution, and the Bill of Rights.
While the preservation process is painstaking and the security surrounding those documents is tremendous, anyone who wants to can walk up and read these Charters of Freedom that have been around for over 200 years.
We also have in our care hundreds of thousands of government records that date from the Revolutionary War through the 20th century. Many of these records have survived more than a century, even though the National Archives and Records Administration was not created to care for them until 1934.
As you know, the same cannot be said of electronic records. Records created just a few years ago are already unreadable by today's technology.
We can not even begin to fathom how the hardware and software of the next century will work. And when you combine the rate of technological obsolescence with the explosive number of electronic records being created by the government everyday, then you can begin to imagine the challenge that we face at the National Archives and Record Administration.
Believe me, the Y2K problem was a piece of cake compared to this.
Today, we are responsible for preserving and providing public access to over four billion paper records, as well as more than 40 million special media items such as photos, films, sound recordings, maps and charts, and even presidential gifts. We are responsible for the records of all three branches of the Federal Government.
The National Archives and Records Administration preserves these records not just for the sake of history, but also so that the records of our Government can be examined by its citizens. We enable people to inspect for themselves the records of what government has done. We enable officials and agencies to review their actions, and we help citizens hold them accountable for those actions. We also hold records that document the rights of private citizens. Materials such as military records, or land rights, for example, enable citizens to claim their rightful entitlements.
This accountability of the Government to its people and the protection of their rights is the very cornerstone of the democracy in which we live, and I am proud to play a role in safeguarding the records of this democracy.
Obviously, we have a huge job now, but it's nothing compared to what is coming with the literally billions of electronic records that are being generated as we speak.
Along with the rest of the world - both government and private sector -- the National Archives and Records Administration still lacks proven methods for preserving most forms of electronic records that will be created in the near and long-term future with the unending changes to information technology.
Our challenge is to build an archives that can keep the essential records of government retrievable, readable, and authentic for as long as they remain valuable. That time frame ranges from a few years to hundreds of years into the future. Unfortunately, there are now no ready-to-use solutions available to keep records for even 20 or 25 years, which, for us, is a very short time.
It is vital that government agencies are able to guarantee the integrity of their records over time for their own business needs alone.
For example, the Food and Drug Administration must keep records on adverse reactions to drugs for as long as the drug is on the market. The Social Security Administration needs to keep records on the accounts of all citizens until all possible claims have been exhausted, which they estimate is through the lives of the grandchildren of the original account holder.
The problem of preserving electronic records hits the Air Force too, because after a few years they find they can't access research and development records to find out why decisions - such as the set of the wing of a jet fighter - were made during the development of a weapon system. Without this data, the military may be faced with the very costly need to repeat research and testing that has already been done.
And, those are just examples of the critical information that will be lost if the government can't preserve its electronic records. Every government agency has its own specific requirements, but the over-arching problem of electronic records management is the same.
Not that long ago, if I were speaking about "Building the Archives of the Future," I would probably tell you about the construction of our new state-of-the-art archives in College Park Maryland.
Today, when I speak of the "Archives of the Future," I am talking about a digital National Archives that will make Government records available to anyone, at any time, and in any place, for as long as needed.
Two hundred years from now, if a student wants to read George W. Bush's first State of the Union address, a couple of clicks of a mouse (or whatever the folks of the 23rd century are using) should bring it to him or her.
In the nearer future, if you want to see patent information on software technology developed by one of your colleagues, you should be able to get it almost instantly.
Right now, over at Fort Bragg, North Carolina there are young soldiers in boot camp. Twenty-five or thirty years from now they may need veterans' benefits and will be able to access their electronic records to verify their service.
Until very recently, building this "Archives of the Future" seemed to be technologically impossible. A few years ago, when we looked for a way to preserve accessible and authentic electronic records, we found no one who could offer us a solution. Today, definitive solutions still do not exist, but we have reason to be optimistic.
In the next few minutes, I want to tell you a little about the challenges we are facing, the partnerships we have forged to help overcome those challenges, and how we are working to build the digital "Archives of the Future." Perhaps most importantly, I hope to illustrate for you how the technological solutions we seek can also benefit countless other organizations in both the public and private sectors.
While this is the first time an Archivist of the United States has addressed an IEEE conference, it is obvious from your program that there are significant common interests between your community and ours.
You have heard from our partners at the San Diego Supercomputer Center, Richard Marciano and Reagan Moore, speak about "Emerging Information and Knowledge Management Technologies." Other speakers are addressing "Persistent Storage Allocation," "High-Speed Data Transfer," and "Enabling Data Management in a Distributed World."
Although I will not pretend to have expertise in any of these topics, I do recognize their relevance to the problems we face in trying to build the "National Archives of the Future." And, I am confident that your professional community can not only help us to find solutions to problems like obsolescence, but will also help us deliver the essential evidence of our government to the American people for generations to come.
The research being done by your community - by computer scientists and electronics engineers working with archivists and records managers - is yielding possible sustainable solutions for preserving and providing access to electronic records.
Going into this, I quickly realized that although the National Archives and Records Administration has a digital information storage challenge that is unique in its sheer size, complexity, and need for longevity, we don't have the resources, or the frankly the clout, to drive technology to satisfy our needs.
The Defense Department or NASA may be able to do this, but not the National Archives. We needed to leverage the expertise and resources of other organizations with similar problems with digital records management.
And, we are now laying the cornerstone of the "Archives of the Future," thanks to the partnerships we have built with the San Diego Supercomputer Center, as well as the Georgia Tech Research Institute and several Government agencies including the National Science Foundation, the Defense Department, and the Patent and Trademark Office.
This challenge, to build an information management architecture for a persistent archives, represents a new relationship between the scientific and archival communities. Until very recently, engineers and archivists didn't interact very much. But technology has given us common interests and common problems.
I have been told that engineers like nothing better than to solve problems, and I am more than happy to provide you with one heck of a problem to solve. I should point out that I'm not the only one who thinks we have a heck of a problem. When we explained our situation to the director of one of the major computer research and development programs in the government, he said, "Your problem is so big it's probably stupid to even try to solve it." Since he's an engineer, we knew he was hooked, and we were able to establish an ongoing partnership with his organization.
We did, however, quickly find out that the scientific community and the archival community tend to view records a bit differently. A staff member told me that one of the engineers came to him very excited that he had found mistakes in the information contained in some of our records·and that now we could correct them. We had to explain that while engineers are concerned with the absolute accuracy of information, archivists are concerned with the authenticity of a record.
To put it another way, a map that contains errors might just be a bad map. But, if the map was used by an Army general in the decision to launch air raids, then that inaccurate map becomes a very important part of a specific set of records.
In partnering with the technological community, we also soon found out that the solutions researchers are coming up with can be applied in any number of areas. We needed the infrastructure to support a persistent archives that is capable of permanently storing billions of records, while preserving the authenticity of the information.
Research by our partners has begun to answer our needs, and now the technologies developed for use by the National Archives and Records Administration are paving the way for further developments in the field of records management that have countless other applications.
For example, the National Science Foundation is currently looking at using the same kind of infrastructure as we are to develop a National Science Digital Library to store and manage science curricula for students from grades Kindergarten through college.
The California Digital Library is using a similar infrastructure to support the Art Museum Image Consortium, which consists of approximately 200 museums in this country. By using the same sort of technology developed for us, the art museums can provide access to images of art and corresponding data. However, they are not yet able to provide long term storage of this information, which is key for us.
The neuroscience community also has a similar infrastructure used to store and access a variety of different research projects. Again, while this information is capable of being stored and retrieved, we are not yet at the point where we can preserve records long-term and ensure their authenticity.
The key to this kind of records management is model-based mediation, which not only stores data, but groups it by how it relates to other data. This is how the National Archives and Records Administration stores records as well, because in order for records to be relevant over time, and to tell their story, they must be preserved within the context of other related records. This is a concept archivists call provenance.
In addition to the infrastructure to support our digital archives, we also need the tools to intelligently process and manage collections of records. Collections of records come from all over the government and are diverse in subject matter, technical aspects and sensitivity levels.
For example, we now have about 700 computer hard drives used during the first Bush Administration. To deal with them, we first have to identify the actual records on these drives, and that is no easy chore. In a way that's harder than looking for a needle in a haystack, because at least the needle and the hay have very different attributes.
We are finding that the vast majority of files on these drives are things like software, and help files, tutorials, and samples provided by the vendors. Therefore, we have to filter through all this to find the user-created files that might be records.
Next, we have to determine which records are governmental in nature and which are personal. And among the government records, we must identify those that are protected by the Privacy Act or other legal restrictions.
We have to figure out how individual files relate to each other, and how digital records relate to other paper records. And, to make things even more confusing, we must determine whether thousands of records with no security markings should be treated as sensitive documents.
Obviously, with the volumes of records we are talking about, we need tools to help us, and the research funded by the National Archives and Records Administration, our partners, and other Government agencies is beginning to show results.
Advanced technologies being applied by our partners at Georgia Tech look especially promising. At the core of the work Georgia Tech is doing is the development of advanced technology using natural language processors to address a variety of problems.
For example, they are working on tools that would identify systems files on computer hard drives, making it easier for us to identify actual records. Other tools would identify personal information like a social security number to help us determine the sensitivity level of a record.
Several weeks ago, we launched a pilot system at the George Bush Library in Texas, which uses some of this new technology developed by researchers at Georgia Tech. We are excited about this new system, and we will be using it to conduct experiments on automatically generating basic descriptions of records.
Despite recent developments that will help us, the commercial products we need are not here yet. More research will be needed as technology continues to evolve, but we have begun to identify our needs and expect to see software solutions to deal with several aspects of our problem within two years. And as we continue to learn more about our needs, we believe research will yield increasingly sophisticated technology and products to help solve our problems.
Of course the tools we need are also of interest to the private sector. Just as the FAA must keep records on aircrafts, private airlines must do the same. Pharmaceutical companies are required to save information on drugs, just as the FDA is. Architectural and Engineering firms need access to building plans for as long as the building is standing.
Local governments need to keep long-term records on bridges, sewer systems, and road construction and maintenance. Oil companies keep huge amounts of highly sensitive data on the likely locations of large deposits of oil.
But, perhaps the worst-case scenario of a need for long-term records storage lies with the nuclear power plants. Along with the Department of Energy, they are required to keep records on exhausted fuels for as long as those fuels remain radioactive. According to nuclear scientists, this can be as long as ten thousand to a hundred thousand years.
As you can see, the need for long-term record storage, and the use for the tools to accomplish it are endless. We don't yet have all the answers, but we are optimistic that answers can and will be found.
And we have come a long way in just a few short years. In 1998, we told the folks at the San Diego Supercomputer Center that we needed the ability to process millions of records as quickly as possible. At the time, they succeeded in bringing in a million email messages from the Internet, processing them into a preservable format, and bringing them back out in a different format. This all took less than two days on a supercomputer two years ago. Now they can do 10 million records in a day, and are exploring the feasibility of processing 100 million records a day.
We expect that in the near future we will have the technology to process a million messages a day from a single workstation. And that also means this technology can be scalable for smaller state, local, and university archives.
Three years ago the technology to build the "Archives of the Future" simply did not exist. Now, within a year we expect to launch a pilot program that will allow researchers at our facilities direct access to some of our 10,000 government databases.
Next year we will also begin taking in more than a million digital messages a year from the State Department. Users will be able to access messages related to things like international drug trade or the European Economic Union. Within five years we will be able to accept digital personnel files from the Defense Department.
Full deployment of the digital Archives will be progressive and dependent on the availability of technology. The research we are working with now uses technology that is not yet in the marketplace. We expect the products to come, and as they do, we will continue to build.
I was very pleased that the President's 2002 budget released last week included the request for a substantial increase of $20 million dollars for our Electronic Records Archives.
These resources would allow us, with help from our partners, to further define a research agenda of archival and technical questions for the development of a digital Archives and to collaborate in sponsoring research to answer these questions. It would also allow us to begin translating the research results into engineering solutions that work.
Of course this budget still has to be approved by Congress, but the fact that the President has recognized the critical importance of dealing with our nation's electronic records challenges tells me that we must push even harder to make the digital Archives a reality. It is now even more clear that the Administration agrees that this investment is critical to the health of our democracy.
When I became the Archivist of the United States in 1995, I took on the mission of preserving and providing access for our citizens to our nation's vital records. The explosion of digital records has forever changed the way the National Archives and Records Administration carries out this mission.
The "Archives of the Future" will not consist of many buildings scattered across the country. Instead, the "Archives of the Future" will be available on the desktop of any American who chooses to explore the records of his or her country. Our friends around the world will also be able to visit the "Archives of the Future" through their computers as well.
Building this new, digital archives is not and will not be easy. But we have no alternative. Electronic records, like records in traditional forms, are critical for the effective functioning of a democracy.
A society whose records are closed cannot be open. A people who cannot document their rights, cannot exercise them. A nation without access to its history cannot analyze itself. And, a government whose records are lost cannot accountably govern.
If we do not succeed, the Archives of the United States will cease to exist. And a democracy without open access to its government's records is no longer a democracy.
I believe that it is important that those of you in the technology field understand the National Archives that we are trying to build. In the management of electronic records, all of us must learn from each other.
Pressure from the public, our customers, and our organizations' own business needs is driving the search for solutions. The progress we are making at the National Archives and Records Administration may help you, just as what you are doing and learning will help us.
Finally, let me leave you with this thought -- what we are doing at the National Archives and Records Administration matters to you not only as technology professionals - it matters to you as citizens of this nation. Effective electronic records management is critical to keeping your rights protected and your government accountable to you.
Thanks to the expertise, talent and dedication of those of you in the scientific and engineering communities, I'm not worried about the demise of the National Archives and Record Administration. I can see the "Archives of the Future" taking shape right now, and I can't wait to see how it turns out.