Information for the NARA - Virginia Tech AI Conference, April-May 2021
Welcome
![]() |
Please read the Archivist's blog post about the conference, Exploring the Future Together.
David S. Ferriero was confirmed as 10th Archivist of the United States on November 6, 2009. Early in 2010, he committed the National Archives and Records Administration to the principles of Open Government—transparency, participation, and collaboration. Before joining the NYPL in 2004, Mr. Ferriero served in top positions at two of the nation's major academic libraries, the Massachusetts Institute of Technology in Cambridge, MA, and Duke University in Durham, NC. Mr. Ferriero earned bachelor's and master's degrees in English literature from Northeastern University in Boston and a master's degree from the Simmons College of Library and Information Science, also in Boston. Mr. Ferriero served as a Navy hospital corpsman during the Vietnam War. |
NARA Datasets for Artificial Intelligence and Machine Learning
The following datasets represent a variety of the data available in the National Archives Catalog including textual records, which make up the majority of NARA’s current holdings, and electronic records such as emails and databases.
These datasets may be accessed with the National Archives Catalog API or as JSON files for download. Within these datasets are all of the metadata for the records listed, including object-level data such as citizen archivist tags and transcriptions or optical character recognition (OCR) data extracted from the records where present and noted below, and the URLs for the digital objects of the digitized or electronic records.
Datasets with Object-Level Metadata
This table provides information about datasets from the National Archives Catalog that have object-level metadata, which can serve as a baseline and comparison of what AI/ML tools can achieve. Object-level metadata includes both data produced by optical character recognition (OCR) and citizen archivist contributions. These datasets only reflect digitized textual records, which are the vast majority of NARA’s holdings.
Record Group / Collection | Series | Number of digital objects | Number of descriptions with digital objects | OCR data? | Citizen archivist data? | National Archives Catalog API query | JSON file |
---|---|---|---|---|---|---|---|
RG 109 - War Department Collection of Confederate Records |
94,063 |
1,401 |
Yes |
Yes |
|
JSON | |
RG 92 - Office of the Quartermaster General |
47,100 |
982 |
Yes |
Yes |
JSON | ||
RG 75 - Bureau of Indian Affairs |
Correspondence Relating to Reindeer Herds in Alaska, 1911-1960 |
21,426 |
636 |
No |
Yes |
JSON | |
RG 120 - Records of the American Expeditionary Forces (World War I) |
9,120 |
2,405 |
Yes |
Yes |
JSON | ||
RG 276 - U.S. Court of Appeals |
8,238 |
27 |
Yes |
Yes |
JSON | ||
Collection LBJ-PCTJWHD |
4,017 |
1,604 |
Yes |
Yes |
JSON | ||
RG 412 - Environmental Protection Agency |
Program Development Files on Seabrook Nuclear Power Plant, 1/1/1973 - 12/31/1979 |
3,547 |
29 |
Yes |
Yes |
JSON | |
RG 341 - U.S. Air Force |
Reports Regarding Proposed Air Force Academy Site Selection, 1950 - 1950 |
2,872 |
43 |
Yes |
No |
JSON |
Datasets without Object-Level Metadata
This table provides information about datasets from the National Archives Catalog that do not have object-level metadata but may be used to understand how AI/ML tools will apply to records without pre-existing object-level metadata and how the results may or may not meet both NARA’s needs and user expectations. These datasets describe digitized textual records, which are the vast majority of NARA’s holdings.
Record Group / Collection | Series | Number of digital objects | Number of descriptions with digital objects | OCR data? | Citizen archivist data? | National Archives Catalog API query | JSON file |
---|---|---|---|---|---|---|---|
RG 181- Naval Districts and Shore Establishments |
25,694 |
48 |
No |
No |
JSON | ||
RG 472 - U.S. Forces in Southeast Asia |
23,959 |
260 |
No |
No |
JSON | ||
RG 60 - Department of Justice |
Files of Associate Deputy General Merrick B. Garland, 1994 - 1997 |
23,552 |
447 |
No |
No |
JSON | |
RG 22 - U.S. Fish and Wildlife Service |
7,144 |
42 |
No |
No |
JSON |
Datasets of Born-Electronic Records
Record Group / Collection | Series | Number of digital objects | Number of descriptions with digital objects | National Archives Catalog API query | JSON File |
---|---|---|---|---|---|
RG 541- Assassination Records Review Board |
Electronic Records Relating to John F. Kennedy Assassination Research, 4/1/1994 - 9/30/1998 [includes emails] |
83 |
83 |
JSON | |
RG 330 - Office of the Secretary of Defense |
Defense Casualty Analysis System (DCAS) Files, ca. 2001 - 3/16/2009 |
4 |
4 |
JSON |
Datasets with Photographic Records of People
Record Group / Collection | Series | Number of digital objects | Number of descriptions with digital objects | National Archives Catalog API query | JSON File |
---|---|---|---|---|---|
BHO-WHPO - Records of the White House Photo Office (Obama Administration) |
8010 |
8010 |
JSON | ||
WJC-WHPO - Photographs of the White House Photograph Office (Clinton Administration) |
Photographs Relating to the Clinton Administration, 1/20/1993 - 1/20/2001 |
499 |
499 |
JSON | |
FL - Frank W. Legg Photographic Collection of Portraits of Nineteenth-Century Notables |
91 |
91 |
JSON |
For questions about these datasets, please contact catalog@nara.gov.
Terminology
This table provides definitions of key terminology used on this page and in the related resources listed below to assist in understanding these datasets.
Term | Definition |
---|---|
Description | Archival metadata describing archival records, may or may not be associated with objects. |
Digital Object | Digital files associated with descriptions usually digitized pages of archival records. Some are born-electronic records. |
Public Contribution | Tags, transcriptions, and comments added to descriptions and objects by citizen archivists, who are our public volunteers |
Authority Record | Authoritative terms established by NARA staff to identify creators of and to establish links between records. |
Presenters
![]() |
William A. Ingram is Assistant Dean and Director of IT for University Libraries at Virginia Tech. He received an M.S. in Library and Information Science from the University of Illinois at Urbana-Champaign in 2008. Since then, he has been involved in projects and services related to scholarly communication, digital preservation, repositories, and digital libraries. He is currently pursuing a Ph.D. in Computer Science at Virginia Tech with a dissertation focused on the application of NLP and machine/deep learning to large-scale scholarly data. Earlier this year, he was awarded a 3-year National Digital Infrastructures research grant from the Institute of Museum and Library Services (LG-37-19-0078-19-0) to study the application of computational methods and resources to large corpora of electronic theses and dissertations. |
![]() |
Sylvester A. Johnson is Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech. He is the founding director of Virginia Tech’s Center for Humanities, which is supporting human-centered research and humanistic approaches to the guidance of technology. Sylvester’s research has examined religion, race, and empire in the Atlantic world; religion and sexuality; national security practices; and the impact of intelligent machines and human enhancement on human identity and race governance. In addition to co-facilitating a national working group on religion and the US empire, Johnson led an Artificial Intelligence project that developed a successful proof-of-concept machine learning application to ingest and analyze a humanities text. He is the author of The Myth of Ham in Nineteenth-Century American Christianity (Palgrave 2004), a study of race and religious hatred that won the American Academy of Religion’s Best First Book award; and African American Religions, 1500-2000 (Cambridge 2015), an award-winning interpretation of five centuries of democracy, colonialism, and freedom in the Atlantic world. Johnson has also co-edited The FBI and Religion: Faith and National Security Before and After 9/11 (University of California 2017). He is co-edit founding co-editor of the Journal of Africana Religions. He is currently producing a digital scholarly edition of an early English history of global religions and writing a book on human identity in an age of intelligent machines and human-machine symbiosis. |
![]() |
In 2012, Pamela Wright was selected by the Archivist of the United States to be the first Chief Innovation Officer at the National Archives and Records Administration (NARA). Since then, she has focused on projects that combine NARA’s values to collaborate, innovate, and learn, with the exploration and use of emerging technologies. In support of NARA's strategic goals to make access happen, connect with customers, and maximize our value to the nation, she launched NARA's first social media program and Citizen Archivist program. She developed the agency’s digitization program, which has resulted in making 130 million records available through NARA's online Catalog. The Catalog newsletter reaches over 275,000 subscribers. Her focus on making the records shareable has resulted in substantial numbers of digital copies of NARA's records on platforms across the internet, including Wikipedia and Wikidata, Giphy, and more. She and her staff of archives specialists, community and project managers, user experience designers, and IT specialists run the agency’s web, description, next-generation finding aids, and digital reference programs. She is a member of the advisory board for the Digital Public Library of America. Prior to joining the National Archives, she worked as a research historian for Historical Research Associates. She holds degrees in English and history from the University of Montana. |
![]() |
Tanu Mitra is an Assistant Professor at the University of Washington, Information School, where she leads the Social Computing research group. She and her students study and build large-scale social computing systems to understand and counter problematic information online. Her research spans auditing online systems for misinformation and conspiratorial content, understanding digital misinformation in the context of the news ecosystem, unraveling narratives of online extremism and hate, and building technology to foster critical thinking online. Her work employs a range of interdisciplinary methods from the fields of human computer interaction, data mining, machine learning, and natural language processing. Dr. Mitra’s work has been supported by grants from the NSF, DoD, Social Science One, and other Foundations. Her research has been recognized through multiple awards and honors, including an NSF-CRII award, an ICTAS Junior Faculty Award, the Virginia Tech College of Engineering Outstanding New Assistant Professor Award and Georgia Tech’s GVU Center’s Foley Scholarship for excellence in research innovation and potential impact, along with several best paper honorable mention awards. Dr. Mitra received her PhD in Computer Science from Georgia Tech’s School of Interactive Computing and her Masters in Computer Science from Texas A&M University. |
![]() |
Erica Boudreau is an archives specialist for data standards in the Office of Innovation's Digital Public Access Branch (VEO) at NARA. Her work focuses on archival description review and the preparation of digital objects and metadata provided by NARA's partners for import into the National Archives Catalog. Prior to joining VEO, she spent 12 years at the John F. Kennedy Presidential Library where she led digitization and description efforts and gained experience processing large and complex collections. |
![]() |
Jason Clingerman is the director of the Digital Engagement Division in the Office of Innovation at NARA. He has 13 years of experience in archives, archival standards, information systems, and user experience. Jason oversees the processes and platforms for making our nation's historical records publicly available online. In his role, he oversees the programs that contribute to two of NARA's strategic objectives to Make Access Happen: to make 500 million digitized pages of records available through the National Archives Catalog and to provide digital, next-generation finding aids to the holdings described in the Catalog. |
![]() |
Michael L. Knight is the Web Branch Chief in the Office of Innovation's Digital Engagement Division. He has led numerous web development projects since joining the agency in 2017, including efforts to redesign and migrate Presidential Library websites to the Drupal content management system and the NARA enterprise cloud. Michael is a Certified Scrum Professional (CSP-SM), Certified Scrum Master (CSM), and Project Management Professional (PMP). |
![]() |
Billy Wade joined the Still Picture Branch in July 2000 after studying history at Salisbury University. In 2005, he became an archivist specializing in the accessioning and processing of both analog and digital photography as well as digitization of legacy holdings and in 2012 became one of two subject matter experts within the branch. In 2016, he transitioned to his current position as supervisor for Still Picture accessioning, processing, and digitization activities. |
Presentations and Videos
Below are slide decks and recorded presentations relevant to these datasets and collaborations with external stakeholders in exploring the use cases of artificial intelligence and machine learning for NARA’s data.
Presentation Title | Presenter | Description | Slides | Recording |
---|---|---|---|---|
Opening Presentation (Why We’re Here: Ensuring Scholarly Access to Government Archives and Records) | William A. Ingram, Assistant Dean and Director of IT for University Libraries at Virginia Tech | Background information on the conference, partners involved, and overview of workshop goals. | ||
Workshop Series Overview | Sylvester A. Johnson, Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech | An overview of the conference series workshops, including processes, methods, and approaches. | ||
National Archives Catalog | Erica Boudreau, Archives Specialist | Overview of the National Archives Catalog, including NARA’s archival standard, the Lifecycle Data Requirements Guide (LCDRG); NARA’s archival hierarchy; what’s in the Catalog; the Catalog’s technology stack; how to search the Catalog user interface; and basics on the Catalog API. | YouTube | |
NARA Datasets for Artificial Intelligence / Machine Learning | Jason Clingerman, Digital Engagement Division Director | Description of the potential use cases for datasets listed on this page and more details on the datasets themselves. | YouTube | |
NARA’s Digital Personas | Michael L. Knight, Project Manager | Background on NARA’s digital personas, including the research that informed the development of the personas and overviews of the personas. | YouTube | |
Using Object Recognition with NARA Photographs and Other Graphic Materials | Billy Wade, Supervisory Archivist | Description of potential use cases for object recognition for NARA's photographs and other graphic materials. | YouTube |
- National Archives Catalog: NARA’s central repository for providing online access to digitized holdings, descriptions of holdings, authority records (e.g. people, organizations, geographic locations, and topical subjects, public contributions by citizen archivists, and indexed National Archives and Presidential Library web pages.
- National Archives Catalog API GitHub Repository: Contains documentation for using the National Archives Catalog API.
- Lifecycle Data Requirements Guide (LCDRG): NARA’s standard for describing archival records, organizations, persons, and digital objects.
PDF files require the free Adobe Reader.
More information on Adobe Acrobat PDF files is available on our Accessibility page.