Information for the NARA - Virginia Tech AI Conference, April-May 2021

Welcome

Please read the Archivist's blog post about the conference, Exploring the Future Together.

David S. Ferriero was confirmed as 10th Archivist of the United States on November 6, 2009. Early in 2010, he committed the National Archives and Records Administration to the principles of Open Government—transparency, participation, and collaboration.

Previously, Mr. Ferriero served as the Andrew W. Mellon Director of the New York Public Libraries (NYPL). He was part of the leadership team responsible for integrating the four research libraries and 87 branch libraries into one seamless service for users, creating the largest public library system in the United States and one of the largest research libraries in the world.

Before joining the NYPL in 2004, Mr. Ferriero served in top positions at two of the nation's major academic libraries, the Massachusetts Institute of Technology in Cambridge, MA, and Duke University in Durham, NC. Mr. Ferriero earned bachelor's and master's degrees in English literature from Northeastern University in Boston and a master's degree from the Simmons College of Library and Information Science, also in Boston. Mr. Ferriero served as a Navy hospital corpsman during the Vietnam War.

NARA Datasets for Artificial Intelligence and Machine Learning

The following datasets represent a variety of the data available in the National Archives Catalog including textual records, which make up the majority of NARA’s current holdings, and electronic records such as emails and databases.

These datasets may be accessed with the National Archives Catalog API or as JSON files for download. Within these datasets are all of the metadata for the records listed, including object-level data such as citizen archivist tags and transcriptions or optical character recognition (OCR) data extracted from the records where present and noted below, and the URLs for the digital objects of the digitized or electronic records.

Datasets with Object-Level Metadata

This table provides information about datasets from the National Archives Catalog that have object-level metadata, which can serve as a baseline and comparison of what AI/ML tools can achieve. Object-level metadata includes both data produced by optical character recognition (OCR) and citizen archivist contributions. These datasets only reflect digitized textual records, which are the vast majority of NARA’s holdings.

Record Group / Collection	Series	Number of digital objects	Number of descriptions with digital objects	OCR data?	Citizen archivist data?	National Archives Catalog API query	JSON file
RG 109 - War Department Collection of Confederate Records	Record Books of Executive, Legislative, and Judicial Offices of the Confederate Government, 1874 - 1899	94,063	1,401	Yes	Yes	link	JSON
RG 92 - Office of the Quartermaster General	Correspondence, Reports, Telegrams, Applications, and Other Papers Relating to Burials of Service Personnel, 1915-1939	47,100	982	Yes	Yes	link	JSON
RG 75 - Bureau of Indian Affairs	Correspondence Relating to Reindeer Herds in Alaska, 1911-1960	21,426	636	No	Yes	link	JSON
RG 120 - Records of the American Expeditionary Forces (World War I)	Records of Divisions, 1918-1942	9,120	2,405	Yes	Yes	link	JSON
RG 276 - U.S. Court of Appeals	Case Files, 1891 - 1997	8,238	27	Yes	Yes	link	JSON
Collection LBJ-PCTJWHD	Lady Bird Johnson's Daily Diary, 12/1963 - 1/31/1969	4,017	1,604	Yes	Yes	link	JSON
RG 412 - Environmental Protection Agency	Program Development Files on Seabrook Nuclear Power Plant, 1/1/1973 - 12/31/1979	3,547	29	Yes	Yes	link	JSON
RG 341 - U.S. Air Force	Reports Regarding Proposed Air Force Academy Site Selection, 1950 - 1950	2,872	43	Yes	No	link	JSON

Datasets without Object-Level Metadata

This table provides information about datasets from the National Archives Catalog that do not have object-level metadata but may be used to understand how AI/ML tools will apply to records without pre-existing object-level metadata and how the results may or may not meet both NARA’s needs and user expectations. These datasets describe digitized textual records, which are the vast majority of NARA’s holdings.

Record Group / Collection	Series	Number of digital objects	Number of descriptions with digital objects	OCR data?	Citizen archivist data?	National Archives Catalog API query	JSON file
RG 181- Naval Districts and Shore Establishments	Shipyard Logs, 1888 - 1958	25,694	48	No	No	link	JSON
RG 472 - U.S. Forces in Southeast Asia	General Records, 1965 - 1972	23,959	260	No	No	link	JSON
RG 60 - Department of Justice	Files of Associate Deputy General Merrick B. Garland, 1994 - 1997	23,552	447	No	No	link	JSON
RG 22 - U.S. Fish and Wildlife Service	Endangered Species Delisting Files, 1975 - 2000	7,144	42	No	No	link	JSON

Datasets of Born-Electronic Records

Record Group / Collection	Series	Number of digital objects	Number of descriptions with digital objects	National Archives Catalog API query	JSON File
RG 541- Assassination Records Review Board	Electronic Records Relating to John F. Kennedy Assassination Research, 4/1/1994 - 9/30/1998 [includes emails]	83	83	link	JSON
RG 330 - Office of the Secretary of Defense	Defense Casualty Analysis System (DCAS) Files, ca. 2001 - 3/16/2009	4	4	link	JSON

Datasets with Photographic Records of People

Record Group / Collection	Series	Number of digital objects	Number of descriptions with digital objects	National Archives Catalog API query	JSON File
BHO-WHPO - Records of the White House Photo Office (Obama Administration)	Presidential Photographs, 1/20/2009 - 1/20/2017	8010	8010	link	JSON
WJC-WHPO - Photographs of the White House Photograph Office (Clinton Administration)	Photographs Relating to the Clinton Administration, 1/20/1993 - 1/20/2001	499	499	link	JSON
FL - Frank W. Legg Photographic Collection of Portraits of Nineteenth-Century Notables	Portraits, 1862 - 1884	91	91	link	JSON

For questions about these datasets, please contact catalog@nara.gov.

Terminology

This table provides definitions of key terminology used on this page and in the related resources listed below to assist in understanding these datasets.

Term	Definition
Description	Archival metadata describing archival records, may or may not be associated with objects.
Digital Object	Digital files associated with descriptions usually digitized pages of archival records. Some are born-electronic records.
Public Contribution	Tags, transcriptions, and comments added to descriptions and objects by citizen archivists, who are our public volunteers
Authority Record	Authoritative terms established by NARA staff to identify creators of and to establish links between records.

Presenters

	William A. Ingram is Assistant Dean and Director of IT for University Libraries at Virginia Tech. He received an M.S. in Library and Information Science from the University of Illinois at Urbana-Champaign in 2008. Since then, he has been involved in projects and services related to scholarly communication, digital preservation, repositories, and digital libraries. He is currently pursuing a Ph.D. in Computer Science at Virginia Tech with a dissertation focused on the application of NLP and machine/deep learning to large-scale scholarly data. Earlier this year, he was awarded a 3-year National Digital Infrastructures research grant from the Institute of Museum and Library Services (LG-37-19-0078-19-0) to study the application of computational methods and resources to large corpora of electronic theses and dissertations.
	Sylvester A. Johnson is Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech. He is the founding director of Virginia Tech’s Center for Humanities, which is supporting human-centered research and humanistic approaches to the guidance of technology. Sylvester’s research has examined religion, race, and empire in the Atlantic world; religion and sexuality; national security practices; and the impact of intelligent machines and human enhancement on human identity and race governance. In addition to co-facilitating a national working group on religion and the US empire, Johnson led an Artificial Intelligence project that developed a successful proof-of-concept machine learning application to ingest and analyze a humanities text. He is the author of The Myth of Ham in Nineteenth-Century American Christianity (Palgrave 2004), a study of race and religious hatred that won the American Academy of Religion’s Best First Book award; and African American Religions, 1500-2000 (Cambridge 2015), an award-winning interpretation of five centuries of democracy, colonialism, and freedom in the Atlantic world. Johnson has also co-edited The FBI and Religion: Faith and National Security Before and After 9/11 (University of California 2017). He is co-edit founding co-editor of the Journal of Africana Religions. He is currently producing a digital scholarly edition of an early English history of global religions and writing a book on human identity in an age of intelligent machines and human-machine symbiosis.
	In 2012, Pamela Wright was selected by the Archivist of the United States to be the first Chief Innovation Officer at the National Archives and Records Administration (NARA). Since then, she has focused on projects that combine NARA’s values to collaborate, innovate, and learn, with the exploration and use of emerging technologies. In support of NARA's strategic goals to make access happen, connect with customers, and maximize our value to the nation, she launched NARA's first social media program and Citizen Archivist program. She developed the agency’s digitization program, which has resulted in making 130 million records available through NARA's online Catalog. The Catalog newsletter reaches over 275,000 subscribers. Her focus on making the records shareable has resulted in substantial numbers of digital copies of NARA's records on platforms across the internet, including Wikipedia and Wikidata, Giphy, and more. She and her staff of archives specialists, community and project managers, user experience designers, and IT specialists run the agency’s web, description, next-generation finding aids, and digital reference programs. She is a member of the advisory board for the Digital Public Library of America. Prior to joining the National Archives, she worked as a research historian for Historical Research Associates. She holds degrees in English and history from the University of Montana.
	Tanu Mitra is an Assistant Professor at the University of Washington, Information School, where she leads the Social Computing research group. She and her students study and build large-scale social computing systems to understand and counter problematic information online. Her research spans auditing online systems for misinformation and conspiratorial content, understanding digital misinformation in the context of the news ecosystem, unraveling narratives of online extremism and hate, and building technology to foster critical thinking online. Her work employs a range of interdisciplinary methods from the fields of human computer interaction, data mining, machine learning, and natural language processing. Dr. Mitra’s work has been supported by grants from the NSF, DoD, Social Science One, and other Foundations. Her research has been recognized through multiple awards and honors, including an NSF-CRII award, an ICTAS Junior Faculty Award, the Virginia Tech College of Engineering Outstanding New Assistant Professor Award and Georgia Tech’s GVU Center’s Foley Scholarship for excellence in research innovation and potential impact, along with several best paper honorable mention awards. Dr. Mitra received her PhD in Computer Science from Georgia Tech’s School of Interactive Computing and her Masters in Computer Science from Texas A&M University.
	Erica Boudreau is an archives specialist for data standards in the Office of Innovation's Digital Public Access Branch (VEO) at NARA. Her work focuses on archival description review and the preparation of digital objects and metadata provided by NARA's partners for import into the National Archives Catalog. Prior to joining VEO, she spent 12 years at the John F. Kennedy Presidential Library where she led digitization and description efforts and gained experience processing large and complex collections.
	Jason Clingerman is the director of the Digital Engagement Division in the Office of Innovation at NARA. He has 13 years of experience in archives, archival standards, information systems, and user experience. Jason oversees the processes and platforms for making our nation's historical records publicly available online. In his role, he oversees the programs that contribute to two of NARA's strategic objectives to Make Access Happen: to make 500 million digitized pages of records available through the National Archives Catalog and to provide digital, next-generation finding aids to the holdings described in the Catalog.
	Michael L. Knight is the Web Branch Chief in the Office of Innovation's Digital Engagement Division. He has led numerous web development projects since joining the agency in 2017, including efforts to redesign and migrate Presidential Library websites to the Drupal content management system and the NARA enterprise cloud. Michael is a Certified Scrum Professional (CSP-SM), Certified Scrum Master (CSM), and Project Management Professional (PMP).
	Billy Wade joined the Still Picture Branch in July 2000 after studying history at Salisbury University. In 2005, he became an archivist specializing in the accessioning and processing of both analog and digital photography as well as digitization of legacy holdings and in 2012 became one of two subject matter experts within the branch. In 2016, he transitioned to his current position as supervisor for Still Picture accessioning, processing, and digitization activities.

Presentations and Videos

Below are slide decks and recorded presentations relevant to these datasets and collaborations with external stakeholders in exploring the use cases of artificial intelligence and machine learning for NARA’s data.

Presentation Title	Presenter	Description	Slides	Recording
Opening Presentation (Why We’re Here: Ensuring Scholarly Access to Government Archives and Records)	William A. Ingram, Assistant Dean and Director of IT for University Libraries at Virginia Tech	Background information on the conference, partners involved, and overview of workshop goals.	PDF
Workshop Series Overview	Sylvester A. Johnson, Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech	An overview of the conference series workshops, including processes, methods, and approaches.	PDF
National Archives Catalog	Erica Boudreau, Archives Specialist	Overview of the National Archives Catalog, including NARA’s archival standard, the Lifecycle Data Requirements Guide (LCDRG); NARA’s archival hierarchy; what’s in the Catalog; the Catalog’s technology stack; how to search the Catalog user interface; and basics on the Catalog API.	PDF	YouTube
NARA Datasets for Artificial Intelligence / Machine Learning	Jason Clingerman, Digital Engagement Division Director	Description of the potential use cases for datasets listed on this page and more details on the datasets themselves.	PDF	YouTube
NARA’s Digital Personas	Michael L. Knight, Project Manager	Background on NARA’s digital personas, including the research that informed the development of the personas and overviews of the personas.	PDF	YouTube
Using Object Recognition with NARA Photographs and Other Graphic Materials	Billy Wade, Supervisory Archivist	Description of potential use cases for object recognition for NARA's photographs and other graphic materials.	PDF	YouTube

Related Resources

National Archives Catalog: NARA’s central repository for providing online access to digitized holdings, descriptions of holdings, authority records (e.g. people, organizations, geographic locations, and topical subjects, public contributions by citizen archivists, and indexed National Archives and Presidential Library web pages.
National Archives Catalog API GitHub Repository: Contains documentation for using the National Archives Catalog API.
Lifecycle Data Requirements Guide (LCDRG): NARA’s standard for describing archival records, organizations, persons, and digital objects.

PDF files require the free Adobe Reader.
More information on Adobe Acrobat PDF files is available on our Accessibility page.