Information for the NARA - Virginia Tech AI Conference, April-May 2021

National Archives and Virginia Tech

Welcome

Please read the Archivist's blog post about the conference,   Exploring the Future Together.

 

David S. Ferriero was confirmed as 10th Archivist of the United States on November 6, 2009. Early in 2010, he committed the National Archives and Records Administration to the principles of Open Government—transparency, participation, and collaboration.

Previously, Mr. Ferriero served as the Andrew W. Mellon Director of the New York Public Libraries (NYPL). He was part of the leadership team responsible for integrating the four research libraries and 87 branch libraries into one seamless service for users, creating the largest public library system in the United States and one of the largest research libraries in the world.

Before joining the NYPL in 2004, Mr. Ferriero served in top positions at two of the nation's major academic libraries, the Massachusetts Institute of Technology in Cambridge, MA, and Duke University in Durham, NC. Mr. Ferriero earned bachelor's and master's degrees in English literature from Northeastern University in Boston and a master's degree from the Simmons College of Library and Information Science, also in Boston. Mr. Ferriero served as a Navy hospital corpsman during the Vietnam War.

NARA Datasets for Artificial Intelligence and Machine Learning

The following datasets represent a variety of the data available in the National Archives Catalog including textual records, which make up the majority of NARA’s current holdings, and electronic records such as emails and databases.

These datasets may be accessed with the National Archives Catalog API or as JSON files for download. Within these datasets are all of the metadata for the records listed, including object-level data such as citizen archivist tags and transcriptions or optical character recognition (OCR) data extracted from the records where present and noted below, and the URLs for the digital objects of the digitized or electronic records.

Datasets with Object-Level Metadata

This table provides information about datasets from the National Archives Catalog that have object-level metadata, which can serve as a baseline and comparison of what AI/ML tools can achieve. Object-level metadata includes both data produced by optical character recognition (OCR) and citizen archivist contributions. These datasets only reflect digitized textual records, which are the vast majority of NARA’s holdings.

Record Group / Collection Series Number of digital objects Number of descriptions with digital objects OCR data? Citizen archivist data? National Archives Catalog API query JSON file

RG 109 - War Department Collection of Confederate Records

Record Books of Executive, Legislative, and Judicial Offices of the Confederate Government, 1874 - 1899

94,063

1,401

Yes

Yes

link


 
JSON

RG 92 - Office of the Quartermaster General

Correspondence, Reports, Telegrams, Applications, and Other Papers Relating to Burials of Service Personnel, 1915-1939

47,100

982

Yes

Yes

link

JSON

RG 75 - Bureau of Indian Affairs

Correspondence Relating to Reindeer Herds in Alaska, 1911-1960

21,426

636

No

Yes

link

JSON

RG 120 - Records of the American Expeditionary Forces (World War I)

Records of Divisions, 1918-1942

9,120

2,405

Yes

Yes

link

JSON

RG 276 - U.S. Court of Appeals

Case Files, 1891 - 1997

8,238

27

Yes

Yes

link

JSON

Collection LBJ-PCTJWHD

Lady Bird Johnson's Daily Diary, 12/1963 - 1/31/1969

4,017

1,604

Yes

Yes

link

JSON

RG 412 - Environmental Protection Agency

Program Development Files on Seabrook Nuclear Power Plant, 1/1/1973 - 12/31/1979

3,547

29

Yes

Yes

link

JSON

RG 341 - U.S. Air Force

Reports Regarding Proposed Air Force Academy Site Selection, 1950 - 1950

2,872

43

Yes

No

link

JSON

Datasets without Object-Level Metadata

This table provides information about datasets from the National Archives Catalog that do not have object-level metadata but may be used to understand how AI/ML tools will apply to records without pre-existing object-level metadata and how the results may or may not meet both NARA’s needs and user expectations. These datasets describe digitized textual records, which are the vast majority of NARA’s holdings.

Record Group / Collection Series Number of digital objects Number of descriptions with digital objects OCR data? Citizen archivist data? National Archives Catalog API query JSON file

RG 181- Naval Districts and Shore Establishments

Shipyard Logs, 1888 - 1958

25,694

48

No

No

link

JSON

RG 472 - U.S. Forces in Southeast Asia

General Records, 1965 - 1972

23,959

260

No

No

link

JSON

RG 60 - Department of Justice

Files of Associate Deputy General Merrick B. Garland, 1994 - 1997

23,552

447

No

No

link

JSON

RG 22 - U.S. Fish and Wildlife Service

Endangered Species Delisting Files, 1975 - 2000

7,144

42

No

No

link

JSON

Datasets of Born-Electronic Records

Record Group / Collection Series Number of digital objects Number of descriptions with digital objects National Archives Catalog API query JSON File

RG 541- Assassination Records Review Board

Electronic Records Relating to John F. Kennedy Assassination Research, 4/1/1994 - 9/30/1998 [includes emails]

83

83

link

JSON

RG 330 - Office of the Secretary of Defense

Defense Casualty Analysis System (DCAS) Files, ca. 2001 - 3/16/2009

4

4

link

JSON

Datasets with Photographic Records of People

Record Group / Collection Series Number of digital objects Number of descriptions with digital objects National Archives Catalog API query JSON File

BHO-WHPO - Records of the White House Photo Office (Obama Administration)

Presidential Photographs, 1/20/2009 - 1/20/2017

8010

8010

link

JSON

WJC-WHPO - Photographs of the White House Photograph Office (Clinton Administration)

Photographs Relating to the Clinton Administration, 1/20/1993 - 1/20/2001

499

499

link

JSON

FL - Frank W. Legg Photographic Collection of Portraits of Nineteenth-Century Notables

Portraits, 1862 - 1884

91

91

link

JSON

For questions about these datasets, please contact catalog@nara.gov.

Terminology 

This table provides definitions of key terminology used on this page and in the related resources listed below to assist in understanding these datasets.

Term Definition
Description Archival metadata describing archival records, may or may not be associated with objects.
Digital Object Digital files associated with descriptions usually digitized pages of archival records. Some are born-electronic records.
Public Contribution Tags, transcriptions, and comments added to descriptions and objects by citizen archivists, who are our public volunteers
Authority Record Authoritative terms established by NARA staff to identify creators of and to establish links between records.

Presenters

William A. Ingram William A. Ingram is Assistant Dean and Director of IT for University Libraries at Virginia Tech. He received an M.S. in Library and Information Science from the University of Illinois at Urbana-Champaign in 2008. Since then, he has been involved in projects and services related to scholarly communication, digital preservation, repositories, and digital libraries. He is currently pursuing a Ph.D. in Computer Science at Virginia Tech with a dissertation focused on the application of NLP and machine/deep learning to large-scale scholarly data. Earlier this year, he was awarded a 3-year National Digital Infrastructures research grant from the Institute of Museum and Library Services (LG-37-19-0078-19-0) to study the application of computational methods and resources to large corpora of electronic theses and dissertations.
Sylvester A. Johnson Sylvester A. Johnson is Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech. He is the founding director of Virginia Tech’s Center for Humanities, which is supporting human-centered research and humanistic approaches to the guidance of technology. Sylvester’s research has examined religion, race, and empire in the Atlantic world; religion and sexuality; national security practices; and the impact of intelligent machines and human enhancement on human identity and race governance. In addition to co-facilitating a national working group on religion and the US empire, Johnson led an Artificial Intelligence project that developed a successful proof-of-concept machine learning application to ingest and analyze a humanities text. He is the author of The Myth of Ham in Nineteenth-Century American Christianity (Palgrave 2004), a study of race and religious hatred that won the American Academy of Religion’s Best First Book award; and African American Religions, 1500-2000 (Cambridge 2015), an award-winning interpretation of five centuries of democracy, colonialism, and freedom in the Atlantic world. Johnson has also co-edited The FBI and Religion: Faith and National Security Before and After 9/11 (University of California 2017). He is co-edit founding co-editor of the Journal of Africana Religions. He is currently producing a digital scholarly edition of an early English history of global religions and writing a book on human identity in an age of intelligent machines and human-machine symbiosis.
Pamela Wright In 2012, Pamela Wright was selected by the Archivist of the United States to be the first Chief Innovation Officer at the National Archives and Records Administration (NARA). Since then, she has focused on projects that combine NARA’s values to collaborate, innovate, and learn, with the exploration and use of emerging technologies. In support of NARA's strategic goals to make access happen, connect with customers, and maximize our value to the nation, she launched NARA's first social media program and Citizen Archivist program.  She developed the agency’s digitization program, which has resulted in making 130 million records available through NARA's online Catalog. The Catalog newsletter reaches over 275,000 subscribers. Her focus on making the records shareable has resulted in substantial numbers of digital copies of NARA's records on platforms across the internet, including Wikipedia and Wikidata, Giphy, and more. She and her staff of archives specialists, community and project managers, user experience designers, and IT specialists run the agency’s web, description, next-generation finding aids, and digital reference programs. She is a member of the advisory board for the Digital Public Library of America. Prior to joining the National Archives, she worked as a research historian for Historical Research Associates. She holds degrees in English and history from the University of Montana.
Tanu Mitra

Tanu Mitra is an Assistant Professor at the University of Washington, Information School, where she leads the Social Computing research group. She and her students study and build large-scale social computing systems to understand and counter problematic information online. Her research spans auditing online systems for misinformation and conspiratorial content, understanding digital misinformation in the context of the news ecosystem, unraveling narratives of online extremism and hate, and building technology to foster critical thinking online. Her work employs a range of interdisciplinary methods from the fields of human computer interaction, data mining, machine learning, and natural language processing.

Dr. Mitra’s work has been supported by grants from the NSF, DoD, Social Science One, and other Foundations. Her research has been recognized through multiple awards and honors, including an NSF-CRII award, an ICTAS Junior Faculty Award, the Virginia Tech College of Engineering Outstanding New Assistant Professor Award and Georgia Tech’s GVU Center’s Foley Scholarship for excellence in research innovation and potential impact, along with several best paper honorable mention awards. Dr. Mitra received her PhD in Computer Science from Georgia Tech’s School of Interactive Computing and her Masters in Computer Science from Texas A&M University.

Erica Boudreau Erica Boudreau is an archives specialist for data standards in the Office of Innovation's Digital Public Access Branch (VEO) at NARA. Her work focuses on archival description review and the preparation of digital objects and metadata provided by NARA's partners for import into the National Archives Catalog. Prior to joining VEO, she spent 12 years at the John F. Kennedy Presidential Library where she led digitization and description efforts and gained experience processing large and complex collections.
Jason Clingerman Jason Clingerman is the director of the Digital Engagement Division in the Office of Innovation at NARA. He has 13 years of experience in archives, archival standards, information systems, and user experience. Jason oversees the processes and platforms for making our nation's historical records publicly available online. In his role, he oversees the programs that contribute to two of NARA's strategic objectives to Make Access Happen: to make 500 million digitized pages of records available through the National Archives Catalog and to provide digital, next-generation finding aids to the holdings described in the Catalog.
Michael L. Knight Michael L. Knight is the Web Branch Chief in the Office of Innovation's Digital Engagement Division. He has led numerous web development projects since joining the agency in 2017, including efforts to redesign and migrate Presidential Library websites to the Drupal content management system and the NARA enterprise cloud. Michael is a Certified Scrum Professional (CSP-SM), Certified Scrum Master (CSM), and Project Management Professional (PMP).
Billy Wade Billy Wade joined the Still Picture Branch in July 2000 after studying history at Salisbury University. In 2005, he became an archivist specializing in the accessioning and processing of both analog and digital photography as well as digitization of legacy holdings and in 2012 became one of two subject matter experts within the branch. In 2016, he transitioned to his current position as supervisor for Still Picture accessioning, processing, and digitization activities.

Presentations and Videos

Below are slide decks and recorded presentations relevant to these datasets and collaborations with external stakeholders in exploring the use cases of artificial intelligence and machine learning for NARA’s data.

Presentation Title Presenter Description Slides Recording
Opening Presentation (Why We’re Here: Ensuring Scholarly Access to Government Archives and Records) William A. Ingram, Assistant Dean and Director of IT for University Libraries at Virginia Tech Background information on the conference, partners involved, and overview of workshop goals. PDF   
Workshop Series Overview Sylvester A. Johnson, Assistant Vice Provost for the Humanities and Executive Director of the “Tech for Humanity” initiative at Virginia Tech An overview of the conference series workshops, including processes, methods, and approaches. PDF   
National Archives Catalog Erica Boudreau, Archives Specialist Overview of the National Archives Catalog, including NARA’s archival standard, the Lifecycle Data Requirements Guide (LCDRG); NARA’s archival hierarchy; what’s in the Catalog; the Catalog’s technology stack; how to search the Catalog user interface; and basics on the Catalog API. PDF YouTube
NARA Datasets for Artificial Intelligence / Machine Learning Jason Clingerman, Digital Engagement Division Director Description of the potential use cases for datasets listed on this page and more details on the datasets themselves. PDF YouTube
NARA’s Digital Personas Michael L. Knight, Project Manager Background on NARA’s digital personas, including the research that informed the development of the personas and overviews of the personas. PDF YouTube
Using Object Recognition with NARA Photographs and Other Graphic Materials Billy Wade, Supervisory Archivist Description of potential use cases for object recognition for NARA's photographs and other graphic materials. PDF YouTube
Related Resources
  • National Archives Catalog: NARA’s central repository for providing online access to digitized holdings, descriptions of holdings, authority records (e.g. people, organizations, geographic locations, and topical subjects, public contributions by citizen archivists, and indexed National Archives and Presidential Library web pages.
  • National Archives Catalog API GitHub Repository: Contains documentation for using the National Archives Catalog API.
  • Lifecycle Data Requirements Guide (LCDRG): NARA’s standard for describing archival records, organizations, persons, and digital objects.

PDF files require the free Adobe Reader.
More information on Adobe Acrobat PDF files is available on our Accessibility page.