Selected Publications from Our Applied Research Partners
“Services Needed for Management, Preservation and Access
to Digital Records of Scientific & Engineering Research”
Description: “This is a position paper from two of our Research Partners – Bill Underwood from Georgia Tech Research Institute (GTRI) and Richard Marciano from the University of North Carolina at Chapel Hill (UNC-CH) for the Research Data Management Implementations Workshop held in Arlington, VA on March 13-14, 2013. The paper describes their efforts to develop services for managing the digital data and records of active scientific and engineering research projects and for preserving and providing access to the digital data and records after the conclusion of the projects.”
Tags: “2013,applied research,richard marciano,william underwood,access,digital preservation,metadata,file format identification,gtri,unc-ch,file format validation”
"Assessing the Preservation Condition of Large and Heterogeneous Electronic Records Collections with Visualization"
Description: “As collections become larger in size, more complex in structure and increasingly diverse in composition, new approaches are needed to help curators assess digital files and make decisions about their long-term preservation. We present research on the use of interactive visualization to analyze file characterization information for the purpose of assessing the preservation condition of a vast collection of complex electronic records. The case study collection contains over 1,000,000 files of diverse formats arranged in varied record structures and record groups. The visualization application uses treemaps and a relational database management system (RDBMS) to represent the collection’s arrangement and to show available characterization information at different levels of aggregation, classification, and abstraction. Through this visualization interface, curators can interact dynamically with the collections’ characterization information to discover trends, as well as compare and contrast various file characteristics across the collection. Archivists may select and weight the variables that they want to analyze. They can pursue analysis workflows that go from a high-level overview of the collection’s preservation condition based on file format risks to obtaining more detailed results about the condition of record groups and individual records. While there are various digital preservation planning tools available, to our knowledge none have been designed specifically to visually present assessment information across vast and complex collections. We present research to address the need for such a tool."
Tags: "2011,applied research,digital preservation,electronic records,maria esteva,nara,nara transcontinental persistent archives prototyp,suyog dott jain,tacc,visualization,weijia xu"
"Visualization for Archival Appraisal of Large Digital Collections"
Description: “Our research examines data-driven visualization methods for archival purposes. Using data extracted from a large and heterogeneous digital collection, we created an information visualization that uses RDBMS and treemap to enable archival analysis. Different views present the collection’s structure and properties at different levels of aggregation and abstraction, transforming 1,000,000 data points into information that enables observation and decision-making."
Tags: "2010,access,applied research,appraisal,archival process,digital preservation,electronic records,information management,nara,nara transcontinental persistent archives prototyp,scalability,tacc,visualization"
"An Alternate Approach to the Exchange of Ship Product Model Data"
Description: “This paper considers an alternative approach to the exchange of ship product model data based on general-purpose STEP application protocols. The vision is to provide the functionality defined in the shipbuilding application protocols using a combination of STEP AP239, AP214, and reference data libraries. It is expected that AP239 translators will soon be available, thus enabling the exchange of significant portions of ship product model data."
Tags: "2007,applied research,cad,digital preservation,digital preservation of complex engineering data,electronic records,iso 10303,nara,navsea,step data exchange"
"NASA Report to NARA on OAIS Based Federated Registry/Repository Research: May 2005-January 2006"
Description:“The National Aeronautics and Space Administration’s Goddard Space Flight Center through its Space Sciences and Exploration Directorate’s National Space Science Data Center is performing research into advanced information encapsulation, information models and procedures, and highly scalable ingest mechanisms based on the Open Archival Information System Reference Model (ISO 14721:2003) (1) and the emerging XML Formatted Data Unit (XDFU) technologies for contributions supporting NARA’s requirements to provide the American public with access to federal, presidential, and congressional electronic records collections."
Tags: "2006,access,applied research,electronic records,information packaging,ingest,nara,nasa,oais,scalability,xfdu"
"Examples of Performative Sentences in Presidential Records Working Paper 09-01_1"
Description: “Underwood  argues that an archivist‟s ability to understand the acts carried out by records is fundamental to his capabilities to describe and review records for possible restrictions on disclosure. To support archivists in performing these tasks, he proposes an approach for automatically recognizing the speech acts performed by the sentences in electronic records. This method is dependent, in part, on the capability to recognize performative verbs and interpret performative sentences. Verbs like recommend, request, and promise whose action is accomplished merely by saying them or writing them are termed performative verbs. A performative verb has a performative use in a performative sentence if the form of the verb is first person (singular or plural), present tense, indicative, and active (or passive) voice. There are also performative sentences in which the verb is in the present continuous, in nominalized form, or in the passive voice. In this report, 201 performative verbs are defined and examples are provided that were found in the Public Papers of the Presidents. These examples are being analyzed to determine syntactic and semantic features that will enable the implementation of that part of the speech act recognition method for recognizing and interpreting performative sentences."
Tags: "2009,access,applied research,army research lab,automated archival description,content summarization,document type identification,electronic records,george hw bush,gtri,nara,presidential records,speech act recognition"
File: WP 09-01_1
"Advanced Decision Support for Archival Processing of Presidential Electronic Records: Final Scientific and Technical Report TR 09-05"
Description: “The overall objective of this project is to develop and apply advanced information technology to decision problems that archivists at the Presidential Libraries encounter when processing electronic records. Among issues and problems to be addressed are areas responsive to national security, including automated content analysis, automatic summarization, advanced information retrieval, advanced support of decision making for access restrictions and declassification, information security, and Global Information Grid technology, which are also important research areas for the U.S. Army."
Tags: "2009,access,applied research,army research lab,content summarization,decision support for archival review,declassification,electronic records,file format identification,foia,george hw bush,gtri,information extraction,nara,perpos,presidential records"
"An Overview of 3D Data Content, File Formats and Viewers NCSA isda-2008-002"
Description: “This report presents an overview of 3D data content, 3D file formats, and 3D viewers. It attempts to enumerate the past and current file formats used for storing 3D data and several software packages for viewing 3D data. The report also provides more specific details on a subset of file formats, as well as several pointers to existing 3D data sets. This overview serves as a foundation for understanding the information loss introduced by 3D file format conversions with many of the software packages designed for viewing and converting 3D data files."
Tags: "2008,3d file formats,digital preservation,digital preservation of complex engineering data,electronic records,file format migration,kenton mchenry,measuring information loss,nara,ncsa,peter bajcsy,technical report"
"Experiments in Data Format Interoperation Using Defuddle"
Description: “This document discusses the Defuddle parser and work conducted as part of the “Innovative Systems and Software: Applications to NARA Research Problems” project. Robust sharing, reuse, and curation of data requires a clean separation of issues related to bits, formats, and logical content. To address these issues the Open Grid Forum is defining the Data Format Description Language (DFDL) standard for describing the structure of binary and textual files and data streams so that their format, structure, and metadata can be exposed as XML. While this is sufficient for describing the internal layout of data (the “syntax”), interoperability and curation also require a description of logical relationships within and between data sets in terms of globally understood concepts (the “semantics”).”
Tags: "2009,access,applied research,dfdl,digital preservation,electronic records,file format obsolescence,nara,ncsa"
"Recovery of a Digital Image Collection Through the SDSC/UMD/NARA Prototype Persistent Archive"
Description: “The San Diego Supercomputer Center (SDSC), the University of Maryland, and the National Archives and Records Administration (NARA) are collaborating on building a pilot persistent archive using and extending data grid and digital library technologies. The current prototype consists of node servers at SDSC, University of Maryland, and NARA, connected through the Storage Request Broker (SRB) data grid middleware, and currently holds several terabytes of NARA selected collections. In particular, a historically important image collection that was on the verge of becoming inaccessible was fully restored and ingested into our pilot system. In this report, we describe the methodology behind our approach to fully restore this image collection and the process used to ingest it into the prototype persistent archive."
Tags: "2004,data recovery,digital preservation,electronic records,file format obsolescence,hardware obsolescence,ingest,nara,nara transcontinental persistent archives prototyp,regular expressions,sdsc,srb,umiacs"
"PERPOS II: Annual Technical Status Report July 1, 2004 – June 30, 2005 TR 05-4"
Description: “Annual Technical Report for the GTRI Presidential Electronic Records PilOt System (PERPOS) II Project.”
Tags: "2005,access,applied research,army research lab,automated archival description,content extraction,decision support for archival review,document type recognition,electronic records,foia,george hw bush,gtri,information assurance,information extraction,knowledge representation,nara,natural language processing,perpos,presidential records,summarization"
File: PERPOS TR 05-4
"Tradeoff Studies about Storage and Retrieval Efficiency of Boundary Data Representations for LLS, TIGER and DLG Data Structures"
Description: “We present our theoretical comparisons and experimental evaluations of three boundary data representations in terms of storage and information retrieval efficiency. We focus on three boundary data representations, such as, location list data structure (LLS), digital line graphs (DLGs) and topologically integrated geographic encoding and referencing (TIGER) data organizations. These three boundary data representations are used frequently in the GIS domain, and are known as ESRI Shapefiles (LLS), the SSURGO DLG-3 soil files (DLG), and the U.S. Census Bureau 2000 TIGER/Line files (TIGER). Boundary information is viewed as an efficient representation of image documents describing spatial regions. The goal of our work is to study the impacts of choosing boundary information representation on document image management and information retrieval, as well as to improve our understanding of the processing noise introduced during representation conversions. Our storage and retrieval efficiency tradeoff evaluations are based on load time, computer memory, and hard disk space requirements. The experimental measurements are obtained with test data sets derived from the SSURGO DLG-3 soil files and the U.S. Census Bureau 2000 TIGER/Line files. Based on our experiments, we concluded that LLS files will provide the fastest boundary retrieval (40 times faster than TIGER and 2.5 times faster than DLG) at the price of file size (storage redundancy for LLS files is between 70% and 180% in our experiments). DLG format offers a smaller file size, but is less efficient for boundary retrieval. TIGER format also offers a compact physical representation, at the cost of more processing for boundary retrievals.”
Tags: "2005,applied research,david clutter,dlg,esri,file format migration,geospatial electronic records,gis,lls,nara,ncsa,peter bajcsy,scalability,tiger"
"Factors Affecting I/O performance When Accessing Large Arrays in HDF5 on NCSA’s TeraGrid Cluster"
Description:“Achieving good I/O performance depends on a number of interacting components and how these are configured. These include the access patterns of the application itself, the architecture of the computing system, middleware such as I/O libraries, the type of file system, the mode of access, data set size, data storage layout, and the external storage configuration. This study of I/O performance on NCSA’s TeraGrid computing system examines the role of access patterns and types of I/O with arrays stored in HDF5 using different storage layouts. Access patterns involve accessing data from an image by one or more columns per access. Three different I/O modes are compared (serial, independent parallel, collective parallel), involving a relatively small and large array (48 MB vs.1 GB), and using both chunked and contiguous storage layouts. In the parallel modes, the effect of the number of nodes was also included. For the configuration used in the study both collective and independent I/O was better than serial I/O for 8 or more processors when accessing the large image, but not when accessing the small images. Collective I/O performed best when data was stored contiguously. Chunking improved all I/O modes for the access patterns tested, especially for serial I/O, and especially for the large image. Chunking was particularly effective for independent I/O when the contents of chunks corresponded closely to the access patterns."
Tags: "2005,applied research,bandwidth,electronic records,hdf,mike folk,nara,ncsa,network performance,scalability,teragrid,vailin choi"
File: Factors Affecting IO Performance
"NARA Email Collections Summary Report "
Description: “A summary report on activities undertaken by the National Center for Supercomputing Applications to explore the applicability of scientific data management tools and techniques for the records management requirements of the National Archives and Records Administration. This work specifically focused on collections of email."
Tags: "2004,applied research,data management,data mining,electronic records,email,nara,ncsa,records"
File: NARA E-mail Collections Summary Report
A Security Architecture For A Web Portal Of Sensitive Archival Records"
Description: “Web portals are ubiquitous, but an operational portal containing sensitive electronic archival records has not existed in a public network due to security concerns; therefore, setting one up requires serious consideration of defensive security measures. The key topic of this paper is a security architecture for the protection of an experimental portal that provides its stakeholders with convenient and low-cost means for sharing and processing sensitive electronic archival records. The portal also serves as a testbed for conducting empirical research activities in support of the future building of a secure operational portal of sensitive electronic archival records. The proposed architectural framework applies the technical facet of the defense-in-depth strategy that was developed and widely implemented by the Department of Defense"
Tags: "2004,access,applied research,army research lab,binh nguyen,declassification,defense in depth strategy,electronic records,foia,information assurance,security architecture,sensitive electronic archives"
File: A Security Architecture for a Web Portal of Sensitive Archival Records
"Mobile Agents For Distributed Processing Of Electronic Records Archives"
Description: “Distributive processing of electronic records archives (ERA) could be a niche for mobile agents to outrival established client-server network computing methods. Archival records are usually very large computer files that consume a large bandwidth when they are transferred from one networked computer to another. This paper describes an ideational ERA distributive processing scenario in which the advantages of using mobile agents or knowledgeable objects could outweigh disadvantages and overcome some of the hurdles that are facing mobile agents."
Tags: "2004,applied research,army research lab,bandwidth,binh nguyen,distributed processing,electronic records,knowledgeable objects,mobile agents,nara"
File: Mobile Agents for Distributed Processing of Electronic Records Archives
"A Virtual Test Bed for Distributed Processing Of Archives"
Description: “Performing empirical research requiring numerous interactions with web servers and transfers of very large archival data files without affecting operational information system infrastructure is highly desirable by separating unsteady test-bed environments from steady administrative networks. The key theme of this paper is the building of a virtual network of heterogeneous computing systems using virtual machine technologies. The virtual network exists to provide a low-cost and convenient environment for conducting applied research in support of operational requirements for the secure transfer and storage of distributed archival electronic records over a public network, such as the Internet. First, the paper describes a virtual network environment in which security technologies and methods are being developed, tested, and evaluated. Second, it discusses the concepts and methods for building such a virtual system. The paper then concludes with positive preliminary results and suggestions for processing electronic records archives using virtual machine technologies."
Tags: "2004,applied research,army research lab,binh nguyen,digital preservation,electronic records,nara"
File: A Virtual Test Bed for Distributed Processing of Archives
"Integrating Multi-Touch In High-Resolution Display Environments"
Description: "High-resolution display environments consisting of many individual displays arrayed to form a single visible surface are commonly used to present large scale data. Using these displays often involves a control paradigm where interactions become cumbersome and non-intuitive. By combining high-resolution displays with multi-touch and gesture interactive hardware, researchers can explore data more naturally, efficiently and collaboratively. This fusion of technology is necessary to effectively use tiled-display environments and
mediate their primary weakness - interaction. In order to realize these objectives, a team at the Texas Advanced Computing Center (TACC) developed an economical display system using a combination of commodity hardware and customized software. In this paper, we explain the requirements, design process, functions and best practices for constructing such displays. In addition, we explain how these systems can be used effectively with application examples.”
Tags: "access,applied research,electronic records,maria esteva,nara,tacc,visualization,weijia xu"
"Preservation of Digital Data with Self Validating Self-Instantiating Knowledge-Based Archives"
Description: "Digital archives are dedicated to the long-term preservation of electronic information and have the mandate to enable sustained access despite
rapid technology changes. Persistent archives are confronted with heterogeneous data formats, helper applications, and platforms being used over
the lifetime of the archive. This is not unlike the interoperability challenges, for which mediators are devised. To prevent technological obsolescence over time and across platforms, a migration approach for persistent archives is proposed based on an XML infrastructure.
Tags: "2001,applied research,digital preservation,nara transcontinental persistent archives prototyp,persistent archives,reagan moore,richard marciano,sdsc"
Description: “The tools available to professionals looking to do things with geographic data have not grown to meet the Big Data problem. We present our vision for a third-generation solution, RENCI’s Geoanalytics Cyberinfrastructure, for working with geographic data.
Tags: "2011,access,geospatial electronic records,technical report,unc-ch,visualization"
"Semantic Annotation of Presidential E-records, TR 08-01"
Description: “The capability to extract metadata from electronic records depends on the capability to automatically recognize and annotate semantic categories in the text such as persons’ names, dates, job titles, and postal addresses. The capability to recognize document types such as correspondence, memoranda, schedules, minutes of meetings and press releases also depends on the capability to automatically recognize semantic categories in text. This technical report discusses an experiment that was conducted using records from the Bush Presidential personal computer records. The results are an overall average precision of 0.9178, overall average recall of 0.9282, and an overall F-measure of 0.9108.”
Tags: "2008,applied research,army research lab,content summarization,document type recognition,george hw bush,gtri,information extraction,nara,presidential records,sheila isbell,technical report,william underwood"
File: Semantic Annotation of Presidential E-Records
"Extensions Of The Unix File Command And Magic File For File Type Identification TR 09-02"
Description: File format identification is a core requirement for digital archives. The UNIX file command is among the most promising technologies for file type identification. This report describes extensions to the file command and magic file that enhance their utility for file format identification in archival systems."
Tags: "2009,applied research,army research lab,digital preservation,droid,file format identification,gtri,nara,pronom"
File: Extensions of the UNIX File Command and Magic File for File Type Identification
"Grammar Based Recognition Of Documentary Forms And Extraction Of Metadata"
Description:“Metadata extraction is a critical aspect of the ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types."
Tags: "2010,applied research,army research lab,digital preservation,document type recognition,electronic records,gtri,information extraction,nara,presidential records,william underwood"
File: Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
"FOIA Processing In The Presidential Electronic Records Pilot System (PERPOS) TR 06-05"
Description: “This technical report describes functionality added to PERPOS that supports FOIA processing, including:
- Indexing the accessioned electronic records
- Creating a FOIA case
- Searching the indexed records for records relevant to a FOIA Request
- Automatic estimation of the number of pages of e-records relevant to a request
- Reviewing records for a FOIA case
- Creating the Scope and Content Note for a FOIA case
- Automatically creating a finding aid and a container for a FOIA Collection”
Tags: "2006,access,applied research,decision support for archival review,foia,george hw bush,gtri,nara,perpos,presidential records,redaction,review,sandra laib,william underwood"
File: FOIA Processing in the Presidential Electronic Records Pilot System PERPOS TR 06-05
"Access Restriction Checker PERPOS TR 2005-07"
Description: “Part of the PERPOS project has been to analyze the kinds of knowledge that archivists use to review Presidential Records for Presidential Record Act (PRA) restrictions and Freedom of Information Act (FOIA) exceptions, and to develop an automated tool that could use this knowledge to support archivists’ decisions in reviewing Presidential Records. We have begun prototyping such a tool, which we call the Access Restriction Checker. The results of our initial exploration show great promise for such a tool and we believe it would be a great labor saver as a component in the future archivist’s tool kit. Such a tool is not a replacement for the judgment of archivists, whose responsibility is to review the records; rather the tool is a decision support tool. This Technical report provides an overview of our initial work on the Access Restriction Checker. Additional work is required to broaden the knowledge coverage to other types of access restrictions.”
Tags: "2005,access,applied research,army research lab,brian harris,decision support for archival review,elizabeth whitaker,foia,gtri,nara,perpos,presidential records,redaction,review,robert simpson"
File: Access Restriction Checker PERPOS TR 2005-07
"Grammar-Based Specification And Parsing Of Binary File Formats"
Description: “The capability to validate and view or play binary file formats, as well as to convert binary file formats to standard or current file formats, is critically important to the preservation of digital data and records. This paper describes the extension of context-free grammars from strings to binary files. Binary files are arrays of data types, such as long and short integers, floating-point numbers and pointers, as well as characters. The concept of an attribute grammar is extended to these context-free array grammars. This attribute grammar has been used to define a number of chunk-based and directory-based binary file formats. A parser generator has been used with some of these grammars to generate syntax checkers (recognizers) for validating binary file formats. Among the potential benefits of an attribute grammar-based approach to specification and parsing of binary file formats is that attribute grammars not only support format validation, but support generation of error messages during validation of format, validation of semantic constraints, attribute value extraction (characterization), generation of viewers or players for file formats, and conversion to current or standard file formats. The significance of these results is that with these extensions to core computer science concepts, traditional parser/compiler technologies can potentially be used as a part of a general, cost-effective curation strategy for binary file formats.”
Tags: "2012,applied research,file format identification,grammars,gtri,international journal of digital curation,nara,william underwood"
"Designing A Scalable Cross-Platform Imposed Code Reuse Framework"
Description: File format identification is a core requirement for digital archives. The UNIX file command is among the most promising technologies for file type identification. "
Description: "In order to construct a file format conversion service supporting as many formats as possible we have introduced the notion of imposed code reuse in our past work (McHenry et al. [2-4]). Traditional code reuse, when an option, can save a significant amount of energy and time with regards to new software development while at the same time adding robustness through the use of code that has been proven over time. Imposed code reuse comes into play when original source code is not available. Consider proprietary software where only a compiled binary version is available. Such software will often provide only a graphical user interface (GUI) allowing humans to utilize the software with a mouse, keyboard, and monitor. While a GUI is useful for human interaction it is not exactly useful for accessing functionality programmatically by software developers. Imposed code reuse attempts to bridge this access to functionality locked away in compiled software by wrapping it through various scripting languages in such a manner that it can be used in other software through an API like interface.”
Tags: "2010,ncsa,file format migration,scalability,applied research,electronic records,digital preservation"
"The Conversion Software Registry"
Description: “We have designed the web-based Conversion Software Registry (CSR) for collecting information about the software that is capable of file format conversions. The work is motivated by a community need for finding file format conversions inaccessible via current search engines and by the specific need to support systems that could actually perform conversions, such as the NCSA Polyglot. In addition, the value of CSR is in complementing the existing file format registries such as the Unified Digital Formats Registry (UDFR before GDFR) and PRONOM, and introducing software quality information obtained by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model design that includes file format extension-based conversion, as well as software scripts, software quality measures and test file specific information for evaluating software quality. We have populated the CSR with the help of the National Archives and Records Administration (NARA) staff.
Tags: "2010,ncsa,file format migration,digital preservation,electronic records,applied research,scalability,conversion software registry"
"Towards A Universal Viewer For Digital Content"
Description: "In this paper, we present a distributed set up for the viewing of digital data within a large number of formats. Through the use of software servers, 3rd party software is automated and made available as functions that can be called
within other programs. Using multiple machines containing software and running software servers a conversion service is built. Being built on top of 3rd party software the conversion service is easily extensible and thus can be made to support a large number of conversions. We use this service to build a “universal” viewer by converting given ﬁles among a large set of ﬁle formats to a relatively small subset of formats that are renderable by a given viewer (e.g. web browser). We describe this service and the underlying software servers as well as the future directions we are planning on taking.
Tags: "2011,ncsa,file format migration,digital preservation,electronic records,applied research,scalability,conversion software registry,polyglot"
"A Mosaic of Software"
Description: “In this paper, we describe a Software Server. Where conventional web servers allow data to be accessible from anywhere within the web, Software Servers allow arbitrary software functionality to be accessible over the web.”
Tags: "2011,software servers,ncsa,applied research,electronic records,digital preservation"
"The ISDA Tools: Preserving 3D Digital Content"
Description: “NCSA has developed a number of tools to aid in the preservation of digital records. In this paper, we present these tools in the context of preserving 3D digital files. Tools such as the Conversion Software Registry, Software Servers, Polyglot, 3D Utilities, and Versus provide users with a scalable means of discovering and carrying out large file migration tasks in a manner that can take into account and quantify the unavoidable information loss that occurs when moving from one format to another. We present each of these tools and describe how they can be used."
Tags: "2011,ncsa,digital preservation,conversion software registry,csr,polyglot,versus,applied research,file format migration"
"A Framework For Understanding File Format Conversions"
Description: "This paper addresses the question: Can data generated from the infancy of the digital age be ingestible by software today? We have prototyped a set of e-services that serve as a framework for understanding content preservation, automation and computational requirements on the preservation of electronic records. The framework consists of e-services for (a) finding file
format conversion software, (b) executing file format conversions using available software, and (c) evaluating information loss across conversions. While the target audience for the technology is the US National Archives, these basic e-services are of interest to any manager of electronic records and to all citizens trying to
keep their files current with the rapidly changing information technology. The novelty of the framework is in organizing the information about file format conversions, providing services about file format conversion paths, in prototyping a general architecture for reusing existing third-party software with import/export capabilities, and in evaluating information loss due to file format conversions. The impact of these e-services is in the widely accessible conversion software registry (CSR), conversion engine (Polyglot) and comparison engine (Versus) which can increase the productivity of the digital preservation community and other users of digital files.
Tags: "2011,ncsa,digital preservation,conversion software registry,csr,polyglot,versus,applied research,file format migration,scalability"
"A Framework To Access Handwritten Information Within Large Digitized Paper Collections"
Description: “This refereed conference paper describes NCSA’s efforts with NARA to provide a form of automated search of handwritten content within large digitized document archives. With a growing push towards the digitization of paper archives, there is an imminent need to develop tools capable of searching the resulting unstructured image data as data from such collections offer valuable historical records that can be mined for information pertinent a number of ﬁelds from the geosciences to the humanities. To carry out the search we use a Computer Vision technique called Word Spotting. A form of content-based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive, three computationally expensive pre-processing steps are required. We describe these steps, the open-source framework we have developed, and how it can be used not only on the recently released 1940 Census data containing nearly 4 million
high resolution scanned forms, but also on other collections of forms. With a growing demand to digitize our wealth of paper archives we see this type of automated search like a low-cost scalable alternative to the costly manual transcription that would otherwise be required.”
Tags: "2012,ncsa,applied research,1940 Census,advanced search,access,handwriting recognition,scalability"
"Digitization And Search A Non-Traditional Use Of HPC"
Description: "Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difﬁcult task, we use a content-based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar-looking images are presented to the user. A signiﬁcant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open-source framework developed are discussed focusing speciﬁcally on HPC considerations that are
relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.”
Tags: "2012,ncsa,applied research,1940 Census,advanced search,access,handwriting recognition,scalability"
"Computational Scalability of Large Size Image Dissemination"
Description: "We have investigated the computational scalability of image pyramid building needed for dissemination of very large image data. The sources of large images include high-resolution microscopes and telescopes, remote sensing and airborne imaging, and high-resolution scanners. The term ‘large’ is understood from a user perspective which means either larger than a display size or larger than a memory/disk to hold the image data. The application drivers for our work are digitization projects such as the Lincoln Papers project (each image scan is about 100-150MB or about 5000x8000 pixels with the total number to be around 200,000) and the UIUC library scanning project for historical maps
from 17th and 18th century (smaller number but larger images). The goal of our work is understand computational scalability of web-based dissemination using image pyramids for these large image scans, as well as the preservation
aspects of the data. We report our computational benchmarks for (a) building image pyramids to be disseminated using the Microsoft Seadragon library, (b) a computation execution approach using hyper-threading to generate image
pyramids and to utilize the underlying hardware, and (c) an image pyramid preservation approach using various hard drive configurations of Redundant Array of Independent Disks (RAID) drives for input/output operations. The benchmarks are obtained with a map (334.61 MB, JPEG format, 17591x15014 pixels). The discussion combines the speed and preservation objectives.”
Tags: "2011,scalability,applied research,image pyramids,gigapixel images,access,advanced search,ncsa,"
"A Framework for Relationship Discovery Among Files of Different Types "
Description: This poster presents a framework for relationship discovery from heterogeneous data systems. The framework consists of modules for automated file system analysis, file content analysis, integration of the results from analyses, storage of metadata and data-driven decision support for discovering relationships among files.
Tags: "applied research,electronic records,file type identification,nara,national archives and records administration,national center for supercomputing applications,ncsa,rdf"
"Appraisal and Data Mining of Large Size Complex Documents”
Description: “This poster addresses the problems of comprehensive document comparisons and computational scalability of document mining using cluster computing and the Map and Reduce programming paradigm. While the volume of contemporary documents and the number of embedded object types have been steadily growing, there is a lack of understanding (a) how to compare documents containing heterogeneous digital objects, and (b) what hardware and software configurations would be cost-efficient for handling document processing operations such as document appraisals. The novelty of our work is in designing a methodology and a mathematical framework for comprehensive document comparisons including text, image and vector graphics components of documents, and in supporting decisions for using Hadoop implementation of Map/Reduce paradigm to perform counting operations.”
Tags: "applied research,appraisal,data mining,electronic records,hadoop,nara,ncsa,peter bajcsy"
"PERPOS II: Scientific and Technical Report June 19, 2003 – June 18, 2004, TR 04-8"
Description: “The report covers the following topics: progress and results in applying advanced information and content extraction technologies to document type identification and description of folder and record series contents; results in identifying and representing the knowledge needed for PRA and FOIA review; an architecture for an Access Restriction Checker; the PERPOS2 network and security policy and its use as a testbed for evaluating NIAP certified security products (firewalls, intrusion detection systems, virtual private networks, and encryption); and the results of pilot testing of Archival Processing Tools at the Bush Presidential Library.”
Tags: "2004,applied research,army research lab,digital preservation,document type recognition,electronic records,foia,george hw bush,gtri,information extraction,nara,perpos,presidential records,william underwood"
"Towards a Semantic Preservation System"
Description: “Preserving access to file content requires preserving not just bits but also meaningful logical structures. The ongoing development of the Data Format Description Language (DFDL) is a completely general standard that addresses this need. The Defuddle parser is a generic parser that can use DFDL-style format descriptions to extract logical structures from ASCII or binary files written in those formats. DFDL and Defuddle provide a preservation capability that has minimal format-specific software and cleanly separates issues related to bits, formats, and logical content. Such a system has the potential to greatly reduce overall system development and maintenance costs as well as the per-file-format costs for long term preservation.
This project is investigating extending this model to extract descriptions of the structure and relations in the data into standard semantic web languages (the Resource Description Framework (RDF) and the Web Ontology Language (OWL)). Our approach is a two-step process, the standard DFDL processing within Defuddle to generate XML, and a second phase to extract semantic
descriptions from the XML using the Gleaning Resource Descriptions from Dialects of Languages (GRDDL) standard as a standard mechanism for declaring these transformations.
Tags: "2009,data format description language,dfdl,digital preservation,electronic records,nara,ncsa,owl,web ontology language"
"Collection-based Long-term Preservation Project Status Report June 1999"
Description: “This is an early report from our Research Partners then located at the San Diego Supercomputer Center (SDSC) in which they explore the feasibility of preserving electronic records at scale. Their estimates of the costs associated with preserving one billion electronic records is particularly interesting to read with the benefit of hindsight.”
Tags: "1999,applied research,digital preservation,electronic records,nara,scalability,sdsc"
"TR 07 04 Results of Pilot Testing of FOIA Processing Using PERPOS"
Description: “Technical Report on the use of PERPOS tools for FOIA review of Presidential Records at the George H. W. Bush Presidential Library."
Tags: "2007,applied research,army research lab,brooke clement,debbie carter,foia,george hw bush,gtri,nara,perpos,presidential records,redaction,sandra laib,william underwood"
File: TR 07-04
"TR 08-03 Recognizing Speech Acts in Presidential E-records"
Description: “Among the challenges facing contemporary archivists at Presidential Libraries and the National Archives are the tasks of reading, understanding, describing, accessing and reviewing terabyte- and petabyte-sized collections of electronic records. The Advanced Decision Support for Archival Processing of Presidential E-Records Project seeks to apply natural language processing technologies to the support of these archival tasks.”
Tags: "2008,applied research,army research lab,digital preservation,electronic records,george hw bush,gtri,nara,presidential records,speech act recognition,william underwood"
"Language of Records Disposition by William Underwood"
Description: “This is a paper written in the mid-1990s about constructing records disposition schedules in such a way that they could be machine-executable."
Tags: "1994,applied research,army research lab,artificial intelligence atlanta inc,electronic records,gtri,nara,records disposition,records management,records schedules,william underwood"