National Archives Applied Research

National Center for Supercomputing Applications-NCSA

  • Bajcsy, Peter. Technologies for Appraising and Managing Electronic Records. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. Invited Lecture, National Archives, College Park, MD. September 23, 2009. PowerPoint slides as PDF.

Abstract:  A discovery of relationships among digital file collections (file2learn); Doc2Learn: a comprehensive comparison of contemporary documents; automated file format conversion software; Polyglot: conversion quality assessment tool; design technologies for appraising and managing electronic records; and discovery of relationships among digital file collections.

  • Bajcsy, Peter. Appraisal of 3D Data Conversions and Visualization Software Packages. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. January 21, 2009. PowerPoint Presentation PPT.

Abstract: Discusses appraisal of 3D Digital Data; managing 3D file formats; scalability of appraisals, 3D data conversions; components of Polyglot; challenges with conversion software; Polyglot as a web service; and international collaborations with PRONOM/DROID/JHOVE projects.  Project URL:  http://isda.ncsa.uiuc.edu/NARA/index.html and http://isda.ncsa.uiuc.edu/CompTradeoffs/ And other NCSA publications: http://isda.ncsa.uiuc.edu/publications

  • Bajcsy, Peter. To Preserve or Not to Preserve? How Can Computers Help with Appraisals. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. October 16, 2008. PPT.

Abstract: Discusses Past & Current Research; Computer-Assisted Appraisal of Documents; Approach and Methodology; PDF Documents; experimental results; grouping, ranking and Integrity verification; and computational scalability.  Project URL: http://isda.ncsa.uiuc.edu/CompTradeoffs/

  • Bajcsy, Peter and Kooper, Rob. Comprehensive Appraisals of Contemporary Documents. In 5th International IEEE eScience conference (IEEE e-Science 2009). National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. Oxford, UK, December, 2009. PDF.

Abstract: Describes problems related to contemporary document analyses. Contemporary documents contain multiple digital objects of different type. These digital objects have to be extracted from document containers, represented as data structures, and described by features suitable for comparing digital objects. In many archival and machine learning applications, documents are compared by using multiple metrics, checked for integrity and authenticity, and grouped based on similarity. The objective of our work is to design methodologies for contemporary document processing, visual exploration, grouping and integrity verification.

  • Bajcsy, Peter, Kooper, Rob, and McHenry, Kenton. Towards a Universal, Quantifiable, and Scalable File Format Converter. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. In 2009 Fifth IEEE International Conference on e-Science, pp. 140–147. Oxford, UK. PDF.

Abstract: Addresses the problem of designing a universal file format converter. Discusses NCSA’s Polyglot system for data conversion.

  • Bajcsy, Peter, Kooper, Rob, McHenry, Kenton, McFadden, William, Ondrejcek, Michal, and Yahja, Alex.  Advanced Information Systems for Archival Appraisals of Contemporary Documents. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. In IEEE Fourth International Conference on eScience, 2008. eScience'08, pp. 440–441. PDF.

Abstract: Addresses the problem of designing a scalable framework for archival appraisals of contemporary PDF documents. Discusses methodologies on information types to one comprehensive analytical framework; small/large scale computational studies; comparisons of contemporary documents containing text, images and vector graphics; framework for including 3D and 3D+time data sets into the appraisal analyses; exploratory archival appraisal analyses with small scale data sets; infrastructure supporting the transition from small scale to large scale computations using commodity and high performance computing resources; and mathematical frameworks and prototypes for comprehensive and scalable document appraisals that include text, images, vector graphics , and high dimensional data.

  • Bajcsy, Peter, and McHenry, Kenton. Key Aspects in 3D File Format Conversions. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. PPT.

Abstract: Discussion of 3D file formats; Polyglot to support archival processes automation of file format conversions; quality of file format conversions; scalability with volume; and demonstrations.

  • Bajcsy, Peter, and McHenry, Kenton. An Overview of 3D Data Content, File Formats and Viewers. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. Technical Report. October 31, 2008. PDF.

Abstract:  Presents an overview of 3D data content, 3D file formats and 3D viewers. It attempts to enumerate the past and current file formats used for storing 3D data and several software packages for viewing 3D data. The report also provides more specific details on a subset of file formats, as well as several pointers to existing 3D data sets. This overview serves as a foundation for understanding the information loss introduced by 3D file format conversions with many of the software packages designed for viewing and converting 3D data files.

  • Bajcsy, Peter, Kastner, Jason, Kooper, Rob, and Ondrejcek, Michal. A Methodology for File Relationship Discovery. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. In 2009 Fifth IEEE International Conference on e-Science, pp. 193–200. PDF.

Abstract:  Addresses the problem of discovering temporal and contextual relationships across document, data, and software categories of electronic records; investigates automation of metadata extraction from engineering drawings and storage requirements for metadata extraction. Keywords - Data processing, Data conversion, Optical character recognition

  • Bajcsy, Peter, Kastner, Jason, Kooper, Rob, and Ondrejcek, Michal. Information Extraction from Scanned Engineering Drawings. National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. Technical Report. December 31, 2009. PDF.

Abstract: This work is motivated by the need to discover and preserve relationships among engineering drawings and their modern counterparts such as 3D CAD models. We have developed a general prototype system called File2Learn for (1) extracting file system level information using Aperture software, (2) performing content based analyses of 2D engineering drawings by optical character recognition (OCR) techniques and of 3D CAD models by string matching, and (3) discovering and establishing relationships among files by using a semi-automated exploratory framework.

  • Kastner, Jason, McGrath, Robert E., and Myers, Jim.  Experiments in Data Format Interoperation Using Defuddle.  National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. Technical Report, September, 2009. PDF.

Abstract: Discusses Defuddle technology and the status of the Defuddle parser and recent work conducted as part of the “Innovative Systems and Software: Applications to NARA Research Problems” project; file characterization; recognition of 3D data format interoperation; and recognition of “MIME-type” of a file.

  • Steffen, Craig. FPGA Data Ingest Processing for NARA Electronic Records.  National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign, Urbana, IL. PPT.

Abstract: Discusses Field Programmable Gate Arrays (FPGA) and design process.

Top