Federal Records Management

Bulletin 2005-02b


NARA is aware of the following harvesting tools, listed alphabetically. Please be aware that the default settings of web harvesters may not capture all content in a manner acceptable to NARA. Please examine the harvester configuration settings to ensure compatibility with NARA's Transfer Instructions for Permanent Electronic Records: Web Content Records, which are available online at http://www.archives.gov/records-mgmt/initiatives/web-content-records.html. . (Note: The quotation attributed to each URL is a direct quotation from the respective web site and is not a NARA opinion):

Heritrix
http://crawler.archive.org/
"Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project."

HTTrack
http://www.httrack.com/
"HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" web site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system."

Teleport Pro
http://www.tenmax.com/teleport/pro/home.htm
"PC Magazine's Editors' Choice for offline browsers, Teleport Pro is an all-purpose high-speed tool for getting data from the Internet."

GNU Wget
http://www.gnu.org/software/wget/wget.html
"GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without Xsupport, etc."

Top