Here at the Getty’s Institutional Archives, our job is to preserve the history of the entire Getty. This includes archiving the significant electronic records staff have produced over the years.
These born-digital files—meaning files that originated in digital form, as opposed to being digitally scanned from papers and photos—document the vast array of work that the Getty conducts in areas such as exhibitions, publications, public programming, art history research, and art and cultural heritage conservation. Both Getty staff and visiting researchers are welcome to consult these records.
In the past, digital files have come to us hidden within paper collections on old floppy disks and CDs. In more recent years, staff have been actively transferring their files to us through hard drives, network drives, email, and drop boxes. The files we deal with encompass your standard text documents and photographs, but also sound and video files, emails, databases, CAD files, digital art installations, and even the Getty’s website and social media presence.
How do we handle all this? Well, Institutional Archives’ strategy for preserving born-digital files is very much a work in progress. In fact, you’ll be hard pressed to find any organization that has digital preservation completely figured out. It’s common knowledge that you should back up your files. But that by itself isn’t sufficient as a long-term preservation strategy. The reality is that there’s no magical one-step solution for preserving electronic files.
When Files Age and Rot
Digital archivists deal with two major challenges: obsolescence and bit rot. Bits of a file can change over time due to physical deterioration of storage media, which can make files corrupt and even inaccessible. This is common with CDs, but it can also happen with files sitting on hard drives or network drives.
Even with pristine files, there’s still the issue of obsolescence. Remember LaserDiscs? Technology is constantly evolving, making it difficult to ensure that disks and drives—and more importantly, the files they hold—can still be accessed in the future. Because of these vulnerabilities, when it comes to digital preservation, there’s no such thing as permanent storage. Files need to be periodically moved from one storage media to the next.
Forensics to the Rescue
That’s why an important step in our workflow is to get content off disks and hard drives. To do this, we’ve borrowed digital forensics tools used by law enforcement agencies. Although our motives may differ, archivists and digital forensic investigators are both concerned with making exact copies of files without altering them.
We maintain two forensic workstations, one of which is a FRED (Forensic Recovery of Evidence Device). FRED has ports to connect various kinds of internal and external hard drives and a built-in write-blocker to prevent file modification. Our second workstation, which we’ve named Fluffy, is a standard laptop with old drives attached by USB to read 5.25″ and 3.25″ floppies and zip disks. These two workstations, along with forensic imaging software, allow us to make exact copies of disks and hard drives that we then preserve as our master copy.
We also use forensic software to examine the content of digital collections. The software allows us to safely view files in various formats, including obsolete formats, without accidentally modifying them. It also has keyword and pattern search functionalities so we can flag files containing sensitive information like social security numbers and credit card numbers.
Don’t Drag and Drop—Bag It!
To deal with file corruption, we use software to capture and monitor the checksums of all the files we accession. Checksums are alphanumeric strings that are unique to each file—like a digital fingerprint; if a file changes, its checksum will also change. We use the checksum to verify that we made an exact copy of the original and to make sure the file doesn’t change over time.
Files are particularly vulnerable to corruption during transfers, so rather than using the usual drag-and-drop or copy-and-paste function, we use Bagger, a tool developed by the Library of Congress to move digital content from one location to another. Bagger calculates and compares the checksums of the original and copied files and verifies that the checksums match. This can take a while when we’re moving really large sets of files—sometimes an entire day.
Automating File Monitoring
The final storage destination for our digital files is Rosetta, the Getty Research Institute’s digital preservation system. Rosetta extracts technical metadata from the files and monitors checksums on a regular basis. If a checksum mismatch is identified, the system can restore the file to its previous, uncorrupted version from a backup copy. We can also program the system to provide alerts if a file format has become obsolete, so we can decide whether to convert files to another format.
Even though we have the basic tools and strategies in place, there’s still a great deal for us to do. Conversion of file formats is an area we have yet to navigate. We’re in the early stages of working with the Internet Archive’s web archiving service Archive-It to preserve the Getty’s webpages. We’re still trying to decide how exactly we want to integrate descriptions of digital materials into our finding aids. Providing remote access to complex born-digital collections, particularly ones with obsolete file formats, is also a bit of a head-scratcher. As new tools come out and the digital preservation field continues to develop, we will continue to reevaluate and refine our procedures.
This post marks Electronic Records Day, which spotlights digital records and the need to manage and preserve them. If you’re interested in preserving your own digital files, see tthese tips put together by the Council of State Archivists [PDF], which sponsors the day.