Digital Preservation in Practice

An archivist and a library assistant at the Getty Research Institute describe the challenges they face in preserving digital files

Topics

Three figures in a room with tall white columns, on top of a circular glass structure. — Linta Kunnathuparambil, Teresa Soleau, and Lorain Wang under the oculus at the Getty Research Institute.

By Teresa Soleau, Lorain Wang, Linta Kunnathuparambil

Nov 30, 2017

At the Getty Research Institute in Los Angeles we have been using Rosetta, the digital preservation solution from Ex Libris, since 2012.

Although Rosetta is a vendor solution, and so in some ways proprietary, it is based on OAIS (Open Archival Information System) principles and uses many of the standard community-developed digital preservation tools and metadata formats such as Jhove, DROID, PREMIS, and METS. The vendor, Ex Libris, works closely with Rosetta customers to continually enhance the product following best practices for digital preservation.

Even with a vendor solution, there is still quite a bit of opportunity for customization and local configuration. We started out preserving materials that we digitized and more recently began depositing born-digital institutional records as well. Below you’ll hear from two of the staff members at the Getty Research Institute who interact with Rosetta on a regular basis. They describe some of the issues they encounter in trying to preserve our resources.

Preserving Born-Digital Materials

Lorain Wang, Institutional Archivist

A figure sitting and working at their desk, surrounded by file boxes — Lorain Wang in her office

“Born digital” refers to materials that originated in digital form, as opposed to files produced from digitization projects. They tend to be messier and more complicated than digitized files because they often come in a wider variety of file formats, and because they’re less likely to be well-formed (i.e., the file does not meet the specifications of the format). Born-digital files are also more likely to be organized in a complex hierarchical structure, which can cause problems when using software, such as bulk format-conversion tools, that cannot retain the original file structure. They are also more likely to suffer from bad file-naming practices, such as the use of problematic characters, opaque names, or names that are too long. Then there are the duplicate files and drafts and various versions of a document that often make it difficult to determine which file is the final product.

Getting people to follow best practices in creating and managing their files could resolve many of these issues, but that is more easily said than done. In addition, files often come to the archives long after the file creator has left.

The Getty’s Institutional Archives began ingesting born-digital materials into Rosetta over a year ago. We are currently taking a broad approach; our focus is on getting as much preserved in the preservation system as possible. We ingest files even if they are unprocessed (which is the case for most of our born-digital materials). Appraisal generally does not go beyond the accession/collection level, which means we are ingesting files that we may not necessarily want to preserve. We do not transform files to preservation formats, and we do not try to address files that are not well-formed or that have incorrect extensions. We’re ingesting the “good,” the “bad,” and the “ugly.” At the bare minimum we strip problematic characters from file names and remove system files.

We want to avoid taking a “boutique” approach, but the reality is that this approach is sometimes unavoidable. When you’re trying to ingest the “bad” and the “ugly,” the preservation system will quickly put up a fight and respond with system errors.

Troubleshooting is time-consuming, particularly when system errors are not always clear. At times I can’t tell if it’s a bug with the system, a problem with the files, or just a problem with me. Our preservation system does not list all of the problem files, so it’s not unusual to fix one problem and to try ingesting again, only for the system to spit up on the next file. A deposit can sometimes require multiple ingest attempts, which can translate into a few weeks (or months) on a single deposit that I still can’t manage to push through.

The program, ReNamer Pro, uses a list of rules to rename problematic file names. — Lorain uses this program to rename problem file names.

Fortunately, most of the problems that we’ve encountered have been due to file-name issues. They can be difficult to pinpoint, but at least they’re fairly easy to fix. Frankly, I wouldn’t know where to begin to fix a file that isn’t well formed.

Are we going to regret our approach? Possibly. Our strategy brings up many questions that we’ve been putting off. Ingested files do not necessarily represent the final form of our information package. When we do eventually process a set of files, do we want to replace the original package or ingest them as a separate package? And what about the files that we need to continually add to every few years? In our efforts to preserve as much as possible now, are we creating more work for ourselves in the future? Realistically, given our backlog, it’s also very possible that many of these ingested information packages will remain the final form.

Preserving Digitized Resources

Linta Kunnathuparambil, Library Assistant

A figure sitting and working at their desk in front of a computer — Linta Kunnathuparambil at her desk

Although Rosetta is a preservation-only system for most born-digital institutional records, it also serves the purpose of delivering digitized resources to end users through the Research Institute’s digital collections. This requires some key pieces of metadata—rights and descriptive —which sometimes present challenges that we need to tackle in order to get the items preserved and available to users.

When we digitize two-dimensional items like manuscripts, photographs, and correspondence, we make TIF and JPG files. The preservation and modified master TIFs are only accessible to staff with Rosetta accounts, while the JPGs, or access files, are made available to the end user through, Primo Search, our discovery system. Unlike born-digital materials, we have control over the format, size, structural hierarchy, and filenames of our digitized sources. We follow a standardized file name convention so any issues related to filenames are due to human error. The TIFs have a standardized resolution and size regardless of rights, but the access files vary based on the copyright of the collection that’s being digitized. The copyright category determines the resolution, pixel dimensions, and off-site accessibility of the digitized material through Primo Search.

We assign one rights category per digital object. Although we are able to change the category after a deposit has been made, we are limited to only one choice. This is problematic when there are collections with items that have partial or mixed rights. For example, some items could be in the public domain while others are still within copyright. In those instances, we work with our rights analyst to determine the best possible solution. This means that sometimes we choose the more restrictive rights category and apply it to all the files in a digital object, resulting in the down-sizing or restriction of some files that we have the rights to deliver at high resolution.

Another challenge is the level of metadata we apply to each digital object. In order for the deposit team to create a Dublin Core record, the Research Institute special collections catalogers review existing metadata in the MARC records and finding aids. While it is possible to check the accuracy of records for small deposits, it is very cumbersome to do this for large-scale digitization projects involving thousands of digital objects.

With these large-scale digitization projects, there is only so much you can do within the given deadline. You are limited by the number of staff, funds, technology, and the requirements of your preservation and digital asset management systems. There isn’t enough time to check every single item and verify that the metadata are correct. Therefore, we often have to agree on generalized metadata that is applicable to a wider set of material within the collection. This enables us to have control over the metadata that’s going into Rosetta for preservation and access, but it sometimes means we are not providing complete descriptions of these materials.

Best Effort, Documentation, and Speed

Working within the limitations of our digital preservation system and available resources can be challenging. While we can standardize the workflows involved in digitization and digital preservation, adjustments sometimes need to be made to work with our varied and diverse collections. We try our hardest to make decisions that follow best practices for digital preservation, but at the end of the day we’re committed to getting the material into the system quickly. That’s why we have checksums and file format information, and why multiple copies of files are saved in our data archive and checked regularly to verify they haven’t changed. Best effort at this moment in time, with clear documentation, has to be enough.

Visit

What's On

Explore Art

Research & Conservation

Funding

About