open navigation menu close navigation menu

Digital Collections

Rockefeller McCormick New Testament from The Goodspeed
Manuscript Collection. Photographer: Michael Kenny.
Rockefeller-McCormick New Testament (Ms. 965) from The Goodspeed Manuscript Collection. Photographer: Michael Kenny.

Contents

  1. Overview
  2. Managing digital collections projects
  3. Quality control
  4. Website development
  5. Digital archiving
  6. Serving images
  7. Artifacts
  8. Automatic image cropping

Overview

5.7Tb of data stored in an OCFL file share. 991 scanned prints
from the Speculum Romanae Magnificentia collection. 22,258 high resolution
scans of pages from the Goodspeed Manuscript Collection. 48,776 digitized
photographs from the University of Chicago Photographic Archive.
Some statistics about digital collections.

A lot of my work has been to provide infrastructure for digital library collections. I manage a digital repository, make websites, provide search interfaces for cultural heritage materials, and I look for ways to build searches that let people explore in satisfying ways.

Digitized items from the Goodspeed Manuscript Collection, by century

Each digitized manuscript is represented by its own block. The width and height of each block are scaled to match the pixel dimensions of each image, and the depth of each block is scaled to match the number of pages that have been scanned for each item. The project contains over 22,000 scans total. This graphic was produced using D3. See the data here, and see the source code here.

Managing digital collections projects

These cultural heritage materials inclue things like notes, letters, engravings, photographs, manuscripts, sound recordings, videos, and more. The first step in getting these materials online is to run a digitization project to create digital surrogates of these items. These projects need to be managed—they need accurate inventories to be sure that every physical item is accounted for and to manage the digitization process. They also need specifications for things like resolution and bit depth of the resulting files, file naming schemes and a specification for how files should be laid out on disk, as well as standards for image, sound, or video quality. Parallel to digitization, metadata specialists collect and enhance metadata to describe each digital object. Some type of identifier scheme needs to connect files on disk with specific metadata records. I have worked on teams to manage these projects either in-house or with external vendors, and I have also provided consulting and advice for teams taking on these projects elsewhere.

Quality control

A network diagram showing the relationships between printers
and publishers in the Speculum Romanae Magnificentiae.
A network graph showing the relationships between printers and publishers in The Speculum Romanae Magnificentiae digital collection. Printer nodes are rendered in gray, publisher nodes are rendered in black, and individuals who worked as both a printer and publisher have gray nodes with a black outline. Edges in the graph represent the number of works where specific printers and publishers worked together. Thicker lines mean the two collaborated more often. To produce this graph I wrote a Python script to output graph data to Gephi and cleaned the resulting graph up in Illustrator.

Every digitization project I have worked on has also required some type of quality control and rework process, to make sure that all standards have been met, that files have been named and arranged accurately, and that metadata for all digital objects is complete. It is easy for this rework process to take longer than expected. I’ll use scripts to find inconsistencies in digital objects and metadata files, and I’ll work with vendors, digitization staff, and metadata specialists to correct problems. I also maintain programs to automatically correct filenaming errors, to modify files by doing things like automatically cropping images and deskewing them. I validate that files and metadata are well formed, and that metadata matches the expectations of all stakeholders. This can involve working with metadata in a variety of formats including MARC, MARCXML, various types of XML documents including TEI, DC, VRACore and more, relational databases or linked data.

Website development

Wireframes.
Wireframes. Each digital collection uses a common template.

In my experience these projects almost always involve displaying the newly digitized objects on some type of website—so it is important that there is a specification for this interface that can inform metadata and digitization work. We need to be sure that project metadata not only supports long term preservation and discovery of items, but that it can populate an interface for this project’s specific stakeholders and use cases. I often write reports and test assumptions about metadata to be sure it does what the website development team needs it to do.

Digital archiving

In our setup, digital object data are loaded into a digital repository which feeds into our image server and linked data triplestore for metadata. This ensures that whatever digital objects and metadata users see on a website is also being archived in the digital repository. I mint permanent identifiers for each digital object, which tie together the object’s working identifier, its files in the digital repository, and its metadata. Because these identifiers are permanent, we track them and put policies in place to provide access to them for the long term.

Serving images

Gargoyles from The University of Chicago Photographic Archive.
Gargoyles, from The University of Chicago Photographic Archive. See, originals on the photoarchive site for each image, numbered from upper left to lower right: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. Various photographers.

The image server and manifest server pull image and metadata together into a standard form that can be used for website display. Currently we use IIIF Image Servers, IIIF Manifests, and EDM linked data to do this. Using standards like these allows our data to not only be used for the specific collection we’re working with, but it also puts it into interoperable formats that allow digital objects to be combined with others from other institutions, to be browsed and searched as a whole, and more.

Artifacts

I have worked on digital collections projects for a few years. The first thumbnail below is for the current digital collections template at my place of work, and all thumbnails after that are for legacy projects. For the current template I worked as part of a team including a designer and two other developers, while for legacy projects I did both design and development work on my own.

Automatic image cropping

Automatic image cropping. The pink rectangle is a crop of
the page image only, without the colorbar.
When many of the archival digital images are captured, they include things like rules and color bars. You can see my code to automatically crop out color bars here.

This is an example of one of the image processing scripts I have written to manage images produced in a digitization project. When images are initially captured, scans or photographs often include a ruler or color bar that may be cropped out when the image is viewed on a website. Because it would be labor-intensive for a person to open hundreds or thousands of images in an image editor to manually crop these extraneous elements out, I wrote this script to automatically detect the position of a color bar in an image so that it could be cropped automatically.