MDQC Case Studies
March 14, 2014
By Alex Duryee
You can also download a PDF version of these case studies here.
INTRODUCTION
The sheer quantity of digital assets created in migration projects and large-scale ingests often overwhelm library staff. Outside of digitization on demand, objects are typically digitized at scale in order to capitalize on efficiencies of volume. In such cases it is not uncommon for small archival teams to handle many thousands of digital assets, each of which must go through an ingest workflow. The most important part of the ingest workflows, quality control of incoming preservation masters, is oftentimes the most time consuming step for digital archivists. These assets are typically reviewed manually at the item level. As such, a bottleneck is created as the rate at which quality control is performed falls behind the rate at which newly digitized assets are created or digital collection are acquired.
Quality verification also tends to be an ineffective use of staff time. Despite being important, it is tedious and a poor use of skilled labor. Digitization projects and departments can sink unanticipated amounts of valuable time and resources into item-level quality control, thus detracting from other services (both real and potential). All told, asset quality control is a step in archival workflows that is ripe for improvement.
Tool development
AVPreserve developed MDQC (Metadata Quality Control) to address these bottlenecks and expedite digitization workflows. MDQC is an open source tool based on Exiftool and MediaInfo that allows users to set baseline rules for specifications related to digital media asset quality (such as resolution, framerate, colorspace, et al) and embedded metadata (such as date formatting, completed fields, standard values, et al). Once a set of rules is created, it can be applied across an entire collection at once, reporting any assets that fail to meet the quality standard (e.g. wrong colorspace, below minimum resolution, gaps in descriptive metadata, wrong sample rate). From these reports, which are generated utilizing nearly no staff time, an archivist can separate problematic assets from those that do meet the required specifications. As such, MDQC tremendously expedites the quality control of digital media collections, replacing a manual item-level task with an automated collection-level one.
CASE STUDIES
Overview
During the development of MDQC, AVPreserve worked with two organizations to test and implement MDQC in a production setting. The Digital Lab at the American Museum of Natural History applied MDQC in a large-scale image digitization project, and successfully used it to greatly expedite their processing workflow. Similarly, the Carnegie Hall Archives used MDQC to rapidly verify if vendor-generated assets were meeting the preservation quality specified in the statement of work.
The following brief case studies outline how the two organizations implemented MDQC and its effect on their digital asset workflows.
Unsupervised Image Digitization: American Museum of Natural History
Background and practices
The Digital Lab at the American Museum of Natural History (AMNH) is working on an ambitious project digitizing historical photonegatives, with the goal of scanning each one – over one million in total – and making them accessible in a public digital asset management system for research use. Currently, the AMNH is digitizing these photonegatives using a volunteer force, which generates roughly 200-300 images per week, in tandem with a small team of archivists performing quality control and image processing. Due to the nature of volunteer labor, changing standards over time, and turnover, quality control is tremendously important to the Digital Lab’s project. Traditionally, this was performed on a per-image basis, where scans were loaded into Photoshop and visual/technical assessments were performed. This process was slow and repetitive, and was a bottleneck in the imaging workflow.
Selection and implementation
AVPreserve selected the Digital Lab as a pilot partner for MDQC, as its scenario was ideal for testing and implementing the tool. The Digital Lab was able to set its imaging quality standard for resolution, color space, file format, compression, and bits per sample. While this does not capture every aspect of image quality control—a brief visual inspection is still needed for alignment, cropping, et al—it allows for rapid automated testing for basic technical standards. This tremendously expedites the image review step in the digitization workflow, as images can now be assessed hundreds at a time for technical quality.
One area in which MDQC had unexpected success was in legacy asset management. The Digital Lab, when first embarking on its project, did not have established standards or workflows for its volunteer scanning efforts. As such, there were an overwhelming number of images – approximately sixty thousand – that were created without a standard specification in mind. These images may or may not meet the current standard, and may or may not need to be reprocessed. Manually performing quality control on these legacy images would be overly arduous and a poor use of staff time, creating backlogs in the new images requiring quality control that are being created every day. By automating technical quality control, MDQC has allowed the Digital Lab to bring these legacy assets under control. The archivist can set their current imaging standard into a rule template and apply it across thousands of images at once, and thus automatically sort between images meeting specification and those failing to do so. As of writing, MDQC has helped the Digital Lab to bring three thousand legacy assets forward into their workflow, saving the Lab weeks of labor.
Benefits to the organization
Excitingly, MDQC has allowed for expanding the AMNH digitization services and production processes into new realms. Due to the sheer number of images to be processed, the Digital Lab is always looking for new scanning sources. The Lab has recently implemented project-based digitization, where researchers scan sub-collections of images for both personal use and to contribute to the general collection. Previously, this was a difficult service to implement in production workflows, as it required additional processing and review for a collection of images outside of the standard workflow and expected weekly volume.
By employing MDQC, the Digital Lab is able to very quickly assess the researcher’s images for baseline quality and bring them into their production workflow. MDQC has also allowed the archivists in the Digital Lab to develop a training program on image processing for interns, as there are now plenty of verified images to work with and prepare for final deposit, as well as no pressing backlog for image review by the staff.
Vendor Performed Mass Digitization: Carnegie Hall
Background and practices
In 2012, the Carnegie Hall Archives launched the Digital Archives Project, a comprehensive digitization program, to digitize and store a majority of their media holdings. Due to the scope and speed of the project, the Archives used a vendor service to digitize manuscripts, audio, and video recordings, which were returned in bulk on hard disks. As the vendor-supplied materials will be the digital masters for these assets, the archivists at Carnegie Hall implemented a quality control workflow for returning assets.
Previous to implementing MDQC, the workflow involved a technician opening each file in Adobe Bridge and comparing technical metadata against a set standard in an Excel spreadsheet. This step is important in guaranteeing that the minimum standard for quality was met by the vendor, but is also tremendously time-consuming. The archivist estimates that 70-100 images per hour were processed by a technician, with a total of 35,816 images digitized. This would have required roughly 400 hours of labor to perform quality control for the images alone, not to mention 1,235 audio and 1,376 video assets also in the pipeline.
Selection and implementation
The Digital Archives project (DAP) was developing backlogs of material to review, making MDQC a natural fit in their production and ingest workflow. The manual step of verifying technical quality could be automated via MDQC by establishing baseline rules (as outlined in the service contract with the vendor) and testing returned assets against those rules. This fit neatly into the Archive’s workflow, as returned assets could be scanned in-place on the hard drives before further action was taken.
Benefits to the organization
As a result of MDQC, the Carnegie Hall Archives tremendously expedited their digitization workflow. Returned batches of digitized assets were assessed for technical quality (resolution, compression, format, colorspace) within minutes instead of weeks or months. While there is still a need for human analysis of assets (for issues such as digital artifacts and playback problems), these can be performed more efficiently by automating the analysis of technical quality. As such, the Archives were able to accelerate their workflow and make remarkable headway on this aspect of DAP in a very short time.
CONCLUSIONS
MDQC has allowed for our pilot organizations to greatly accelerate their digitization workflows. By automating technical quality control, it has allowed these organizations to focus their time and labor on more fruitful tasks. Technicians are able to focus on processing and ingest instead of technical standards, and interns can be trained on more valuable tasks than the rote checking of technical metadata. Additionally, by expediting the previously slow process of quality control, more assets can go through production than ever before. As such, we are excited for the possibilities of MDQC in increasing digitization throughput and archival productivity.
The most surprising and exciting development from our pilot program was how dramatically MDQC could affect an organization: by automating a tedious and time-intensive task, it opened the door to new services as well as expediting existing ones. The AMNH was able to use MDQC to offer new research services by applying it to patron-generated assets, thus creating a new source of materials for their digital archive. This came about due to how quickly MDQC allows for the quality control of assets – verifying a small batch requires minimal additional work by the archivist, and can thus easily be done as part of a daily workflow. We hope that other organizations find similar success with MDQC and are excited to see what springs from it.
MDQC is a free application developed by AVPreserve. Download and further information can be found at https://www.avpreserve.com/avpsresources/tools/, along with many other free digital preservation resources.
ACKNOWLEDGEMENTS
AVPreserve would like to extend special thanks to the following staff and stakeholders for their contributions and generous feedback that made these case studies possible:
Testing Support:
Miwa Yokoyama, Carnegie Hall
Anna Rybakov, AMNH
Jen Cwiok, AMNH
Development Support:
Phil Harvey (Exiftool)
Jerome Martinez (MediaInfo)