Select
MDQC Case Studies
14 March 2014
By Alex Duryee
You can also download a PDF version of these case studies here.
INTRODUCTION
The sheer quantity of digital assets created in migration projects and large-scale ingests often overwhelm library staff. Outside of digitization on demand, objects are typically digitized at scale in order to capitalize on efficiencies of volume. In such cases it is not uncommon for small archival teams to handle many thousands of digital assets, each of which must go through an ingest workflow. The most important part of the ingest workflows, quality control of incoming preservation masters, is oftentimes the most time consuming step for digital archivists. These assets are typically reviewed manually at the item level. As such, a bottleneck is created as the rate at which quality control is performed falls behind the rate at which newly digitized assets are created or digital collection are acquired.
Quality verification also tends to be an ineffective use of staff time. Despite being important, it is tedious and a poor use of skilled labor. Digitization projects and departments can sink unanticipated amounts of valuable time and resources into item-level quality control, thus detracting from other services (both real and potential). All told, asset quality control is a step in archival workflows that is ripe for improvement.
Tool development
AVPreserve developed MDQC (Metadata Quality Control) to address these bottlenecks and expedite digitization workflows. MDQC is an open source tool based on Exiftool and MediaInfo that allows users to set baseline rules for specifications related to digital media asset quality (such as resolution, framerate, colorspace, et al) and embedded metadata (such as date formatting, completed fields, standard values, et al). Once a set of rules is created, it can be applied across an entire collection at once, reporting any assets that fail to meet the quality standard (e.g. wrong colorspace, below minimum resolution, gaps in descriptive metadata, wrong sample rate). From these reports, which are generated utilizing nearly no staff time, an archivist can separate problematic assets from those that do meet the required specifications. As such, MDQC tremendously expedites the quality control of digital media collections, replacing a manual item-level task with an automated collection-level one.
CASE STUDIES
Overview
During the development of MDQC, AVPreserve worked with two organizations to test and implement MDQC in a production setting. The Digital Lab at the American Museum of Natural History applied MDQC in a large-scale image digitization project, and successfully used it to greatly expedite their processing workflow. Similarly, the Carnegie Hall Archives used MDQC to rapidly verify if vendor-generated assets were meeting the preservation quality specified in the statement of work.
The following brief case studies outline how the two organizations implemented MDQC and its effect on their digital asset workflows.
Unsupervised Image Digitization: American Museum of Natural History
Background and practices
The Digital Lab at the American Museum of Natural History (AMNH) is working on an ambitious project digitizing historical photonegatives, with the goal of scanning each one – over one million in total – and making them accessible in a public digital asset management system for research use. Currently, the AMNH is digitizing these photonegatives using a volunteer force, which generates roughly 200-300 images per week, in tandem with a small team of archivists performing quality control and image processing. Due to the nature of volunteer labor, changing standards over time, and turnover, quality control is tremendously important to the Digital Lab’s project. Traditionally, this was performed on a per-image basis, where scans were loaded into Photoshop and visual/technical assessments were performed. This process was slow and repetitive, and was a bottleneck in the imaging workflow.
Selection and implementation
AVPreserve selected the Digital Lab as a pilot partner for MDQC, as its scenario was ideal for testing and implementing the tool. The Digital Lab was able to set its imaging quality standard for resolution, color space, file format, compression, and bits per sample. While this does not capture every aspect of image quality control—a brief visual inspection is still needed for alignment, cropping, et al—it allows for rapid automated testing for basic technical standards. This tremendously expedites the image review step in the digitization workflow, as images can now be assessed hundreds at a time for technical quality.
One area in which MDQC had unexpected success was in legacy asset management. The Digital Lab, when first embarking on its project, did not have established standards or workflows for its volunteer scanning efforts. As such, there were an overwhelming number of images – approximately sixty thousand – that were created without a standard specification in mind. These images may or may not meet the current standard, and may or may not need to be reprocessed. Manually performing quality control on these legacy images would be overly arduous and a poor use of staff time, creating backlogs in the new images requiring quality control that are being created every day. By automating technical quality control, MDQC has allowed the Digital Lab to bring these legacy assets under control. The archivist can set their current imaging standard into a rule template and apply it across thousands of images at once, and thus automatically sort between images meeting specification and those failing to do so. As of writing, MDQC has helped the Digital Lab to bring three thousand legacy assets forward into their workflow, saving the Lab weeks of labor.
Benefits to the organization
Excitingly, MDQC has allowed for expanding the AMNH digitization services and production processes into new realms. Due to the sheer number of images to be processed, the Digital Lab is always looking for new scanning sources. The Lab has recently implemented project-based digitization, where researchers scan sub-collections of images for both personal use and to contribute to the general collection. Previously, this was a difficult service to implement in production workflows, as it required additional processing and review for a collection of images outside of the standard workflow and expected weekly volume.
By employing MDQC, the Digital Lab is able to very quickly assess the researcher’s images for baseline quality and bring them into their production workflow. MDQC has also allowed the archivists in the Digital Lab to develop a training program on image processing for interns, as there are now plenty of verified images to work with and prepare for final deposit, as well as no pressing backlog for image review by the staff.
Vendor Performed Mass Digitization: Carnegie Hall
Background and practices
In 2012, the Carnegie Hall Archives launched the Digital Archives Project, a comprehensive digitization program, to digitize and store a majority of their media holdings. Due to the scope and speed of the project, the Archives used a vendor service to digitize manuscripts, audio, and video recordings, which were returned in bulk on hard disks. As the vendor-supplied materials will be the digital masters for these assets, the archivists at Carnegie Hall implemented a quality control workflow for returning assets.
Previous to implementing MDQC, the workflow involved a technician opening each file in Adobe Bridge and comparing technical metadata against a set standard in an Excel spreadsheet. This step is important in guaranteeing that the minimum standard for quality was met by the vendor, but is also tremendously time-consuming. The archivist estimates that 70-100 images per hour were processed by a technician, with a total of 35,816 images digitized. This would have required roughly 400 hours of labor to perform quality control for the images alone, not to mention 1,235 audio and 1,376 video assets also in the pipeline.
Selection and implementation
The Digital Archives project (DAP) was developing backlogs of material to review, making MDQC a natural fit in their production and ingest workflow. The manual step of verifying technical quality could be automated via MDQC by establishing baseline rules (as outlined in the service contract with the vendor) and testing returned assets against those rules. This fit neatly into the Archive’s workflow, as returned assets could be scanned in-place on the hard drives before further action was taken.
Benefits to the organization
As a result of MDQC, the Carnegie Hall Archives tremendously expedited their digitization workflow. Returned batches of digitized assets were assessed for technical quality (resolution, compression, format, colorspace) within minutes instead of weeks or months. While there is still a need for human analysis of assets (for issues such as digital artifacts and playback problems), these can be performed more efficiently by automating the analysis of technical quality. As such, the Archives were able to accelerate their workflow and make remarkable headway on this aspect of DAP in a very short time.
CONCLUSIONS
MDQC has allowed for our pilot organizations to greatly accelerate their digitization workflows. By automating technical quality control, it has allowed these organizations to focus their time and labor on more fruitful tasks. Technicians are able to focus on processing and ingest instead of technical standards, and interns can be trained on more valuable tasks than the rote checking of technical metadata. Additionally, by expediting the previously slow process of quality control, more assets can go through production than ever before. As such, we are excited for the possibilities of MDQC in increasing digitization throughput and archival productivity.
The most surprising and exciting development from our pilot program was how dramatically MDQC could affect an organization: by automating a tedious and time-intensive task, it opened the door to new services as well as expediting existing ones. The AMNH was able to use MDQC to offer new research services by applying it to patron-generated assets, thus creating a new source of materials for their digital archive. This came about due to how quickly MDQC allows for the quality control of assets – verifying a small batch requires minimal additional work by the archivist, and can thus easily be done as part of a daily workflow. We hope that other organizations find similar success with MDQC and are excited to see what springs from it.
MDQC is a free application developed by AVPreserve. Download and further information can be found at https://www.avpreserve.com/avpsresources/tools/, along with many other free digital preservation resources.
ACKNOWLEDGEMENTS
AVPreserve would like to extend special thanks to the following staff and stakeholders for their contributions and generous feedback that made these case studies possible:
Testing Support:
Miwa Yokoyama, Carnegie Hall
Anna Rybakov, AMNH
Jen Cwiok, AMNH
Development Support:
Phil Harvey (Exiftool)
Jerome Martinez (MediaInfo)
This Is Your Thesis; This Is Your Thesis On CD-R
14 March 2014
I was recently contacted by one of my alma maters about my masters thesis. The school required deposit (in duplicate) with the Library as part of the application for graduation, but luckily by this time they were accepting (perhaps enforcing — who wants to manage all those bound theses?) submission of an electronic copy on CD-R. This was the second time in 10 years that someone had written to me about my thesis. Pretty awesome for a Humanities masters thesis, so I was starting to feel awfully big-headed.
Don’t Blame Poor Records Management On Overpreservation
7 March 2014
There’s been a lot of good discussion lately about the meaning of the word archive(s) and its use by those outside of the profession. Much of this discourse is focused on the relation of the issue to the profession, questioning what impact the broad (and what some might think as improper) application of the term(s) has on the public’s view of the role of archives and archivists.
The 100 Year Lie And The Lay Of The Shelf
25 February 2014
*Thanks to Kevin Driedger for the inspiration here (and shout out) in his blog post “Time & Media” over at Library Preservation 2.
More Podcast, Less Process Episode 7 Now Available
24 February 2014
Episode #7 of “More Podcast, Less Process”, the archives podcast co-produced by METRO and AVPreserve, is now available for streaming and download. This week’s episode is “humans.txt.mp3 : The Web Archivists Are Present” with guests Alex Thurman (Web Resources Collection Coordinator, Columbia University Libraries) and Lily Pregill (Project Coordinator & Systems Manager, New York Art Resources Consortium). Web archiving and the tools available for doing it have made a lot of advancements of late, but there are still a number of gaps in our capabilities and challenging policy decisions institutions must make when undertaking archiving one’s own website or sites from the web at large. Alex and Lily are leaders within their organizations in overcoming these challenges, and with co-hosts Josh and Jefferson they delve into the many issues faced and solutions derived.
New AVPreserve Resource on Assessing Cloud Storage
20 February 2014
AVPreserve Consultant Seth Anderson wrote in a blog post earlier this week about the need for a practical approach to assessing cloud storage and other cloud services for use in an archival or preservation-oriented environment. He points out how “The misinterpretation of cloud services and their operations — and a number of operational hiccups in many service providers’ development of their products — has kept preservation organizations from readily adopting cloud computing,” but that such services must be seriously considered as an alternative for organizations with limited budgets, infrastructure, and support for the storage of digital files at a preservation level.
The Need for a Practical Approach to Assessing Cloud Services for Preservation
18 February 2014
With each new release or update of my iPhone operating system, I am required to revisit and complete the redundant process of “setting up” my phone. The “iPhone Essentials” process includes establishing a Wi-Fi connection, signing into your Apple account, turning on various features, and setting up your iCloud account. iCloud was launched in 2011 and is now a regular feature of most Apple products. It allows users to store contacts, calendars, music, and other data in the cloud and to sync such information between Apple devices. With iCloud, Apple has expedited the establishment of a cloud storage service and infrastructure into a few easy steps: sign in with your account information, choose what you want to store, and sync your data. Many users, myself included, likely take for granted the service offered by Apple; the simple setup and automatic integration of the service allows users not to ignore what, how much, and where their data is used and stored by iCloud.
Three Views Of Digital Preservation
5 February 2014
I worked on a large inventory project with a group of media production units in the Department of Defense a few years ago. We were at one of the storage facilities and a shipment of albums that had been on a base in Spain came in that day. As we were opening one of the couch-sized boxes to get an estimate of what was inside, someone warned us to be careful and not dig around in the box too quickly — you couldn’t be sure what critter had slipped in during years of storage and months of international transport. We laughed at what we thought was some friendly teasing, but he was serious. Scorpions, insects, spiders, reptiles, etc. could have crept in at any time, and we had to watch out.
More Podcast, Less Process Episode 6 Released
30 January 2014
AVPreserve announces the release of episode #6 of “More Podcast, Less Process”, the archives podcast co-produced with METRO. This week’s episode is “Can’t we all just get along? Shared services & institutional collaboration in libraries & archives” with guests Rachel Miller, Melanie Meyers, Lauren Bradley, and Felicity Corkill of the Center for Jewish History. The CJH consists of five partner organizations (American Jewish Historical Society, American Sephardi Federation, Leo Baeck Institute, Yeshiva University Museum and YIVO Institute for Jewish Research) that have banded together to share facilities and services in order support their common goal of providing research and educational resources related to more than 700 years of Jewish history. Our guests discuss the aspects of providing such shared services — the successes and challenges, as well as tips on collaboration. Though CJH presents a special situation of shared institutional goals, their efforts provide a model and guidance for many aspects of working across institutions, across departments, or even across the cubicle wall.
The Preservation Of The Singular Versus The Mass
23 January 2014
The (sorta) recent discovery and restoration of the film interstitials and outtakes for the stage play Too Much Johnson (I refuse to call a jumble of scenes tied to a larger narrative a film) shot by Orson Welles poses the obvious, titular question, how much Johnson is too much?