Article

Data Is A Simple Machine

By: AVP
January 10, 2014

This post is a (I hope) friendly argument response to (or perhaps more a restatement of position regarding) Megan McShea’s response to a prior blog of mine posted on archivesnext.com regarding the use and effectiveness of EAD. I love what Megan wrote, but as the Al Chet prayer of Yom Kippur denotes, there are many sins of pride and stiff-neckedness that push one to respond in such situations.

*****

I had all four of my wisdom teeth extracted in one fell swoop when I was 19. My dentist assured me I could get through it without going under — just Novocain (and maybe a valium to calm the nerves). Dr. Payne — no joke — said he could just numb up my mouth, have at the offending teeth, and send me on my merry way without the need to recover from anesthesia. I was 19 and interested in saving money on the whole procedure, so I went along with his plan. I mean, really, why would Dr. Payne steer me wrong?

I reckon the whole thing wasn’t all that bad — he used a lot of Novocain — but I distinctly recall the odd sensation of mentally feeling teeth being pulled out of my head but not physically feeling it. Something to make me go “hmmm” while also needing to strongly hold back the urge to stop what was happening and/or gag.

I also recall the point at which I realized Dr. Payne was repurposing my forehead as a fulcrum, using the leverage of his arm against my skull to help work the teeth out. Yes, I thought, my head is a simple machine.

*****

This is still the case.

*****

Megan, I appreciate your thoughtful and extensive response to — or smackdown of — my blog post on EAD. If I had thought anyone would read and respond to my blogs I might’ve changed my mind about posting them! Really though, I think your point of view as a practitioner in an archive is valuable in opposition to mine as a consultant. The organizational and resource realities of archives are always foremost in my mind in my work as I try to find the point where the ideal, the possible, and the practical can meet and shake hands. I think the policies and processes the AAA has in place are at the high end of the curve on that graph, and it’s not easy to argue against your points because, well, you’re right.

But I still think you’re wrong.

I agree that, at the base of it, we’re talking across communities of practice. I deal with many broadcast or production centric collections, which produce a specific set of asset types and present use cases that may be less researcher focused than academic or research archives. A finding aid is fairly useless in this situation because it is not specific enough to identify and locate the correct item amidst what may be multiple versions, instantiations, and fragments — the same way one would need to be able to discern among manuscripts and various print editions (perhaps with major revisions or printing errors) of written material.

But I also think that all collections are now mixed collections (as all archivists are digital archivists) and it’s making less sense to make such distinctions. Also, many of those production-based collections eventually end up in traditional or institutional archives because it was produced internally by the organization or, at some point, the television station or documentarian or radio producer ends up donating their materials somewhere. That results in the ingest of boxes and boxes of (for example) camera originals, U-matic or Betacam film transfers, rough cuts, alternate audio mixes, masters, viewing copies on VHS or DVD, and perhaps some commercial distribution copies. As in the production environment, item level processing is really the only way to tell what’s what and find the right pieces to preserve or transfer for access.

Again, yes, a case that does not apply across the board if a collection is primarily published releases or one-off recordings, but with unique recordings I do see more frequently than not either materials that are not annotated well enough (if at all) to be able to identify them, or even the well-annotated materials that get stuck in the backlog and are never moved into the preservation queue because they are undescribed or unreckoned with.

And that really gets to the crux of my concern: whether dealing with a media-specific collection or mixed collection, audiovisual (and all digital) media needs to be made accessible either through reformatting or through the acquisition of proper playback equipment. Equally so, whether dealing with a media-specific collection or mixed collection, audiovisual (and all digital) media that is considered of lasting value needs to be preserved through reformatting or migration at some point in its lifecycle.

In dealing with magnetic media and certain other physical formats, the deadline for that reformatting is sooner than later. If you are not performing the activities that move those materials into a preservation queue and reformat them in the next 10-15 years, you are essentially discarding them. EAD does not achieve this level of need. In my view the reliance on EAD has resulted in it becoming an endpoint or cul de sac, not a pivot point — a situation much research is going into to try and dig out of in order to take advantage of how data is used today.

*****

Now I totally understand researcher driven preservation. When resources are limited, of course it makes sense to funnel them where it best supports the institutional mission. Though I know it happens I seldom see the individual items make it into MARC catalogs (notable exceptions, such as a number of research collections at NYPL). Description via EAD is of course considered a driver of that process by making collections more findable and promoting researcher request. I see a few problems here.

– Poorly annotated items can only be described at the collection name level with little clue as to actual content. Even if a minimal finding aid at the collection level is created, it will not necessarily give a clue to the contents. If we were looking for some rare Doris Wishman outtakes, without extra information would we necessarily know that the Mike Vraney Collection may contain them? Or that there may be home movie footage of a historical event mixed in with the primary type of content in such a collection if I do know the general scope?

– Just because someone is researching a topic does not make it of significant historical value. It may just be as equally an esoteric one-off or a wrong turn down a blind alley. That’s fine if materials are already accessible, but problematic if resources must first go to reformatting.

– Not all collections are research collections, and even in research collections the use cases for search and potential users may rely more on keywords, time-based metadata, transcripts, dates, etc.

– We simply don’t have that long to wait around.

Cumulatively these factors make the reactive position untenable. Of course we cannot (nor should not) preserve everything, but with the volume, time constraints, and severe findability/access issues with audiovisual and digital materials, we are at risk of slashing and burning much more than we need to or mean to.

*****

One of my regrets prior to and after posting about EAD was that the post didn’t quite express what was my original thought — not that EAD should be completely jettisoned, but that data collection should be granular and flexible. EAD has its place and its uses, but I often get the sense of a hyper focus on generating EAD that comes at the expense of collecting data in a way that would support finding aids as well as the other activities that support collection management. The dataset generated should be eminently deployable in its fundamental form. The information should be in a holding ground of sorts where an EAD finding aid, a MARC record, a viewer/media player, or some other online portal pulls the data it needs and maps it per its use in that form.

Despite the frothy mouthed proclamations over Big Data or the mystical gurus of data like Nate Silver and the Obama reelection campaign, data is a simple machine. It is nothing more than chunks of information (or more often, per the axiom, chunks of garbage) of varying length, moving in and out of systems. The complexities of data are merely a result of the Rube Goldbergian structure of schemas or relational tables — fragile systems of levers and pulleys performing various tasks — or a result of garbage-y, uncontrolled data.

If data is collected in a manner or form that privileges the endpoint (as opposed to a step or a goal stage), we risk getting stuck in that endpoint. I appreciate that EAD is an open standard and the XML structure allows portability, and the projects you list out are great examples of that use. But I also know a number of developers who hate working with it and have been stymied in creating automated upload/mapping utilities for EAD because the application of the standard is frequently idiosyncratic. Also, when dealing with audiovisual and digital media technical and process metadata is integral to collection management. Description is just one piece of supporting archival services.

In short, yes there are people mapping data out of EAD, but in my opinion we need to flip things and not just try to find solutions for porting out of EAD because that’s what we’re stuck in, but rather use EAD as something that is ported into as just one option for archival activities.

When MARC goes away, it’s going to be incredibly messy, both from the data migration standpoint and the resistance standpoint. And EAD will go away, too. First, because at some point you cannot keep revising a data structure to new media, new situations, and new technology without it eventually falling apart. At some point things need to be scrapped and begun anew. Second, because things go away. It is the nature of life and, therefore, of archives.

— Joshua Ranger

Ready to put your data and digital assets to work for you?