My Digitization Vendor Always Sends .md5 Files As Part Of Their Deliverables
December 5, 2015
A file’s checksum is a specially generated hash based on a computation of all of the individual bytes that make up the file. A variety of algorithms exist that can be used to perform this computation, e.g., MD5, SHA-1, or SHA-256. For example, an MD5 checksum is a 32-bit string that may look something like this: ff839faf604272ba094741b62c7e4254. Because the computation takes into account every byte of a given file, if any bytes change for a given file, then the outcome of the computation will produce a different hash. Only an identical set of bytes will produce the same hash repeatedly when processed through the checksum algorithm in use.
When exchanging files over a network (or delivering files on hard drives or optical media), comparing a file’s checksum before its transfer over the network to the file’s checksum after it was received helps to verify whether the file was corrupted in transit. If the checksums match, then it is generally safe to assume that the received file is byte-for-byte identical to the original before it was transmitted over the network. If the checksums do not match, then it is an inconsistency that needs to be investigated.
It is common practice to generate checksums for digital files as early as possible in the lifespan of the file. Especially when the files are considered important assets or resources (e.g., archival collections, digital library collections, or broadcast content).
Vendors that offer digitization services also recognize the value of checksums, and many organizations that send content to audiovisual vendors for digitization are probably accustomed to receiving files and checksums from the vendor after completion of the job.
For example, maybe you requested MD5 checksums for each digital file and the vendor returns something that looks like this:
file1.wav
file1.wav.md5
file2.wav
file2.wav.md5
file3.wav
file3.wav.md5
…
The challenge that some organizations face when receiving files like this is that verification can be tedious and labor intensive, often performed file by file and frequently via opening the .md5 files in a text editor to retrieve the hash value. Some MD5 applications provide a batch way of generating and verifying .md5 files in a scenario like this. However, there are no standards for the information or formatting contained in a .md5 file and different applications behave in different ways. Practically speaking, this means that a .md5 file written by one application will not likely be able to read by another application, and often the application that created the .md5 file is unknown.
At the request of a colleague from the Smithsonian Institution who was receiving such files, I started a simple bash script (currently only for Mac OS) that will evaluate this content and report the results of attendance and fixity, while additionally generating a single checksum/file manifest that can be used instead of the individual .md5 files moving forward (if the organization prefers).
The script is called vmanifest and it is available on AVPreserve’s GitHub repo here for using, forking, and/or adding to in situ.
It is worth noting here that the BagIt specification and associated tools offer a great way to provide attendance and fixity checking. In fact, keep an eye out for one such tool, named Exactly, coming out of our recent collaboration with Doug Boyd at the University of Kentucky. However, in cases where BagIt is not used—such as when you simply receive a directory of files with their checksums in separate, individual files—BagIt-based tools are of no help. But organizations still need to validate the attendance (are all files accounted for?) and the fixity (are all files the same as they were when created?) of the content received.
Current vmanifest Assumptions:
- All files must be in the same directory (with no subdirectories).
- The script assumes that for each content file there is an ancillary .md5 file (raw text) that contains one line in the following structure per line:
[md5] [file name]
*(these two variables are delimited by a single space) - The script will write the results to a new folder created on the user’s desktop called “md5_verification”.
- That MD5 is the algorithm used for the original checksums.
Current Known To-Do’s:
- Existence of .DS_Store files in the target folder may cause the script to report a failure, unless there is a corresponding .md5 file for the particular .DS_Store file.
- Not heavily tested for possible errors. The script was written as a demonstration to solve a specific use case. It may need to be adjusted as the use cases begin to vary.
A set of test files (test.zip [26.7MBs .zip file]) can be downloaded to your local environment in order to test the functionality of vmanifest. Read more here: vmanifest GitHub repo.