Meditations | Redactions

Archives Week, Addendum: More Notes on Technique

by Alex Wellerstein, published December 26th, 2011

My post on Day 2 of Archives Week got a few people asking me if I could elaborate on my post-processing methods for all those photos I take — the conversion from JPEG to PDF that I hinted at.

I've played with a few different ways of doing this, from the very simple to the reasonably sophisticated, and have come up with a way that, in the end, is "good enough" in the sense that it is easy, saves me time, and does a fine enough job.

Warning: this is a long post! There's a document at the end of it, though, for those of you who don't care much about how I make the files.

Step 1: Take the photos

First a brief refresher: as I said in the just-linked-to post, I take basically no citation information down while I am photographing. I get around this because I photograph the side of each box before I use it, and then the title of each folder before I photograph its contents. This takes a lot of discipline — miss one box, or one folder, and you can't cite anything. But if you can trust yourself to do it, it saves eons of time and hassle.

A refresh: the basic photographing technique.

After photographing a given folder, I will usually then just download the contents of the camera to my computer. I used to wait until I was done with lots of folders and then sort them out later, and you can do that. I just found it's easier if I download each folder as I look at it — it saves on the post-processing time (less sorting).

So each folder gets a directory, named after its box and folder title. These are then stored under whatever the big series title is. So for example, I've been looking at the Joint Committee on Atomic Energy (JCAE) records, in Series 2, their General Correspondence file. One of my folders in Box 60 is titled "Thermonuclear Program." So my directory tree looks like this (click to see full size):

(Yes, I'm using a Mac. That shouldn't matter for anything I'm describing though, and I these days I usually do the post-processing on a PC anyway, just because the PC in my office is faster than my Mac.)

Now as you can see, those JPEGs are pretty huge — they are 2592 x 1944 pixels, which is way larger than I really need, and even though I have the camera set to "black and white," the camera still saves them as if they were full RGB images, so they're larger than they ought to be (2.3 MB per image).

(One camera-use pro-tip: make sure that the "auto-rotate" settings are disabled when photographing documents. The camera is usually confused about how you are holding it anyway, when you are taking photos of documents, and what you really want is a consistent rotation, so later you can rotate them all at once. If the camera is guessing about the rotation, it adds a lot of time later.)

Step 2: Prepare the JPEGs

I use two stages of post-processing from this point. Both use proprietary Adobe software. There might be open-source or free ways to do this that don't require learning how to be a computer programmer — I really don't know, but I doubt it. If you have Adobe Photoshop and Adobe Acrobat Professional, though, the method below is super easy.

First we use Photoshop. Our goal here is to cut down the file sizes of the images. My method of doing this is pretty crude: 1. convert them to "true" grayscale (since I'm not using color anyway), 2. adjust the "levels" of the photograph to make the whites whiter and the blacks blacker (this is where the consistent lighting I mentioned in the previous post pays off), 3. reduce the overall dimensions of the image to something more reasonable (I find 1080 x 1440 pixels to be good enough for OCR, or about 120 dpi), and, 4. save it as a JPG with some medium-level compression (which is good enough for OCR and research use, and cuts the file sizes down considerably).

The end result are files that are only 300KB or so per page, which is much more manageable (not just in terms of hard drive space, but in terms of rendering in your PDF reader as well). Of course, they're highly reduced in quality (purposefully so). So I usually keep "untouched" copies of all the original photos somewhere else as well, in case I do need them in the future.

How does one perform such miracles? I use Photoshop Actions, which are little recordable macros. There are lots of tutorials on the web on making and using Actions in Photoshop; a pretty straightforward one by Adobe is here. Basically I just "record" myself doing it once, and then have the option to "replay" back the same image manipulations on the entire directory of images.

These are the Action settings I use, in order (all command paths given are for Photoshop CS5 Extended for the Mac, but they should exist in all versions of Photoshop, somewhere):

  1. Convert mode to Grayscale. (Image > Mode > Grayscale)
  2. Auto levels (makes the lightest point pure white, the darkest point pure black, averages everything in between). (Image > Adjustment > Levels > Auto)
  3. Resize to something sensible. (Image > Image Size > Whatever the size you want. I adjust the DPI setting for the size, because then it doesn't matter if the image is inadvertently rotated by the camera. So I use 100 DPI, which on my camera is 1080 x 1440 pixels, which is plenty for the PDFs.)
  4. Save the file as a JPEG with a suitable level of compression (e.g. high or medium level of compression quality; I usually save it with a quality of 5 or 6). (File > Save > JPEG > 6 quality).

The last step is there because otherwise when you run Actions on the entire folder, Photoshop will demand that you tell it what level of JPEG compression you want for every file. This gets tiring when running on hundreds of files. If you make it explicit, you don't have to do that.

I name my action "Prepare for PDF," which seems straightforward enough. Once I've got the Action recorded and saved, I can then run it on a folder with a Batch operation (File > Automate > Batch). I can run it on the whole collection of folders (with the "Include All Subfolders" option), or on specific folders. If you choose the "Override Action Save As Commands," this will mean that your "Save" action (#4 above) will be heeded.

My Batch settings, using my custom Action. Click for full size.

I then run it on the whole group, which converts my initial folder of JPEGs into something much smaller. Generally the JPEGs produced by this set of Actions are about 250-550KB in size, down from 2.4MB each. So that's a pretty significant size reduction. From the point of view of actual research, I find the final product fairly improved:

A sample page, before and after processing, a quarter of actual size (just to preserve bandwidth).

You can see above why I admonish about even lighting gradients — that little shadow on the bottom of the page has become much darker with the auto leveling, even as the levels have caused the "paper" part to become much lighter bringing out the text contrast better.

Step 3: Make the PDFs

Now that I have a whole folder of reduced-size, adjusted JPEGs, I convert them into PDFs. This is fairly simple. In Adobe Acrobat Professional, you need only find the "Create PDF > Merge Files into Single PDF" command (the location and naming can vary based on your version of Acrobat), and add the entire directory to it. Then click "Begin." This will then create one big PDF of all of your JPEGs of the selected directory and ask you to save it — I always save it with the same name as the directory.

I then rotate all of the images as necessary (Document > Rotate Pages on mine), so they are all the correct orientation. (I usually do a quick run-through of the file to make sure they are all facing the right direction.)

Finally, I run the OCR command on them ("Recognize Text Using OCR") to make them semi-searchable. This is time consuming for large files, so I usually run a batch OCR command on the entire directory of PDFs. How to do this varies quite a bit with your version of Acrobat (Adobe is nothing if not inconsistent across versions of this particular program), but Googling "batch OCR Acrobat X" (or whatever your version is) will turn up useful answers.

Having created the PDFs, I can get rid of the directories of compressed JPEGs. (I've still got the originals kept somewhere for safe-keeping.)

Step 4: Use the PDFs

At the end of this process, I've now got a directory of PDFs. Remember that I've saved my directory structure along the following lines: Archive > Collection > PDFs, where the PDFs have the name of "Box X - Folder Title." This allows me to very quickly retrace the citation information for this particular document, and the PDFs are much easier to deal with than huge folders of JPEGs.

The final product: orderly directories that correspond to archival collections, with PDFs that correspond to specific folders.

The final PDFs created with this method have a file size of approximately 350KB per page. So one of my folders contained some 400 pages of documents, and is now a 145MB sized PDF. That's large, but still very manageable on today's computer hardware. When I switch pages in the PDF, I don't have to wait at all for them to render, and that's a key thing.

Once you have the PDFs, you can do lots of nice things with them, like using bookmarks to find specific pages again, or taking notes, or what have you. I use a custom database system for keeping track of the documents within the PDFs, which isn't something you are going to want to waste a lot of time setting up (I already have a lot of "sunk costs" with regards to database programming), but the point is, a PDF-based system is a lot more flexible than a JPEG-based system. And again, once you've got the PDFs set up, not only can you start to use them, but it's relatively easy to retrace your citation data, even though you didn't records it while you were at the archive.

Here's an honest-to-god example of what the final product's quality looks like. This is from my most recent archival jaunt, with no special editing done to it. The document is from the JCAE staffer J. Kenneth Mansfield to JCAE Chairman Senator Stirling Cole, October 2, 1953. It contains Mansfield's recommendations on possible steps to take with regards to further hydrogen bomb development — specifically, the question of whether the AEC was deficit in its attempts to make H-bombs. Mansfield describes the AEC effort as "incredibly shabby" on the whole, and emphasizes the possibility of Fuchs giving the information to the Soviets. It's all part of the late-1953 effort to discredit the anti-H-bomb faction of the AEC, a precursor to the salvos launched against Oppenheimer.1

Click for the full PDF.

As you can see, the final product is small (377KB for three pages), searchable (not perfectly so, but it's something), and very readable. It's not print-ready, but it's fine for research. Since I've saved the original JPEGs, there is no loss by having reduced the quality.

Is it a perfect system? Not really. It is reasonably easy, and does it facilitate my research? It does. And this, in the end, is what any computer-assisted system of this sort needs to do, in the end: help the human being on the other end of it!

  1. Citation: J. Kenneth Mansfield to Stirling Cole, "Possible steps to be taken in the thermonuclear field," (2 October 1953), in Records of the Joint Committee on Atomic Energy, RG 128, National Archives and Records Administration, Washington, D.C., Series 2: General Subject Files, Box 58, "Thermonuclear Program." []

Comments are closed.