As is often the case with DH projects, my current endeavor has involved going back to an older research project, and resurrecting and reformatting ideas and materials that I haven’t looked at in a while.
A few years ago I started working with the journal Novosel’e (I’ll write another post about my research question and its evolution), and was able to physically locate the entire run – 35 issues bound together in 7 volumes – and spent hours (hours) at the Electronic Text Services Center at Columbia University (now the Digital Humanities Center) scanning every single page.
So when I came back to this project a few months ago, I had PDFs of every issue at my disposal, and didn’t have to go though the time-consuming process of creating them. But what to do with these PDFs? What is the most efficient and effective way to make them machine readable and machine-actionable?
Luckily I could ask my multitalented colleague at the Princeton CDH, Ben Johnson, for advice. He sat down and helped me figure out how to extract clean text from those PDFs that could be plugged in, processed, and analyzed by the programs and tools I’m interested in.
The best text files will be created when images are black/white (not color or greyscale) and at a resolution of about 300dpi. By using image editing software you can clean up the images (cropping out edges and noise on the margins) will help. I chose not to spend the time cropping the text – my images seemed clean enough.
The next step was gauging the quality of the OCR. There are many options for OCR – such as Adobe Acrobat Pro, ABBYY FineReader, OmniPage, Google’s Tesseract, etc.. Since I created my files in Adobe Acrobat Pro, that’s what I first used to OCR my texts. The result was not great, and certainly not good enough to create the clean text that I require.
I then ran my PDFs through ABBYY FineReader and got much better results. What was so much better? This software seems to read and understand Russian, is good at recognizing and “guessing” words, and even puts together words split at the end of a line (ABBYY is headquartered in Moscow btw). These are all very important for creating machine-actionable texts that can be read and analyzed at the word-level.
So the takeaway from this post is: use ABBYY to OCR your Russian texts.
Anyone have any other experiences?
In the next post I’ll describe how I plan to model my data, and which tools I will use to encode it.