Step 1: making my Russian texts machine readable

As is often the case with DH projects, my current endeavor has involved going back to an older research project, and resurrecting and reformatting ideas and materials that I haven’t looked at in a while.

FullSizeRenderA few years ago I started working with the journal Novosel’e (I’ll write another post about my research question and its evolution), and was able to physically locate the entire run – 35 issues bound together in 7 volumes – and spent hours (hours) at the Electronic Text Services Center at Columbia University (now the Digital Humanities Center) scanning every single page.

So when I came back to this project a few months ago, I had PDFs of every issue at my disposal, and didn’t have to go though the time-consuming process of creating them. But what to do with these PDFs? What is the most efficient and effective way to make them machine readable and machine-actionable?

Luckily I could ask my multitalented colleague at the Princeton CDH, Ben Johnson, for advice. He sat down and helped me figure out how to extract clean text from those PDFs that could be plugged in, processed, and analyzed by the programs and tools I’m interested in.

The best text files will be created when images are black/white (not color or greyscale) and at a resolution of about 300dpi. By using image editing software you can clean up the  images (cropping out edges and noise on the margins) will help. I chose not to spend the time cropping the text – my images seemed clean enough.

The next step was gauging the quality of the OCR. There are many options for OCR –  such as Adobe Acrobat Pro, ABBYY FineReader, OmniPage, Google’s Tesseract, etc.. Since I created  my files in Adobe Acrobat Pro, that’s what I first used to OCR my texts. The result was not great, and certainly not good enough to create the clean text that I require.

I then ran my PDFs through ABBYY FineReader and got much better results. What was so much better? This software seems to read and understand Russian, is good at recognizing and “guessing” words, and even puts together words split at the end of a line (ABBYY is headquartered in Moscow btw). These are all very important for creating machine-actionable texts that can be read and analyzed at the word-level.

So the takeaway from this post is: use ABBYY to OCR your Russian texts.

Anyone have any other experiences?

In the next post I’ll describe how I plan to model my data, and which tools I will use to encode it.


DH tools & technologies for Russian émigré periodicals

As part of my R&D time at Princeton’s Center for Digital Humanities during the Spring 2015 semester, I will be turning my attention to the Russian émigré journal Novosel’e (Housewarming), published by Sofia Pregel in New York between 1942 and 1950, and testing out how some of today’s widely-used digital humanities tools and technologies can help me learn more about Russian emigre periodical culture.

My aim is to run my corpus of Russian-language text and metadata through tools for text analysis, processing and visualization (e.g. NLTK, TextBlob, Voyant), markup and analysis (e.g. CATMA), named entity recognition (e.g. Stanford NER), topic modeling (e.g. MALLET), and network analysis (e.g. Gephi, Raw, Palladio).

I’ll document my process in a series of blog posts on SEEEPS in the hopes that this may be helpful to other scholars curious about how digital humanities methodologies can shed light on Slavic and East European periodical studies.

Stay tuned!