Using Optical Character Recognition (OCR): Observations

In my current contract I had the opportunity to work with optical character recognition (OCR). We had over 50 documents in paper format that were published before 1991 that needed to get digitized and published on the internet. While these documents were old, they have really in-depth knowledge that simply needed to be shared with the world. OCR, however, has its quirks and is not all that straight forward. Some are due to the age and handling of the original documents over the years, and some are due to the original typographical or layout decisions of the original publishers. No matter the reason, they are not to be found and you need these documents on the internet, so the monkey is now on your back.

Artifacts: Artifacts, as defined here, are imperfections in the original printed document that fools the OCR software into adding superfluous characters into your text. These imperfections can be creases in the original document, folds that are no longer folded, 3 hole punch marks, stains, etc. All these add really odd and unwanted text into your document that you must physically edit out. Most old documents were never stored to be digitized in the future, so special storage and handling were never considered.

Non-English Quotes: Maybe these are called European quotes, I am unsure, but they are quotes nonetheless but do not use the traditional English quote symbol for the front quotes, but use something like quote marks but on the base line, called a “low 9 quote”, instead („ This is a quote ”). Though my documents were written in English, the writers were from Austria, so they used Austrian or German language rules. The first set of quotes are replaced with two commas “,,” in my OCR software.

Italics and Script Font: In summary, they really screw up the OCR. I am sure this depends on the font type, but the more fancy the script, the worse the OCR. Add italics on top and you have something that a native English speaker might have trouble deciphering. You will need to completely retype these sections from sight. Keep the italics to a minimum, and eschew the script font altogether.

Blue Ink: Blue ink was not something that I thought would give trouble, but it gave my OCR software a hard time. I am sure that dark blue ink is Ok, but after some 30 years, blue ink will fade, resulting in an OCR problem. It is best to stick with black. I did not have a chance to try scanning other colours.

Different font sizes, All Uppercase: Instead of uppercase and lowercase, using a larger font for uppercase and a smaller font for lowercase is just a bad idea. Creative, it is not even. There is very little to be gained by this strategy, it is harder to type and change fonts, does not gain much from a typographic standpoint, and plays hell on OCR.

Type of Font: The type of font plays a critical role in how well your OCR will go. Of course 30 years ago they were not thinking about this. Substitutions are common, such as numeric “1” for lowercase “i” (eye). Sometimes a lowercase “l” (el) is thrown in for good measure. There are many others. You could carefully run them through a spell checker and do global substitutions, which would help immensely. There are other substitutions that also randomly occur and need to be manually fixed.

Typographic Errors: Having columns in a two column page that are too close together can trick the OCR software into thinking that you only have one column with wonky spacing. This error means that you will need to read half the text from each line and then put them all together. This is quite a difficult and tedious procedure best avoided. Random and “creative” typesetting will also fool your OCR software, putting snippets of text in odd places. Keep your typesetting simple and flowing from left to right and from top to bottom. Your OCR software will better cooperate with you.

There are snippets of OCR software out there that will create web pages but often they add so much superfluous HTML that you will need to strip all that out. I was using WordPress as the CMS, so plain text was necessary because WordPress and our chosen theme would take care of the typesetting. Having all the generated HTML to remove was more hassle than it is worth, taking more time than a simple OCR to text.

In summary I could find no easy and suitable method of taking a paper document into my WordPress CMS except a simple OCR, warts and all. There will always be some errors that need to be overcome, as no system is perfect for your individual application. You simply reduce the amount of extra work you need to do as best you can, go with this solution, and slog through the reams of documents until you are done. Then proofread.

Leave a Reply