PDF of graphics: Convert to Searchable and Text Copyable for Ubuntu 16.04 using OCRMyPdf

Manulife blacklist of health services providers provided in PDF but all pages are images. You therefore cannot copy or search the 22 pages.

Manulife blacklist of health services providers provided in PDF but all pages are images. You therefore cannot copy or search the 22 pages.

Our insurer, Manulife, has a list of health service providers, available on PDF, that are black listed. If you use any of them Manulife will not reimburse you. These companies go back to 2015. My doctor has recommended a provider near my house, so I wanted to check if the provider was blacklisted.

Very smart Manulife provides a PDF that looks like text but is actually 22 pages of graphics that look like text. Therefore you cannot copy it nor can you text search it. This means you will need to physically read each of the 22 pages in order to determine if your prospective provider is black listed. This is very inconvenient and error prone. It would be much more helpful to insureds if they had a search facility for this document. This paper disclosure is disingenuous.

I am sure that Manulife has its reasons for doing an all graphics PDF, but ease of use is not one of them. Without PDF they can say that Manulife did provide the list to all insured, but this is very onerous for the insured to use.

Here is my install of OCRMyPdf, which will read the document and create a PDF that is searchable and text copyable. As I am on Ubuntu 16.04 I followed these instructions. The only prerequisite I did not have was the python3-pip, which easily installed. The ocrmypdf install was also easy. OCRMyPdf uses tesseract-ocr, from Google.

After the OCRMypPdf install I went to my directory with the Manulife file and then:

ocrmypdf old.pdf new.pdf source, Ludenticus

It did take a while. I also received the message:

You are using qpdf version 6.0.0 which has known issues including security vulnerabilities with certain malformed PDFs. Consider upgrading to version 7.0.0 or newer.

but it did completely run. The output file was copyable and text searchable. Yay. An odd behaviour is that when searching the document text would slowly scroll down a little, but did eventually stop. I was able to page down to see the search result.

The document has, on every page the footer “Due to the sensitive nature of this communication, please do not post it publicly online.”

You can easily tell from the document that there is a lot of fraud happening. When you see three company names listed under the same address you know something is not right. This pattern is repeated throughout the document.

B* S* H* C* C*
335 Sheppard Ave. E
Toronto, ON
M2N 383

Dr. B* L*
335 Sheppard Ave. E
Toronto, ON
M2N 383

BL S*
335 Sheppard Ave. E
Toronto, ON
M2N 383

Leave a Reply

Your email address will not be published. Required fields are marked *