When someone, such as a person or a bot, the requester, requests a resource from your server, this request, for Apache, is logged in the raw access log. The requester also leaves some information about itself called http request headers. While not standard to log on Apache, with a little bit of php added to the html, this extra information can be logged and examined to help determine if the requester is a bot or human.
As an additional file will be created daily, I opted to put these files into a subdirectory. The headers, one per line, are being logged into a headers-yyyymmdd.log file, which seems free form. Different requesters leave different sets of headers.
In my current contract I had the opportunity to work with optical character recognition (OCR). We had over 50 documents in paper format that were published before 1991 that needed to get digitized and published on the internet. While these documents were old, they have really in-depth knowledge that simply needed to be shared with the world. OCR, however, has its quirks and is not all that straight forward. Some are due to the age and handling of the original documents over the years, and some are due to the original typographical or layout decisions of the original publishers. No matter the reason, they are not to be found and you need these documents on the internet, so the monkey is now on your back.
This is a preview of
Using Optical Character Recognition (OCR): Observations
. Read the full post (834 words, 0 images, estimated 3:20 mins reading time)