Tag: HTML

Importing Table Data from the Web to a Spreadsheet

It is not very difficult once you know how:

highlight and copy ctrl-c your web table
open your spreadsheet, paste ctl-v into the spreadsheet
you may have headers you want to delete, remove the whole row
you can also delete any columns, remove the whole column
to remove images: home > find & select > go to special > objects. All images will be selected, delete
then you can manipulate your spreadsheet
save your spreadsheet

To highlight an area, click on the top left box, manipulate to the bottom right box, hit shift enter

This is a preview of Importing Table Data from the Web to a Spreadsheet. Read the full post (111 words, 0 images, estimated 27 secs reading time)

Request Header-Based Logging for Apache

When someone, such as a person or a bot, the requester, requests a resource from your server, this request, for Apache, is logged in the raw access log. The requester also leaves some information about itself called http request headers. While not standard to log on Apache, with a little bit of php added to the html, this extra information can be logged and examined to help determine if the requester is a bot or human.

As an additional file will be created daily, I opted to put these files into a subdirectory. The headers, one per line, are being logged into a headers-yyyymmdd.log file, which seems free form. Different requesters leave different sets of headers.

This is a preview of Request Header-Based Logging for Apache. Read the full post (400 words, 0 images, estimated 1:36 mins reading time)

Using Optical Character Recognition (OCR): Observations

In my current contract I had the opportunity to work with optical character recognition (OCR). We had over 50 documents in paper format that were published before 1991 that needed to get digitized and published on the internet. While these documents were old, they have really in-depth knowledge that simply needed to be shared with the world. OCR, however, has its quirks and is not all that straight forward. Some are due to the age and handling of the original documents over the years, and some are due to the original typographical or layout decisions of the original publishers. No matter the reason, they are not to be found and you need these documents on the internet, so the monkey is now on your back.

This is a preview of Using Optical Character Recognition (OCR): Observations. Read the full post (834 words, 0 images, estimated 3:20 mins reading time)