Request Header-Based Logging for Apache

When someone, such as a person or a bot, the requester, requests a resource from your server, this request, for Apache, is logged in the raw access log. The requester also leaves some information about itself called http request headers. While not standard to log on Apache, with a little bit of php added to the html, this extra information can be logged and examined to help determine if the requester is a bot or human.

As an additional file will be created daily, I opted to put these files into a subdirectory. The headers, one per line, are being logged into a headers-yyyymmdd.log file, which seems free form. Different requesters leave different sets of headers.

Added this logheaders.php file to a subdirectory, under public_html. Changed this in the php code to account for the subdir:

$fh = fopen($_SERVER[‘DOCUMENT_ROOT’] . “/loghead/headers-“. date(‘Ymd’) . “.log”,”a”);

Added this to the header.php of your WordPress theme:

<?php include($_SERVER[‘DOCUMENT_ROOT’] . ‘/loghead/logheaders.php’) ?>

I also added the php call to the 404.php and 403.php, but they do not like the php in a subdirectory, so had to duplicate the code in public_html.

The resulting file logging looks like this, with each header on a separate line:

URL: /wp/2016/03/16/ubuntu-1404-chinese-sunpinyin-pinyin-fixes/
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0

You can now correlate your obvious human raw access log activity with their request header entry. Humans leave an unmistakable trail. Humans read a single page at a time, so they will log a single set of headers per page. They then take some time to read the content.

Bots can ask for multiple pages at a time, so there will be multiple sets of headers logged in quick succession, a dead giveaway that this is not a human.

Header-based logging seems pretty rare. I am finding very little on the subject in searches. This is especially so when trying to bot kill.

blocking options

Connection ; close Connection: keep-alive
Baidu, Sogou human
  Googlebot, Yisou, Bingbot, Seznam, Yandex,

Poking holes: Now that I have some request header-based rules I can make some exceptions for those bots that would get caught:

# Google Image Proxy
SetEnvIf Remote_Addr ^64\.233\.173.[0-9]{1,3} !bad_header
Header Accept
A client MUST include a Host header field

Accept-Language Headers

How to differentiate bots and humans

Leave a Reply

Your email address will not be published. Required fields are marked *