As an additional file will be created daily, I opted to put these files into a subdirectory. The headers, one per line, are being logged into a headers-yyyymmdd.log file, which seems free form. Different requesters leave different sets of headers.
Added this logheaders.php file to a subdirectory, under public_html. Changed this in the php code to account for the subdir:
$fh = fopen($_SERVER[‘DOCUMENT_ROOT’] . “/loghead/headers-“. date(‘Ymd’) . “.log”,”a”);
Added this to the header.php of your WordPress theme:
<?php include($_SERVER[‘DOCUMENT_ROOT’] . ‘/loghead/logheaders.php’) ?>
I also added the php call to the 404.php and 403.php, but they do not like the php in a subdirectory, so had to duplicate the code in public_html.
The resulting file logging looks like this, with each header on a separate line:
2018-07-17:07:44:58
URL: /wp/2016/03/16/ubuntu-1404-chinese-sunpinyin-pinyin-fixes/
IP: 42.249.25.227
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Host: dontai.com
Referer: http://cn.bing.com/search?q=ubuntu14+chinese+input&qs=n&form=QBRE&sp=-1&pq=ubuntu14+chinese+input&sc=1-22&sk=&cvid=F172705FDEF146F0A86DFF1E28B0C438
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
You can now correlate your obvious human raw access log activity with their request header entry. Humans leave an unmistakable trail. Humans read a single page at a time, so they will log a single set of headers per page. They then take some time to read the content.
Bots can ask for multiple pages at a time, so there will be multiple sets of headers logged in quick succession, a dead giveaway that this is not a human.
Header-based logging seems pretty rare. I am finding very little on the subject in searches. This is especially so when trying to bot kill.
| Connection ; close | Connection: keep-alive |
| Baidu, Sogou | human |
| Googlebot, Yisou, Bingbot, Seznam, Yandex, |
Poking holes: Now that I have some request header-based rules I can make some exceptions for those bots that would get caught:
# Google Image Proxy
SetEnvIf Remote_Addr ^64\.233\.173.[0-9]{1,3} !bad_header
https://www.abyssguard.com/wiki/features/headers-check/
Header Accept
A client MUST include a Host header field
https://websiteadvantage.com.au/Request-HTTP-Header-Info
http://www.nicholassolutions.com/tutorials/php/headers.html#requestheaders
Accept-Language Headers
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language
https://www.w3.org/International/questions/qa-accept-lang-locales
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html
https://log0.wordpress.com/2008/07/12/how-to-differentiate-bots-and-humans/
https://www.usenix.org/legacy/event/usenix06/tech/full_papers/park/park_html/paper.html
https://www.sans.org/reading-room/whitepapers/detection/http-header-heuristics-malware-detection-34460
