When someone, such as a person or a bot, the requester, requests a resource from your server, this request, for Apache, is logged in the raw access log. The requester also leaves some information about itself called http request headers. While not standard to log on Apache, with a little bit of php added to the html, this extra information can be logged and examined to help determine if the requester is a bot or human.
As an additional file will be created daily, I opted to put these files into a subdirectory. The headers, one per line, are being logged into a headers-yyyymmdd.log file, which seems free form. Different requesters leave different sets of headers.
Apache, the server and not the Indian tribe, is a fickle mistress. She is more than a little unpredictable, or at least it feels this way on Site5. While I realize that Apache is a web server, a computer who should be very logical, often times I notice very odd behaviour. Maybe it is the server setup, caching, or even traffic volume, I do not know. I do know that if you have some error in your htaccess file, the Apache server will then display a combination of ip addresses and host names. Once you fix the error, which no one can point out and there is no error message to go by, you will be back to only ip addresses.
Does your raw access log display a host name of “0”, or zero? Very odd, is it not? I have been struggling with this for a couple of months, and my ISP Site5 had no answers. It turns out that one of my spammers, NFORCE_ENTERTAINMENT, puts an unprintable character into their host table, so that when my ISP looks them up, they display the unprintable character in my log as “0”.
Trying to control your site’s spam can be challenging. If you try to ban an IP that is simply 0, or a host name of “0” you will fail, because there is no zero in their host name, but an unprintable character. Ban these guys instead.
My htaccess file is getting large as I continually ban more bad bots of the world. As it gets larger there are bound to be more mistakes. One of the mistakes can occur in “deny from” lines, which account for the vast majority of lines in the htaccess. If you add any alpha characters to the ip addresses in “deny from” lines, the Apache server will do all host lookups and try to not return IP addresses. This means that some spammers’ ip addresses will be hidden behind bogus host names. For accuracy it is best for the Apache server to return their IP addresses. Using IPs you can then do host and search lookups, find them and ban them.