Apache, the server and not the Indian tribe, is a fickle mistress. She is more than a little unpredictable, or at least it feels this way on Site5. While I realize that Apache is a web server, a computer who should be very logical, often times I notice very odd behaviour. Maybe it is the server setup, caching, or even traffic volume, I do not know. I do know that if you have some error in your htaccess file, the Apache server will then display a combination of ip addresses and host names. Once you fix the error, which no one can point out and there is no error message to go by, you will be back to only ip addresses.
Apache htaccess log problem, it now displays a mix of IP and host names in column 1. Photo by Don Tai.
IP addresses only are preferred because anyone who spams you can be tracked down and banned by IP. This is important. While some DDOS attacks will spoof an IP address, anyone browsing your site, be they a person or bot, needs to provide their IP address in order to receive the requested info. If they do not provide a valid IP address then they do not receive the info they requested as it would go to the bogus IP.
The interesting thing about the mix of IP and host names is that for the most common bots, they are already identified by their host name. The log is much easier to read with the host name mix. The residual IPs, about 30%, are displayed because they have no host name. With the IP and host name mix I can go through my log much faster.
Take for example the above mixed IP and host name access log snippet. I allow both archive.org and googlebot to search my site. As they are clearly identified by name in the IP column, I need not look up their ip addresses for identification. It would be very difficult for another IP address to spoof the host name, so the host name can be trusted.
The IP and host name mix is also much slower for the Apache server, because all the host names must be looked up. Would it not be better for Apache to provide an error message of some kind to say that there is an htaccess error, rather than spend more resources looking up host names?
Host Name Hiding
The problem with the IP, host name mix is that a returned host name can hide the actual IP address. As the mix version of the log does not provide both, a host name can elude detection and banning of their IP address.
In the mixed ip and hostname access log snippet above, note the entry for “mirror.ambrust.me”. A DNS Host lookup says there is no IP address associated with this host name, yet there it appears in my log. This means that while this bot was hitting my site, my Apache server was able to look them up, but the reverse DNS entry, which would provide the IP, is not available. My log is not providing all the info I need, even though at the time of the visit it had everything, both IP address and host name.
hn.kd.ny.adsl is another good example. They have hundreds of Ip addresses that they cycle through, but they hit you with only one. In your ip host name mixed log you see “hn.kd.ny.adsl” in the ip address column in your log, but what IP do you ban? You have no IP, so you search the internet, only to find you have a choice of a couple of hundred. Your only option is to start banning large ranges in the hope that you’ll hit the elusive IP address. This may also ban many potential people visiting your site, as well as slow down the server and your site.
Apache htaccess log problem fixed, it now displays only IP addresses in column 1. Photo by Don Tai.
The Benefits of IP addresses Only
The benefit of IP addresses only in your log is that you can track down anyone who visits your site. The problem is that you see a huge column of ip numbers that you need to look up. It is not as easy nor fast to work through IP addresses. Some bots provide their identities but many do not. Microsoft is well known for not always providing a User Agent name, or not mention they are Microsoft in their user agent name. This does not help.
Ideally it would be best to provide both the IP address and the host name in the access log. This way I could more easily scan the log for potential issues and skip the entries from my approved bots. I could also see who is trying to spoof the UAs of known white bots.
If there was an Apache wish list I would add this to their requirements. Allow us to have both the IP and the host name in the access log.
If you have an IP host name mixed log, spend the time to find and fix the htaccess error. Yes, htaccess is difficult to manage. It might take a couple of days, but you will receive the Ip address of all visitors. They will have nowhere to hide.
My htaccess mistakes:
deny from 22.214.171.124 ExitNode2319 tor