For some odd reason my htaccess was banning some Bingbot and Yandexbot crawls, but there were others, such as University of Toronto, and I could not figure out why. I blamed my host provider, and I was incorrect. There was no other cache of IP blocks. It turned out to be, as usual, a user error. My user error.
Through the course of repeatedly bisecting my htaccess, I went through with my patient Big Weed, 12 times, until I isolated the errant code. It turned out to be syntactically correct but logically incorrect. As the syntax was fine, it passed my two htaccess syntax checkers. The error was, in CIDR format:
Hotlinking is simply not cool. Referrer spam is also not cool. I get both of these from 4 Blogspot sites, and have struggled to contain their mess. The problem is that they are hosted by Google, through their Blogger platform, GoogleUserContent.com. Though Blogger is free, they are very difficult to kill. Here’s what I did to combat the problem.
Puzzling, it is at times, that my htaccess does not always behave as intended. As a computer scientist I expect that my programs and file input should output consistent, stable and reliable results immediately. This is not the case with my htaccess file, hosted on Site5, my internet service provider.
Delays in htaccess Implementation
When I do certain changes to my htaccess, there may be delays of a day or two. This is very odd to me, because supposedly the htaccess is checked for every server request. Maybe there are some caching that I do not know about. Nevertheless it seems like the htaccess has a unique personality. I know that I should not anthropomorphize a computer, much less a security file such as htaccess on an Apache server, but it is difficult to not.
Apache, the server and not the Indian tribe, is a fickle mistress. She is more than a little unpredictable, or at least it feels this way on Site5. While I realize that Apache is a web server, a computer who should be very logical, often times I notice very odd behaviour. Maybe it is the server setup, caching, or even traffic volume, I do not know. I do know that if you have some error in your htaccess file, the Apache server will then display a combination of ip addresses and host names. Once you fix the error, which no one can point out and there is no error message to go by, you will be back to only ip addresses.
After a long while your htaccess might get a tad long. My favourite htaccess checker only processes files up to 5,000 lines. Often this is due to lots of comments, which I encourage. Let us cover some ways you can shorten your htaccess:
Combine your user agents/referrers
If you have multiple user agent or referrers that have similar names, combine them into a single statement,
from:
RewriteCond %{HTTP_USER_AGENT} ^.*Blackboard\ Safeassign [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow [OR]
to:
RewriteCond %{HTTP_USER_AGENT} ^.*Black(board\ Safeassign|Widow) [OR]
Difficult it was, this afternoon, after 7 days of Googlebot crawl error 500s, but I am learning. One htaccess regex line error was the cause. Hopefully it will go away.
I tried to compress some HTTP_USER_AGENT mod rewrite rules in my htaccess, into a single line, in order to shorten my htaccess, from:
My htaccess file is getting large as I continually ban more bad bots of the world. As it gets larger there are bound to be more mistakes. One of the mistakes can occur in “deny from” lines, which account for the vast majority of lines in the htaccess. If you add any alpha characters to the ip addresses in “deny from” lines, the Apache server will do all host lookups and try to not return IP addresses. This means that some spammers’ ip addresses will be hidden behind bogus host names. For accuracy it is best for the Apache server to return their IP addresses. Using IPs you can then do host and search lookups, find them and ban them.
These user agents, or bots, somehow fool and subvert my .htaccess user agent rules and continue to scrape my site. I’ve looked at my htaccess user agent rule many times and don’t know why. The next step is to ban their IP.
AhrefsBot is a large content scraper that hits my site hard, reads robots.txt but ignores it, fools my htaccess, bot is “Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)”
OVH 51.254.0.0 – 51.255.255.255
51.255.65.0/24
51.255.66.0/24
OVH 151.80.16.0 – 151.80.31.255
151.80.31.0/24
OVH 164.132.0.0 – 164.132.255.255
164.132.161.0/24
Busy I have been recently, with not much time for my blog, but it was all for a good cause. My internet service provider (ISP) informed me that I was taking up too much CPU time on their shared service and banned me. I am a good guy and generally follow the rules, so getting banned is out of character. After a frantic email they restored my account so that I could figure out what happened. I truly am a “less is more” type of guy, and that includes IT resources, and my online sites are pretty consistent, so a propensity of new content was not the issue. Eventually I took some steps to rein in the numerous bots that were scraping and doing whatever to my site, wasting my CPU usage on my tab, and eventually getting me banned. If your site is suffering the same fate, you may glean some hints and tips for reducing your CPU usage.