Busy I have been recently, with not much time for my blog, but it was all for a good cause. My internet service provider (ISP) informed me that I was taking up too much CPU time on their shared service and banned me. I am a good guy and generally follow the rules, so getting banned is out of character. After a frantic email they restored my account so that I could figure out what happened. I truly am a “less is more” type of guy, and that includes IT resources, and my online sites are pretty consistent, so a propensity of new content was not the issue. Eventually I took some steps to rein in the numerous bots that were scraping and doing whatever to my site, wasting my CPU usage on my tab, and eventually getting me banned. If your site is suffering the same fate, you may glean some hints and tips for reducing your CPU usage.
Firstly, while both my WordPress and Drupal sites are both content management systems (CMS), neither are very active. I really should say that they are semi-static sites, not that I would want to return to coding my sites in HTML. Suffice it to say that it was not me that was pumping out content sufficiently to warrant the 3.5% of CPU I was chewing up on my shared Site5 server. With such a semi-static site, how was it that I was using up so much CPU time, exactly?
Site5 could only tell me that I was using up too much CPU time, and not very much more. There is no real monitoring of which processes were causing trouble, and this made it more difficult for me to solve the problem. I would have to change something, and then ask their tech support if it made any difference, a process that was as slow as it was maddening. They are supposedly working on a tool that will come out some time in the future.
Here are some steps that are helpful, but in the end did not do any appreciable difference:
- ¤ Upgrade WordPress and Drupal, and any plugins, to the newest versions
- ¤ Optimize your PHP databases. Also ensure that you are using UTF8 collation, as this will generate many log errors if it is not. I was inadvertently using latin_swedish for some unknown reason, which is why I could not store Chinese characters in my databases.
- ¤ Reduced the frequency of my CRON jobs. This made almost no difference.
Here are some steps that made a big difference:
¤ Use AWStats: These are statistics collected by your ISP on your site usage. In it there is a list of hosts that are hitting your site and how much bandwidth they are chewing up. Using their IP addresses I looked these up and searched Google to see if they were a known spam site. Included in these were Googlebot and the Bing bot, both benevolent and well identified bots that help my site get indexed. I used Whois to identify the site owner, and then MyIP to search for any history of spam or untoward behaviour. The IP addresses eating up the largest portion of my CPU turned out to be well known spam sites. I used my IP Ban manager to ban these sites. When they are banned they do not chew up my CPU usage and bandwidth. On the issue of banning IPs I usually ban the whole last digit range, from 0 to 255. If I see further activity from the IP address I will go up one digit at a time. Requests from banned sites will show up on your 403 error report, better there than wasting your CPU resources.
Scan AWStats’ downloads and commonly viewed URLs and see if there is anything odd. An IP adress should not be going after system or login files. The only reason I can think of for this is to try to break into your site. If this behaviour is consistent I ban them.
¤ Scan the Latest Visitor Stats: On Site5 these stats are on the most recent 300 visitors. Ensure that these visitors are hitting legitimate web pages that you wish them to visit. If you see any untoward bad behaviour you simply ban the IP address.
¤ Use robots.txt to tell bots to not crawl certain parts of your site. From AWStats you can see what parts of your site are well used. If certain sections of your site are static or you do not want that content indexed, then add this to your robots.txt. Legitimate bots will read your robots.txt and not crawl these areas. If you find a bot that disregards your robots.txt, and there are a lot of them out there, ban them. Note that bots do not read your robots.txt on every crawl, so give them a day or so. I found that the googlebot took over a day to read my robots.txt, while the bingbot was much faster.
In my case I was running an aggregator of RSS feeds, and legitimate bots were indexing this content. Because this aggregator runs multiple times a day the bots were going crazy and hitting my site hard. After adding the aggregator to my robots.txt their onslaught stopped and my ISP let me live another day.
¤ Ban a list of known bad bots using your htaccess file. You simply cut and paste from the examples given and let your web server do the rest. Known bad bots will be banned.
¤ Finding out who are these anonymous bots: It was clear that the problem was not my site’s content, but who it attracted. Much of the heavy traffic to my site were not live human visitors but automated bots. Many of these bots were scraping my site so much as to bring the CPU to its knees. One suggestion from Site5 was to create a bot trap. This is a section of my site, linked from a web page on my site. The link uses a tiny 1 x 1 pixel transparent dot, which cannot be seen by humans but will be crawled by bots. This link then goes to a subdirectory that I have specifically told bots to not go, using the robots.txt file. If a bot gets to this area of my site, I ban them. I also log their IP address, the bot name and send an email to myself to notify me when this happens.
¤ One of the benefits of the bot trap is that they give you the code to document your HTTP 403 errors. These are HTTP errors where someone tried to get into some part of your site that they should not. This log is useful because you can see what they were trying to access, track down their IP addresses and if not legitimate, ban them. Bots and people should not be trying to access unauthorized areas of your site, and if they consistently try to do this you know this is a problem. IP addresses that you have banned will be logged on this 403 error report.
¤ Download your Raw Access Logs: These are available from C-panel or your ISP and contain all transactions from your site that did not generate errors. The file will be compressed, so you will need to uncompress it. The file name will be your site name, so rename it with a .txt file extension, otherwise you won’t be able to read it. I used excel to process it. Excel will recognize it as a file and ask for a delimiter. I used “-“, a hyphen and it read it in easily. Sort the first column of IP addresses. For any IP address that has more than a page of transactions, look them up in Google. If you don’t recognize them, then ban them. Googlebot will be included in this list, so don’t ban everyone. I am still looking for three anonymous bots that are taking up lots of bandwidth, and this may be the way to find their IP addresses. Once you have done this import into Excel, processing your raw access log can be done very quickly. I am only looking for the large bandwidth users, but could also use the log to find malicious users and bots.
You will obviously see legitimate browsers visiting your site. This is the whole reason for having a web site. People will download certain documents, view graphics and such. These are legitimate uses of your site that you approve. If you do not approve then you can do something about it, because any web activity has a cost, a cost borne by the web site provider, which is you.
Work continues on my site to try to track down the anonymous bots and to ban them. I realize they are crafty and do not want to be found. Bots can take over an unsuspecting computer and use it to spam sites like mine. If I need to ban the unsuspecting computer from my site, then so be it. Until I find another way this crude method will have to do. If you have any suggestions please feel free to add a comment.