City of Toronto internet scraper bot scrapes my site a couple of times per month. Why? Toronto, Canada
I live in the City of Toronto, and write about Toronto-related subjects. What is surprising is that the City of Toronto has an internet bot that randomly scrapes content from my site a couple of times each month. The bot started scraping me near the end of January 2017.
What is interesting was that I, concerned citizen, actually emailed them because I thought they had a Zombie PC taken over by a bot, or some other security issue. I sent the City a log of the relevant entries related to their IP address. Was I naive. Here is their reply (isg@toronto.ca):
Today I received a massive 1,000 line scraper attack from spbot, from OpenLinkProfiler.org. The ip address is 138.197.47.148, a Digital Ocean IP, which I have banned. I’ve also added spbot to by robots.txt. Sent a complaint letter to Digital Ocean at abuse@digitalocean.com:
Hi there,
Today I received a 1000 line scrape from one of your IP addresses:
138.197.47.148
The UA is Mozilla/5.0 (compatible; spbot/5.0.3; +http://OpenLinkProfiler.org/bot )
Please have them cease their scraping activity as it unnecessarily uses up my bandwidth and CPU time.
I have included today’s log entry with their activity:
Domain Crawler hit my server a 500 transaction attack today, using 5 IP addresses, all from Sweden. They scraped me hard! Their user agent is “DomainCrawler/3.0 (info@domaincrawler.com; http://www.domaincrawler.com/dontai.com)”. I have banned all these IP addresses with their last octet. Good riddance.
80.248.225.142 Internetbolaget Se domaincrawler
80.248.227.107 Internetbolaget Se domaincrawler
176.74.192.36 Tralex Se domaincrawler
193.183.102.178 Internetbolaget Se domaincrawler
These five lot came on my site with a innocent but fake User Agent name of “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”, scraped some documents, and then proceeded to try to break into my site’s security. Cheeky bastards.
Seven attempts at document scraping, followed by 9 attempted logins. The interesting thing is that when you use a computer to do these campaigns, if you are not clever they really do look like a computer generated attempt and are thus easy to identify. Which user would have this behaviour? Of course they have all been banned.
These host names try hard to evade detection of their IP addresses, in order to scrape content and sometimes break into from web sites. They have specifically scraped mine and so I hunted them down and banished them. Often times the unix host command returns nothing, so research is required. This usually works.