City of Toronto internet scraper bot scrapes my site a couple of times per month. Why? Toronto, Canada
I live in the City of Toronto, and write about Toronto-related subjects. What is surprising is that the City of Toronto has an internet bot that randomly scrapes content from my site a couple of times each month. The bot started scraping me near the end of January 2017.
What is interesting was that I, concerned citizen, actually emailed them because I thought they had a Zombie PC taken over by a bot, or some other security issue. I sent the City a log of the relevant entries related to their IP address. Was I naive. Here is their reply (email@example.com):
Today I received a massive 1,000 line scraper attack from spbot, from OpenLinkProfiler.org. The ip address is 184.108.40.206, a Digital Ocean IP, which I have banned. I’ve also added spbot to by robots.txt. Sent a complaint letter to Digital Ocean at firstname.lastname@example.org:
Today I received a 1000 line scrape from one of your IP addresses:
The UA is Mozilla/5.0 (compatible; spbot/5.0.3; +http://OpenLinkProfiler.org/bot )
Please have them cease their scraping activity as it unnecessarily uses up my bandwidth and CPU time.
I have included today’s log entry with their activity:
Domain Crawler hit my server a 500 transaction attack today, using 5 IP addresses, all from Sweden. They scraped me hard! Their user agent is “DomainCrawler/3.0 (email@example.com; http://www.domaincrawler.com/dontai.com)”. I have banned all these IP addresses with their last octet. Good riddance.
220.127.116.11 Internetbolaget Se domaincrawler
18.104.22.168 Internetbolaget Se domaincrawler
22.214.171.124 Tralex Se domaincrawler
126.96.36.199 Internetbolaget Se domaincrawler
Permanent link to this post
(58 words, 0 images, estimated 14 secs reading time)
These five lot came on my site with a innocent but fake User Agent name of “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”, scraped some documents, and then proceeded to try to break into my site’s security. Cheeky bastards.
Seven attempts at document scraping, followed by 9 attempted logins. The interesting thing is that when you use a computer to do these campaigns, if you are not clever they really do look like a computer generated attempt and are thus easy to identify. Which user would have this behaviour? Of course they have all been banned.
This is a preview of
Bad Bot: Cheeky Scraper campaign, then login attempts
. Read the full post (1522 words, 1 image, estimated 6:05 mins reading time)
These host names try hard to evade detection of their IP addresses, in order to scrape content and sometimes break into from web sites. They have specifically scraped mine and so I hunted them down and banished them. Often times the unix host command returns nothing, so research is required. This usually works.