User Agents I Could not Ban with htaccess

These user agents, or bots, somehow fool and subvert my .htaccess user agent rules and continue to scrape my site. I’ve looked at my htaccess user agent rule many times and don’t know why. The next step is to ban their IP.

AhrefsBot is a large content scraper that hits my site hard, reads robots.txt but ignores it, fools my htaccess, bot is “Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)”
OVH 51.254.0.0 – 51.255.255.255
51.255.65.0/24
51.255.66.0/24
OVH 151.80.16.0 – 151.80.31.255
151.80.31.0/24
OVH 164.132.0.0 – 164.132.255.255
164.132.161.0/24

BoardReader Blog Indexer(http://boardreader.com)
BoardReader/1.0 (http://boardreader.com/info/robots.htm)-CommentCrawler5
This is a terrible bot. It does not read robots.txt, and ravages my site repeatedly. I could not ban it using htaccess for an unknown reason, so had to use IP banning. I banned the whole company, Waveform Technology, Troy, MI
htaccess rule: ^.*BoardReader [NC,OR]
BoardReader Blog Indexer(http://boardreader.com)
BoardReader/1.0 (http://boardreader.com/info/robots.htm)-CommentCrawler5
199.16.184.-199.16.191
204.11.32.-204.11.35.
208.64.36.-208.64.39.
208.79.208.-208.79.215
208.92.216.-208.92.233.
deny from 199.16.184.0/21
deny from 204.11.32.0/22
deny from 208.64.36.0/22
deny from 208.79.208.0/21
deny from 208.92.216.0/21
deny from 208.92.224.0/21
deny from 208.92.232.0/23

Buzzbot scrapes content, reads but gets an error for Robots.txt, bot is “Buzzbot/1.0 (Buzzbot; http://www.buzzstream.com; buzzbot@buzzstream.com)”
Amazon 52.0.0.0-52.31.255.255
52.0.19.0

Facebook has an image scraper bot “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)”, does not read robots.txt, fools my htaccess. They scrape so much and use so many IPs that I banned their whole 173.252 range.
31.13.97.0/24
31.13.98.0/24
31.13.110.0/24
31.13.113.0/24
66.220.144.0 – 66.220.159.255
69.63.176.0 – 205.189.140.069.63.191.255 69.63.176.0/20
69.171.224.0 – 69.171.255.255 69.171.224.0/19
173.252.64.0 – 173.252.127.255 173.252.64.0/18
203.104.128.0 – 203.104.158.255 a company called LINE fakes the facebook user agent!

Flamingo_SearchEngine, a feed scraper bot is “Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)”
I was about to ban this search engine because when I looked them up there was no search and no info online. Looking up their IPs I find that they are hosted by Amazon. They do not read robots.txt
54.144.0.0 – 54.159.255.255 54.144.0.0/12
54.147.188.0/24
54.224.0.0 – 54.239.255.255

GroupHigh scrapes feeds and content, does not read the Robots.txt. They scrape multiple times a day, eating up bandwidth. Their bot “Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)”
50.128.0.0-50.255.255.255
50.203.216.14

ips-agent is a content scraper, reads robot.txt, bit is “Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:14.0; ips-agent) Gecko/20100101 Firefox/14.0.1”
Verisign 69.58.176.0-69.58.191.255
69.58.178.56

Kik is an image scraper, does not read robots.txt, fools my htaccess, bot is “Kik/9.10.0.5037 (Android 5.1.1) Mozilla/5.0 (Linux; Android 5.1.1; LG-H901 Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/48.0.2564.106 Mobile Safari/537.36”. Kikbot’s histroy showsthat they are using single IPs and not a range, making it harder for me to kill. On the plus side is that they use a combo of US and Canadian ISPs, while on the negative side they are a pain in the ass.
DNA 37.219.0.0 – 37.219.127.255
37.219.97.70
Cogeco 45.78.160.0 – 45.78.191.255
45.78.166.37
St. Clair County Regional Educational 64.90.128.0 – 64.90.143.255
64.90.141.2
Sprint 66.87.0.0 – 66.87.255.255
66.87.79.191
Cox Comm 68.96.192.0 – 68.96.223.255
68.96.203.104
Charter 68.184.0.0 – 68.191.255.255 71.9.0.0 – 71.9.15.255
68.188.187.84 71.9.11.0/24
Cellco 70.192.0.0 – 70.223.255.255
70.197.143.69
Charter 71.9.0.0 – 71.9.15.255
71.9.11.237
CABLE-1 73.0.0.0 – 73.255.255.255
73.28.231.55
Comcast 73.184.0.0 – 73.184.255.255
73.184.11.45
BSKYB 90.210.0.0 – 90.211.255.255
90.211.189.92
AT&T 99.128.0.0 – 99.191.255.255
99.189.21.187
OPTUS 110.20.0.0 – 110.23.255.255
110.22.88.36
Consumer_DSL 112.207.0.0 – 112.207.127.255
112.207.119.20
ISC 149.20.0.0 – 149.20.255.255
149.20.80.82
Service Provider 166.128.0.0 – 166.255.255.255
166.137.118.87
T-Mobile 172.32.0.0 – 172.63.255.255
172.56.15.0/24 172.58.233.0/24
172.56.15.164
172.58.233.53
Telus 173.180.0.0 – 173.183.255.255
173.180.95.39
173.184.11.45

magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net). Does not read robots.txt. There is a second bot associated with the same IP, that has the nam209.235.220.230e “robots”. Banning this IP removes both bots.
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
htaccess rule: ^.*magpie\-crawler, ^.*robots
94.228.34.0/24
94.228.34.248

NetTrekker is an image scraper, does not read Robots.txt, fools my htaccess, bot is “Mozilla/5.0 (compatible;netTrekker-Link-Checker-AAMS/1.0)”. It always reads the same image.
209.235.192.0-209.235.255.255
209.235.202.0/24
209.235.220.0/24

OrangeBot is a scraper that reads my Robots.txt but ignores it.
Mozilla/5.0 (compatible; OrangeBot/2.0; support.orangebot@orange.com)
81.52.142.0 – 81.52.143.255
193.252.118.0 – 193.252.118.255

Qwant is a content and tag scraper, does not read robots.txt, fools my htaccess, bot is “Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)/*”
194.187.168.0 – 194.187.171.255
194.187.168.0/24

Seznam CZ is an image scraper. It reads but ignores my Robots.txt and fools my htaccess. Bot is “Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)”
77.75.77.0 /24
77.75.77.17

SPUTNIK does not read Robots.txt and tricks my htaccess
Mozilla/5.0 (compatible; SputnikImageBot/2.3; +http://corp.sputnik.ru/webmaster)
5.143.224.0 – 5.143.231.255
5.143.231.0/24

Sysomos scans for my feeds multiple times a day, every day, but my content does not change that often. It is simply too much. They are kind enough to use only a single IP, so I don’t need to ban the range. They don’t read the robots.txt, and bypass my htaccess rule. Bot is “Mozilla/5.0 (compatible; Sysomos/1.0; +http://www.sysomos.com/; Sysomos)”
Savvis 205.138.0.0 – 205.140.175.255
205.139.141.54

Uptimebot scrapes the HEAD daily. I don’t really know what it does. Bot is “Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.uptime.com/uptimebot)”, does not read Robots.txt, fools my htaccess.
Linode 45.79.0.0-45.79.255.255
45.79.81.0/24
45.79.89.0/24

Photon is from Automattic, of WordPress fame.
They do not read the robots.txt, and can somehow fool my robots.txt. Multiple IP addresses, so had to ban the whole shebang. They were constantly scraping images.
Photon/1.0
192.0.64.0-192.0.127.255
192.0.64.0/18

192.243.55.138 Mozilla/5.0 (compatible; SemrushBot/1~bl; +http://www.semrush.com/bot.html)
Does not read robots.txt
Mozilla/5.0 (compatible; SemrushBot/1~bl; +http://www.semrush.com/bot.html)
192.243.55.0/24
htaccess rule: ^.*Semrush [NC,OR]

Seznam reads my robots.txt but ignores it. It’s a combo image scraper and feed bot “Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)”. It fools my htaccess.
htaccess rule: ^.*SeznamBot [OR]
77.75.76.0 – 77.75.76.255
77.75.76.0/24

Twingly Recon receives feeds only, does not read robots.txt. It fools my htaccess. It is tricky because it does not have the common bot, spider, crawler name
Mozilla/5.0 (compatible; Twingly Recon; twingly.com)
htaccess rule: ^.*Twingly [NC,OR]
80.252.171.64 – 95
80.252.171.64/27

unrulymedia scrapes my site every couple of days, does not read Robots.txt, fools my htaccess. Bot is “viralvideochart.unrulymedia.com vvcscanner@unrulymedia.com”
Amazon.com 23.20.0.0-23.23.255.255
23.22.131.0/24

Mozilla/5.0 (compatible; XoviBot/2.0; +http://www.xovibot.net/)
Could not stop with it with htaccess, does not read Robots.txt, so had to ban the IPs.
Mozilla/5.0 (compatible; XoviBot/2.0; +http://www.xovibot.net/)
185.53.44.0/24
212.224.119.128-212.224.119.191

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon; .NET CLR 1.1.4322)
I thought .NET CLR was a bot, but it turns out I got heavy comment spam from Colocrossing/Hudson Valley Host. Banned a huge number of their IP ranges. They hit me with 45 comment spam messages in one evening.

Anonymous Bots
Accenture, anonymous scraper, does not read robots.txt
170.251.176.0 – 170.252.255.255
170.252.0.0/16, 170.251.192.0/18, 170.251.176.0/20

Alisoft is scraping for images using bot “Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/WK10171 Safari/8536.25”
121.40.0.0 – 121.43.255.255

Amazon is coming in with a scraper bot “Ruby”. Why so anonymous? It turns out that Microsoft, Google and others have anonymous bots as well. Amazon also has a scraper bot “-“. Odd that “Mozilla/5.0 (compatible; DomainAppender /1.0; +http://www.profound.net/domainappender)” and “Mozilla/5.0 (compatible; Cliqzbot/1.0 +http://cliqz.com/company/cliqzbot)” are also using Amazon IPs. I’ve emailed them and was told it was resolved, but the next day I still have these bots scraping my site, so I banned IPs.
52.32.0.0 – 52.63.255.255
52.32.0.0/11
52.34.246.205
52.36.100.209
52.36.176.116 POSTed a spam comment
52.37.147.174

Bank of America is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”. Why? I’m a Canadian and am not even in their jurisdiction. They do not read robots.txt, and even if they did they have no unique User Agent name
171.128.0.0 – 171.206.255.255
171.128.0.0/10, 171.192.0.0/13, 171.200.0.0/14, 171.204.0.0/15, 171.206.0.0/16

Bank of Montreal is running an anon scraper bot “Mozilla/4.0 (compatible;)”
198.96.168.0 – 198.96.183.255
198.96.180.245

a bot from BITSRO, South Korea, is an anonymous content scraper “Mozilla/5.0 (compatible; MSIE or Firefox mutant;) Daum 4.1”
203.133.160.0 – 203.133.191.255
203.133.169.0/24

Government of Canada Shared Services is scraping my site for images?!? Bot is “Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)”. What the…
198.103.0.0 – 198.103.255.255
198.103.180.1

CHINANET Guangdong, anon scraper using bot “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36”
183.0.0.0 – 183.63.255.255
183.60.243.0/24

China TieTong Telecommunications has a massive scraper, with a wide variety of bot names, too numerous to document. They also POST.
123.64.0.0 – 123.95.255.255
123.72.158.215Dalvik/1.6.0 (Linux; U; Android 4.2.2; Lenovo A369i Build/JDQ39)

Dalvik-based anon scraper bot “Dalvik/1.6.0 (Linux; U; Android 4.2.2; Lenovo A369i Build/JDQ39)”, different builds, all coming in with a referrer of “-“.
AT&T 108.192.0.0 – 108.255.255.255
108.234.27.159
XL_INFRASTRUKTUR 112.215.0.0 – 112.215.75.0
112.215.63.0/24
TELEFÔNICA BRASIL 189.68.0.0 – 189.69.255.255
189.68.21.243
Albanian Mobile 31.22.48.0 – 31.22.55.255
31.22.50.0/24
31.22.52.0/24

Drake Holdings has a scraper than takes up a huge amount of bandwidth on my site. Their bot is an anonymous “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)”, and was hard to find.
204.79.180.0/24

Eaton.com, a power managemnt company, is scraping my site with an anon bot “Mozilla/4.0 (compatible;)”
192.104.67.0 – 192.104.67.255
192.104.67.121

Fibrenoir is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”. 208.94.104.0 – 208.94.111.255
208.94.104.0/21

HETZNER has a bot that changes names: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0, Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0, Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36, Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0, and also change their referrers, which are all bogus as well. They are only looking for awstats tags, which is odd.
178.63.100.0/24

IQ-PL is scraping for Semrush tags, using bot “Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 ( .NET CLR 3.5.30729; .NET CLR 4.0.20506)”
46.248.160.0-46.248.191.0

Kintiskton LLC is scraping with the anon bot “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”>
65.208.151.112 – 65.208.151.119
65.208.151.112/29

Leaseweb is scraping with an anon bot called “Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)” and a bot “-”
91.109.16.0 – 91.109.23.255
95.211.142.0 – 95.211.144.255

Unit Permodenan Tadbiran dan Perancangan Pengurusan Malaysia Government of Malaysia comment spamming me?
103.8.161.137 – 103.8.161.138
wfmdb.1mocc.gov.my
103.8.161.138

OVH is scraping for tags using bot “Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0”
37.187.56.0 – 37.187.57.255

PrivateCloud is scraping my feeds for comments, with a user agent of “-“, and bypasses my htaccess.
149.202.157.208 – 149.202.157.223

Network Technology Experiment Validation and Demonstration Center
FIT Center, Tsinghua University
203.91.120.0 – 203.91.127.255

RO-INFINITY is an anonymous scraper, bot is “Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60”
46.102.234.0/24

Romtel has an anon scraper bot “Java/1.6.0_04”, does not read robots.txt
92.85.168.0 – 92.85.175.255
92.85.172.59

RO-RESIDENTIAL is scraping with an anonymous bot “Java/1.6.0_04”
79.116.24.0 – 79.116.31.255
79.116.24.0/21

Royal St. George’s College is continually scraping a specific image, using bot “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4”
216.223.144.208 – 216.223.144.223
216.223.144.208/28

Savvis is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”, for images.
72.35.0.0 – 72.35.31.255
72.35.0.0/19

Shoppers Drug Mart has a scraper bot “Mozilla/4.0 (compatible;)”. What are they doing on my site?
205.189.140.0/24
205.189.140.138

SCHLUND is scraping for comments and images, anonymously with a bot “-”
212.227.118.0 – 212.227.118.255
212.227.118.0/24

TD Bank has an anon bot “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.3)”
142.205.0.0 – 142.205.255.255
nat-soc-241-254.tdbank.ca
142.205.241.254

TELEVIDEO NOVELDA has an anon scraper bot “Java/1.8.0_40”, does not read Robots.txt
130.185.88.0 – 130.185.95.255

Test Spider 0.2, no purpose, just crawling around scraping content
Amazon 54.240.0.0 – 54.255.255.255
54.242.26.226

United States Army Information Systems Command (USAISC) is scraping my site with bot called “Mozilla/4.0 (compatible;)”, essentially anonymous. Does not read robots.txt, no unique User Agent IP.
138.162.0.0 – 138.162.255.255 gate3-norfolk.nmci.navy.mil
138.162.0.43
138.163.0.0 – 138.163.255.255 gate2-bremerton.nmci.navy.mil
138.163.106.72
143.85.0.0 – 143.85.255.255
143.85.0.0/16

Virginia School for the Deaf and the Blind: VSDB is scarping my site for images
vsdbs.virginia.gov
69.72.243.186

WORLDSTREAM is scraping for images with a bot “Mozilla/5.0 (iPad; CPU OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11A501” and “Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3” and “Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30”
217.23.9.0 – 217.23.9.255

YOMURA Corporation is scraping my site with an anonymous bot “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11”
199.233.244.0 – 199.233.247.255,
CIDR: 199.233.244.0/22
NetName:
199.233.246.203

ZAGREBACKA BANKA has a content scraper anonymous bot “Mozilla/4.0 (compatible;)” and “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)”
195.29.221.160 – 195.29.221.175

ZOMRO is doing something simlar to Hetzner, looking for awstats tags. Their anonymous bot is “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36”. They download the same info 3 times per night, very wasteful.
93.170.141.0/24 93.170.168.0/24

Search Engines but Anonymous
Microsoft using bot “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)”
65.52.0.0 – 65.55.255.255
131.253.21.0 – 131.253.47.255
131.253.36.203
157.54.0.0 – 157.60.255.255
157.55.39.166
207.46.0.0 – 207.46.255.255
207.46.13.191

Google is masquerading as host name “mail.chandco.ca”, bot called “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0; InfoPath.2; MSOffice 12)”. What the hey?!?
mail.chandco.ca 74.125.207.121

Related posts:

Leave a Reply