User Agents I Could not Ban with htaccess

These user agents, or bots, somehow fool and subvert my .htaccess user agent rules and continue to scrape my site. I’ve looked at my htaccess user agent rule many times and don’t know why. The next step is to ban their IP.

AhrefsBot is a large content scraper that hits my site hard, reads robots.txt but ignores it, fools my htaccess, bot is “Mozilla/5.0 (compatible; AhrefsBot/5.0; +”

BoardReader Blog Indexer(
BoardReader/1.0 (
This is a terrible bot. It does not read robots.txt, and ravages my site repeatedly. I could not ban it using htaccess for an unknown reason, so had to use IP banning. I banned the whole company, Waveform Technology, Troy, MI
htaccess rule: ^.*BoardReader [NC,OR]
BoardReader Blog Indexer(
BoardReader/1.0 (
deny from
deny from
deny from
deny from
deny from
deny from
deny from

Buzzbot scrapes content, reads but gets an error for Robots.txt, bot is “Buzzbot/1.0 (Buzzbot;;”

Facebook has an image scraper bot “facebookexternalhit/1.1 (+”, does not read robots.txt, fools my htaccess. They scrape so much and use so many IPs that I banned their whole 173.252 range. – – – – – a company called LINE fakes the facebook user agent!

Flamingo_SearchEngine, a feed scraper bot is “Flamingo_SearchEngine (+”
I was about to ban this search engine because when I looked them up there was no search and no info online. Looking up their IPs I find that they are hosted by Amazon. They do not read robots.txt – –

GroupHigh scrapes feeds and content, does not read the Robots.txt. They scrape multiple times a day, eating up bandwidth. Their bot “Mozilla/5.0 (compatible; GroupHigh/1.0; +”

ips-agent is a content scraper, reads robot.txt, bit is “Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:14.0; ips-agent) Gecko/20100101 Firefox/14.0.1”

Kik is an image scraper, does not read robots.txt, fools my htaccess, bot is “Kik/ (Android 5.1.1) Mozilla/5.0 (Linux; Android 5.1.1; LG-H901 Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/48.0.2564.106 Mobile Safari/537.36”. Kikbot’s histroy showsthat they are using single IPs and not a range, making it harder for me to kill. On the plus side is that they use a combo of US and Canadian ISPs, while on the negative side they are a pain in the ass.
Cogeco –
St. Clair County Regional Educational –
Sprint –
Cox Comm –
Charter – –
Cellco –
Charter –
Comcast –
AT&T –
Consumer_DSL –
Service Provider –
T-Mobile –
Telus –

magpie-crawler/1.1 (U; Linux amd64; en-GB; + Does not read robots.txt. There is a second bot associated with the same IP, that has the nam209.235.220.230e “robots”. Banning this IP removes both bots.
magpie-crawler/1.1 (U; Linux amd64; en-GB; +
htaccess rule: ^.*magpie\-crawler, ^.*robots

NetTrekker is an image scraper, does not read Robots.txt, fools my htaccess, bot is “Mozilla/5.0 (compatible;netTrekker-Link-Checker-AAMS/1.0)”. It always reads the same image.

OrangeBot is a scraper that reads my Robots.txt but ignores it.
Mozilla/5.0 (compatible; OrangeBot/2.0; – –

Qwant is a content and tag scraper, does not read robots.txt, fools my htaccess, bot is “Mozilla/5.0 (compatible; Qwantify/2.2w; +*” –

Seznam CZ is an image scraper. It reads but ignores my Robots.txt and fools my htaccess. Bot is “Mozilla/5.0 (compatible; SeznamBot/3.2; +” /24

SPUTNIK does not read Robots.txt and tricks my htaccess
Mozilla/5.0 (compatible; SputnikImageBot/2.3; + –

Sysomos scans for my feeds multiple times a day, every day, but my content does not change that often. It is simply too much. They are kind enough to use only a single IP, so I don’t need to ban the range. They don’t read the robots.txt, and bypass my htaccess rule. Bot is “Mozilla/5.0 (compatible; Sysomos/1.0; +; Sysomos)”
Savvis –

Uptimebot scrapes the HEAD daily. I don’t really know what it does. Bot is “Mozilla/5.0 (compatible; Uptimebot/1.0; +”, does not read Robots.txt, fools my htaccess.

Photon is from Automattic, of WordPress fame.
They do not read the robots.txt, and can somehow fool my robots.txt. Multiple IP addresses, so had to ban the whole shebang. They were constantly scraping images.
Photon/1.0 Mozilla/5.0 (compatible; SemrushBot/1~bl; +
Does not read robots.txt
Mozilla/5.0 (compatible; SemrushBot/1~bl; +
htaccess rule: ^.*Semrush [NC,OR]

Seznam reads my robots.txt but ignores it. It’s a combo image scraper and feed bot “Mozilla/5.0 (compatible; SeznamBot/3.2; +”. It fools my htaccess.
htaccess rule: ^.*SeznamBot [OR] –

Twingly Recon receives feeds only, does not read robots.txt. It fools my htaccess. It is tricky because it does not have the common bot, spider, crawler name
Mozilla/5.0 (compatible; Twingly Recon;
htaccess rule: ^.*Twingly [NC,OR] – 95

unrulymedia scrapes my site every couple of days, does not read Robots.txt, fools my htaccess. Bot is “”

Mozilla/5.0 (compatible; XoviBot/2.0; +
Could not stop with it with htaccess, does not read Robots.txt, so had to ban the IPs.
Mozilla/5.0 (compatible; XoviBot/2.0; +

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon; .NET CLR 1.1.4322)
I thought .NET CLR was a bot, but it turns out I got heavy comment spam from Colocrossing/Hudson Valley Host. Banned a huge number of their IP ranges. They hit me with 45 comment spam messages in one evening.

Anonymous Bots
Accenture, anonymous scraper, does not read robots.txt –,,

Alisoft is scraping for images using bot “Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/WK10171 Safari/8536.25” –

Amazon is coming in with a scraper bot “Ruby”. Why so anonymous? It turns out that Microsoft, Google and others have anonymous bots as well. Amazon also has a scraper bot “-“. Odd that “Mozilla/5.0 (compatible; DomainAppender /1.0; +” and “Mozilla/5.0 (compatible; Cliqzbot/1.0 +” are also using Amazon IPs. I’ve emailed them and was told it was resolved, but the next day I still have these bots scraping my site, so I banned IPs. – POSTed a spam comment

Bank of America is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”. Why? I’m a Canadian and am not even in their jurisdiction. They do not read robots.txt, and even if they did they have no unique User Agent name –,,,,

Bank of Montreal is running an anon scraper bot “Mozilla/4.0 (compatible;)” –

a bot from BITSRO, South Korea, is an anonymous content scraper “Mozilla/5.0 (compatible; MSIE or Firefox mutant;) Daum 4.1” –

Government of Canada Shared Services is scraping my site for images?!? Bot is “Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)”. What the… –

CHINANET Guangdong, anon scraper using bot “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36” –

China TieTong Telecommunications has a massive scraper, with a wide variety of bot names, too numerous to document. They also POST. – (Linux; U; Android 4.2.2; Lenovo A369i Build/JDQ39)

Dalvik-based anon scraper bot “Dalvik/1.6.0 (Linux; U; Android 4.2.2; Lenovo A369i Build/JDQ39)”, different builds, all coming in with a referrer of “-“.
AT&T –
Albanian Mobile –

Drake Holdings has a scraper than takes up a huge amount of bandwidth on my site. Their bot is an anonymous “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)”, and was hard to find., a power managemnt company, is scraping my site with an anon bot “Mozilla/4.0 (compatible;)” –

Fibrenoir is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”. –

HETZNER has a bot that changes names: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0, Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0, Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36, Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0, and also change their referrers, which are all bogus as well. They are only looking for awstats tags, which is odd.

IQ-PL is scraping for Semrush tags, using bot “Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv: Gecko/20101026 Firefox/3.6.12 ( .NET CLR 3.5.30729; .NET CLR 4.0.20506)”

Kintiskton LLC is scraping with the anon bot “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”> –

Leaseweb is scraping with an anon bot called “Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)” and a bot “-” – –

Unit Permodenan Tadbiran dan Perancangan Pengurusan Malaysia Government of Malaysia comment spamming me? –

OVH is scraping for tags using bot “Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0” –

PrivateCloud is scraping my feeds for comments, with a user agent of “-“, and bypasses my htaccess. –

Network Technology Experiment Validation and Demonstration Center
FIT Center, Tsinghua University –

RO-INFINITY is an anonymous scraper, bot is “Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60”

Romtel has an anon scraper bot “Java/1.6.0_04”, does not read robots.txt –

RO-RESIDENTIAL is scraping with an anonymous bot “Java/1.6.0_04” –

Royal St. George’s College is continually scraping a specific image, using bot “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/601.4.4 (KHTML, like Gecko) Version/9.0.3 Safari/601.4.4” –

Savvis is scraping my site with a rogue bot called “Mozilla/4.0 (compatible;)”, for images. –

Shoppers Drug Mart has a scraper bot “Mozilla/4.0 (compatible;)”. What are they doing on my site?

SCHLUND is scraping for comments and images, anonymously with a bot “-” –

TD Bank has an anon bot “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.3)” –

TELEVIDEO NOVELDA has an anon scraper bot “Java/1.8.0_40”, does not read Robots.txt –

Test Spider 0.2, no purpose, just crawling around scraping content
Amazon –

United States Army Information Systems Command (USAISC) is scraping my site with bot called “Mozilla/4.0 (compatible;)”, essentially anonymous. Does not read robots.txt, no unique User Agent IP. – – –

Virginia School for the Deaf and the Blind: VSDB is scarping my site for images

WORLDSTREAM is scraping for images with a bot “Mozilla/5.0 (iPad; CPU OS 7_0_2 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11A501” and “Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3” and “Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30” –

YOMURA Corporation is scraping my site with an anonymous bot “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20070312 Firefox/” –,

ZAGREBACKA BANKA has a content scraper anonymous bot “Mozilla/4.0 (compatible;)” and “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)” –

ZOMRO is doing something simlar to Hetzner, looking for awstats tags. Their anonymous bot is “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36”. They download the same info 3 times per night, very wasteful.

Search Engines but Anonymous
Microsoft using bot “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)” – – – –

Google is masquerading as host name “”, bot called “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0; InfoPath.2; MSOffice 12)”. What the hey?!?

Leave a Reply

Your email address will not be published. Required fields are marked *