Blocking web crawlers. ModSecurity or in vhost?

DennisMidjord

Well-Known Member
Sep 27, 2016
361
80
78
Denmark
cPanel Access Level
Root Administrator
Lately, a lot of our customers' websites has been crawled by a lot of bots. Yesterday, a single website was crawled by 4 different bots at the same time. All of the bots were bad bots.
We want to block these bots but I'm wondering which method is the best performance wise or if it really doesn't matter.

So, does anyone have any recommendations for blocking bad bots?
 

Handssler Lopez

Well-Known Member
Apr 30, 2019
90
34
18
Guatemala
cPanel Access Level
Root Administrator
the best recommendation would be robots.txt so you block the bad ones and allow the good ones.

* Even if you configure it, you decide whether or not to follow the instructions, blocking robots through apache is not very convenient, there may be problems with advertising campaigns, site validators, etc.
 
  • Like
Reactions: cPanelAnthony

DennisMidjord

Well-Known Member
Sep 27, 2016
361
80
78
Denmark
cPanel Access Level
Root Administrator
@Handssler Lopez I'm only talking about bad bots - not bots in general. Blocking access by configuring robots.txt is not a viable solution because a) we need to do it for every website on all of our servers, and b) it makes no difference if the bot doesn't respect robots.txt. A lot of them don't.
 

ffeingol

Well-Known Member
PartnerNOC
Nov 9, 2001
943
423
363
cPanel Access Level
DataCenter Provider
We just use mod_secuity.

Example of rules (that we picked up somewhere)

Code:
SecRule REQUEST_HEADERS:User-Agent "@rx (?:AhrefsBot)" "msg:'AhrefsBot Spiderbot blocked',phase:1,log,id:7777771,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:MJ12bot)" "msg:'MJ12bot Spiderbot blocked',phase:1,log,id:7777772,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:Yandex)" "msg:'Yandex Spiderbot blocked',phase:1,log,id:7777773,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:SeznamBot)" "msg:'SeznamBot Spiderbot blocked',phase:1,log,id:7777774,t:none,block,status:403"
We grab the User-Agent for Apache logs and then just plug in. When you add another, you just need to increment the ID, so you don't have duplicates.
 
  • Like
Reactions: masterross

nootkan

Well-Known Member
Oct 25, 2006
170
12
168
There is a pretty good plugin that handles bad bots very well also. Just google "stopbadbots". The developer makes a plugin for wordpress and a stand alone plugin, I use both effectively.
 

masterross

Well-Known Member
Apr 7, 2004
73
5
158
We just use mod_secuity.

Example of rules (that we picked up somewhere)

Code:
SecRule REQUEST_HEADERS:User-Agent "@rx (?:AhrefsBot)" "msg:'AhrefsBot Spiderbot blocked',phase:1,log,id:7777771,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:MJ12bot)" "msg:'MJ12bot Spiderbot blocked',phase:1,log,id:7777772,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:Yandex)" "msg:'Yandex Spiderbot blocked',phase:1,log,id:7777773,t:none,block,status:403"
SecRule REQUEST_HEADERS:User-Agent "@rx (?:SeznamBot)" "msg:'SeznamBot Spiderbot blocked',phase:1,log,id:7777774,t:none,block,status:403"
We grab the User-Agent for Apache logs and then just plug in. When you add another, you just need to increment the ID, so you don't have duplicates.
Hi,

These rules work but if customer disabled modsec they wont apply to his account, right?
Is there a way to use similar for CSF?
 

masterross

Well-Known Member
Apr 7, 2004
73
5
158
Actually, these custom ModSec rules trigger the CSF too:

Time: Wed Mar 9 22:29:40 2022 +0200
IP: 95.108.213.9 (RU/Russia/95-108-213-9.spider.yandex.com)
Failures: 4 (mod_security-custom)
Interval: 3600 seconds
Blocked: Permanent Block [LF_CUSTOMTRIGGER]