mod_security - how to allow bots like googlebot? It was blocked.

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
Just got the following in an email :

Time: Sun Oct 21 10:31:52 2012 -0400
IP: 66.249.73.70 (US/United States/crawl-66-249-73-70.googlebot.com)
Failures: 5 (mod_security)
Interval: 300 seconds
Blocked: Permanent Block

[Sun Oct 21 10:28:28 2012] [error] [client 66.249.73.70] ModSecurity: Access denied with code 501 (phase 2). Match of "rx ^((?:(?:POS|GE)T|OPTIONS|HEAD))$" against "REQUEST_METHOD" required. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "38"] [id "960032"] [msg "Method is not allowed by policy"] [severity "CRITICAL"] [tag "POLICY/METHOD_NOT_ALLOWED"] [hostname "server.example.com"] [uri "/"] [unique_id "UIQGjGB-guIAAAwRPyQAAAAB"]

Granted this came to me because I have CSF installed on my server. I removed the block on the ip in CSF. Now, I have two questions :

1 - Since I removed the block in CSF there is nothing I need to unblock in mod_security right? It is my understanding that while it blocks based on the rules it does not implement a 'permanent' block perse.

2 - I don't want this to happen again and would like to allow all 'bots' without them being blocked in any form. I found the following on another site :

# Allow GoogleBot by user-agent 10-21-2012
SecRule HTTP_USER_AGENT "Google" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot" nolog,allow
SecRule HTTP_USER_AGENT "GoogleBot" nolog,allow
SecRule HTTP_USER_AGENT "googlebot" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot-Image" nolog,allow
SecRule HTTP_USER_AGENT "AdsBot-Google" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot-Image/1.0? nolog,allow
SecRule HTTP_USER_AGENT "Googlebot/2.1? nolog,allow
SecRule HTTP_USER_AGENT "Googlebot/Test" nolog,allow
SecRule HTTP_USER_AGENT "Mediapartners-Google/2.1? nolog,allow
SecRule HTTP_USER_AGENT "Mediapartners-Google*" nolog,allow
SecRule HTTP_USER_AGENT "msnbot" nolog,allow

Should I add this at the top of my config through WHM? I also read that using the user-agent method is not good as this can be faked. So, with that said what is the best way to do this? What are the lines for other popular bots so they are not blocked as well?

I also found information on something called gotroot, but apparently it was not meant for whm/cpanel? I would like something I can set and forget and gets updates auto similar to the default.
 

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
I should add that this is the typical line reported in the error log :

[Sun Oct 21 22:55:58 2012] [error] [client 66.249.73.70] File does not exist: /usr/local/apache/htdocs/501.shtml

Why is this error not being handled properly by apache? This appears to be why mod_security is causing a problem... any thoughts?
 

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
I did some testing and I believe this is related to https some how... if I enter any page of any of my sites in https I get a connection error in the browser - no error page is shown. If I then view my error log for apache I get something along the lines of this :

ModSecurity: Access denied with code 501 (phase 2). Match of "rx ^((?:(?:POS|GE)T|OPTIONS|HEAD))$" against "REQUEST_METHOD" required. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "38"] [id "960032"]

If I do this a few times then my IP gets blocked by CSF. So, how do I fix the above problem. I believe this is why Googlebot is being blocked because it is trying to crawl https pages.

Secondly, shouldn't a non-existent https page be showing a regular error page like a 501 or something rather than a connection error?
 

Infopro

Well-Known Member
May 20, 2003
17,113
507
613
Pennsylvania
cPanel Access Level
Root Administrator
Twitter
Just got the following in an email :

Time: Sun Oct 21 10:31:52 2012 -0400
IP: 66.249.73.70 (US/United States/crawl-66-249-73-70.googlebot.com)
Failures: 5 (mod_security)
Interval: 300 seconds
Blocked: Permanent Block

[Sun Oct 21 10:28:28 2012] [error] [client 66.249.73.70] ModSecurity: Access denied with code 501 (phase 2). Match of "rx ^((?:(?:POS|GE)T|OPTIONS|HEAD))$" against "REQUEST_METHOD" required. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "38"] [id "960032"] [msg "Method is not allowed by policy"] [severity "CRITICAL"] [tag "POLICY/METHOD_NOT_ALLOWED"] [hostname "server.example.com"] [uri "/"] [unique_id "UIQGjGB-guIAAAwRPyQAAAAB"]

Granted this came to me because I have CSF installed on my server. I removed the block on the ip in CSF. Now, I have two questions :

1 - Since I removed the block in CSF there is nothing I need to unblock in mod_security right? It is my understanding that while it blocks based on the rules it does not implement a 'permanent' block perse.

2 - I don't want this to happen again and would like to allow all 'bots' without them being blocked in any form. I found the following on another site :

# Allow GoogleBot by user-agent 10-21-2012
SecRule HTTP_USER_AGENT "Google" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot" nolog,allow
SecRule HTTP_USER_AGENT "GoogleBot" nolog,allow
SecRule HTTP_USER_AGENT "googlebot" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot-Image" nolog,allow
SecRule HTTP_USER_AGENT "AdsBot-Google" nolog,allow
SecRule HTTP_USER_AGENT "Googlebot-Image/1.0? nolog,allow
SecRule HTTP_USER_AGENT "Googlebot/2.1? nolog,allow
SecRule HTTP_USER_AGENT "Googlebot/Test" nolog,allow
SecRule HTTP_USER_AGENT "Mediapartners-Google/2.1? nolog,allow
SecRule HTTP_USER_AGENT "Mediapartners-Google*" nolog,allow
SecRule HTTP_USER_AGENT "msnbot" nolog,allow

Should I add this at the top of my config through WHM? I also read that using the user-agent method is not good as this can be faked. So, with that said what is the best way to do this? What are the lines for other popular bots so they are not blocked as well?

I also found information on something called gotroot, but apparently it was not meant for whm/cpanel? I would like something I can set and forget and gets updates auto similar to the default.
You should do some more homework on mod_security.

If I set my browser user agent to be seen as Googlebot, and I come try and hack your site, you've allowed me in, by using the piece of code you posted above.

I should add that this is the typical line reported in the error log :

[Sun Oct 21 22:55:58 2012] [error] [client 66.249.73.70] File does not exist: /usr/local/apache/htdocs/501.shtml

Why is this error not being handled properly by apache? This appears to be why mod_security is causing a problem... any thoughts?
Does that file exist? Assuming no. You should do some more homework on CSF. There are options for this. Example:

Code:
# This option will keep track of the number of "File does not exist" errors in
# HTACCESS_LOG. If the number of hits is more than LF_APACHE_404 in LF_INTERVAL
# seconds then the IP address will be blocked
#
# Care should be used with this option as it could generate many
# false-positives,[B] especially Search Bots (use csf.rignore to ignore such bots)[/B]
# so only use this option if you know you are under this type of attack
#
# A sensible setting for this would be quite high, perhaps 200
#
# To disable set to "0"
LF_APACHE_404
I did some testing and I believe this is related to https some how... if I enter any page of any of my sites in https I get a connection error in the browser - no error page is shown. If I then view my error log for apache I get something along the lines of this :

ModSecurity: Access denied with code 501 (phase 2). Match of "rx ^((?:(?:POS|GE)T|OPTIONS|HEAD))$" against "REQUEST_METHOD" required. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "38"] [id "960032"]

If I do this a few times then my IP gets blocked by CSF. So, how do I fix the above problem. I believe this is why Googlebot is being blocked because it is trying to crawl https pages.

Secondly, shouldn't a non-existent https page be showing a regular error page like a 501 or something rather than a connection error?
Searchbots are scanning your site, if they hit old links to pages that are no longer there they need to be fed an error page. If you have no error pages they will hit the link again, looking for the file again.

You are able to have more control over your mod_sec blocking with a tool like this:
ConfigServer ModSecurity Control
 

Igal Incapsula

Registered
Oct 22, 2012
1
0
1
cPanel Access Level
DataCenter Provider
If I set my browser user agent to be seen as Googlebot, and I come try and hack your site, you've allowed me in, by using the piece of code you posted above.
A recently conducted /http://www.incapsula.com/the-incapsula-blog/item/369-was-that-really-a-google-bot-crawling-my-site - "security research of Googlebot impersonation phenomena" showed that 16% of all "Googlebot" visits were fake and out of those 21% were malicious.

(Googlebot impersonation is also commonly used by SEO crawling tools that try to assess competition and want to "see" the site, just as Googlebot does)

To filter out fake Googlebot access attempts you should cross-verify IP ranges with user-agent data.

To do this, you need to use /http://www.Botopedia.org IP validation tool to perform a reverse DNS lookup, to "weed out" all irrelevant IPs.

Also, you should always check the IP before setting any restrictions.
One common mistake is to ban all Chinese IPs, by default.
This is false because Googlebot will sometimes use Chinese IPs and banning all access from China may lead to crawling errors.

GL
 
Last edited:

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
@InfoPro

Thanks for the detailed response. Yes, I saw the danger of user-agent in the config and was looking for another answer. I do have a few others though.

Regarding your comment about '/usr/local/apache/htdocs/501.shtml' existing or not... in this location I only have 400, 401, 403, 404, 500, and now 501. I was under the assumption that if an error occurred for anything not available it would use a 'default' of some type. Should I make files for all error codes in this location?

On top of that, why is no error page shown to me when trying to view a non-existent https page? It simply says there is a connection error instead. Is this normal... I would think an actual error page would be thrown.

Regarding CSF and LF_APACHE_404... that option has always been disabled and set to 0 so that is not the problem. The problem goes back to the above. If I try to visit a non-existent https page on any of my sites it says there was a connection problem, no error page, and modsec records the error then csf blocks because of it happening x times.
 

Infopro

Well-Known Member
May 20, 2003
17,113
507
613
Pennsylvania
cPanel Access Level
Root Administrator
Twitter
I think the answer here is, there is no site at https:// somedomainwithoutdedicatedipandcert.com so thats not a "valid" url that would generate an error. You might go over your CSF settings to make them work more like you want, give the error a few more times before blocking for example. If this is a problem that continues happening, I would think there's a reason for it. Why is someone going to that https:// domain anyway? A random spider for example, sure, but if users are visiting that URL often, whats sending them there?

If you want, you can modify mod_sec rules per domain using this tool easy enough:
ConfigServer ModSecurity Control

In your errors above we see the rule is: 960032

So you'd add that in the config using that tool, for that domain(s) affected by this issue.

HTH somehow. :)
 

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
I think the answer here is, there is no site at https:// somedomainwithoutdedicatedipandcert.com so thats not a "valid" url that would generate an error. You might go over your CSF settings to make them work more like you want, give the error a few more times before blocking for example. If this is a problem that continues happening, I would think there's a reason for it. Why is someone going to that https:// domain anyway? A random spider for example, sure, but if users are visiting that URL often, whats sending them there?

If you want, you can modify mod_sec rules per domain using this tool easy enough:
ConfigServer ModSecurity Control

In your errors above we see the rule is: 960032

So you'd add that in the config using that tool, for that domain(s) affected by this issue.

HTH somehow. :)
Will look over everything and see what I can do. I am still curious as to why the server default 501 page was trying to be accessed... espcially from Googlebot and one of its ip's... unless that was faked somehow... either way, if it was faked, they succeeded in having their ip banned so my sites could not be crawled.

I think the easiest solution right now is for csf not to block when modsec denies for visiting an invalid https.

To answer your question, I did host https pages, but now have removed all files from the site. Either way, it still does not prevent this from happening on any site of mine... I'm not concerend with 'people' trying the https as this would probably never happen on a site, but I am concerned with spiders being blocked.
 

morrow95

Well-Known Member
Oct 8, 2006
161
8
168
Okay, for the moment I have changed CSF so it no longer blocks IP's who have triggered mod_sec x times in y timeframe, however, I am still having an issue with https being used.

As an example using firefox as a browser, I go to any of my websites using https: I get :

Server Connection Failed. An error occurred during a connection to IANA — Example domains. SSL received a record that exceeded the maximum permissible length. (Error code: ssl_error_rx_record_too_long)

This then triggers mod_security and gives :

[Tue Oct 23 00:47:21 2012] [error] [client 99.30.160.94] ModSecurity: Access denied with code 501 (phase 2). Match of "rx ^((?:(?:POS|GE)T|OPTIONS|HEAD))$" against "REQUEST_METHOD" required. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "38"] [id "960032"] [msg "Method is not allowed by policy"] [severity "CRITICAL"] [tag "POLICY/METHOD_NOT_ALLOWED"] [hostname "exampleserver.com"] [uri "/"] [unique_id "UIYhWWB-guIAAHl3FKcAAAAA"]

While I do not have ssl on these sites this certainly cannot be normal... I read up on the ssl_error_rx_record_too_long here on the forum, but most of the posts were people who actually had ssl on the site. I also read that transferring of accounts to a new server using pkacct(?) can cause this. I did have my accounts tranferred to a new server and this was used.

All in all, this can't be the normal response in this situation and I would think mod_sec wouldn't be triggered if it was 'correct'? Anyone have any ideas? I would like to turn my CSF filter back on for mod_sec, but since all it takes at the moment is going to any of my sites in https: to trigger it that isn't going to work unless something changes.
 

GIANT_CRAB

Well-Known Member
Mar 23, 2012
89
0
56
cPanel Access Level
Root Administrator
>If I set my browser user agent to be seen as Googlebot, and I come try and hack your site, you've allowed me in, by using the piece of code you posted above.

I've been attacked by bots that does that too.

I wouldn't trade security for SEO.