Here is a note we just sent to a hosted customer who complained that the googlebot is blocked from crawling their site. I am inviting any input from anyone here, any ideas on how to resolve this dilemma would be appreciated:
Dear Customer,
Lately plenty of bad-bots masquerade as the GoogleBot. Rather than searching for content to list in a search engine, these bots are looking for scripts that can be used to break into accounts, email addresses to harvest (for spam lists), private customer info, etc. As such we have built routines to kick out the bad bots.
The problem comes in when the GoogleBot gets caught in our security, specifically when its behavior is similar to a bad-bot, as is often the case.
Another huge GoogleBot problem - I have seen the GoogleBot dramatically spike the server load (resulting in a near server crash) when one of our hosted customers put up a Google Sitemaps that are poorly thought out. When I say "poorly thought out" I am referring to site maps that direct the google bot to list/scan something like a WordPress home page that in turn has a "ton" of dynamically generated page links. The google bot then goes about scanning hundreds and hundreds of pages EXTREMELY fast, each pull tolls the MySQL engine on the sever. This can also result in recursive loops during the google bot scan cycle, all of which results in the server virtually being DoSed (i.e. a virtual Denial of Service attack).
In this case, we have no other choice than to block the "attacking" Google Bot, or else we risk having services going off-line for everyone on the server.
This is not just a problem for our servers, here are some examples of others with this same issue:
http://www.webmasterworld.com/forum89/5142.htm
EXCERPT - I was pounded by the mediapartners-googlebot for about 24 hours, ending last night.
------------------
http://www.xoops.org/modules/news/article.php?storyid=1923
EXCERPT - ...are you sure that its not another search engine spider crawling the entire site and creating a cached system as google... this web spiders and crawlers can actually bring down entire site too...
Last time my mailing list was stuck by googlebot, unfortunately i deleted the robots.txt and google, msn, alexa went inside... took away wooping 5gb of bandwith making in 5000+ hits. in less than 1 month. bring down my mailing list.
------------------
http://groups.google.com/group/Goog...read/thread/fce150faafabfdd4/77e020607cf418b2
EXCERPT - I made the fatal mistake of
allowing the crawl rate "expire" from slow to normal a few days ago.
The 'normal' crawl started 15 hours ago and took the site down and
continues to do so every time I turn it on
One thing that I have seen cause the GoogleBot load down a server is
that sometimes the bot will get lost in circular URL loops where
multiple URLs actually point to the same page. This most often occurs
with dynamically generated page
------------------
Slowly but surely Google seems to be addressing the concerns of hosting companies and ISPs in this regard, but it is still very much up to the web master to make (very) informed use of the sitemap technology. Currently for example you have the ability to go to the Google Webmaster's site and slow down the crawl rate, but again, this is something that has to be implemented manually. So as host provider we are left with these choices, none of which are really very good:
-- Block bots that spike the loads and thereby endanger service uptime, i.e. all bots that threaten the server. (Some of our anti-DoS systems react very fast but will only block for a short period of time.)
-- Permit (never block) all of the GoogleBot incoming IPs. This is of course impractical as it would lead to service disruptions though out the day due to the above described issue with poorly thought out site maps.
-- Educate, demand, enforce policies requiring that anyone putting up a Google Sitemap do so only under strict guidelines, etc. This would also involve scanning for "bad" sitemaps and remove those that are not within the guidelines. The problem with this is that it would take a very large staff working 24/7 on just this issue, and we would thereby have to dramatically raise everyone's monthly and annual hosting fees.
-------------... etc.
Solutions anyone?
And damn!!! I wish cPanel would get on it and make Apache 2 compatible and available to the Released version!!! Apache 2 can handle load spikes about 100x better, or so I read.
I love cPanel, but right now cPanel = far-more-hours-of-work (monitoring loads, preventing spikes, manually blocking IPs, etc., than should ever be necessary).
Dear Customer,
Lately plenty of bad-bots masquerade as the GoogleBot. Rather than searching for content to list in a search engine, these bots are looking for scripts that can be used to break into accounts, email addresses to harvest (for spam lists), private customer info, etc. As such we have built routines to kick out the bad bots.
The problem comes in when the GoogleBot gets caught in our security, specifically when its behavior is similar to a bad-bot, as is often the case.
Another huge GoogleBot problem - I have seen the GoogleBot dramatically spike the server load (resulting in a near server crash) when one of our hosted customers put up a Google Sitemaps that are poorly thought out. When I say "poorly thought out" I am referring to site maps that direct the google bot to list/scan something like a WordPress home page that in turn has a "ton" of dynamically generated page links. The google bot then goes about scanning hundreds and hundreds of pages EXTREMELY fast, each pull tolls the MySQL engine on the sever. This can also result in recursive loops during the google bot scan cycle, all of which results in the server virtually being DoSed (i.e. a virtual Denial of Service attack).
In this case, we have no other choice than to block the "attacking" Google Bot, or else we risk having services going off-line for everyone on the server.
This is not just a problem for our servers, here are some examples of others with this same issue:
http://www.webmasterworld.com/forum89/5142.htm
EXCERPT - I was pounded by the mediapartners-googlebot for about 24 hours, ending last night.
------------------
http://www.xoops.org/modules/news/article.php?storyid=1923
EXCERPT - ...are you sure that its not another search engine spider crawling the entire site and creating a cached system as google... this web spiders and crawlers can actually bring down entire site too...
Last time my mailing list was stuck by googlebot, unfortunately i deleted the robots.txt and google, msn, alexa went inside... took away wooping 5gb of bandwith making in 5000+ hits. in less than 1 month. bring down my mailing list.
------------------
http://groups.google.com/group/Goog...read/thread/fce150faafabfdd4/77e020607cf418b2
EXCERPT - I made the fatal mistake of
allowing the crawl rate "expire" from slow to normal a few days ago.
The 'normal' crawl started 15 hours ago and took the site down and
continues to do so every time I turn it on
One thing that I have seen cause the GoogleBot load down a server is
that sometimes the bot will get lost in circular URL loops where
multiple URLs actually point to the same page. This most often occurs
with dynamically generated page
------------------
Slowly but surely Google seems to be addressing the concerns of hosting companies and ISPs in this regard, but it is still very much up to the web master to make (very) informed use of the sitemap technology. Currently for example you have the ability to go to the Google Webmaster's site and slow down the crawl rate, but again, this is something that has to be implemented manually. So as host provider we are left with these choices, none of which are really very good:
-- Block bots that spike the loads and thereby endanger service uptime, i.e. all bots that threaten the server. (Some of our anti-DoS systems react very fast but will only block for a short period of time.)
-- Permit (never block) all of the GoogleBot incoming IPs. This is of course impractical as it would lead to service disruptions though out the day due to the above described issue with poorly thought out site maps.
-- Educate, demand, enforce policies requiring that anyone putting up a Google Sitemap do so only under strict guidelines, etc. This would also involve scanning for "bad" sitemaps and remove those that are not within the guidelines. The problem with this is that it would take a very large staff working 24/7 on just this issue, and we would thereby have to dramatically raise everyone's monthly and annual hosting fees.
-------------... etc.
Solutions anyone?
And damn!!! I wish cPanel would get on it and make Apache 2 compatible and available to the Released version!!! Apache 2 can handle load spikes about 100x better, or so I read.
I love cPanel, but right now cPanel = far-more-hours-of-work (monitoring loads, preventing spikes, manually blocking IPs, etc., than should ever be necessary).