The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Got the GoogleBot blues again. (Dammit!)

Discussion in 'General Discussion' started by jols, May 22, 2007.

  1. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Here is a note we just sent to a hosted customer who complained that the googlebot is blocked from crawling their site. I am inviting any input from anyone here, any ideas on how to resolve this dilemma would be appreciated:

    Dear Customer,

    Lately plenty of bad-bots masquerade as the GoogleBot. Rather than searching for content to list in a search engine, these bots are looking for scripts that can be used to break into accounts, email addresses to harvest (for spam lists), private customer info, etc. As such we have built routines to kick out the bad bots.

    The problem comes in when the GoogleBot gets caught in our security, specifically when its behavior is similar to a bad-bot, as is often the case.

    Another huge GoogleBot problem - I have seen the GoogleBot dramatically spike the server load (resulting in a near server crash) when one of our hosted customers put up a Google Sitemaps that are poorly thought out. When I say "poorly thought out" I am referring to site maps that direct the google bot to list/scan something like a WordPress home page that in turn has a "ton" of dynamically generated page links. The google bot then goes about scanning hundreds and hundreds of pages EXTREMELY fast, each pull tolls the MySQL engine on the sever. This can also result in recursive loops during the google bot scan cycle, all of which results in the server virtually being DoSed (i.e. a virtual Denial of Service attack).

    In this case, we have no other choice than to block the "attacking" Google Bot, or else we risk having services going off-line for everyone on the server.

    This is not just a problem for our servers, here are some examples of others with this same issue:

    http://www.webmasterworld.com/forum89/5142.htm
    EXCERPT - I was pounded by the mediapartners-googlebot for about 24 hours, ending last night.
    ------------------

    http://www.xoops.org/modules/news/article.php?storyid=1923
    EXCERPT - ...are you sure that its not another search engine spider crawling the entire site and creating a cached system as google... this web spiders and crawlers can actually bring down entire site too...

    Last time my mailing list was stuck by googlebot, unfortunately i deleted the robots.txt and google, msn, alexa went inside... took away wooping 5gb of bandwith making in 5000+ hits. in less than 1 month. bring down my mailing list.
    ------------------

    http://groups.google.com/group/Goog...read/thread/fce150faafabfdd4/77e020607cf418b2

    EXCERPT - I made the fatal mistake of
    allowing the crawl rate "expire" from slow to normal a few days ago.
    The 'normal' crawl started 15 hours ago and took the site down and
    continues to do so every time I turn it on

    One thing that I have seen cause the GoogleBot load down a server is
    that sometimes the bot will get lost in circular URL loops where
    multiple URLs actually point to the same page. This most often occurs
    with dynamically generated page
    ------------------


    Slowly but surely Google seems to be addressing the concerns of hosting companies and ISPs in this regard, but it is still very much up to the web master to make (very) informed use of the sitemap technology. Currently for example you have the ability to go to the Google Webmaster's site and slow down the crawl rate, but again, this is something that has to be implemented manually. So as host provider we are left with these choices, none of which are really very good:

    -- Block bots that spike the loads and thereby endanger service uptime, i.e. all bots that threaten the server. (Some of our anti-DoS systems react very fast but will only block for a short period of time.)

    -- Permit (never block) all of the GoogleBot incoming IPs. This is of course impractical as it would lead to service disruptions though out the day due to the above described issue with poorly thought out site maps.

    -- Educate, demand, enforce policies requiring that anyone putting up a Google Sitemap do so only under strict guidelines, etc. This would also involve scanning for "bad" sitemaps and remove those that are not within the guidelines. The problem with this is that it would take a very large staff working 24/7 on just this issue, and we would thereby have to dramatically raise everyone's monthly and annual hosting fees.

    -------------... etc.

    Solutions anyone?

    And damn!!! I wish cPanel would get on it and make Apache 2 compatible and available to the Released version!!! Apache 2 can handle load spikes about 100x better, or so I read.

    I love cPanel, but right now cPanel = far-more-hours-of-work (monitoring loads, preventing spikes, manually blocking IPs, etc., than should ever be necessary).
     
  2. Infopro

    Infopro cPanel Sr. Product Evangelist
    Staff Member

    Joined:
    May 20, 2003
    Messages:
    14,468
    Likes Received:
    196
    Trophy Points:
    63
    Location:
    Pennsylvania
    cPanel Access Level:
    Root Administrator
    Twitter:
    I'm not sure google is a problem. And you can block bad useragents with mod_security. If I was on your server and you mailed me this, I'd be on my way to a new server, ASAP.

    I would think a properly configured server should be able to handle this, just fine.

    my 2
     
  3. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Thanks for your response:

    Yes, I have a mod_security list to block bad user agents, etc. but the Googlebot can absolutely DoS a sever under specific situations.

    And yes, if cPanel ever gets their act together and makes their system (the stable, release version) compatible with Apache v2, then I am sure we would not have such a large issue with this.
     
  4. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Anyone know where I can find the current list of GoogleBot IPs?
     
  5. Infopro

    Infopro cPanel Sr. Product Evangelist
    Staff Member

    Joined:
    May 20, 2003
    Messages:
    14,468
    Likes Received:
    196
    Trophy Points:
    63
    Location:
    Pennsylvania
    cPanel Access Level:
    Root Administrator
    Twitter:
  6. jsnape

    jsnape Well-Known Member

    Joined:
    Mar 11, 2002
    Messages:
    174
    Likes Received:
    0
    Trophy Points:
    16

    Solution is simple. The server is overloaded - get better hardware. If a visit by google is crashing the server it is a weak - junk - server.
     
  7. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    The server is actually a dual processor server with plenty of memory and speed. That's not the issue. The problem is more along the lines of inharent weaknesses of Apache 1.3 and it's apparent lack of ability to handle over a certain amount of concurent requests without really bogging down. But in any case, thanks for your response.
     
  8. nyjimbo

    nyjimbo Well-Known Member

    Joined:
    Jan 25, 2003
    Messages:
    1,125
    Likes Received:
    0
    Trophy Points:
    36
    Location:
    New York
    Not true. We have a client running an Interchange website that has 40,000 records in their database and since IC creates dynamic pages when the google bot sometimes crawls it pulls up all the pages, one by one and slows the machine to a crawl. Its really the fault of interchange as its a memory and CPU hog, but no matter how much muscle we throw at the problem the same thing can happen if the bot runs through the entire database. So using IPFW we block googlebot's ip's from talking to that server.
     
  9. brianoz

    brianoz Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,146
    Likes Received:
    6
    Trophy Points:
    38
    Location:
    Melbourne, Australia
    cPanel Access Level:
    Root Administrator
    This might be naive, but can you create a sitemap telling Google which pages are new, and use that to avoid Googlebot trawling your database?

    I think I read somewhere above that at least one person had problems with Google recursing from the sitemap pages, not sure if that's a generic problem or whether it was something wrong with the sitemap.
     
  10. psychodreams

    psychodreams Well-Known Member

    Joined:
    Apr 14, 2004
    Messages:
    84
    Likes Received:
    0
    Trophy Points:
    6
    :)

    You could also whitelist all the search engine ip's in your firewall might take some time though
    http://www.iplists.com/
     
    #10 psychodreams, Jun 2, 2007
    Last edited: Jun 2, 2007
  11. wrender

    wrender Well-Known Member

    Joined:
    Sep 29, 2007
    Messages:
    69
    Likes Received:
    3
    Trophy Points:
    8
    I believe I am having a similar issue. It appears to be Google indexing the website. When I see Google start to index one of my clients websites, the server load jumps up from 0.2 to about 10-20. I watch the domlogs in /usr/local/apache/domlogs for that domain, and as well do a "top" command to see that their php process is using 99% resources at the time. Just bought this server. 2 x Westmere VPU x5650, 8GB DDR3-1333, High Performance XIV SAN. Server is amazing fast until Google starts to index. Currently only solution I have found is to block the IP when it causes a server overload.

    Has anyone found a solution for this? I don't want to just block google's services, but since this is appearing to be an DOS attack I have no choice.
     
  12. Infopro

    Infopro cPanel Sr. Product Evangelist
    Staff Member

    Joined:
    May 20, 2003
    Messages:
    14,468
    Likes Received:
    196
    Trophy Points:
    63
    Location:
    Pennsylvania
    cPanel Access Level:
    Root Administrator
    Twitter:
  13. wrender

    wrender Well-Known Member

    Joined:
    Sep 29, 2007
    Messages:
    69
    Likes Received:
    3
    Trophy Points:
    8
    Thanks but that does not help. It is my clients websites that are getting attacked, so I would like to protect them from the server side if possible.

    I found this as well which seems interesting, but again it is a client side thing that needs to be done to each website I guess. All things web: Does Googlebot overload your server?
     
  14. Infopro

    Infopro cPanel Sr. Product Evangelist
    Staff Member

    Joined:
    May 20, 2003
    Messages:
    14,468
    Likes Received:
    196
    Trophy Points:
    63
    Location:
    Pennsylvania
    cPanel Access Level:
    Root Administrator
    Twitter:
    I'd have to see the site to be able to comment further I think. If you supect googlebots are attacking a site on your server, you might want to get in touch with google.
     
  15. xanubi

    xanubi Well-Known Member

    Joined:
    Jun 28, 2006
    Messages:
    86
    Likes Received:
    1
    Trophy Points:
    8

    Use CloudLinux, that way you can limit the resources on every client, and prevent that kind of "attack".
    Also, when you discover such a problem, contact your client, and tell them to configure the google spider on google webmaster tools, to index more slowly.
     
  16. wrender

    wrender Well-Known Member

    Joined:
    Sep 29, 2007
    Messages:
    69
    Likes Received:
    3
    Trophy Points:
    8
    Thanks. I am not sure that changing linux platforms at this point is an option for us. As well, configuring Google's crawl rate with Google Webmaster Tools only effects the crawl rate for 90 days. So this is not a solution.

    We did find that this script worked. I have also asked my client to look into what is wrong with their sitemap, but this is very difficult for a client to do that is not familiar with servers.

    We simply inserted this into the start of our index.php file. This script was taken from All things web: Does Googlebot overload your server?

    Code:
    // check if googlebot visit
    $ua = $_SERVER['HTTP_USER_AGENT'];
    $ipos = stripos($ua, 'googlebot');
    if($ipos !== false) {
    // check load
    $data = shell_exec('uptime');
    $data = explode(' ', $data);
    $data = explode(',', $data[13]);
    $load = $data[0];
    if($load > 3) {
    header('HTTP/1.1 503 Service Temporarily Unavailable');
    header('Status: 503 Service Temporarily Unavailable');
    header('Retry-After: 3600');
    die();
    }
    }
    
     
  17. xanubi

    xanubi Well-Known Member

    Joined:
    Jun 28, 2006
    Messages:
    86
    Likes Received:
    1
    Trophy Points:
    8
    Changing Linux Plataforms, is more easy than you think. You'll mantain everything, since cloudlinux, convert centos, without any problems, and their team are very professional. We've made this on all shared hosting servers, without only 1 minute downtime (REBOOT). Yep, it's simple as that.
     
Loading...

Share This Page