The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Good tricks for limiting bot crawls to the server?

Discussion in 'General Discussion' started by jols, Sep 18, 2009.

  1. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Anyone know any good tricks for limiting bot crawls to the entire server?

    Lately I am tempted to just start blocking MSN and Yahoo Slurp using mod_security rules. It's been a real spider-fest lately and these things I am sure are responsible for greatly increasing the server load from time to time, especially when they start hitting WordPress sites with lots of links, etc.
     
  2. Spiral

    Spiral BANNED

    Joined:
    Jun 24, 2005
    Messages:
    2,023
    Likes Received:
    7
    Trophy Points:
    0
    MSN and Yahoo both follow and respect "robots.txt" rules so if you want to limit their scans, you might just want to simply setup a robots file.

    There are a number of spiders out there which ignore "robots.txt" but the major search engines do in fact respect the rules you place in there.

    For those that don't respect the rules, you could put a block on their IPs in your firewall or throttle and rate limit them with an IPTABLES rule or use .htaccess restrictions to deny access. You could use mod_security but this actually adds more overhead especially if you are logging all trapped IPs and triggered rules in your system logs or MySQL database.

    For particularly nasty ones, you could use a poisoning script. If you don't already know what that is and how to implement it properly then it is probably best you forget that I mentioned that lest you make things worse on your server load wise.
     
    #2 Spiral, Sep 18, 2009
    Last edited: Sep 18, 2009
  3. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Thanks for the advice Spiral, but all of what you say seems to be directed for individual accounts, not for the entire server.

    Or, perhaps I am missing something here. Is there a way of putting in a single robots.txt file for the entire server, i.e. for every account hosted thereon?
     
  4. Spiral

    Spiral BANNED

    Joined:
    Jun 24, 2005
    Messages:
    2,023
    Likes Received:
    7
    Trophy Points:
    0
    Actually, you can alias it in httpd.conf for the whole server.

    We have a default robots.txt that is accessible to all sites when users don't install their own robots.txt.
     
  5. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Thanks, this sounds like just what we need. However, I have no idea how to do this. can you provide an example of what scripting to include in the Apache conf file, and where?

    Thanks again.
     
  6. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Slapping forehead ---> Looks like this might be amazingly simple. Please correct me if I'm wrong:

    Step one:

    Add something like the second line below to:
    /usr/local/apache/conf/includes/pre_main_global.conf
    Alias /robots.txt /var/www/some/physical/path/robots.txt

    Step two:

    Restart apache.
     
  7. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    What a minute, ohh okay. So this will not override a robots.txt file that is already there in the individual account. Correct? I will assume so.
     
  8. britsenigma

    britsenigma Well-Known Member

    Joined:
    Dec 14, 2008
    Messages:
    85
    Likes Received:
    0
    Trophy Points:
    6
    Not sure many hosting customers would like the host overiding their search engine crawlers access...i'd leave a host if they did that without telling me first...
     
  9. jols

    jols Well-Known Member

    Joined:
    Mar 13, 2004
    Messages:
    1,111
    Likes Received:
    2
    Trophy Points:
    38
    Overriding, yes I would agree with you if this were the case.

    Certainly what we would like to do is to alias in a robots.txt file if there were no other file there. But this does not look like it is possible using the method that Spiral describes.
     
    #9 jols, Sep 20, 2009
    Last edited: Sep 20, 2009
Loading...
Similar Threads - tricks limiting bot
  1. DWHS.net
    Replies:
    6
    Views:
    379

Share This Page