The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Robots.txt question

Discussion in 'General Discussion' started by MarkHoward, May 25, 2010.

  1. MarkHoward

    MarkHoward Registered

    Joined:
    May 25, 2010
    Messages:
    1
    Likes Received:
    0
    Trophy Points:
    1
    I hope I'm allowed to ask this question here.
    I want to set-up a robots.txt file to manage which BOT's and webcrawlers can access which parts of my website.
    To do so, you have to specify the "User Agent" for each BOT.
    Where can I find a list of valid User Agent names for the most common BOT's?
    For example if I want to specify Google, do I specify "google" or "googlebot" or " google image" etc etc. ?
    I really want to exclude all BOT's other than Google, Yahoo and maybe MSN but I don't know the entire list of "User Agents" I have to specify to achieve this.
    TIA
     
  2. garrettp

    garrettp Well-Known Member
    PartnerNOC

    Joined:
    Jun 18, 2004
    Messages:
    312
    Likes Received:
    0
    Trophy Points:
    16
    cPanel Access Level:
    DataCenter Provider
  3. Spiral

    Spiral BANNED

    Joined:
    Jun 24, 2005
    Messages:
    2,023
    Likes Received:
    7
    Trophy Points:
    0
    Based on this comment, robots.txt is not appropriate for what you are trying to do and in fact may have unexpected consequences

    Unlike .htaccess, the robot.txt file is not physically enforced but is instead a voluntary system that web crawling spiders can choose to read or not read or obey or not obey at their own discretion and choosing.

    The major players will of course read your robots.txt and will obey your requests but the minor crawlers, foreign bots, and even more so the spam harvesting bots DO NOT obey any requests you make in your robots.txt.

    In fact, most of these web crawling bots will actually read your robots.txt to find content to crawl and instead of skipping areas you ask not to crawl will instead make a beeline path straight for those folders! :eek:

    This incidentally is also the reason why you NEVER put the links to your "admin" type areas for web scripts in your robots.txt as it will tell all the web spiders, hackers, and everyone else precisely where to go for that.

    If you want to limit certain web spiders, I would recommend instead using access directives (IE: deny from) in .htaccess which can be easily be written to parse against any Apache web variable including user agent.

    Another good idea is blocking the IP ranges of the known ones you want to block in your firewall as that also frees up resources for your server with not have to wait until the requests complete to block.
     
Loading...

Share This Page