Community Forums
Connect with us on LinkedIn
+ Reply to Thread
Results 1 to 3 of 3
  1. #1
    Registered User
    Join Date
    May 2010
    Posts
    1

    Default Robots.txt question

    I hope I'm allowed to ask this question here.
    I want to set-up a robots.txt file to manage which BOT's and webcrawlers can access which parts of my website.
    To do so, you have to specify the "User Agent" for each BOT.
    Where can I find a list of valid User Agent names for the most common BOT's?
    For example if I want to specify Google, do I specify "google" or "googlebot" or " google image" etc etc. ?
    I really want to exclude all BOT's other than Google, Yahoo and maybe MSN but I don't know the entire list of "User Agents" I have to specify to achieve this.
    TIA

  2. #2
    cPanel Partner NOC cPanel Partner NOC Badge
    Join Date
    Jun 2004
    Posts
    313
    cPanel/Enkompass Access Level

    DataCenter Provider

    Default

    Searching is your friend. First result of a web search:

    The Web Robots Pages
    NDCHost (ProVPS): Xen VPS / Dedicated / Co-Location
    Contact us for your cPanel Licensing needs! We price match, provide better support, and take care of our customers!

  3. #3
    BANNED
    Join Date
    Jun 2005
    Location
    Wild Wild West
    Posts
    2,025

    Default

    I really want to exclude all BOT's other than Google, Yahoo and maybe MSN but I don't know the entire list of "User Agents" I have to specify to achieve this.
    Based on this comment, robots.txt is not appropriate for what you are trying to do and in fact may have unexpected consequences

    Unlike .htaccess, the robot.txt file is not physically enforced but is instead a voluntary system that web crawling spiders can choose to read or not read or obey or not obey at their own discretion and choosing.

    The major players will of course read your robots.txt and will obey your requests but the minor crawlers, foreign bots, and even more so the spam harvesting bots DO NOT obey any requests you make in your robots.txt.

    In fact, most of these web crawling bots will actually read your robots.txt to find content to crawl and instead of skipping areas you ask not to crawl will instead make a beeline path straight for those folders!

    This incidentally is also the reason why you NEVER put the links to your "admin" type areas for web scripts in your robots.txt as it will tell all the web spiders, hackers, and everyone else precisely where to go for that.

    If you want to limit certain web spiders, I would recommend instead using access directives (IE: deny from) in .htaccess which can be easily be written to parse against any Apache web variable including user agent.

    Another good idea is blocking the IP ranges of the known ones you want to block in your firewall as that also frees up resources for your server with not have to wait until the requests complete to block.

Similar Threads & Tags
Similar threads

  1. L4D2 motd.txt and host.txt
    By jmeier64 in forum cPGS Discussions
    Replies: 5
    Last Post: 05-31-2010, 08:30 AM
  2. 406.stml and robots.txt errors as a result of failed image uploads
    By Mickyfin in forum cPanel and WHM Discussions
    Replies: 1
    Last Post: 04-07-2010, 09:44 AM
  3. Replies: 0
    Last Post: 06-25-2007, 10:35 AM
  4. robots.txt & .htaccess question
    By lamans in forum cPanel and WHM Discussions
    Replies: 0
    Last Post: 05-29-2007, 12:44 AM
  5. Awstats not showing robots
    By Devil Inside in forum cPanel and WHM Discussions
    Replies: 1
    Last Post: 01-02-2004, 04:19 AM
Linkedin       Facebook       Twitter       RSS       Flickr       YouTube