Tuesday, February 7, 2012

Robots.txt file

Robots.txt file

What Is Robots.txt file ?

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do.
It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter.
That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.
Explaining structure of a website with/out robots.txt file
The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don’t put robots.txt in the right place, do not be surprised that search engines index your whole site.

Structure of a Robots.txt File

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:
“User-agent” are search engines’ crawlers and disallow: lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:
Example : All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/
This may simple way to default robots.txt file to new websites.

The Traps of a Robots.txt File

When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.
Can I block just bad robots?
The more serious problem is with logical errors. For instance:
User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /images/
Disallow: /temp/
Disallow: /cgi-bin/
The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ – including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.
An example to restrict crawlers from access specified directories and files

Tools to Generate and Validate a Robots.txt File

Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator.
Why did this robot ignore my /robots.txt?
User agent: *
Disallow: /tmp/
This is wrong because there is no Dash between “user” and “agent” and the syntax is incorrect.
In those cases, when you have a complex robots.txt file – i.e. you give different instructions to different user agents or you have a long list of directories and sub directories to exclude, writing the file manually can be a real pain. But do not worry – there are tools that will generate the file for you.
What is more, there are visual tools that allow to point and select which files and folders are to be excluded. But even if you do not feel like buying a graphical tool for robots.txt generation, there are online tools to assist you. For instance, the Server-Side Robots Generator offers a drop down list of user agents and a text box for you to list the files you don’t want indexed.
Honestly, it is not much of a help, unless you want to set specific rules for different search engines because in any case it is up to you to type the list of directories but is more than nothing.
That is what the robots.txt file is for! All you need to do is put it in your root folder and follow a few standard rules and your good to go!
The robots.txt file is what you use to tell search robots, also known as Web Wanderers, Crawlers, or Spiders, which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do.

Robots.txt generators:

Common procedure:
  1. choose default / global commands (e.g. allow/disallow all robots);
  2. choose files or directories blocked for all robots;
  3. choose user-agent specific commands:
    1. choose action;
    2. choose a specific robot to be blocked.
As a general rule of thumb, I don’t recommend using Robots.txt generators for the simple reason: don’t create any advanced (i.e. non default) Robots.txt file until you are 100% sure you understand what you are blocking with it. But still I am listing two most trustworthy generators to check:
  • Google Webmaster tools: Robots.txt generator allows to create simple Robots.txt files. What I like most about this tool is that it automatically adds all global commands to each specific user agent commands (helping thus to avoid one of the most common mistakes):
Google Robots.txt generator
SEObook Robots.txt generator

Robots.txt checkers:

  • Google Webmaster tools: Robots.txt analyzer “translates” what your Robots.txt dictates to the Googlebot:
Google Robots.txt analyzer


Robots.txt Generator, Checker, Analyzer tool








How the Robots.txt works:

When a crawl process initiates,the crawler(or robot or spider,whatever you call it as)searches for www.yourdomain.com/robots.txt before any other page,including the index.
If the following appears in your file,
User-agent: *
Disallow: /
the crawler ignores the site.

Creating and using the Robots.txt file:

Use notepad, or wordpad (Save as Text Document), or even Microsoft Word (Save as Plain Text) to create the file.
Now,if you want to exclude some parts of your site or the entire site,Continue reading below.
Place the following in your robots.txt
-To ignore the whole site,
User-agent: *
Disallow: /
-To ignore a specific directory,
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /mypage/
-To allow specific robots only,
User-agent: Google [bing,yahoo etc]

User-agent: *
Disallow: /
-To prevent specific robots,
User-agent: The bot
Disallow: /

Robots.txt file for WordPress

  • Install a plugin like Yoast’s Robots meta.
    This plugin adds meta tags to the head section of the pages and tell tells the search engine whether or not to index them.
    It also allows you to control search engine indexing for individual posts or pages.
  • Create a robots.txt file. This is very simple, you can use your notepad for this task and save the file as robots.txt.
    Alternately, you can generate this file using an online generator such as this.
    Once you have finished, you should upload the file to your blog’s root directory.
    I use the following rules in my robots.txt file. It covers the steps above plus a few more.
    Note that I block search engines from crawling my category pages because I decided to use tags in this blog.
    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /wp-login.php
    Disallow: /*wp-login.php*
    Disallow: /trackback
    Disallow: /feed
    Disallow: /comments
    Disallow: /author
    Disallow: /contact/
    Disallow: */trackback
    Disallow: */feed
    Disallow: */comments
    Disallow: /z/j/
    Disallow: /z/c/
    Disallow: /stats/
    Disallow: /dh_
    Disallow: /category/*
    Disallow: /category/
    Disallow: /login/
    Disallow: /wget/
    Disallow: /httpd/
    Disallow: /i/
    Disallow: /f/
    Disallow: /t/
    Disallow: /c/
    Disallow: /j/
    Disallow: /*.php$
    Disallow: /*?*
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.gz$
    Disallow: /*.wmv$
    Disallow: /*.cgi$
    Disallow: /*.xhtml$
    Disallow: /*?*
    Disallow: /*?
    Allow: /wp-content/uploads
    # alexa archiver
    User-agent: ia_archiver
    Disallow: /
    # disable duggmirror by Digg
    User-agent: duggmirror
    Disallow: /
    # allow google image bot to search all images
    User-agent: Googlebot-Image
    Disallow: /wp-includes/
    Allow: /*
    # allow adsense bot on entire site
    User-agent: Mediapartners-Google*
    Allow: /*
Once you have set a robots.txt file for your blog, you can test it to see if it does what it should (blocking certain pages).
To test it, you can use the Crawler Access tool in Google Webmaster Tools:
  • In GWT, go to Site Configuration -> Crawler Access.
  • In this page, make sure the text area of the robots.txt file has been downloaded recently by Google and that it reflects the most recent changes you have made.
  • In the URLs box, type different URLs to test against (for example, yourblog.com/wp-admin/) and click on the Test button.
    The result displays something like “Blocked by line 3: Disallow: /wp-admin”.
    If it doesn’t, you missed something when creating the file

Robots.txt file in Joomla

As mentioned, the robots.txt file is in your site root folder. It contains info on which folders should be indexed and not. I can also include information about your XML sitemap.
There are just two tips I would recommend regarding SEO and the robots.txt file:
1. Remove exclusion of images
For reasons I don't understand, the default robots.txt file in Joomla is set up to exclude your images folder. That means your images will not be indexed by Google and included in their Image Search. And that's something you would want, as it adds another level to your sites search engine visibility.
To change this, open your robots.txt file and remove the line that says:
Disallow: /images/
By removing this line, Google will start indexing your images on the next crawl of your site.
2. Add a reference to your sitemap.xml file
I've talked about the Sitemap XML file previously, in my post on How to get your Joomla site indexed in Google. If you have a sitemap.xml file (and you should have!), it will be good to include the following line in your robots.txt file:
Naturally, this line needs to be adjusted to fit your domain and sitemap file. In my case, I use the Xmap component to create the Sitemap XML file automatically.
So, the line looks like this for Joomlablogger.net:
Other than that, the robots.txt file can live happily at peace in your site's root folder.


Robots.txt file for drupal

 Drupal 6 provides a standard robots.txt file that does an adequate job. It likes like this:
This file carries instructions for robots and spiders that may crawl your site. Robots.txt directives Now that we’ve taken a glance at what the file looks like, let’s take a deeper look at each directive used in the Drupal robots.txt file. This is a bit tedious, but that’s why I’m here. It truly is worth it to understand exactly what you’re telling the search engines to do.
Pattern matching
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:
Editing your robots.txt file
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.
1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt.
2. Using your FTP program or command line editor, navigate to the top level of your Drupal website and locate the robots.txt file.
3. Make a backup of the file.
4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.
5. Most directives in the robots.txt file are based the on line user-agent :. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.
6. Add the lines you want.
7. Save your robots.txt file, uploading it if necessary, replacing the existing file. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refreshing on your browser to see the changes.
Problems with the default Drupal robots.txt file
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled.
The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.
For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster!
Fortunately, this is relatively easy to fix.
Fixing the Drupal robots.txt file
Carry out the following steps in order to fix the Drupal robots.txt file:
1. Make a backup of the robots.txt file.
2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.
3. Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
4. Duplicate the two sections (simply copy and paste them) so that you have four sections—two of the # Paths (clean URLs) sections and two of # Paths (no clean URLs) sections.
5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.
6. Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:  
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
# Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes. Now your robots.txt file is working as you would expect it to.
Additional changes to the robots.txt file
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:
• You are developing a new site and you don't want it to show up in any search engine until you're ready to launch it. Add Disallow: * just after the User-agent:
• Say you're running a very slow server and you don't want the crawlers to slow your site down for other users. Adjust the Crawl-delay by changing it from 10 to 20.
• If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.
• Say you're running a site which allows people to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:
User-agent: Googlebot-Image
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$
If all of the files were in the /files/users/images/ directory, you could do this:
User-agent: Googlebot-Image
Disallow: /files/users/images/
• Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file: User-agent: Bad-Robot Disallow: *
• If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine's site and upload your URL. Instead, you can add a couple of simple lines to the robots.txt file.
Adding your XML Sitemap to the robots.txt file
Another way that that the robots.txt file helps you search engine optimize your Drupal site is by allowing you to specify where your sitemaps are located. While you probably want to submit your sitemap directly to Google and Bing, it's a good idea to put a reference to it in the robots.txt file for all of those other search engines.
You can do this by carrying out the following steps:
1. Open the robots.txt file for editing.
2. The sitemap directive is independent of the User-agent line, so it doesn't matter where you place it in your robots.txt file.
4. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?). Go to http://www.yoursite.com/robots.txt and double-check that your changes are in effect. You may need to perform a refresh on your browser to see the changes.

list of some of the many robots wandering about the web,
  1. ABCdatos BotLink
  2. Acme.Spider
  3. Ahoy! The Homepage Finder
  4. Alkaline
  5. Anthill
  6. Walhello appie
  7. Arachnophilia
  8. Arale
  9. Araneo
  10. AraybOt
  11. ArchitextSpider
  12. Aretha
  14. arks
  15. AskJeeves
  16. ASpider (Associative Spider)
  17. ATN Worldwide
  18. Atomz.com Search Robot
  20. BackRub
  21. Bay Spider
  22. Big Brother
  23. Bjaaland
  24. BlackWidow
  25. Die Blinde Kuh
  26. Bloodhound
  27. Borg-Bot
  28. BoxSeaBot
  29. bright.net caching robot
  30. BSpider
  31. CACTVS Chemistry Spider
  32. Calif
  33. Cassandra
  34. Digimarc Marcspider/CGI
  35. Checkbot
  36. ChristCrawler.com
  37. churl
  38. cIeNcIaFiCcIoN.nEt
  39. CMC/0.01
  40. Collective
  41. Combine System
  42. Conceptbot
  43. ConfuzzledBot
  44. CoolBot
  45. Web Core / Roots
  46. XYLEME Robot
  47. Internet Cruiser Robot
  48. Cusco
  49. CyberSpyder Link Test
  50. CydralSpider
  51. Desert Realm Spider
  52. DeWeb(c) Katalog/Index
  53. DienstSpider
  54. Digger
  55. Digital Integrity Robot
  56. Direct Hit Grabber
  57. DNAbot
  58. DownLoad Express
  59. DragonBot
  60. DWCP (Dridus’ Web Cataloging Project)
  61. e-collector
  62. EbiNess
  63. EIT Link Verifier Robot
  65. Emacs-w3 Search Engine
  66. ananzi
  67. esculapio
  68. Esther
  69. Evliya Celebi
  70. FastCrawler
  71. Fluid Dynamics Search Engine robot
  72. Felix IDE
  73. Wild Ferret Web Hopper #1, #2, #3
  74. FetchRover
  75. fido
  76. Hämähäkki
  77. KIT-Fireball
  78. Fish search
  79. Fouineur
  80. Robot Francoroute
  81. Freecrawl
  82. FunnelWeb
  83. gammaSpider, FocusedCrawler
  84. gazz
  85. GCreep
  86. GetBot
  87. GetURL
  88. Golem
  89. Googlebot
  90. Grapnel/0.01 Experiment
  91. Griffon
  92. Gromit
  93. Northern Light Gulliver
  94. Gulper Bot
  95. HamBot
  96. Harvest
  97. havIndex
  98. HI (HTML Index) Search
  99. Hometown Spider Pro
  100. ht://Dig
  101. HTMLgobble
  102. Hyper-Decontextualizer
  103. iajaBot
  104. IBM_Planetwide
  105. Popular Iconoclast
  106. Ingrid
  107. Imagelock
  108. IncyWincy
  109. Informant
  110. InfoSeek Robot 1.0
  111. Infoseek Sidewinder
  112. InfoSpiders
  113. Inspector Web
  114. IntelliAgent
  115. I, Robot
  116. Iron33
  117. Israeli-search
  118. JavaBee
  119. JBot Java Web Robot
  120. JCrawler
  121. Jeeves
  122. JoBo Java Web Robot
  123. Jobot
  124. JoeBot
  125. The Jubii Indexing Robot
  126. JumpStation
  127. image.kapsi.net
  128. Katipo
  129. KDD-Explorer
  130. Kilroy
  131. KO_Yappo_Robot
  132. LabelGrabber
  133. larbin
  134. legs
  135. Link Validator
  136. LinkScan
  137. LinkWalker
  138. Lockon
  139. logo.gif Crawler
  140. Lycos
  141. Mac WWWWorm
  142. Magpie
  143. marvin/infoseek
  144. Mattie
  145. MediaFox
  146. MerzScope
  147. NEC-MeshExplorer
  148. MindCrawler
  149. mnoGoSearch search engine software
  150. moget
  151. MOMspider
  152. Monster
  153. Motor
  154. MSNBot
  155. Muncher
  156. Muninn
  157. Muscat Ferret
  158. Mwd.Search
  159. Internet Shinchakubin
  160. NDSpider
  161. Nederland.zoek
  162. NetCarta WebMap Engine
  163. NetMechanic
  164. NetScoop
  165. newscan-online
  166. NHSE Web Forager
  167. Nomad
  168. The NorthStar Robot
  169. nzexplorer
  170. ObjectsSearch
  171. Occam
  172. HKU WWW Octopus
  173. OntoSpider
  174. Openfind data gatherer
  175. Orb Search
  176. Pack Rat
  177. PageBoy
  178. ParaSite
  179. Patric
  180. pegasus
  181. The Peregrinator
  182. PerlCrawler 1.0
  183. Phantom
  184. PhpDig
  185. PiltdownMan
  186. Pimptrain.com’s robot
  187. Pioneer
  188. html_analyzer
  189. Portal Juice Spider
  190. PGP Key Agent
  191. PlumtreeWebAccessor
  192. Poppi
  193. PortalB Spider
  194. psbot
  195. GetterroboPlus Puu
  196. The Python Robot
  197. Raven Search
  198. RBSE Spider
  199. Resume Robot
  200. RoadHouse Crawling System
  201. RixBot
  202. Road Runner: The ImageScape Robot
  203. Robbie the Robot
  204. ComputingSite Robi/1.0
  205. RoboCrawl Spider
  206. RoboFox
  207. Robozilla
  208. Roverbot
  209. RuLeS
  210. SafetyNet Robot
  211. Scooter
  212. Sleek
  213. Search.Aus-AU.COM
  214. SearchProcess
  215. Senrigan
  216. SG-Scout
  217. ShagSeeker
  218. Shai’Hulud
  219. Sift
  220. Simmany Robot Ver1.0
  221. Site Valet
  222. Open Text Index Robot
  223. SiteTech-Rover
  224. Skymob.com
  225. SLCrawler
  226. Inktomi Slurp
  227. Smart Spider
  228. Snooper
  229. Solbot
  230. Spanner
  231. Speedy Spider
  232. spider_monkey
  233. SpiderBot
  234. Spiderline Crawler
  235. SpiderMan
  236. SpiderView(tm)
  237. Spry Wizard Robot
  238. Site Searcher
  239. Suke
  240. suntek search engine
  241. Sven
  242. Sygol
  243. TACH Black Widow
  244. Tarantula
  245. tarspider
  246. Tcl W3 Robot
  247. TechBOT
  248. Templeton
  249. TeomaTechnologies
  250. TITAN
  251. TitIn
  252. The TkWWW Robot
  253. TLSpider
  254. UCSD Crawl
  255. UdmSearch
  256. UptimeBot
  257. URL Check
  258. URL Spider Pro
  259. Valkyrie
  260. Verticrawl
  261. Victoria
  262. vision-search
  263. void-bot
  264. Voyager
  265. VWbot
  266. The NWI Robot
  267. W3M2
  268. WallPaper (alias crawlpaper)
  269. the World Wide Web Wanderer
  270. w@pSpider by wap4.com
  271. WebBandit Web Spider
  272. WebCatcher
  273. WebCopy
  274. webfetcher
  275. The Webfoot Robot
  276. Webinator
  277. weblayers
  278. WebLinker
  279. WebMirror
  280. The Web Moose
  281. WebQuest
  282. Digimarc MarcSpider
  283. WebReaper
  284. webs
  285. Websnarf
  286. WebSpider
  287. WebVac
  288. webwalk
  289. WebWalker
  290. WebWatch
  291. Wget
  292. whatUseek Winona
  293. WhoWhere Robot
  294. Wired Digital
  295. Weblog Monitor
  296. w3mir
  297. WebStolperer
  298. The Web Wombat
  299. The World Wide Web Worm
  300. WWWC Ver 0.2.5
  301. WebZinger
  302. XGET

Thanks For Reading
Read more ...