What Is Robots.txt file ?

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do.

It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter.

That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.

Explaining structure of a website with/out robots.txt file

The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don’t put robots.txt in the right place, do not be surprised that search engines index your whole site.

Structure of a Robots.txt File

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

User-agent:

Disallow:

“User-agent” are search engines’ crawlers and disallow: lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:

Example : All user agents are disallowed to see the /temp directory.

User-agent: *

Disallow: /temp/

This may simple way to default robots.txt file to new websites.

The Traps of a Robots.txt File

When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.

Can I block just bad robots?

The more serious problem is with logical errors. For instance:

User-agent: *

Disallow: /temp/

User-agent: Googlebot

Disallow: /images/

Disallow: /temp/

Disallow: /cgi-bin/

The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ – including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.

An example to restrict crawlers from access specified directories and files

Tools to Generate and Validate a Robots.txt File

Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator.

Why did this robot ignore my /robots.txt?

User agent: *

Disallow: /tmp/

This is wrong because there is no Dash between “user” and “agent” and the syntax is incorrect.

In those cases, when you have a complex robots.txt file – i.e. you give different instructions to different user agents or you have a long list of directories and sub directories to exclude, writing the file manually can be a real pain. But do not worry – there are tools that will generate the file for you.

What is more, there are visual tools that allow to point and select which files and folders are to be excluded. But even if you do not feel like buying a graphical tool for robots.txt generation, there are online tools to assist you. For instance, the Server-Side Robots Generator offers a drop down list of user agents and a text box for you to list the files you don’t want indexed.

Honestly, it is not much of a help, unless you want to set specific rules for different search engines because in any case it is up to you to type the list of directories but is more than nothing.

That is what the robots.txt file is for! All you need to do is put it in your root folder and follow a few standard rules and your good to go!

The robots.txt file is what you use to tell search robots, also known as Web Wanderers, Crawlers, or Spiders, which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do.

Robots.txt generators:

Common procedure:

choose default / global commands (e.g. allow/disallow all robots);
choose files or directories blocked for all robots;
choose user-agent specific commands:
1. choose action;
2. choose a specific robot to be blocked.

As a general rule of thumb, I don’t recommend using Robots.txt generators for the simple reason: don’t create any advanced (i.e. non default) Robots.txt file until you are 100% sure you understand what you are blocking with it. But still I am listing two most trustworthy generators to check:

Google Webmaster tools: Robots.txt generator allows to create simple Robots.txt files. What I like most about this tool is that it automatically adds all global commands to each specific user agent commands (helping thus to avoid one of the most common mistakes):

SEObook Robots.txt generator unfortunately misses the above feature but it is really easy (and fun) to use:

Robots.txt checkers:

Google Webmaster tools: Robots.txt analyzer “translates” what your Robots.txt dictates to the Googlebot:

Robots.txt Syntax Checker finds some common errors within your file by checking for whitespace separated lists, not widely supported standards, wildcard usage, etc.
A Validator for Robots.txt Files also checks for syntax errors and confirms correct directory paths.

Robots.txt Generator, Checker, Analyzer tool

webpositionadvisor

submitshop

motoricerca

dhtmlextreme

phpweby

seobook

How the Robots.txt works:

When a crawl process initiates,the crawler(or robot or spider,whatever you call it as)searches for www.yourdomain.com/robots.txt before any other page,including the index.

If the following appears in your file,

User-agent: *
Disallow: /

the crawler ignores the site.

Creating and using the Robots.txt file:

Use notepad, or wordpad (Save as Text Document), or even Microsoft Word (Save as Plain Text) to create the file.

Now,if you want to exclude some parts of your site or the entire site,Continue reading below.

Place the following in your robots.txt

-To ignore the whole site,

User-agent: *
Disallow: /

-To ignore a specific directory,

example:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /mypage/

-To allow specific robots only,

User-agent: Google [bing,yahoo etc]
Disallow:

User-agent: *
Disallow: /

-To prevent specific robots,

User-agent: The bot
Disallow: /

Robots.txt file for WordPress

Install a plugin like Yoast’s Robots meta.
This plugin adds meta tags to the head section of the pages and tell tells the search engine whether or not to index them.
It also allows you to control search engine indexing for individual posts or pages.

Create a robots.txt file. This is very simple, you can use your notepad for this task and save the file as robots.txt.
Alternately, you can generate this file using an online generator such as this.
Once you have finished, you should upload the file to your blog’s root directory.
I use the following rules in my robots.txt file. It covers the steps above plus a few more.
Note that I block search engines from crawling my category pages because I decided to use tags in this blog.

User-agent: *

Disallow: /cgi-bin

Disallow: /wp-admin

Disallow: /wp-includes

Disallow: /wp-content/plugins

Disallow: /wp-content/cache

Disallow: /wp-content/themes

Disallow: /wp-login.php

Disallow: /*wp-login.php*

Disallow: /trackback

Disallow: /feed

Disallow: /comments

Disallow: /author

Disallow: /contact/

Disallow: */trackback

Disallow: */feed

Disallow: */comments

Disallow: /z/j/

Disallow: /z/c/

Disallow: /stats/

Disallow: /dh_

Disallow: /category/*

Disallow: /category/

Disallow: /login/

Disallow: /wget/

Disallow: /httpd/

Disallow: /i/

Disallow: /f/

Disallow: /t/

Disallow: /c/

Disallow: /j/

Disallow: /*.php$

Disallow: /*?*

Disallow: /*.js$

Disallow: /*.inc$

Disallow: /*.css$

Disallow: /*.gz$

Disallow: /*.wmv$

Disallow: /*.cgi$

Disallow: /*.xhtml$

Disallow: /*?*

Disallow: /*?

Allow: /wp-content/uploads

# alexa archiver

User-agent: ia_archiver

Disallow: /

# disable duggmirror by Digg

User-agent: duggmirror

Disallow: /

# allow google image bot to search all images

User-agent: Googlebot-Image

Disallow: /wp-includes/

Allow: /*

# allow adsense bot on entire site

User-agent: Mediapartners-Google*

Disallow:

Allow: /*

Once you have set a robots.txt file for your blog, you can test it to see if it does what it should (blocking certain pages).
To test it, you can use the Crawler Access tool in Google Webmaster Tools:

In GWT, go to Site Configuration -> Crawler Access.
In this page, make sure the text area of the robots.txt file has been downloaded recently by Google and that it reflects the most recent changes you have made.
In the URLs box, type different URLs to test against (for example, yourblog.com/wp-admin/) and click on the Test button.
The result displays something like “Blocked by line 3: Disallow: /wp-admin”.
If it doesn’t, you missed something when creating the file

Robots.txt file in Joomla

As mentioned, the robots.txt file is in your site root folder. It contains info on which folders should be indexed and not. I can also include information about your XML sitemap.

There are just two tips I would recommend regarding SEO and the robots.txt file:

1. Remove exclusion of images

For reasons I don't understand, the default robots.txt file in Joomla is set up to exclude your images folder. That means your images will not be indexed by Google and included in their Image Search. And that's something you would want, as it adds another level to your sites search engine visibility.

To change this, open your robots.txt file and remove the line that says:

Disallow: /images/

By removing this line, Google will start indexing your images on the next crawl of your site.

2. Add a reference to your sitemap.xml file

I've talked about the Sitemap XML file previously, in my post on How to get your Joomla site indexed in Google. If you have a sitemap.xml file (and you should have!), it will be good to include the following line in your robots.txt file:

sitemap:http://www.domain.com/sitemap.xml

Naturally, this line needs to be adjusted to fit your domain and sitemap file. In my case, I use the Xmap component to create the Sitemap XML file automatically.

So, the line looks like this for Joomlablogger.net:

sitemap:http://www.joomlablogger.net/component/option,com_xmap/lang,en/sitemap,1/view,xml/

Other than that, the robots.txt file can live happily at peace in your site's root folder.

Robots.txt file for drupal

Drupal 6 provides a standard robots.txt file that does an adequate job. It likes like this:

This file carries instructions for robots and spiders that may crawl your site. Robots.txt directives Now that we’ve taken a glance at what the file looks like, let’s take a deeper look at each directive used in the Drupal robots.txt file. This is a bit tedious, but that’s why I’m here. It truly is worth it to understand exactly what you’re telling the search engines to do.

Pattern matching
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:

Editing your robots.txt file
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.

1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt.

2. Using your FTP program or command line editor, navigate to the top level of your Drupal website and locate the robots.txt file.

3. Make a backup of the file.

4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.

5. Most directives in the robots.txt file are based the on line user-agent :. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.

6. Add the lines you want.

7. Save your robots.txt file, uploading it if necessary, replacing the existing file. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refreshing on your browser to see the changes.

Problems with the default Drupal robots.txt file
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled.

The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.

For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster!

Fortunately, this is relatively easy to fix.

Fixing the Drupal robots.txt file
Carry out the following steps in order to fix the Drupal robots.txt file:

1. Make a backup of the robots.txt file.

2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.

3. Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this:

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

4. Duplicate the two sections (simply copy and paste them) so that you have four sections—two of the # Paths (clean URLs) sections and two of # Paths (no clean URLs) sections.

5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.

6. Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:

# Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login

7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).

8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes. Now your robots.txt file is working as you would expect it to.

Additional changes to the robots.txt file
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.

In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:

• You are developing a new site and you don't want it to show up in any search engine until you're ready to launch it. Add Disallow: * just after the User-agent:

• Say you're running a very slow server and you don't want the crawlers to slow your site down for other users. Adjust the Crawl-delay by changing it from 10 to 20.

• If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.

• Say you're running a site which allows people to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:

User-agent: Googlebot-Image
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$

If all of the files were in the /files/users/images/ directory, you could do this:
User-agent: Googlebot-Image
Disallow: /files/users/images/

• Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file: User-agent: Bad-Robot Disallow: *

• If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine's site and upload your URL. Instead, you can add a couple of simple lines to the robots.txt file.

Adding your XML Sitemap to the robots.txt file

Another way that that the robots.txt file helps you search engine optimize your Drupal site is by allowing you to specify where your sitemaps are located. While you probably want to submit your sitemap directly to Google and Bing, it's a good idea to put a reference to it in the robots.txt file for all of those other search engines.

You can do this by carrying out the following steps:

1. Open the robots.txt file for editing.

2. The sitemap directive is independent of the User-agent line, so it doesn't matter where you place it in your robots.txt file.

4. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?). Go to http://www.yoursite.com/robots.txt and double-check that your changes are in effect. You may need to perform a refresh on your browser to see the changes.

list of some of the many robots wandering about the web,

ABCdatos BotLink
Acme.Spider
Ahoy! The Homepage Finder
Alkaline
Anthill
Walhello appie
Arachnophilia
Arale
Araneo
AraybOt
ArchitextSpider
Aretha
ARIADNE
arks
AskJeeves
ASpider (Associative Spider)
ATN Worldwide
Atomz.com Search Robot
AURESYS
BackRub
Bay Spider
Big Brother
Bjaaland
BlackWidow
Die Blinde Kuh
Bloodhound
Borg-Bot
BoxSeaBot
bright.net caching robot
BSpider
CACTVS Chemistry Spider
Calif
Cassandra
Digimarc Marcspider/CGI
Checkbot
ChristCrawler.com
churl
cIeNcIaFiCcIoN.nEt
CMC/0.01
Collective
Combine System
Conceptbot
ConfuzzledBot
CoolBot
Web Core / Roots
XYLEME Robot
Internet Cruiser Robot
Cusco
CyberSpyder Link Test
CydralSpider
Desert Realm Spider
DeWeb(c) Katalog/Index
DienstSpider
Digger
Digital Integrity Robot
Direct Hit Grabber
DNAbot
DownLoad Express
DragonBot
DWCP (Dridus’ Web Cataloging Project)
e-collector
EbiNess
EIT Link Verifier Robot
ELFINBOT
Emacs-w3 Search Engine
ananzi
esculapio
Esther
Evliya Celebi
FastCrawler
Fluid Dynamics Search Engine robot
Felix IDE
Wild Ferret Web Hopper #1, #2, #3
FetchRover
fido
Hämähäkki
KIT-Fireball
Fish search
Fouineur
Robot Francoroute
Freecrawl
FunnelWeb
gammaSpider, FocusedCrawler
gazz
GCreep
GetBot
GetURL
Golem
Googlebot
Grapnel/0.01 Experiment
Griffon
Gromit
Northern Light Gulliver
Gulper Bot
HamBot
Harvest
havIndex
HI (HTML Index) Search
Hometown Spider Pro
ht://Dig
HTMLgobble
Hyper-Decontextualizer
iajaBot
IBM_Planetwide
Popular Iconoclast
Ingrid
Imagelock
IncyWincy
Informant
InfoSeek Robot 1.0
Infoseek Sidewinder
InfoSpiders
Inspector Web
IntelliAgent
I, Robot
Iron33
Israeli-search
JavaBee
JBot Java Web Robot
JCrawler
Jeeves
JoBo Java Web Robot
Jobot
JoeBot
The Jubii Indexing Robot
JumpStation
image.kapsi.net
Katipo
KDD-Explorer
Kilroy
KO_Yappo_Robot
LabelGrabber
larbin
legs
Link Validator
LinkScan
LinkWalker
Lockon
logo.gif Crawler
Lycos
Mac WWWWorm
Magpie
marvin/infoseek
Mattie
MediaFox
MerzScope
NEC-MeshExplorer
MindCrawler
mnoGoSearch search engine software
moget
MOMspider
Monster
Motor
MSNBot
Muncher
Muninn
Muscat Ferret
Mwd.Search
Internet Shinchakubin
NDSpider
Nederland.zoek
NetCarta WebMap Engine
NetMechanic
NetScoop
newscan-online
NHSE Web Forager
Nomad
The NorthStar Robot
nzexplorer
ObjectsSearch
Occam
HKU WWW Octopus
OntoSpider
Openfind data gatherer
Orb Search
Pack Rat
PageBoy
ParaSite
Patric
pegasus
The Peregrinator
PerlCrawler 1.0
Phantom
PhpDig
PiltdownMan
Pimptrain.com’s robot
Pioneer
html_analyzer
Portal Juice Spider
PGP Key Agent
PlumtreeWebAccessor
Poppi
PortalB Spider
psbot
GetterroboPlus Puu
The Python Robot
Raven Search
RBSE Spider
Resume Robot
RoadHouse Crawling System
RixBot
Road Runner: The ImageScape Robot
Robbie the Robot
ComputingSite Robi/1.0
RoboCrawl Spider
RoboFox
Robozilla
Roverbot
RuLeS
SafetyNet Robot
Scooter
Sleek
Search.Aus-AU.COM
SearchProcess
Senrigan
SG-Scout
ShagSeeker
Shai’Hulud
Sift
Simmany Robot Ver1.0
Site Valet
Open Text Index Robot
SiteTech-Rover
Skymob.com
SLCrawler
Inktomi Slurp
Smart Spider
Snooper
Solbot
Spanner
Speedy Spider
spider_monkey
SpiderBot
Spiderline Crawler
SpiderMan
SpiderView(tm)
Spry Wizard Robot
Site Searcher
Suke
suntek search engine
Sven
Sygol
TACH Black Widow
Tarantula
tarspider
Tcl W3 Robot
TechBOT
Templeton
TeomaTechnologies
TITAN
TitIn
The TkWWW Robot
TLSpider
UCSD Crawl
UdmSearch
UptimeBot
URL Check
URL Spider Pro
Valkyrie
Verticrawl
Victoria
vision-search
void-bot
Voyager
VWbot
The NWI Robot
W3M2
WallPaper (alias crawlpaper)
the World Wide Web Wanderer
w@pSpider by wap4.com
WebBandit Web Spider
WebCatcher
WebCopy
webfetcher
The Webfoot Robot
Webinator
weblayers
WebLinker
WebMirror
The Web Moose
WebQuest
Digimarc MarcSpider
WebReaper
webs
Websnarf
WebSpider
WebVac
webwalk
WebWalker
WebWatch
Wget
whatUseek Winona
WhoWhere Robot
Wired Digital
Weblog Monitor
w3mir
WebStolperer
The Web Wombat
The World Wide Web Worm
WWWC Ver 0.2.5
WebZinger
XGET

Thanks For Reading

Pages

Popular Posts

Blog Archive

Followers

Tuesday, February 7, 2012

Robots.txt file