What Is Robots.txt file ?
Robots.txt
is a text (not html) file you put on your site to tell search robots
which pages you would like them not to visit. Robots.txt is by no means
mandatory for search engines but generally search engines obey what they
are asked not to do.
It is important to clarify that robots.txt
is not a way from preventing search engines from crawling your site
(i.e. it is not a firewall, or a kind of password protection) and the
fact that you put a robots.txt file is something like putting a note
“Please, do not enter” on an unlocked door – e.g. you cannot prevent
thieves from coming in but the good guys will not open to door and
enter.
That is why we say that if you have really sen sitive data,
it is too naïve to rely on robots.txt to protect it from being indexed
and displayed in search results.
The location of robots.txt is very important. It must be in the main directory
because otherwise user agents (search engines) will not be able to find
it – they do not search the whole site for a file named robots.txt.
Instead, they look first in the main directory and if they don’t find it
there, they simply assume that this site does not have a robots.txt
file and therefore they index everything they find along the way. So, if
you don’t put robots.txt in the right place, do not be surprised that
search engines index your whole site.
Structure of a Robots.txt File
The
structure of a robots.txt is pretty simple (and barely flexible) – it
is an endless list of user agents and disallowed files and directories.
Basically, the syntax is as follows:
User-agent:
Disallow:
“User-agent”
are search engines’ crawlers and disallow: lists the files and
directories to be excluded from indexing. In addition to “user-agent:”
and “disallow:” entries, you can include comment lines – just put the #
sign at the beginning of the line:
Example : All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/
This may simple way to default robots.txt file to new websites.
The Traps of a Robots.txt File
When
you start making complicated files – i.e. you decide to allow different
user agents access to different directories – problems can start, if
you do not pay special attention to the traps of a robots.txt file.
Common mistakes include typos and contradicting directives. Typos are
misspelled user-agents, directories, missing colons after User-agent and
Disallow, etc. Typos can be tricky to find but in some cases validation
tools help.
Can I block just bad robots?
The more serious problem is with logical errors. For instance:
User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /images/
Disallow: /temp/
Disallow: /cgi-bin/
The
above example is from a robots.txt that allows all agents to access
everything on the site except the /temp directory. Up to here it is fine
but later on there is another record that specifies more restrictive
terms for Googlebot. When Googlebot starts reading
robots.txt, it will see that all user agents (including Googlebot
itself) are allowed to all folders except /temp/. This is enough for
Googlebot to know, so it will not read the file to the end and will
index everything except /temp/ – including /images/ and /cgi-bin/, which
you think you have told it not to touch. You see, the structure of a
robots.txt file is simple but still serious mistakes can be made easily.
Tools to Generate and Validate a Robots.txt File
Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator.
Why did this robot ignore my /robots.txt?
User agent: *
Disallow: /tmp/
This is wrong because there is no Dash between “user” and “agent” and the syntax is incorrect.
In those cases, when you have a complex robots.txt
file – i.e. you give different instructions to different user agents or
you have a long list of directories and sub directories to exclude,
writing the file manually can be a real pain. But do not worry – there
are tools that will generate the file for you.
What is more, there
are visual tools that allow to point and select which files and folders
are to be excluded. But even if you do not feel like buying a graphical
tool for robots.txt generation, there are online tools to assist you.
For instance, the Server-Side Robots Generator offers a drop down list of user agents and a text box for you to list the files you don’t want indexed.
Honestly,
it is not much of a help, unless you want to set specific rules for
different search engines because in any case it is up to you to type the
list of directories but is more than nothing.
That is what the robots.txt file is for! All you need to do is put it in your root folder and follow a few standard rules and your good to go!
The
robots.txt file is what you use to tell search robots, also known as
Web Wanderers, Crawlers, or Spiders, which pages you would like them not
to visit. Robots.txt is by no means mandatory for search engines but
generally search engines obey what they are asked not to do.
Robots.txt generators:
Common procedure:
- choose default / global commands (e.g. allow/disallow all robots);
- choose files or directories blocked for all robots;
- choose user-agent specific commands:
- choose action;
- choose a specific robot to be blocked.
As a general rule of thumb, I don’t recommend using Robots.txt
generators for the simple reason: don’t create any advanced (i.e. non
default) Robots.txt file until you are 100% sure you understand what you
are blocking with it. But still I am listing two most trustworthy generators to check:
- Google Webmaster tools: Robots.txt generator allows to create simple Robots.txt files. What I like most about this tool is that it automatically adds all global commands to each specific user agent commands (helping thus to avoid one of the most common mistakes):
- SEObook Robots.txt generator unfortunately misses the above feature but it is really easy (and fun) to use:
Robots.txt checkers:
- Google Webmaster tools: Robots.txt analyzer “translates” what your Robots.txt dictates to the Googlebot:
- Robots.txt Syntax Checker finds some common errors within your file by checking for whitespace separated lists, not widely supported standards, wildcard usage, etc.
- A Validator for Robots.txt Files also checks for syntax errors and confirms correct directory paths.
Robots.txt Generator, Checker, Analyzer tool
webpositionadvisor
submitshop
motoricerca
dhtmlextreme
phpweby
seobook
How the Robots.txt works:
When a crawl process initiates,the crawler(or robot or
spider,whatever you call it as)searches for
www.yourdomain.com/robots.txt before any other page,including the index.
If the following appears in your file,
User-agent: *
Disallow: /
the crawler ignores the site.
Creating and using the Robots.txt file:
Use notepad, or wordpad (Save as Text Document), or even Microsoft Word (Save as Plain Text) to create the file.
Now,if you want to exclude some parts of your site or the entire site,Continue reading below.
Place the following in your robots.txt
-To ignore the whole site,
User-agent: * Disallow: /
-To ignore a specific directory,
example:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /mypage/
-To allow specific robots only,
User-agent: Google [bing,yahoo etc] Disallow: User-agent: * Disallow: /
-To prevent specific robots,
User-agent: The bot Disallow: /
Robots.txt file for WordPress
- Install a plugin like Yoast’s Robots meta.
This plugin adds meta tags to the head section of the pages and tell tells the search engine whether or not to index them.
It also allows you to control search engine indexing for individual posts or pages. - Create a robots.txt file. This is very simple, you can use your notepad for this task and save the file as robots.txt.
Alternately, you can generate this file using an online generator such as this.
Once you have finished, you should upload the file to your blog’s root directory.
I use the following rules in my robots.txt file. It covers the steps above plus a few more.
Note that I block search engines from crawling my category pages because I decided to use tags in this blog.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-login.php
Disallow: /*wp-login.php*
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /author
Disallow: /contact/
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /z/j/
Disallow: /z/c/
Disallow: /stats/
Disallow: /dh_
Disallow: /category/*
Disallow: /category/
Disallow: /login/
Disallow: /wget/
Disallow: /httpd/
Disallow: /i/
Disallow: /f/
Disallow: /t/
Disallow: /c/
Disallow: /j/
Disallow: /*.php$
Disallow: /*?*
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads
# alexa archiver
User-agent: ia_archiver
Disallow: /
# disable duggmirror by Digg
User-agent: duggmirror
Disallow: /
# allow google image bot to search all images
User-agent: Googlebot-Image
Disallow: /wp-includes/
Allow: /*
# allow adsense bot on entire site
User-agent: Mediapartners-Google*
Disallow:
Allow: /*
Once you have set a robots.txt file for your blog, you can test it to see if it does what it should (blocking certain pages).
To test it, you can use the Crawler Access tool in Google Webmaster Tools:
To test it, you can use the Crawler Access tool in Google Webmaster Tools:
- In GWT, go to Site Configuration -> Crawler Access.
- In this page, make sure the text area of the robots.txt file has been downloaded recently by Google and that it reflects the most recent changes you have made.
- In the URLs box, type different URLs to test against (for example, yourblog.com/wp-admin/) and click on the Test button.
The result displays something like “Blocked by line 3: Disallow: /wp-admin”.
If it doesn’t, you missed something when creating the file
Robots.txt file in Joomla
As mentioned, the robots.txt file is in your site root folder. It
contains info on which folders should be indexed and not. I can also
include information about your XML sitemap.
There are just two tips I would recommend regarding SEO and the robots.txt file:
1. Remove exclusion of images
For reasons I don't understand, the default robots.txt file in Joomla
is set up to exclude your images folder. That means your images will
not be indexed by Google and included in their Image Search. And that's
something you would want, as it adds another level to your sites search
engine visibility.
To change this, open your robots.txt file and remove the line that says:
Disallow: /images/
By removing this line, Google will start indexing your images on the next crawl of your site.
2. Add a reference to your sitemap.xml file
I've talked about the Sitemap XML file previously, in my post on How
to get your Joomla site indexed in Google. If you have a sitemap.xml
file (and you should have!), it will be good to include the following
line in your robots.txt file:
sitemap:http://www.domain.com/sitemap.xml
Naturally, this line needs to be adjusted to fit your domain and
sitemap file. In my case, I use the Xmap component to create the Sitemap
XML file automatically.
So, the line looks like this for Joomlablogger.net:
sitemap:http://www.joomlablogger.net/component/option,com_xmap/lang,en/sitemap,1/view,xml/Other than that, the robots.txt file can live happily at peace in your site's root folder.
Robots.txt file for drupal
Drupal 6 provides a standard robots.txt file that does an adequate job. It likes like this:
This file carries instructions for robots and spiders that may crawl your site. Robots.txt directives Now that we’ve taken a glance at what the file looks like, let’s take a deeper look at each directive used in the Drupal robots.txt
file. This is a bit tedious, but that’s why I’m here. It truly is worth
it to understand exactly what you’re telling the search engines to do.
Pattern matching
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:
Editing your robots.txt file
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.
1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt.
2. Using your FTP program or command line editor, navigate to the top level of your Drupal website and locate the robots.txt file.
3. Make a backup of the file.
4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.
5. Most directives in the robots.txt file are based the on line user-agent :. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.
6. Add the lines you want.
7. Save your robots.txt file, uploading it if necessary, replacing the existing file. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refreshing on your browser to see the changes.
Problems with the default Drupal robots.txt file
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled.
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled.
The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.
For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster!
Fortunately, this is relatively easy to fix.
Fixing the Drupal robots.txt file
Carry out the following steps in order to fix the Drupal robots.txt file:
Carry out the following steps in order to fix the Drupal robots.txt file:
1. Make a backup of the robots.txt file.
2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.
3. Find the Paths (clean URLs) section and the Paths (no clean URLs)
section. Note that both sections appear whether you've turned on clean
URLs or not. Drupal covers you either way. They look like this:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
4. Duplicate the two sections (simply copy and paste them) so that
you have four sections—two of the # Paths (clean URLs) sections and two
of # Paths (no clean URLs) sections.
5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.
6. Delete the trailing / after each Disallow line in the fixed!
sections. You should end up with four sections that look like this:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
# Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes. Now your robots.txt file is working as you would expect it to.
Additional changes to the robots.txt file
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:
• You are developing a new site and you don't want it to show up in any search engine until you're ready to launch it. Add Disallow: * just after the User-agent:
• Say you're running a very slow server and you don't want the
crawlers to slow your site down for other users. Adjust the Crawl-delay
by changing it from 10 to 20.
• If you're on a super-fast server (and you should be, right?) you
can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1
second. Monitor your server closely for a few days to make sure it can
handle the extra load.
• Say you're running a site which allows people to upload their own
images but you don't necessarily want those images to show up in Google.
Add these lines at the bottom of your robots.txt file:
User-agent: Googlebot-Image
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$
If all of the files were in the /files/users/images/ directory, you could do this:
User-agent: Googlebot-Image
Disallow: /files/users/images/
User-agent: Googlebot-Image
Disallow: /files/users/images/
• Say you noticed in your server logs that there was a bad robot out
there that was scraping all your content. You can try to prevent this by
adding this to the bottom of your robots.txt file: User-agent: Bad-Robot Disallow: *
• If you have installed the XML Sitemap module, then you've got a
great tool that you should send out to all of the search engines.
However, it's tedious to go to each engine's site and upload your URL.
Instead, you can add a couple of simple lines to the robots.txt file.
Adding your XML Sitemap to the robots.txt file
Another way that that the robots.txt file helps you search
engine optimize your Drupal site is by allowing you to specify where
your sitemaps are located. While you probably want to submit your
sitemap directly to Google and Bing, it's a good idea to put a reference
to it in the robots.txt file for all of those other search engines.
You can do this by carrying out the following steps:
1. Open the robots.txt file for editing.
2. The sitemap directive is independent of the User-agent line, so it doesn't matter where you place it in your robots.txt file.
4. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?). Go to http://www.yoursite.com/robots.txt and double-check that your changes are in effect. You may need to perform a refresh on your browser to see the changes.
list of some of the many robots wandering about the web,
- ABCdatos BotLink
- Acme.Spider
- Ahoy! The Homepage Finder
- Alkaline
- Anthill
- Walhello appie
- Arachnophilia
- Arale
- Araneo
- AraybOt
- ArchitextSpider
- Aretha
- ARIADNE
- arks
- AskJeeves
- ASpider (Associative Spider)
- ATN Worldwide
- Atomz.com Search Robot
- AURESYS
- BackRub
- Bay Spider
- Big Brother
- Bjaaland
- BlackWidow
- Die Blinde Kuh
- Bloodhound
- Borg-Bot
- BoxSeaBot
- bright.net caching robot
- BSpider
- CACTVS Chemistry Spider
- Calif
- Cassandra
- Digimarc Marcspider/CGI
- Checkbot
- ChristCrawler.com
- churl
- cIeNcIaFiCcIoN.nEt
- CMC/0.01
- Collective
- Combine System
- Conceptbot
- ConfuzzledBot
- CoolBot
- Web Core / Roots
- XYLEME Robot
- Internet Cruiser Robot
- Cusco
- CyberSpyder Link Test
- CydralSpider
- Desert Realm Spider
- DeWeb(c) Katalog/Index
- DienstSpider
- Digger
- Digital Integrity Robot
- Direct Hit Grabber
- DNAbot
- DownLoad Express
- DragonBot
- DWCP (Dridus’ Web Cataloging Project)
- e-collector
- EbiNess
- EIT Link Verifier Robot
- ELFINBOT
- Emacs-w3 Search Engine
- ananzi
- esculapio
- Esther
- Evliya Celebi
- FastCrawler
- Fluid Dynamics Search Engine robot
- Felix IDE
- Wild Ferret Web Hopper #1, #2, #3
- FetchRover
- fido
- Hämähäkki
- KIT-Fireball
- Fish search
- Fouineur
- Robot Francoroute
- Freecrawl
- FunnelWeb
- gammaSpider, FocusedCrawler
- gazz
- GCreep
- GetBot
- GetURL
- Golem
- Googlebot
- Grapnel/0.01 Experiment
- Griffon
- Gromit
- Northern Light Gulliver
- Gulper Bot
- HamBot
- Harvest
- havIndex
- HI (HTML Index) Search
- Hometown Spider Pro
- ht://Dig
- HTMLgobble
- Hyper-Decontextualizer
- iajaBot
- IBM_Planetwide
- Popular Iconoclast
- Ingrid
- Imagelock
- IncyWincy
- Informant
- InfoSeek Robot 1.0
- Infoseek Sidewinder
- InfoSpiders
- Inspector Web
- IntelliAgent
- I, Robot
- Iron33
- Israeli-search
- JavaBee
- JBot Java Web Robot
- JCrawler
- Jeeves
- JoBo Java Web Robot
- Jobot
- JoeBot
- The Jubii Indexing Robot
- JumpStation
- image.kapsi.net
- Katipo
- KDD-Explorer
- Kilroy
- KO_Yappo_Robot
- LabelGrabber
- larbin
- legs
- Link Validator
- LinkScan
- LinkWalker
- Lockon
- logo.gif Crawler
- Lycos
- Mac WWWWorm
- Magpie
- marvin/infoseek
- Mattie
- MediaFox
- MerzScope
- NEC-MeshExplorer
- MindCrawler
- mnoGoSearch search engine software
- moget
- MOMspider
- Monster
- Motor
- MSNBot
- Muncher
- Muninn
- Muscat Ferret
- Mwd.Search
- Internet Shinchakubin
- NDSpider
- Nederland.zoek
- NetCarta WebMap Engine
- NetMechanic
- NetScoop
- newscan-online
- NHSE Web Forager
- Nomad
- The NorthStar Robot
- nzexplorer
- ObjectsSearch
- Occam
- HKU WWW Octopus
- OntoSpider
- Openfind data gatherer
- Orb Search
- Pack Rat
- PageBoy
- ParaSite
- Patric
- pegasus
- The Peregrinator
- PerlCrawler 1.0
- Phantom
- PhpDig
- PiltdownMan
- Pimptrain.com’s robot
- Pioneer
- html_analyzer
- Portal Juice Spider
- PGP Key Agent
- PlumtreeWebAccessor
- Poppi
- PortalB Spider
- psbot
- GetterroboPlus Puu
- The Python Robot
- Raven Search
- RBSE Spider
- Resume Robot
- RoadHouse Crawling System
- RixBot
- Road Runner: The ImageScape Robot
- Robbie the Robot
- ComputingSite Robi/1.0
- RoboCrawl Spider
- RoboFox
- Robozilla
- Roverbot
- RuLeS
- SafetyNet Robot
- Scooter
- Sleek
- Search.Aus-AU.COM
- SearchProcess
- Senrigan
- SG-Scout
- ShagSeeker
- Shai’Hulud
- Sift
- Simmany Robot Ver1.0
- Site Valet
- Open Text Index Robot
- SiteTech-Rover
- Skymob.com
- SLCrawler
- Inktomi Slurp
- Smart Spider
- Snooper
- Solbot
- Spanner
- Speedy Spider
- spider_monkey
- SpiderBot
- Spiderline Crawler
- SpiderMan
- SpiderView(tm)
- Spry Wizard Robot
- Site Searcher
- Suke
- suntek search engine
- Sven
- Sygol
- TACH Black Widow
- Tarantula
- tarspider
- Tcl W3 Robot
- TechBOT
- Templeton
- TeomaTechnologies
- TITAN
- TitIn
- The TkWWW Robot
- TLSpider
- UCSD Crawl
- UdmSearch
- UptimeBot
- URL Check
- URL Spider Pro
- Valkyrie
- Verticrawl
- Victoria
- vision-search
- void-bot
- Voyager
- VWbot
- The NWI Robot
- W3M2
- WallPaper (alias crawlpaper)
- the World Wide Web Wanderer
- w@pSpider by wap4.com
- WebBandit Web Spider
- WebCatcher
- WebCopy
- webfetcher
- The Webfoot Robot
- Webinator
- weblayers
- WebLinker
- WebMirror
- The Web Moose
- WebQuest
- Digimarc MarcSpider
- WebReaper
- webs
- Websnarf
- WebSpider
- WebVac
- webwalk
- WebWalker
- WebWatch
- Wget
- whatUseek Winona
- WhoWhere Robot
- Wired Digital
- Weblog Monitor
- w3mir
- WebStolperer
- The Web Wombat
- The World Wide Web Worm
- WWWC Ver 0.2.5
- WebZinger
- XGET
Thanks For Reading