Robots.txt and Search Engine Spiders - SEO

Amitgupta007_99
Posted by in SEO category on for Intermediate level | Views : 19481 red flag

A spider can be defined as which crawls over the web and fetches the webpages for search engines. It can virtually start from anywhere and go everywhere following the links.
When a search engine visits a web site through a submission or when following a link from site one site to another, the search engine robot (also known as a Crawlers, Agents, Bots and Spiders) will look for a text file called robots.txt. The file normally resides in the root directory of the site such as "www.abc.com/robots.txt". Robots.txt will instruct spiders to visit or not to visit particular webpages from the website.

A robots.txt file is just a simple text file; it doesn’t require any special type of formatting like font face, size etc. Make sure that the file is saved as (all lowercase) robots.txt.

The three most common items you will find in a robots.txt file are:
  • allow 
  • disallow 
  • and the wildcard or asterisk: "*" 

Normally you would use the "disallow" command so that an engine not index certain areas of your site, while the "allow" command is actually redundant since they will usually follow any other link that you have not prohibited. Finally the wildcard indicates all engines thus if you had a file folder called "images" under the main directory such as: "www.abc.com/images/" you might use the following coding if you wished to disallow all spiders from that folder:

User-agent: *
Disallow: /images/

When you really meant to block a folder not individual files as in:
Disallow: /images/

We also have to keep an eye on different spiders (Log files or Web Analytics) from a single search engine, like “GoogleBot”  and “GoogleBot-Images” etc. We have to be very clear about which files/images needs to be indexed by which spiders like images needs to indexed by ImageBot.

Meta Tags and Robots.txt

<META name="ROBOTS" content="NOINDEX, NOFOLLOW"> 
Indicates nor to index the webpage nor to follow the links. 

<META name="ROBOTS" content="NOINDEX"> 
Indicates not to index the webpage. 

<META name="ROBOTS" content="NOFOLLOW"> 
Indicates not to follow any links on the webpage 

<META name="ROBOTS" content="NOINDEX, FOLLOW"> 
Indicates to follow links on webpage but not index the web page. 

<META name="ROBOTS" content="INDEX, NOFOLLOW"> 
Indicates not to follow links on webpage but index the web page. 

Outbound links & robots.txt

Outbound links are links which contributes to PR of your webpage. Outbound Links involves mainly websites having facility to post by outsiders / visitors, where people post useless contents and links of their respective websites to promote their websites or products. You can block these types of attempts by taking following action: 

<a href="http://www.abc.com/cars.htm" rel="nofollow">the truth about cars.</a> 

With the proper robots.txt in place, link building services can create a very positive effect on search engine rankings. 


Conclusion


Robots.txt is a vital part of any website; it can be compared with a traffic controller system in a city so in a way it’s necessary to have an updated traffic controller system with all possible directions. Robots.txt also prevents spam and penalties associated with duplicate content. 

We humans risk health to earn money and then we give away money to earn the health back. When we try to get indexed by all available spiders, some BAD Agents are generated by software, using which mirror of your website can be downloaded for plagiarism, stealing your clients by posting a similar website. We loose bandwidth, documents, images, Adsense money and prospective business. 

So, we need to take control of Robots.txt to save on resources, minimize the risk of loosing content, money and prospective business and ENJOY the growth. 

You need any help, always write on amit@r2ainformatics.com.

All the best!!!! 


Amit P Gupta 
Web Strategist 
Page copy protected against web site content infringement by Copyscape

About the Author

Amitgupta007_99
Full Name: Amit Gupta
Member Level: Starter
Member Status: Member
Member Since: 7/23/2007 11:05:49 PM
Country:

http://www.r2ainformatics.com
His experience covers a wide range of spectrum: SEO, Analytics, consultant, technical editor and college instructor . Amit holds more than 3 technical certifications and has completed MCA. Amit may be reached at amit@r2ainformatics.com

Login to vote for this post.

Comments or Responses

Posted by: Animesh on: 12/4/2007
Very informative article .

But i have some doubts,

So is this robots.txt file is automatically created when we host our application or it will not be created untill we dont create it.

And if it is created automatically then who is responsible of creating it.

Thanks

Animesh Misra
Posted by: ProgTalk on: 4/20/2008
It is not created automatically. You usually have to manually create it and put it on your root directory.
Posted by: Raja on: 5/23/2008
No Animesh. Robots.txt file will not be created automatically. You will have to create yourself.

There are certain protocol you need to follow. You can get some information from here http://www.robotstxt.org/ or http://www.google.com/webmasters/

Login to post response

Comment using Facebook(Author doesn't get notification)