A spider can be defined as which crawls over the web and fetches the webpages for search engines. It can virtually start from anywhere and go everywhere following the links.
When a search engine visits a web site through a submission or when following a link from site one site to another, the search engine robot (also known as a Crawlers, Agents, Bots and Spiders) will look for a text file called robots.txt. The file normally resides in the root directory of the site such as "www.abc.com/robots.txt". Robots.txt will instruct spiders to visit or not to visit particular webpages from the website.
A robots.txt file is just a simple text file; it doesn’t require any special type of formatting like font face, size etc. Make sure that the file is saved as (all lowercase) robots.txt.
The three most common items you will find in a robots.txt file are:
- allow
- disallow
- and the wildcard or asterisk: "*"
Normally you would use the "disallow" command so that an engine not index certain areas of your site, while the "allow" command is actually redundant since they will usually follow any other link that you have not prohibited. Finally the wildcard indicates all engines thus if you had a file folder called "images" under the main directory such as: "www.abc.com/images/" you might use the following coding if you wished to disallow all spiders from that folder:
User-agent: *
Disallow: /images/
When you really meant to block a folder not individual files as in:
Disallow: /images/
We also have to keep an eye on different spiders (Log files or Web Analytics) from a single search engine, like “GoogleBot” and “GoogleBot-Images” etc. We have to be very clear about which files/images needs to be indexed by which spiders like images needs to indexed by ImageBot.
Meta Tags and Robots.txt
<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
Indicates nor to index the webpage nor to follow the links.
<META name="ROBOTS" content="NOINDEX">
Indicates not to index the webpage.
<META name="ROBOTS" content="NOFOLLOW">
Indicates not to follow any links on the webpage
<META name="ROBOTS" content="NOINDEX, FOLLOW">
Indicates to follow links on webpage but not index the web page.
<META name="ROBOTS" content="INDEX, NOFOLLOW">
Indicates not to follow links on webpage but index the web page.
Outbound links & robots.txt
Outbound links are links which contributes to PR of your webpage. Outbound Links involves mainly websites having facility to post by outsiders / visitors, where people post useless contents and links of their respective websites to promote their websites or products. You can block these types of attempts by taking following action:
<a href="http://www.abc.com/cars.htm" rel="nofollow">the truth about cars.</a>
With the proper robots.txt in place, link
building services can create a very positive effect on search
engine rankings.
Conclusion
Robots.txt is a vital part of any website; it can be compared with a traffic controller system in a city so in a way it’s necessary to have an updated traffic controller system with all possible directions. Robots.txt also prevents spam and penalties associated with duplicate content.
We humans risk health to earn money and then we give away money to earn the health back. When we try to get indexed by all available spiders, some BAD Agents are generated by software, using which mirror of your website can be downloaded for plagiarism, stealing your clients by posting a similar website. We loose bandwidth, documents, images, Adsense money and prospective business.
So, we need to take control of Robots.txt to save on resources, minimize the risk of loosing content, money and prospective business and ENJOY the growth.
You need any help, always write on amit@r2ainformatics.com.
All the best!!!!
Amit P Gupta
Web Strategist