Do you know the importance of a Robots.txt file? Read to know.

Success of big companies lies in keeping their confidential data a secret, hidden from all. This enables them to execute their future course of action easily and change plans according to the situation. Job of robots.txt file is the same. It can or cannot allow a search engine to visit some or all of your web pages. Of course a human visitor is free to visit these pages. That being the case, for the search engines your website may be different than what a visitor is seeing. If you think one or some of the pages aren’t good enough to be visited by search engines you can do it.

Every search engine has a “robot” (a software program) that does the job of visiting a website. Their purpose is to gather a copy of the site and keep them in their database. So, if your site is not there in their database it never shows up in the search results.

Web Robots are sometimes referred to as Web Crawlers, or Spiders. Therefore the process of a robot visiting your website is called “Spidering” or “Crawling”. When somebody says “the search engines have spidered my website”, it means the search engine robots have visited their website. This robot is known by a name and has an independent IP address. This IP address is of no importance to us, but knowing their names will help since this name will be used when we create a robots.txt file. This is why the file is called “robots.txt.”

Given below is the list of the robots of some of the very popular search engines:

Search Engine Robot
Alexa.com ia_archiver
Altavista.com Scooter
Alltheweb.com FAST-WebCrawler
Excite.com ArchitextSpider
Euroseek.net Arachnoidea
Google.com Googlebot
(http://www.google.com/bot.html)
Hotbot.com
(uses Inktomi’s robot)
Slurp
Inktomi.com Slurp
Infoseek.com UltraSeek
Looksmart.com MantraAgent
Lycos.com Lycos_Spider_(T-Rex)
Nationaldirectory.com NationalDirectory-SuperSpider
UKSearcher.co.uk UK Searcher Spider


Writing Robots.txt:

Let’s learn to write robots command. Note that there are two ways to write robots command. One is to include all the commands in a text file called “robots.txt” and another is to write robots command in the meta tag.

We will learn both ways of writing robots command.

Writing robots command in Meta tag:

There are 4 things you can tell a search engine robot when it visits your page:

1) Do not index this page – the search engines will not index the page.

2) Do not follow any links on this page – the search engines will not follow the links included in the page, i.e. they will not index any page that this page links to.

3) Do index this page – the search engines will index the page.

4) Do follow the links – the search engines will index the pages that this page links to.

Note that “index” is different than “spider”. A search engine first spiders a page and then indexes it. Indexing is giving a certain importance to the page on the basis of its content, information, meta tags, link popularity with respect to the searched keyword. All this is decided at run time. When you tell search engines not to index a page, it means they know that “certain” page exists but do not rank them. That is, a no-index page will never be shown in their search results. This in any case does not mean a no-index page will not get visitors, it might get visitors indirectly from a page which links to it. Yes, no direct visitors from the search engines.

Suppose you want the search engines to index and also index (follow) its linked pages then include the following command in the Meta Tag:

<meta name=”robots” content=”index, follow”>

Suppose you want the search engines to index a page but not follow its links then include the following command in the Meta Tag:

<meta name=”robots” content=”index, nofollow”>

Suppose you do not want the search engines to index a page but follow its links then include the following command in the Meta Tag:

<meta name=”robots” content=”noindex, follow”>

Suppose you do not want the search engines to either index or follow links of a particular page then include the following command in the Meta Tag:

<meta name=”robots” content=”noindex, nofollow”>

Note:

Google makes a “Cached” of every file it spiders. It’s a small snap shot of the page. Want to stop Google from doing so? Include the following Meta Tag:

<meta name=”robots” content=”noindex, nofollow, noarchive”>

Like any meta tag the above written tags should be placed in the HEAD section of an HTML page:

<html>
<head>
<title>your title</title>
<meta name=”description” content=”your description.”>
<meta name=”keywords” content=”your keywords”>
<meta name=”robots” content=”index, follow”>
</head>
<body>

Creating robots.txt file:

A robots.txt file is an independent file and should be written in a plain text editor like Notepad. Do not use MS-Word or any other text editor to create robots.txt. The bottom line is this file should have the extension “.txt” else it will be useless.

Let’s begin. Open Notepad (it comes free with Microsoft Windows) and save the file with the name “robots.txt”. Make sure that the extension is .txt.

By the way, did you note we did not use name of any robot in the meta tag! What does it indicate? Simple – by using meta you direct all the search engines to do something or not do something on a page. You do not have control over any one search engine. The solution is robots.txt.

It can always happen you do not want a particular search engine to index a page for certain reasons. In that case using a robots.txt file will help. Even though I do not recommend such a thing. The search engines get you traffic, why hate them. Stop them from doing their job and they hate you. I again repeat keep your pages smart for the search engines and welcome them. Fine, then why take the trouble to learn robots.txt? Why should you include a robots.txt file at all?

Let’s suppose yours is a dynamic database site containing information of your newsletter subscribers, customers, their address, phone numbers etc. All these confidential information is kept in a separate directory called “admin”. (It is recommended to keep such information in a separate directory. Handling data will be easier for you and so will be easy to keep the search engines away. We will just know how.) I am sure you would never want any unauthorized person to visit this area leave alone the search engines. It does not help the search engines either since they have nothing to do with the data or files there. Here comes the role of a robots.txt file.

Write the following in the robots.txt file:


User-agent: *
Disallow: /admin/


This does not allow the spiders to index anything in the admin directory also including sub-directories if any.

The asterisk (*) mark indicates all the search engines. How do you stop a particular search engine from spidering your files or directory?

Suppose you want to stop Excite from spidering this directory:


User-agent: ArchitextSpider
Disallow: /admin/


Suppose you want to stop Excite and Google from spidering this directory:


User-agent: ArchitextSpider
Disallow: /admin/

User-agent: Googlebot
Disallow: /admin/


Files are no different. Suppose you want a file datafile.html not to be spidered by Excite:


User-Agent: ArchitextSpider
Disallow: /datafile.html


Similarly, you do not want it to be spidered by Google too:


User-agent: ArchitextSpider
Disallow: /datafile.html

User-agent: Googlebot
Disallow: /datafile.html


Suppose you want two files datafile1.html and datafile2.html not to be spidered by Excite:


User-Agent: ArchitextSpider
Disallow: /datafile1.html
Disallow: /datafile2.html


Can you guess what does the following mean?


User-agent: ArchitextSpider
Disallow: /datafile1.html
Disallow: /datafile2.html

User-agent: Googlebot
Disallow: /datafile1.html


Excite will not spider datafile1.html and datafile2.html, but Google will not spider only datafile1.html. It will spider datafile2.html and the rest of the files in the directory.

Imagine you have a file kept in a sub-directory that you wouldn’t like to be spidered. What do you do? Lets suppose the sub-directory is “official” and the file is “confidential.html”.


User-agent: *
Disallow: /official/confidential.html


If the syntax of your robots.txt file is not written correctly, the search engines will ignore that particular command. Before uploading the robots.txt file double check for any possible errors. You should upload robots.txt file in the ROOT Directory of your server. The search engines look for robots.txt file only in the root directory.

Note:

You should be able to see robots.txt file if you type the following in the address bar of your Internet browser.

http://www.your-domain.com/robots.txt

Here is Google’s Robots.txt file:

http://www.google.com/robots.txt

All search engines follow robots.txt command.

You can look in your web server log files to see what search engine robots have visited. They all leave signatures that can be detected. These signatures are nothing but name of their robots. For instance if Google has spidered your site it will leave a log file called Googlebot. This is how you know which search engine has spidered your pages and when!

We are highly experienced in SEO/SEM/Pay Per Click Management. Contact us regarding any query you may have.