GetDotted Domains

Viewing Thread:
"Making use of the Robots Exclusion Protocol with a robots.txt file"

The "Freeola Customer Forum" forum, which includes Retro Game Reviews, has been archived and is now read-only. You cannot post here or create a new thread or review on this forum.

Thu 28/05/09 at 15:48
Regular
"It goes so quickly"
Posts: 4,083
Making use of the Robots Exclusion Protocol with a robots.txt file on your web site

Once online, your web site can be accessible by many millions of people from all over the world, whether they are using a desktop PC at home, a laptop at school or a mobile phone down the pub. That is, after all, how the Internet works – but; if you thought humans were the only web users these days, you’d be mistaken, because robots surf the web too.

Well, maybe not quite in the manor that I suggest above, but robots in the form of computer programmes are out and about on the Internet, consuming web site after web site for it’s content. Some of these robots are good, some of them bad, but they are out there, and they are everywhere.

Some of the good robots are often known as web crawlers, and are built and made use of by search engines companies like Google, Bing, Yahoo, and Ask Jeeves, and these very search engines are the doorway many millions of people walk through to get to a web site, any we site, maybe even your web site.

Having your web site listed on a search engines results page is generally a good thing, but there may be times when certain content of yours is perhaps best left unfound and unlisted for a certain amount of time, or all the time, for a number of reasons.

Trying out a new design!

If you’ve already got yourself a web site up and running, and have had it there for some time, you may decide to re-design it, and upload it to allow you to test it out before replacing your old design. The problem is that if search engines find this new site, it may begin listing it, even if it isn’t quite ready, and people may come to an uncompleted web site, and leave again.

Keeping an old design uploaded for a while, just in case!

You may have already redesigned your site, and consider it to work fine, and have updated all your links and such-like, but kept your old design uploaded, just in case you need to go back to it, because of a bug, or simply because your uses don’t like it.

A similar problem here is that if search engines continue to list it, users may come to your old design from the search engine results page, and not like what they see. After all, you re-designed for a reason, didn’t you?

Printer friendly pages

On many web sites, a “printer friendly” option is made available for use to anyone who needs a hard-copy of what they have just read. This typically might have a larger font, or reduced or clearer images, but the majority of cases involve the navigation and other irrelevant information removed from the page, so that ink isn’t wasted and the print-out is better utilised.

Unfortunately, search engines are unable to tell which version is the web page and which is the printer version, and may list your printer version on it’s results pages, which you might not want to happen, as it doesn’t have your full design, or navigational links, and make the use move on elsewhere.

Robots Exclusion Protocol!

This is where the “Robots Exclusion Protocol” can help you out, as it enables you include some basic instructions to search engines about certain web pages you do not want them to consider when scanning and indexing your web site. The above reasons are just some that you might consider, but the goal of the “Robots Exclusion Protocol” is to ask bots (aka search engines) to ignore some of your site from it’s index.

It is important to note that this technique should NOT be used in placed of any proper password protected areas of your web site, as having them not listed in a search engine doesn’t prevent them from being seen. This protocol is only to tidy up the indexing of your web site.

robotx.txt

You can talk to a web search engine bot using a small text file called the robots.txt file, and from this file tell it to ignore certain files, folders, or if you desire, the whole web site.

You need to be careful with this tool, as if you make an error, you may find your web site isn’t indexed by anybody. Before making use of this feature, consider what it is you want to achieve, which pages you won’t want indexed, and why you feel that way about them.

The robots.txt file is the file that the search engines will look for, and they will only look for it in your top-level directory

Search Engine bots will look for a plain text file called robots.txt under what is known as your “root directory”, or “top (level) directory”, and this basically means it’ll browse to http://example.com/[B]robots.txt[/B].

Creating the robot.txt file!

As you might have noticed, the file name ends in .txt, which makes it a bog-standard text file, that can be created in Windows notepad, or any other text editor software you may have. The important aspect is to call it robots.txt, as this is the file that the search engines look for. You can start with a blank file if you like, and upload the robots.txt file to your web space, and see if you can access it via the web address http://yoursite.co.uk/ [B]robots.txt[/B].

Before writing any instructions to search engines, you’ll need to do is figure out what, if anything, you want to exclude. In our examples below, we’ll block out an old design, a new one, the printer versions, two single web pages, and the cgi-bin, because we don’t want our web scripts to be indexed either.

Adding your instructions!

The first thing that we need to do is tell a search engine that we are talking to them, and let them know to pay attention. This is done using the command User-agent, which can list each search engine bot by name, or use a wild-card star to indicate that they all should follow the instructions.

If you want to pass instructions to Yahoo (Slurp), Google (Googlebot) and Microsoft (MSNBot), you would need to know the bots names, and list them one after the other, separated by a comma:

User-agent: [B]Googlebot[/B], [B]Slurp[/B], [B]MSNBot[/B]

The reason you might want to target a specific search engine will be rare, and in practically all cases, you’ll want the instructions to be followed by all bots, so use the wild-card, value on the first line:

User-agent: [B]*[/B]

This tells all search engines that support the “Robots Exclusion Protocol” that they should adhere to the followed commands.

With this done, all that you need to do from here is add any particular instructions that you might want a search engine to adhere to, such as to ignore a printer page, and this is done with the Disallow command, which must be on a new line after the User-agent command has been used, and contain the name of the file or folder than you want to “disallow” from being indexed, for example:

User-agent: *
[B] Disallow: [I]/printer-friendly/[/I][/B]


And with those two very simple lines of code within a robots.txt file uploaded to your root directory, you’ve asked search engines not to index your printer friendly web pages.

One very important thing to note is that if you write this next command incorrectly, you will find that your whole web site might be unlisted in any search engine. If you just include a single forward slash (/) after the Disallow command, it tells search engines to not index ANY of your web site:

User-agent: *
Disallow: [B]/[/B]


And with the above two lines, you’ve told search engines not to index anything!!!

With that scary thought out of the way, lets exclude few more of our web site from being indexed. We’ve done our printer friendly pages, so next lets make sure our old design won’t be indexed, and this is done in the same way, but ensure that the new Disallow command is on a new line:

User-agent: *
Disallow: /printer-friendly/
[B]Disallow: [I]/old-design/[/I][/B]


And after that, lets get rid of any chance of our new design (which isn’t yet finished) and our cgi-bin scripts from being listed to, like so:

User-agent: *
Disallow: /printer-friendly/
Disallow: /old-design/
[B]Disallow: [I]/new-site-design/[/I][/B]
[B]Disallow: [I]/cgi-bin/[/I][/B]


Finally, there are a couple of single pages that we also want blocked, as they are not important for other people to see in results listings:

User-agent: *
Disallow: /printer-friendly/
Disallow: /old-design/
Disallow: /new-site-design/
Disallow: /cgi-bin/
[B]Disallow: [I]/log-in.php[/I][/B]
[B]Disallow: [I]/private/me-only.php[/I][/B]


And there where have seven lines of code that ask search engines to ignore some of our site when indexing, so that it can concentrate of the other parts of your web site.

Adding comments to your robots file!

You can also use the hash character (#) to add comments to yourself in the file, so that you can give a tip to yourself or others later on as to why you excluded something, for example:

[B]# Last updated 25th May 2009[/B]
User-agent: * [B]# All robots[/B]
Disallow: /printer-friendly/ [B]# Don�t index printer only pages[/B]
Disallow: /old-design/ [B]# Don�t index my old web site[/B]
Disallow: /new-site-design/ [B]# Don�t index my new design yet[/B]
Disallow: /cgi-bin/ [B]# Don�t index scripts, they aren�t content[/B]
Disallow: /log-in.php [B]# My log in page, not useful for others[/B]
Disallow: /private/me-only.php [B]# Just a page of links for me[/B]


Is there anything else I can do?

While the “Robots Exclusion Protocol” was written to simply exclude content, the robot.txt file can hold other possible instructions, that may either be common usages that a lot of search engines use, or ones that have been invented by only one search engine.

For example, Yahoo supports a Crawl-delay setting, which can be included like so:

User-agent: Slurp
Crawl-delay: 3


If you’ve read over how to use a sitemap to tell Search Engines about your web pages, or any other articles related to sitemaps, then you may want to add your sitemap location to your robots file, using a single line that begins with Sitemap: and is followed by the full web address location, for example:

Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.comsitemap_index.xml


= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

As always, any comments, questions, and especially corrections are welcome.
Thu 28/05/09 at 15:48
Regular
"It goes so quickly"
Posts: 4,083
Making use of the Robots Exclusion Protocol with a robots.txt file on your web site

Once online, your web site can be accessible by many millions of people from all over the world, whether they are using a desktop PC at home, a laptop at school or a mobile phone down the pub. That is, after all, how the Internet works – but; if you thought humans were the only web users these days, you’d be mistaken, because robots surf the web too.

Well, maybe not quite in the manor that I suggest above, but robots in the form of computer programmes are out and about on the Internet, consuming web site after web site for it’s content. Some of these robots are good, some of them bad, but they are out there, and they are everywhere.

Some of the good robots are often known as web crawlers, and are built and made use of by search engines companies like Google, Bing, Yahoo, and Ask Jeeves, and these very search engines are the doorway many millions of people walk through to get to a web site, any we site, maybe even your web site.

Having your web site listed on a search engines results page is generally a good thing, but there may be times when certain content of yours is perhaps best left unfound and unlisted for a certain amount of time, or all the time, for a number of reasons.

Trying out a new design!

If you’ve already got yourself a web site up and running, and have had it there for some time, you may decide to re-design it, and upload it to allow you to test it out before replacing your old design. The problem is that if search engines find this new site, it may begin listing it, even if it isn’t quite ready, and people may come to an uncompleted web site, and leave again.

Keeping an old design uploaded for a while, just in case!

You may have already redesigned your site, and consider it to work fine, and have updated all your links and such-like, but kept your old design uploaded, just in case you need to go back to it, because of a bug, or simply because your uses don’t like it.

A similar problem here is that if search engines continue to list it, users may come to your old design from the search engine results page, and not like what they see. After all, you re-designed for a reason, didn’t you?

Printer friendly pages

On many web sites, a “printer friendly” option is made available for use to anyone who needs a hard-copy of what they have just read. This typically might have a larger font, or reduced or clearer images, but the majority of cases involve the navigation and other irrelevant information removed from the page, so that ink isn’t wasted and the print-out is better utilised.

Unfortunately, search engines are unable to tell which version is the web page and which is the printer version, and may list your printer version on it’s results pages, which you might not want to happen, as it doesn’t have your full design, or navigational links, and make the use move on elsewhere.

Robots Exclusion Protocol!

This is where the “Robots Exclusion Protocol” can help you out, as it enables you include some basic instructions to search engines about certain web pages you do not want them to consider when scanning and indexing your web site. The above reasons are just some that you might consider, but the goal of the “Robots Exclusion Protocol” is to ask bots (aka search engines) to ignore some of your site from it’s index.

It is important to note that this technique should NOT be used in placed of any proper password protected areas of your web site, as having them not listed in a search engine doesn’t prevent them from being seen. This protocol is only to tidy up the indexing of your web site.

robotx.txt

You can talk to a web search engine bot using a small text file called the robots.txt file, and from this file tell it to ignore certain files, folders, or if you desire, the whole web site.

You need to be careful with this tool, as if you make an error, you may find your web site isn’t indexed by anybody. Before making use of this feature, consider what it is you want to achieve, which pages you won’t want indexed, and why you feel that way about them.

The robots.txt file is the file that the search engines will look for, and they will only look for it in your top-level directory

Search Engine bots will look for a plain text file called robots.txt under what is known as your “root directory”, or “top (level) directory”, and this basically means it’ll browse to http://example.com/[B]robots.txt[/B].

Creating the robot.txt file!

As you might have noticed, the file name ends in .txt, which makes it a bog-standard text file, that can be created in Windows notepad, or any other text editor software you may have. The important aspect is to call it robots.txt, as this is the file that the search engines look for. You can start with a blank file if you like, and upload the robots.txt file to your web space, and see if you can access it via the web address http://yoursite.co.uk/ [B]robots.txt[/B].

Before writing any instructions to search engines, you’ll need to do is figure out what, if anything, you want to exclude. In our examples below, we’ll block out an old design, a new one, the printer versions, two single web pages, and the cgi-bin, because we don’t want our web scripts to be indexed either.

Adding your instructions!

The first thing that we need to do is tell a search engine that we are talking to them, and let them know to pay attention. This is done using the command User-agent, which can list each search engine bot by name, or use a wild-card star to indicate that they all should follow the instructions.

If you want to pass instructions to Yahoo (Slurp), Google (Googlebot) and Microsoft (MSNBot), you would need to know the bots names, and list them one after the other, separated by a comma:

User-agent: [B]Googlebot[/B], [B]Slurp[/B], [B]MSNBot[/B]

The reason you might want to target a specific search engine will be rare, and in practically all cases, you’ll want the instructions to be followed by all bots, so use the wild-card, value on the first line:

User-agent: [B]*[/B]

This tells all search engines that support the “Robots Exclusion Protocol” that they should adhere to the followed commands.

With this done, all that you need to do from here is add any particular instructions that you might want a search engine to adhere to, such as to ignore a printer page, and this is done with the Disallow command, which must be on a new line after the User-agent command has been used, and contain the name of the file or folder than you want to “disallow” from being indexed, for example:

User-agent: *
[B] Disallow: [I]/printer-friendly/[/I][/B]


And with those two very simple lines of code within a robots.txt file uploaded to your root directory, you’ve asked search engines not to index your printer friendly web pages.

One very important thing to note is that if you write this next command incorrectly, you will find that your whole web site might be unlisted in any search engine. If you just include a single forward slash (/) after the Disallow command, it tells search engines to not index ANY of your web site:

User-agent: *
Disallow: [B]/[/B]


And with the above two lines, you’ve told search engines not to index anything!!!

With that scary thought out of the way, lets exclude few more of our web site from being indexed. We’ve done our printer friendly pages, so next lets make sure our old design won’t be indexed, and this is done in the same way, but ensure that the new Disallow command is on a new line:

User-agent: *
Disallow: /printer-friendly/
[B]Disallow: [I]/old-design/[/I][/B]


And after that, lets get rid of any chance of our new design (which isn’t yet finished) and our cgi-bin scripts from being listed to, like so:

User-agent: *
Disallow: /printer-friendly/
Disallow: /old-design/
[B]Disallow: [I]/new-site-design/[/I][/B]
[B]Disallow: [I]/cgi-bin/[/I][/B]


Finally, there are a couple of single pages that we also want blocked, as they are not important for other people to see in results listings:

User-agent: *
Disallow: /printer-friendly/
Disallow: /old-design/
Disallow: /new-site-design/
Disallow: /cgi-bin/
[B]Disallow: [I]/log-in.php[/I][/B]
[B]Disallow: [I]/private/me-only.php[/I][/B]


And there where have seven lines of code that ask search engines to ignore some of our site when indexing, so that it can concentrate of the other parts of your web site.

Adding comments to your robots file!

You can also use the hash character (#) to add comments to yourself in the file, so that you can give a tip to yourself or others later on as to why you excluded something, for example:

[B]# Last updated 25th May 2009[/B]
User-agent: * [B]# All robots[/B]
Disallow: /printer-friendly/ [B]# Don�t index printer only pages[/B]
Disallow: /old-design/ [B]# Don�t index my old web site[/B]
Disallow: /new-site-design/ [B]# Don�t index my new design yet[/B]
Disallow: /cgi-bin/ [B]# Don�t index scripts, they aren�t content[/B]
Disallow: /log-in.php [B]# My log in page, not useful for others[/B]
Disallow: /private/me-only.php [B]# Just a page of links for me[/B]


Is there anything else I can do?

While the “Robots Exclusion Protocol” was written to simply exclude content, the robot.txt file can hold other possible instructions, that may either be common usages that a lot of search engines use, or ones that have been invented by only one search engine.

For example, Yahoo supports a Crawl-delay setting, which can be included like so:

User-agent: Slurp
Crawl-delay: 3


If you’ve read over how to use a sitemap to tell Search Engines about your web pages, or any other articles related to sitemaps, then you may want to add your sitemap location to your robots file, using a single line that begins with Sitemap: and is followed by the full web address location, for example:

Sitemap: http://example.com/sitemap.xml
Sitemap: http://example.comsitemap_index.xml


= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

As always, any comments, questions, and especially corrections are welcome.
Thu 28/05/09 at 16:28
Regular
"How Ironic"
Posts: 4,312
You must be superhuman..... :O
Thu 28/05/09 at 18:16
Staff Moderator
"Aargh! Broken..."
Posts: 1,408
Just to add:
Not to do directly with robots.txt but still tells robots what to do:
using one of the following meta tags in the of a page tells robots to not index the page but follow links to other pages, index the page but not follow links or not index and not follow any links.

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Obviously only use one of these meta tags on the page.

Freeola & GetDotted are rated 5 Stars

Check out some of our customer reviews below:

Everybody thinks I am an IT genius...
Nothing but admiration. I have been complimented on the church site that I manage through you and everybody thinks I am an IT genius. Your support is unquestionably outstanding.
Brian
LOVE it....
You have made it so easy to build & host a website!!!
Gemma

View More Reviews

Need some help? Give us a call on 01376 55 60 60

Go to Support Centre
Feedback Close Feedback

It appears you are using an old browser, as such, some parts of the Freeola and Getdotted site will not work as intended. Using the latest version of your browser, or another browser such as Google Chrome, Mozilla Firefox, or Opera will provide a better, safer browsing experience for you.