Posted by Lindsay
Some of the Internet’s most important pages from many of the most linked-to domains, are blocked by a robots.txt file. Does your website misuse the robots.txt file, too? Find out how search engines really treat robots.txt blocked files, entertain yourself with a few seriously flawed implementation examples and learn how to avoid the same mistakes yourself.
The robots.txt protocol was established in 1994 as a way for webmasters to indicate which pages and directories should not be accessed by bots. To this day, respectable bots adhere to the entries in the file… but only to a point.
Bots that follow the instructions of the robots.txt file, including Google and the other big guys, won’t index the content of the page but they may still put the page in their index. We’ve all seen these limited listings in the Google SERPs. Below are two examples of pages that have been excluded using the robots.txt file yet still show up in Google.
The below highlighted Cisco login page is blocked in the robots.txt file, but shows up with a limited listing on the second page of a Google search for ‘login’. Note that the Title Tag and URL are included in the listing. The only thing missing is the Meta Description or a snippet of text from the page.
One of WordPress.com’s 100 most popular pages (in terms of linking root domains) is www.wordpress.com/next. It is blocked by the robots.txt file, yet it still appears in position four in Google for the query ‘next blog’.
As you can see, adding an entry to the robots.txt file is not an effective way of keeping a page out of Google’s search results pages.
The thing about using the robots.txt file to block search engine indexing is not only that it is quite ineffective, but that it also cuts off your inbound link flow. When you block a page using the robots.txt file, the search engines don’t index the contents (OR LINKS!) on the page. This means that if you have inbound links to the page, this link juice cannot flow to other pages. You create a dead end.
(If this depiction of Googlebot looks familiar, that’s because you’ve seen it before! Thanks Rand.)
Even though the inbound links to the blocked page likely have some benefit to the domain overall, this inbound link value is not being utilized to its fullest potential. You are missing an opportunity to pass some internal link value from the blocked page to more important internal pages.
Ouch, Digg. That’s a lot of lost link love!
This leads us to our first seriously flawed example of robots.txt use.
Digg.com used the robots.txt to create as much disadvantage as possible by blocking a page with an astounding 425,000 unique linking root domains, the "Submit to Digg" page.
The good news for Digg is that from the time I started researching for this post to now, they’ve removed the most harmful entries from their robots.txt file. Since you can’t see this example live, I’ve included Google’s latest cache of Digg’s robots.txt file and a look at Google’s listing for the submit page(s).
As you can see, Google hasn’t begun indexing the content that Digg.com had previously removed in the robots.txt.
I would expect Digg to see a nice jump in search traffic following the removal of it’s most linked to pages from the robots.txt file. They should probably keep these pages out of the index with the robots meta tag, ‘noindex’, so as not to flood the engines with redundant content. This move would ensure that they benefit from the link juice without flooding the search engine indexes.
If you aren’t up to speed on the use of noindex, all you have to do is place the following meta tag into the <head> section of your page:
<meta name="robots" content="noindex, follow">
Additionally, by adding ‘follow’ to the tag you are telling the bots to not index that particular page, but allowing them to follow the links on the page. This is usually the best scenario as it means that the link juice will flow to the followed links on the page. Take for example a paginated search results page. You probably don’t want that specific page to show up in the search results as the contents of page 5 of that particular search is going to change day to day. But by using the robots noindex, follow the links to products (or jobs in this example from Simply Hired) will be followed and hopefully indexed.
Alternitavely you can use "noindex, nofollow" but that’s a mostly pointless endeavor as you’re blocking link juice as with the robots.txt.
Blogger.com is the brand behind Google’s blogging platform, with subdomains hosted at ‘yourblog.blogspot.com’. The link juice blockage and robots.txt issue that arises here is that www.blogspot.com is entirely blocked with the robots.txt. As if that wasn’t enough, when you try to pull up the home page of Blogspot, you are 302 redirected to Blogger.com.
Note: All subdomains, aside from ‘www’, are accessible to robots.
A better implementation here would be a straight 301 redirect from the home page of Blogspot.com to the main landing page on Blogger.com. The robots.txt entry should be removed altogether. This small change would unlock the hidden power of more than 4,600 unique linking domains. That is a good chunk of links.
When a popular page is expired or moved, the best solution is usually a 301 redirect to the most suitable final replacement.
In the big site examples highlighted above, we’ve covered some misuses of the robots.txt file. Some scenarios weren’t covered. Below is a list of effective solutions to keep content out of the search engine index without link juice leak.
In most cases, the best replacement for robots.txt exclusion is the robots meta tag. By adding ‘noindex’ and making sure that you DON’T add ‘nofollow’, your pages will stay out of the search engine results but will pass link value. This is a win/win!
The robots.txt file is no place to list old worn out pages. If the page has expired (deleted, moved, etc.) don’t just block it. Redirect that page using a 301 to the most relevant replacement. Get more information about redirection from the Knowledge Center.
Don’t block your duplicate page versions in the robots.txt. Use the canonical tag to keep the extra versions out of the index and to consolidate the link value. Whenever possible. Get more information from the Knowledge Center about canonicalization and the use of the rel=canonical tag.
The robots.txt file is not an effective way of keeping confidential information out of the hands of others. If you are making confidential information accessible on the web, password protect it. If you have a login screen, go ahead and add the ‘noindex’ meta tag to the page. If you expect a lot of inbound links to this page from users, be sure to link to some key internal pages from the login page. This way, you will pass the link juice through.
The best way to use a robots.txt file is to not use it at all. Well… almost. Use it to indicate that robots have full access to all files on your website and to direct robots to your sitemap.xml file. That’s it.
Your robots.txt file should look like this:
Earlier in the post I mentioned that "Bots that follow the instructions of the robots.txt file," which means that there are bots that don’t adhere to the robots.txt at all. So while you’re doing a good job of keeping out the good bots, you’re doing a horrible job of keeping out the "bad" bots. Additionally, filtering to only allow bot access to Google/Bing isn’t recommend for three reasons:
If your competitors are SEO savvy in any way shape or form, they’re looking at your robots.txt file to see what they can uncover. Let’s say you’re working on a new redesign, or a whole new product line and you have a line in your robots.txt file that disallows bots from "indexing" it. If a competitor comes along, checks out the file and sees this directory called "/newproducttest" then they’ve just hit the jackpot! Better to keep that on a staging server, or behind a login. Don’t give all your secrets away in this one tiny file.
Both Rand Fishkin & Andy Beard have covered robots.txt misuse in the past. Take note of the publish dates and be careful with both of these posts, though, because they were written before the practice of internal PR sculpting with the nofollow link attribute was discouraged. In other words, these are a little dated but the concept descriptions are solid.