The robots exclusion protocol, aka robots.txt file is a simple text file created by webmasters in order to provide instructions to search engine robots on how they should crawl and index a website and its web pages. What this basically means is that a robots.txt file effectively works as an anti-sitemap, instructing search engines such as Google, Yahoo, Bing, etc., what kind of content you want them to discover and index. A robots.txt file tells search engines about the specific pages on your site that you don't want them and users to find or access.
The problem with robots.txt files is that they are almost always neglected by webmasters during the process of search engine optimization, which means that they are usually created at the last minute, proving to be ineffective or even counter-productive. The absence of a robots.txt file wouldn't really prevent search engines from identifying and subsequently indexing your website or specific pages, but there is a real-time need to ban search engines bots from crawling and indexing a website or section thereof.
Another point to note is that if a robots.txt file has been is configured incorrectly, search engines will completely ignore your website's existence on the internet! The whole point of optimizing a robots.txt file should be to make it more easily accessible, as misconfiguring it will most likely lead to bots blocking access to core components of your website. Here's a comprehensive guide that will help you optimize your robots.txt file for better search engine optimization.
But wait, do I really need to use robots.txt?
While the absence of a robots.txt file wouldn't really prevent search engines from identifying and subsequently indexing your website or specific pages therein, there is a routine need to ban search engines bots from crawling and indexing a particular part or section of your site. Another point to note is that if a robots.txt file has been is configured incorrectly, search engines will completely ignore your website's existence on the internet! Though you should optimize robots.txt but you should never misconfigure it so that it shouldn't block bots to access important part of your blog.
In fact, the lack of a robots.txt file in your website would mean that it is:
- Not optimized for search engine crawlability
- Susceptible to SEO errors
- Vulnerable in terms of access to sensitive data
- Prone to constant hacks
- Not as good as the competition
- Prone to indexation issues
- Most likely confusing search engines
Step 1: Create or identify an existing Robots.Txt File
Before you do anything, it's best to determine whether your site has an existing robots.txt file or not. If you don't do this, you run the risk of overriding any data or configuration settings that might already been in place. The Robots.txt file, if any, can be found in your website's root folder. For general websites, you can view this file using applications like cPanel or FTP client. But if you're website is based on WordPress, you can view the file using the root of your WordPress installer.
In case you're not sure whether your site has a robots.txt file or not, you can easily find out by typing your website into the address bar, followed by "robots.txt". It would look something like this:
If nothing comes up using this URL, you'll know that your website is running without any robot.txt file. In this case, all you need to do is create a simple text file, using Notepad, and save it as robots.txt. Once this is done, just upload the file to the root folder of your website, using either cPanel or FTP client.
Step 2: Understanding the Rules of a Robots.txt File
Once you've learnt how to use the file to allow or disallow search engines from detecting and indexing your site, you're going to need to understand that you'll have to follow a set format in order to achieve your specific search engine optimization goals.
Here's a simple example; you own a website called genericwebsite.com, which contains a sub-folder. Now say that this folder contains sensitive or private data, including but not limited to redundant data, testing information or other details that you want to keep hidden. Let's call the folder testing, assuming you created the folder for testing purposes. Now, your robots.txt should look something like this:
Seems simple? It kind of is, but just remember that just because you've prevented a website or certain web page from being crawled by search bots, it doesn't necessarily mean that it won't show up in the index of various search engines. This is especially true if the site was being crawled earlier, and its pages were allowed to be indexed. It's up to you to make sure that this doesn't happen, and the best way to do so is to integrate a disallow rule with "no index" Meta tags. And if the website or the pages you want to keep hidden are still being displayed within the search engine's index, you'll have to remove them manually using webmaster tools of the concerned search engines.
Also remember that:
- Asterisks (*) are typically used as a wildcard.
- Use the "allow" directive to make your site accessible to crawling.
- Use the "allow" directive to disable crawling.
Step 3: Learn to use your robots.txt
Once you've created your robots.txt file or identified a pre-existing one, you're going to have to learn how to use the file to your advantage. As a thumb rile, the first line in of code in your file should name a user agent, which is the name given to a search engine. There are a number of user agents that you can use, including Bingbot & Googlebot. You could restrict the kind of user agent that crawls your site, but it's recommended to allow all of them, especially if you're looking to increase traffic to your website.
To instruct all bots to index your site simply write:
Once you've inputted that text in your robots.txt file, you'll have to end the sequence with the allow or disallow commands, depending whether you want the search engine bots to know which part of your site you want them to index or not. It would look something like this:
The Blocking Page
Alternately, you could block search engines from a certain domain or part of it by using:
- Robots.txt - This instructs the user agent to avoid crawling the specified website. Despite this, the robots might retain the site or webpage within the index, displaying it as a result.
- Meta NoIndex - This instructs the user agent that access, but banned from displaying the specified URL in the results. This is the most recommended technique.
- Nofollowing Links – This is the most unsuccessful method of blocking user agents from accessing certain domains as it still allows search engines to mark webpages using other means. These include analytics, using browser tool, or even links from other webpages.
Step 4: Optimize your Robots.txt
Top to Bottom – One of the most common misconceptions of robots.txt files is regarding the method in which a search engine interprets it. Generally, when search engines crawl and index websites, they read the robots.txt file from the top to the bottom. What this means that the search engine crawler will ignore any instructions after any syntax errors or other issues. In such a case, the best way to maintain the integrity of your robots.txt, especially if you're not sure about your syntax or attempting to try something new or unique, is to type in your syntax at the bottom of your file so that the other directives won't be ignored even if you do have errors.
Make proper use of Wildcards - The wildcard directive in search engine optimization is a very useful tool as it allows you to create simple text commands that will enable you create disallow patterns found in URLs. But wildcard should be used as such, and not at every possible junction, because if you misconfigure them, you'll end up mess up the entire code of your website. Another important point to remember is that not all search engine crawlers offer wildcard support, which can put you in quite the fix if you don't know exactly what you're doing. Put all and any wildcard texts at the bottom of your file so that you don't create errors and confuse the search engine bot, who will end up ignoring any other directives.
Disallow – This is robots.txt optimization 101 for advanced level users, but not so obvious to others. It is important to realize that a robots.txt file can only be used to lock or prevent crawlers from indexing particular sites or web pages, and not to guide them towards URLs that need to be indexed, which would be totally counter-productive. Take special care of this because webmasters often make the mistake of allowing directives that typically do not exist, causing massive errors.
Line Breaking – Your typical search engine crawlers might read an entire robots.txt file, but it does so in parts. You will first define the user agent, and then input a block of code that that contains all the Disallow directives associated with that specific user agent. The best way to ensure that you don't mess it up is to leave blank lines between every disallow statement that you wish to incorporate. And if you have to define a fresh user agent, you should do so by placing a blank line that will effectively separate the final disallow statement before defining the new user agent. If you don't make proper use of line breaks, you will run the risk of creating errors that can cause your other directives to be ignored.
Other Useful Tips:
- Go through the entire library of directories on your website, as there will always be some that you will want to disallow a search engine from indexing. This can include your /scripts/, /cgi-bin/, /cart/, and /wp-admin/ directories, as well as any other directory that contains sensitive data.
- It is recommended that you prevent search engines from indexing particular directories within your website. Those that contain duplicate content would be a good start because allowing search engines to index more than one version of your content will negatively affect your SEO ranking.
- Make sure that you have nothing preventing the user agent from crawling and indexing the web content that you want others to have access to.
- Keep an eye out for specific site files for which you would want to implement the disallow directive. This includes certain files, data or scripts contain personal information like phone numbers, email or physical addresses, and so on.
- Avoid using robots.txt files to hide inferior content, prevent indexing of certain categories or dates because it doesn't really stop the user agent from crawling your website. Instead, make use of web builder specific plugins when adding "noindex" and/or "nofollow" Meta tags. For example, Yoast's WordPress SEO Plugin.
- The disallow readme.html file can be accessed by almost anyone, allowing them to hack into your website. Disallowing it keeps you safe against such hacks. To do this, just type this in the specific robots.txt file.
Additionally, disallow any plugin directories, like WordPress plugin directory, as it adds a layer of security to your newly optimized website. To do this, just type:
- Although not meant to affect the privacy of your content, but your website's XML sitemap to the robots.txt file will result in more efficient indexing of your site.
Step 5: Validation
The final step in implementing your robots.txt optimization plan is to ensure that your file is free of errors and will perform as desired. By scrutinizing your site's robots.txt file, you will also prevent your site from having to face any search engine ranking issues. You can employ the use of various validation tools that will not just identify any errors, but will also display all the pages that you have specified to be disallowed.