Crawl management is a core technical SEO component for a large website. Robots.txt helps in managing the usage of the crawl budget effectively.
This text file seems simple, but a small error or change can cause huge damage to your website’s organic visibility.
Misconfiguration of robots.txt can block quality content from crawlers or lead to wastage of crawl budget.
In this article, you will get complete detail on what is robots txt, its syntax, the role of each directive, the process of optimizing, and things to avoid.
Let’s move forward.
What Is Robots.txt?
Robots.txt is a text file that instructs search engine bots on the web pages to crawl or exclude crawling on the website. It is included in the base file of the website
This file can also exclude specific bots from crawling web pages or websites.
Robots txt directives can also exclude the complete website from crawling by all search engine bots.
Bad and good bots are available over the internet, and good bots are termed as web crawler bots.
Not all bots obey their commands. Important bots like Googlebot and Bing bot obeys and follows the instruction of the robots txt file completely.
Robots Txt Example:
Here is a sample of a robots txt file of Moz.
Robot.txt file syntax consists of components like XML sitemap, user-agent, allow, and disallow.
Sometimes, the component includes the host in the file.
To check the robots.txt of any website, you should use this URL formula https://yourdomain.com/robots.txt.
Post enter and you will receive a similar file as below.
Robots Txt Format and Directives:
As you see in the above image, each robots.txt file can hold the following directives:
The order should be user-agent, disallow, allow, and XML sitemap.
Google search engine crawlers will only support and follow these four directives
Every user-agent should have separate disallow and allow commands. Never use two or more user-agent in the same commanding pattern.
Robots Txt User Agent:
User-agent is one of the important directives that can be either one or many as per the website’s requirement.
This robots txt user agent commands the set of rules on what an individual or pool of search engine crawlers should crawl on the website.
When user-agent is marked as * (asterisk), the rules are the same for every crawler.
Here is the list of user-agents by important crawlers of the internet world.
User-agent: * – All crawlers
User-agent: Googlebot -> Google Crawlers for Desktop
User-agent: Googlebot-new -> Google News crawlers
User-agent: Googlebot-images -> Google Image crawlers
User-agent: smartphone Googlebot-mobile -> Google Bot for Mobile
User-agent: Adsbot-Google -> Google Ads Bot for Desktop
User-agent: Adsbot-Google-mobile -> Google Ads Bot for mobile
User-agent: Mediapartners-Google -> Adsense Bot
User-agent: Googlebot-Video -> Google Bot for videos
User-agent: Bingbot -> Crawlers by Bing for both Desktop and Mobile
User-agent: MSNBot-Media ->Bing bot for crawling images and videos
User-agent: Baiduspider -> Baidu crawlers for desktop and mobile
User-agent: Slurp -> Yahoo Crawlers
User-agent: DuckDuckBot -> DuckDuckGo Crawlers
Good Practices of Robots txt User-agent:
- Never use any other special characters apart from * in user-agent.
- Never leave the user-agent empty.
- Always command for a specific crawler in a user-agent command. For example, user-agent: Googlebot Bingbot (not recommended).
Robots Txt Disallow:
Disallow should be the second line or directive present in robots txt file.
This directive commands the mentioned user-agents to exclude a set of web pages from crawling.
This commands, all web pages should be excluded from crawling by all crawlers.
This commands the entire website should be excluded from SEMrush bot.
This commands Googlebot to exclude web pages with URL https://yourdomain.com/videos/ from crawling.
Underscore (_) before a directory commands Bingbot to exclude the complete directory of “gallery” from crawling.
This commands that all the crawlers can crawl the entire website. It disallows nothing.
Good Practices of Robots txt Disallow:
- It would help if you had disallow directory in the Robots.txt file. You can leave disallow empty (intent of allowing every page to crawl).
- Use disallow directives to exclude the pages that could cause damage to SEO practices like:
- Duplicate Pages
- Pagination Pages
- Dynamic Product pages
- Account or Login pages
- Admin pages
- cart page & checkout pages
- Thank you pages
- Chat pages
- Feedback pages
- Registration page, etc.
- Never block sensitive pages by disallow directive.
Robots Txt Allow:
Allow directive is the third command in robots.txt. This directive shares the rule on which web pages a crawler should crawl.
A robots.txt can have one or more allow directives. Allow directive should at least have a command. But, it can help crawl a web page under a sub-directory that is excluded by disallow directive.
This command helps every bot to crawl the complete website.
This command sends the rules for Googlebot to crawl the particular web page of the sub-directory.
Add XML Sitemap to Robots.txt
The fourth and final directory to use in robot txt is the XML sitemap, which is not a mandate in the syntax.
Still, you can have one or many sitemap.xml as required. Ensure you provide the exact URL address of the sitemap.
When the sitemap is in the robots directory file, the pages get more attention on crawling.
Never use the link of the sitemap that is not created, which might affect the website’s crawling performance.
A perfect syntax should appear as below:
Why Robots.txt Is Important in SEO?
It has a huge role in SEO practices, as it is a vital medium for crawl management. Even without a robots.txt file, search engine crawlers could discover, crawl, render, and index the important web pages of the website.
Many websites were found without a robots txt file. Does this affect the SEO performance? No! A robots.txt with error could cause severe damage to the website.
Here are a few benefits a robots.txt file can provide to your website. They are the following:
- Avoid web pages with duplicate content from crawling.
- Prevent staging websites from crawling (This should be private).
- To exclude internal search pages from crawling.
- Exclude crawling of overloading servers.
- Exclude images, videos, pdf, excel, and other private pages from crawling.
- Avoid crawling over cart, checkout, login, registration, and thank you pages. This could save a lot of crawl budget that helps big e-commerce websites.
- Exclude specific search engine crawlers to crawl the entire website.
- Exclude pagination pages from crawling.
Robots txt Best Practices:
We have coupled the daily practices we do in our agency while optimizing robots.txt file. Hope this will help your SEO practice too.
An error in robots.txt can lead to index coverage issues like Indexed though blocked by robots.txt.
Generate Robots.txt File:
The primary step is to check whether your website has a robots.txt file. To do so, use the below format in the browser tab.
If you get a web page, you must validate whether it is properly written.
If your website doesn’t have one, it’s time to create.
CMS like bloggers and Wix don’t have the facility of editing the robot.txt file; rather, they have a setting that could command the search engine crawlers.
If your website don’t have built in CMS, you must create one.
Your website should have only one robots.txt file.
General rules while creating a Robots.txt file
1. Write Each Directives in a Seperate Line:
Never write all the directive in a single line like
User-agent: * Disallow: /videos/ Allow: /videos/seo-services/
Instead, write all the directive in separate line as follows:
2. Use a User-agent only once:
It is a good practice to use a particular User-agent only once in the robots.txt file.
For example, if you use User-agent: *, then never use the same User-agent (*) in the robots.txt, as follows:
This is not the correct format. In fact, follow the perfect syntax as follows:
3. Use separate Robot.txt file for domain and subdomain
When a website has both domain and subdomain. For example,
https://safautocare.com/ and https://blog.safautocare.com/
Then you should use separate robots.txt and rules for domain and subdomains.
4. Use only lower case to Robots Txt file in base file
While you submit the robots txt syntax in the base file, you must create a file named robots.txt.
Remember, the complete name should be in lowercase. Usage of the upper case will end in an error.
For example: Robots.txt or ROBOTS.TXT is not recommended.
5. Never leave user-agent empty:
User-agent is the directive, where the instruction is given for particular crawler. If it is left empty, this could create an issue.
6. Use only directory name to disallow all the web pages under the directory
Many SEO specialists do this mistake of disallow directive to all the file names under the selected directory.
7. Always use disallow directive:
Robots.txt should always have user-agent, disallow, and allow directives. It’s mandated.
Even if you don’t have any web page to exclude from crawling, leave disallow directive empty.
Let crawler discover all the web pages.
8.Never block your entire website
If you disallow: /, then all the web pages are blocked from crawling as per the instruction.
This is not a good practice. Be specific in excluding pages. If you block the entire website, you will never get featured in SERP organically.
9. Always use XML sitemap in Robots.txt:
Sitemap.xml consists of all the web pages that are to be indexed. When XML sitemap is present in robots.txt, the web pages could get the advantage of crawling the webpages to crawl or exclude from crawling.
It’s not a mandate directive as user-agent, disallow, and allow. Yet, the practical addition of sitemap in robots txt file has numerous advantages.
10. Never link to disallowed pages
Disallowing used in the syntax is to exclude the web pages from crawling.
If you link any pages to the excluded page (Disallow Directive web page), there is a chance of crawling and indexing of excluded web page.
If you diagnose such a case, you will be warned as Indexed, though blocked by robots.txt in Google Search Console.
If you get so, you should remove the links from referring web pages.
11. Never use Nofollow and Noindex directives
Robots Txt generator:
In the above generator, you can provide the details of search robots (user-agents), web pages to disallow, allow, and sitemap.xml.
Google search engine has announced crawl-delay as one of the unsupportive directories so that you can exclude it.
Once the robots.txt is created, you have to validate them to find out if any errors are found.
Validate Errors using Robots Txt Tester:
Use Robots.txt tester by Google Search Console to validate them.
You have to select the property in the tester tab, and the robots txt syntax gets features if the file is present in the base file of the website.
Same time, you can also validate whether any web page is blocked by robots.txt
Place Robots.txt in base file of the website
Post creating a successful robots txt file, and validating them with no errors, it time to place them in Base file.
You can login to base file through cPanel, or FTP access.
Create a seperate file named robots.txt, and place them. Let crawler follow the instruction as provided.
You can also edit this text file as need by the same method.
Our hosting provider cloudways, help us in updating the robots txt file as on we provide the changes in text file
Remove the cache
While you update a new file or edit an existing file, you need to remove the cache.
Search engines will always store the cache that they have crawled last time.
So, the crawlers won’t roll out as per the new instruction. To make the changes active, you must remove the cache in the robots.txt file.
Directives that Google Don't Recommend:
If you have reached this section, I hope you should have known what are the directives Google recommends.
There are few directives that are not recommended or recognized by Googlebot. Yet, they are accessible by Bingbot and Slurp (Yahoo Crawlers).
Since, Google holds 90% of the search engine market share, we hope that everyone should know this component.
Here are few directives that you can avoid in syntax if you use User-agent: Googlebot
This directive instructs the search engine crawlers to wait for certain seconds between two crawls.
When crawl-delay is directed as 10 seconds, the crawlers mentioned in the User-agent should wait for 10 seconds post a crawl.
Your robots.txt instruct all the web crawlers to wait for 10 seconds after a crawl.
This directive is currently not supported by Google. Yet Yahoo and Bing still support crawl-delay.
Our recommendation would be to avoid this directive, as even 1-minute delay will miss the crawling of 43,200 URLs. This could cause a severe issue to websites with Million of web pages.
Google is a genius than us; that’s why they have excluded supporting such directives.
Noindex is one common directive we use in the Robots Tag of the HTML file. People use this directive in Robots.txt syntax too.
Google never supported this directive, and effective September 1. 2019, Google announced noindex directive in Robots txt syntax is not supported.
This format doesn’t exclude the Gallery page from indexing.
If you still look to exclude the Gallery page from indexing, use the Meta Robots tag to “noindex.”
Nofollow is one common directive we use in the Robots Tag of the HTML file and rel=”nofollow” in link attribute.
Webmasters used this directive in Robots.txt syntax too from ancient times.
Google never supported this directive, and effective September 1. 2019, Google announced nofollow directive in Robots txt syntax is not supported.
This format doesn’t exclude the link equity from the Gallery page to the linked web pages.
If you still look to exclude the link equity from Gallery page, use the Meta Robots tag to “nofollow.” or use rel=”nofollow” in link attribute.