What Is Robots.txt in SEO and How to Practise it?

What Is Robots.txt in SEO and How to Practise it?

Ashkar Gomez
11min read
Table of Contents

Wondering what is Robots.txt in SEO? The term seems to be complicated but it is not if you understand the concept.

Let me make it simple, It is just a file that helps in crawl management, and guides search engine crawlers on which page to crawl on your website and which page to avoid.

Crawl management is a core technical SEO component for a large website. Robots.txt helps in managing the usage of the crawl budget effectively.

This text file seems simple, but a small error or change can cause huge damage to your website’s organic visibility.

Misconfiguration of robots.txt can block quality content from crawlers or lead to wastage of the crawl budget, so it is important to understand the usage of the Robots.txt file before practising it.

In this article, you will get complete details on what is robots.txt, its syntax, the role of each directive, the process of optimizing, and things to avoid.

Let’s move forward.

What Is Robots.txt?

Robots.txt is a text file that instructs search engine bots on the web pages to crawl or exclude crawling on the website.

It is included in the base file of the website and excludes specific bots from crawling web pages or websites.

Robots.txt directives can also exclude the complete website from crawling by all search engine bots.

Bad and good bots are available over the internet, and good bots are termed web crawler bots.

Not all bots obey their commands. Important bots like Googlebot and Bingbot obey and follow the instructions of the robots.txt file completely.

Robots.txt Example:

Here is a sample of a robots.txt file of Moz.

Robot.txt file syntax consists of components like XML sitemap, user-agent, allow, and disallow.

Sometimes, the component includes the host in the file.

To check the robots.txt of any website, you should use this URL formula https://yourdomain.com/robots.txt.

Post enter and you will receive a similar file as below.

Robots.txt Example of Moz

Robots.Txt Format and Directives:

As you see in the above image, each robots.txt file can hold the following directives:

  1. User-agent
  2. Disallow
  3. Allow
  4. Sitemap.xml

The order should be user-agent, disallow, allow, and XML sitemap.

Google search engine crawlers will only support and follow these four directives

Every user-agent should have separate disallow and allow commands. Never use two or more user-agent in the same commanding pattern.

Good Format:

User-agent: *
Disallow: /project-one/
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

 

User-agent: Googlebot
Disallow: /accounts/
Allow: /products/
Sitemap: https://yourdomain.com/product-sitemap.xml

Bad Format:

User-agent: Googlebot
Disallow: /project-off/
User-agent: Bingbot
Disallow: /
Sitemap: https://yourdomain.com/sitemap.xml

Robots.Txt User Agent:

User-agent is one of the important directives that can be either one or many as per the website’s requirement.

This robots.txt user agent commands the set of rules on what an individual or pool of search engine crawlers should crawl on the website.

When the user-agent is marked as * (asterisk), the rules are the same for every crawler.

Here is the list of user-agents by important crawlers of the internet world.

User-agent: * -> All crawlers

User-agent: Googlebot -> Google Crawlers for Desktop

User-agent: Googlebot-new –> Google News crawlers

User-agent: Googlebot-images -> Google Image crawlers

User-agent: smartphone Googlebot-mobile -> Google Bot for Mobile

User-agent: Adsbot-Google -> Google Ads Bot for Desktop

User-agent: Adsbot-Google-mobile -> Google Ads Bot for mobile

User-agent: Mediapartners-Google -> Adsense Bot

User-agent: Googlebot-Video -> Google Bot for videos

User-agent: Bingbot -> Crawlers by Bing for both Desktop and Mobile

User-agent: MSNBot-Media ->Bing bot for crawling images and videos

User-agent: Baiduspider -> Baidu crawlers for desktop and mobile

User-agent: Slurp -> Yahoo Crawlers

User-agent: DuckDuckBot –> DuckDuckGo Crawlers

Good Practices of Robots.txt User-agent:

  • Never use any other special characters apart from * in user-agent.
  • Never leave the user-agent empty.
  • Always command for a specific crawler in a user-agent command. For example, user-agent: Googlebot Bingbot (not recommended).

Robots.Txt Disallow:

Disallow should be the second line or directive present in the robots.txt file.

This directive commands the mentioned user-agents to exclude a set of web pages from crawling.

User-agent: *
Disallow: /

This commands, all web pages should be excluded from crawling by all crawlers.

User-agent: Semrushbot
Disallow: /

This commands the entire website should be excluded from the SEMrush bot.

User-agent: Googlebot
Disallow: /videos/

This commands Googlebot to exclude web pages with the URL https://yourdomain.com/videos/ from crawling.

User-agent: Bingbot
Disallow: /_gallery/

Underscore (_) before a directory commands Bingbot to exclude the complete directory of “gallery” from crawling.

User-agent: *
Disallow:

This commands that all the crawlers can crawl the entire website. It disallows nothing.

Good Practices of Robots.txt Disallow:

  • It would help if you had a disallowed directory in the Robots.txt file. You can leave disallow empty (intent of allowing every page to crawl).
  • Use disallow directives to exclude the pages that could cause damage to SEO practices like:
    • Duplicate Pages
    • Pagination Pages
    • Dynamic Product pages
    • Account or Login pages
    • Admin pages
    • cart page & checkout pages
    • Thank you pages
    • Chat pages
    • Search
    • Feedback pages
    • Registration page, etc.
  • Never block sensitive pages by disallowing directive.

Robots.Txt Allow:

Allow directive is the third command in robots.txt. This directive shares the rule on which web pages a crawler should crawl.

A robots.txt can have one or more allow directives. Allow directive should have at least one command. However, it can help crawl a web page under a sub-directory that is excluded by the disallow directive.

User-agent: *
Allow: /

This command helps every bot to crawl the complete website.

User-agent: Googlebot
Allow: /services/se0/

This command sends the rules for Googlebot to crawl the particular web page of the sub-directory.

Add XML Sitemap to Robots.txt

The fourth and final directory to use in robot.txt is the XML sitemap, which is not a mandate in the syntax.

Still, you can have one or many sitemap.xml as required. Ensure you provide the exact URL address of the sitemap.

When the sitemap is in the robots directory file, the pages get more attention on crawling.

Never use the link of the sitemap that is not created, which might affect the website’s crawling performance.

A perfect syntax should appear as below:

User-agent: *
Disallow: /project-one/
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

Why Robots.txt Is Important in SEO?

It has a huge role in SEO practices, as it is a vital medium for crawl management. Even without a robots.txt file, search engine crawlers could discover, crawl, render, and index the important web pages of the website.

Many websites were found without a robots.txt file. Does this affect the SEO performance? No! A robots.txt with an error could cause severe damage to the website.

Here are a few benefits of a robots.txt file on your website,

  • Avoid web pages with duplicate content from crawling.
  • Prevent staging websites from crawling (This should be private).
  • To exclude internal search pages from crawling.
  • Exclude crawling of overloading servers.
  • Exclude images, videos, pdf, excel, and other private pages from crawling.
  • Avoid crawling over the cart, checkout, login, registration, and thank you pages. This could save a lot of crawl budget that helps big e-commerce websites.
  • Exclude specific search engine crawlers to crawl the entire website.
  • Exclude pagination pages from crawling.

Robot.txt Best Practices:

Here is the list of daily practices we do in our agency while optimizing the robots.txt file. Hope this will help your SEO practice too.

An error in robots.txt can lead to index coverage issues like Indexed though blocked by robots.txt.

1. Generate Robots.txt File:

The primary step is to check whether your website has a robots.txt file. To do so, use the below format in the browser tab.

https://yourdomain/robots.txt

If you get a web page, you must validate whether it is properly written.

If your website doesn’t have one, it’s time to create.

Is your website built in WordPress? If so, you can use plugins like Rankmath or Yoast to generate and edit them.

CMS like bloggers and Wix don’t have the facility of editing the robot.txt file; rather, they have a setting that could command the search engine crawlers.

If your website doesn’t have a built-in CMS, you must create one.

Your website should have only one robots.txt file.

Robots.txt Rules by Google

General rules while creating a Robots.txt file

1. Write Each Directives in a Seperate Line:

Never write all the directive in a single line like

User-agent: * Disallow: /videos/ Allow: /videos/seo-services/

Instead, write all the directive in separate line as follows:

User-agent: *
Disallow: /project-one/
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

2. Use a User-agent only once:

It is a good practice to use a particular User-agent only once in the robots.txt file.

For example, if you use User-agent: *, then never use the same User-agent (*) in the robots.txt, as follows:

User-agent: *
disallow: /images/
User-agent: *
disallow: /videos/

This is not the correct format. In fact, follow the perfect syntax as follows:

User-agent: *
disallow: /images/
disallow: /videos/

3. Use separate Robot.txt file for domain and subdomain

When a website has both domain and subdomain. For example,

https://safautocare.com/ and https://blog.safautocare.com/

Then you should use separate robots.txt and rules for domain and subdomains.

4. Use only lower case to Robots.txt file in base file

While you submit the robots.txt syntax in the base file, you must create a file named robots.txt.

Remember, the complete name should be in lowercase. Usage of the upper case will end in an error.

For example: Robots.txt or ROBOTS.TXT is not recommended.

5. Never leave user-agent empty:

User-agent is the directive, where the instruction is given for particular crawler. If it is left empty, this could create an issue.

6. Use only directory name to disallow all the web pages under the directory

Many SEO specialists do this mistake of disallow directive to all the file names under the selected directory.

Wrong Format:

User-agent: *
Disallow: /video/seo-basics/
Disallow: /video/seo-checklist/
Disallow: /video/seo-audit/

Correct Format:

User-agent: *
Disallow: /video/

7. Always use disallow directive:

Robots.txt should always have user-agent, disallow, and allow directives. It’s mandated.

Even if you don’t have any web page to exclude from crawling, leave the disallow directive empty.

Let the crawler discover all the web pages.

8.Never block your entire website

If you disallow: /, then all the web pages are blocked from crawling as per the instruction.

This is not a good practice. Be specific in excluding pages. If you block the entire website, you will never get featured in SERP organically.

9. Always use XML sitemap in Robots.txt:

Sitemap.xml consists of all the web pages that are to be indexed. When XML sitemap is present in robots.txt, the web pages could get the advantage of crawling the webpages to crawl or exclude from crawling.

It’s not a mandate directive as user-agent, disallow, and allow. Yet, the practical addition of sitemap in robots.txt file has numerous advantages.

10. Never link to disallowed pages

Disallowing used in the syntax is to exclude the web pages from crawling.

If you link any pages to the excluded page (Disallow Directive web page), there is a chance of crawling and indexing of excluded web page.

If you diagnose such a case, you will be warned as Indexed, though blocked by robots.txt in Google Search Console.

If you get so, you should remove the links from referring web pages.

11. Never use Nofollow and Noindex directives

Google don’t support Nofollow and Noindex directives in Robots.txt since September 1st, 2019.

So, avoid using such a directive. Still, you can use these directives in the Meta Robots tag.

2. Robots.txt generator:

If you have great knowledge of the different directives, you just need a .txt file.

If you are a beginner, use the Robots.txt generator to create one.

Robots.txt Generator

In the above generator, you can provide the details of search robots (user-agents), web pages to disallow, allow, and sitemap.xml.

Google search engine has announced crawl-delay as one of the unsupportive directories so that you can exclude it.

Once the robots.txt is created, you have to validate them to find out if any errors are found.

3. Validate Errors using Robots.txt Tester:

Robots.txt Tester

Use Robots.txt tester by Google Search Console to validate them.

You have to select the property in the tester tab, and the robots.txt syntax gets features if the file is present in the base file of the website.

Same time, you can also validate whether any web page is blocked by robots.txt

4. Place Robots.txt in base file of the website

Post creating a successful robots.txt file, and validating it with no errors, it is time to place them in the Base file.

You can log in to the base file through cPanel, or FTP access.

Create a separate file named robots.txt, and place them. Let the crawler follow the instructions as provided.

You can also edit this text file as needed by the same method.

Our hosting provider cloudways, helps us update the robots.txt file as we provide the changes in a text file.

5. Remove the cache

While you update a new file or edit an existing file, you need to remove the cache.

Search engines will always store the cache that they have crawled last time.

So, the crawlers won’t roll out as per the new instruction. To make the changes active, you must remove the cache in the robots.txt file.

Directives that Google Don't Recommend:

If you have reached this section, I hope you know what the directives Google recommends.

Few directives are not recommended or recognized by Googlebot. Yet, they are accessible by Bingbot and Slurp (Yahoo Crawlers).

Since Google holds 90% of the search engine market share, we hope that everyone will know this component.

Here are a few directives that you can avoid in syntax if you use User-agent: Googlebot

Crawl-delay:

This directive instructs the search engine crawlers to wait for certain seconds between two crawls.

When crawl-delay is directed as 10 seconds, the crawlers mentioned in the User-agent should wait for 10 seconds post a crawl.

User-agent: *
Crawl-delay: 10

Your robots.txt instruct all the web crawlers to wait for 10 seconds after a crawl.

This directive is currently not supported by Google. Yet Yahoo and Bing still support crawl-delay.

Our recommendation would be to avoid this directive, as even 1-minute delay will miss the crawling of 43,200 URLs. This could cause a severe issue to websites with Million of web pages.

Google is a genius than us; that’s why they have excluded supporting such directives.

Noindex Directive:

Noindex is one common directive we use in the Robots Tag of the HTML file. People use this directive in Robots.txt syntax too.

Google never supported this directive, and effective September 1. 2019, Google announced noindex directive in Robots txt syntax is not supported.

User-agent: Googlebot
Noindex: /gallery/

This format doesn’t exclude the Gallery page from indexing.

If you still look to exclude the Gallery page from indexing, use the Meta Robots tag to “noindex.”

Nofollow Directive:

Nofollow is one common directive we use in the Robots Tag of the HTML file and rel=”nofollow” in the link attribute.

Webmasters used this directive in Robots.txt syntax too from ancient times.

Google never supported this directive, and effective September 1. 2019, Google announced nofollow directive in Robots.txt syntax is not supported.

User-agent: Googlebot
Nofollow: /gallery/

This format doesn’t exclude the link equity from the Gallery page to the linked web pages.

If you still looking to exclude the link equity from the Gallery page, use the Meta Robots tag to “nofollow.” or use rel=”nofollow” in the link attribute.

Wrap Up:

As you have learned what is Robots.txt in SEO and how to practise it and on the other hand how not to practise it.

You now have a clear understanding of the robots.txt file, make use of this wonderful SEO feature in your website and prioritize your crawl budget.

Ashkar Gomez

Ashkar Gomez

Ashkar Gomez is the Founder of 7 Eagles (a Growth Marketing & SEO Company). Ashkar started his career as a Sales Rep in 2013 and later shifted his career to SEO in 2014. He is one of the leading SEO experts in the industry with 8+ years of experience. He has worked on 200+ projects across 20+ industries in the United States, Canada, the United Kingdom, UAE, Australia, South Africa, and India. Besides SEO and Digital Marketing, he is passionate about Data Analytics, Personal Financial Planning, and Content Writing.

Table of Contents

Related Articles

Your Business
Growth
Starts Here

Let’s Have a Cup of Digital Tea

Request Your Free Quote