What is Robots.txt and how to use it to tell robots what they should (or shouldn’t) request from your site

Robots

What is Robots.txt and how to use it to tell robots what they should (or shouldn’t) request from your site

This file can be a great ally of your website’s SEO, helping robots to index your content

Have you ever watched the movie “ Me, Robot ”?

In the movie starring Will Smith, we are in the year 2035, robots exist to serve humans and must follow 3 rules:

  1. Robots cannot harm humans;
  2. Robots must obey humans (if it doesn’t go against the first rule);
  3. Robots must protect themselves (if it doesn’t go against the first and second rules).
  4. It’s not 2035 yet, but the internet is already full of robots, filling out forms, screwing up Google Analytics reports, hacking computers and indexing content in internet search engines.
  5. Google’s robot, affectionately called “ Googlebot ”, but also known as “ Crawler ”, “ Spider ” or simply “ Bot ”, was created to search the entire web for new pages (or updates) to index in the results of Google search.
  6. They start their search by starting a list of URLs previously indexed, and move to other pages through existing links, identifying updates and new content and keeping the Google results page always updated and relevant to users.
  7. To better understand the entire process of indexing pages on Google (including what indexing is), visit the How Google Works post.
  8. The other search engines, like Bing and Yahoo, also have their own robots, which work in a similar way.
  9. And so the robots go from page to page, through links, sending an astronomical amount of content for indexing (to give you an idea, today we have more than 1.5 billion  sites on the internet).

But what if I don’t want robots to track my folders, images, or certain resources? Then enter the robots.txt file . What is robots.txt?

In summary, robots.txt is a text file published at the root of the site that contains guidelines for search engine robots, primarily to avoid overloading the site with requests.

Remember the 3 rules that the robots in the movie should follow, at the beginning of the post?

Basically, Blue World City is saying that the rule applies to all robots ( User-agent ) and listing what robots should disregard ( Disallow ) and exceptions to the rules ( Allow ). For example he might disobey a human to protect him, or he might not protect himself to obey a human.

Even a very common mistake is websites that are launched and the robots.txt file is not revised. Days, weeks and even months go by and nothing on the site appears, not even when searching for the company name.

So the file is telling all robots (User-agent: *) not to crawl any pages on the site (Disallow: /). If after launching the site the file is not corrected, the guidelines will remain the same.

To learn more, check out Google’s documentation on robots.txt .

3 reasons to use a robots.txt file on your website

Maybe now you’re asking yourself, “but why the hell am I going to tell the Google robot to disregard features on my site, if the more my site appears there, the more organic traffic I’ll get?”

Well, you are not wrong. Although its use is considered good SEO practice, most sites do not need a robots.txt. Google itself states this in its guidelines. If the site does not have a robot guidelines file, it will be crawled and indexed normally.

In addition to avoiding overloading the site with requests, we have 3 reasons  for you to include the rules on your site:

1. Avoid tracking internal areas, files and resources

Usually, a website has a login area, internal use pages or an area still under development, for example.

These types of pages can, and should, have rules not to be crawled by robots. You can also prevent file types (such as PDF or DOC), images, and even resources such as Ajax from being scanned.

If you do Inbound Marketing, it’s a great feature if you host your materials within your domain. After all, you want your potential Leads to  find your Landing Page and convert, and not directly find and access the final material, right?

By using rules from a robots.txt file, you can prevent crawling an entire area of ​​your site and even new rules for telling you which pages are exceptions and which should be indexed.

2. Robots’ time on your site is limited

I don’t know if you know, but Google has officially declared it has a crawl limit, the famous “Crawl Budget”.

This means that if you don’t determine which pages Google shouldn’t crawl, it could waste more time on your site crawling worthless pages and fail to crawl the pages you’d really like to index or update.

If you’re having trouble getting robots to crawl and index your entire site, it could be that the problem is really Crawl Budget. Blocking irrelevant pages from crawling can do the trick.

Keeping in mind that this usually only becomes a problem for large content sites and portals, with many pages.

3. You can use to indicate where your sitemaps are

It’s a simple feature, but it helps Google and other search engines to find your sitemaps and consequently understand the organization of your website.

Want to learn more about Sitemaps? Access the Sitemap XML post: everything you need to know.

Not sure if your site has a robots.txt file or not? Take the test: go to your website’s main address and include /robots.txt  at the end. For example https://resultadosdigitais.com.br/robots.txt

And when not to use the robots.txt file?

Have you tested whether your site already has a robots.txt? So you’ve seen how easy it is to access it… So other users, including hackers, find it easily too.

Therefore, it is not interesting to use the file to block access to personal documents or confidential files, as you are blocking in search engines, but you are facilitating access directly through robots.txt.

In this case, the most recommended solution is to include a password to access or use Meta Tag Robots.

Important: Having pages listed in the archive to not be crawled does not guarantee they won’t show up on Google.  This works to block access to files and resources, but to ensure that pages won’t show up in the results, the best way is to use Robots Meta Tag Noindex.

Robots.txt x Meta Tag Robots

Unlike the robots.txt file, which is for the entire site, Meta Tag Robots allows you to configure pages individually and tell search engines not to index the page and/or not follow the links present on the page.

The tag is inserted inside the <head> section  in the page’s HTML  and has the following structure:

  • To indicate that robots should not index the page:

<meta name=”robots” content=”noindex” />

  • To indicate that robots should not follow any links on the page:

<meta name=”robots” content=”nofollow” />

  • To tell robots not to index the page or follow your links:

<meta name=”robots” content=”noindex, nofollow”>

This is the best alternative to ensure that certain pages are not indexed in search engines. If the page has already been indexed and the meta tag was inserted after that, when the robots crawl the page again, the tag will be read and the command not to index will be sent to the servers.

A common example of using Meta Tag Robots is on Thank You Pages. As these are the pages that we provide the materials, the ideal is for them not to be indexed in search engines, only the Landing Pages.

It’s also not good practice to include their URLs in the robots.txt file, as it doesn’t guarantee that they won’t be indexed, it will reveal Thank You Page addresses to any user.

Using Meta Tag Robots Noindex Nofollow in each of my Thank You Pages ensures that they will not be indexed by search engines and also indicates that the links in it are not followed, which also prevents materials from being indexed and displayed in search results.

This way you guarantee that only the Landing Pages are indexed in search engines.

To learn more about this, check out Google’s documentation on Meta Tag Robots.

Syntax

For robots to interpret the contents of the file, robots.txt must follow some standards.

The first one is that it needs to be an ASCII or UTF-8 text file. The rules inserted in the file are interpreted from top to bottom, the sequence being: user-agent (to whom the rule applies) and which files and directories this robot can or cannot access.

Another important point is that the rules are case-sensitive. So, if you include a “/Example” directory and also have “/Example”, the rule will only apply to what was included in the file.

Check out the directives used in the files below:

User-agent (required)

This is where we indicate which robot the following rule will apply to and by default, it is the first line of any rule. In this list, you can find most robots on the internet and if the rule applies to all robots

Disallow and Allow (each rule must have at least one)

Disallow is the directive that a certain directory or page should not be crawled on the site, whereas Allow is the opposite, indicating which directories and pages should be crawled.

By default, robots already track all pages on the site, without the need to include Allow in the file. It is only necessary to use it in cases where a certain section or group is blocked (Disallow)

Sitemap (optional)

It is an indication of where the sitemap is and is a good practice to assist Google in crawling and indexing the site.

Simple so far, right?

But the syntax of robots.txt files has many variations and many use cases. To dig a little deeper, I recommend checking out the complete robots.txt syntax.

Enough of theory, let’s practice!

How to create and test the robots.txt file

As the name implies, to create a robots.txt file you just need a simple text editor, such as your computer’s notepad.

Important: the file name must be “robots.txt” and must be installed at the root of the site (site.com/robots.txt).

To make it easier for you to understand, it is possible to include annotations in the file, starting its lines with #.

Have you created the text file with the rules you want to apply to your website? Before submitting to Google, it’s important to test that everything is fine. Just go to  Google’s robots.txt testing tool, present within the Google Search Console. With the tool, it is possible to test whether certain pages or resources are being blocked by the file, in addition to checking for possible errors in the file. Updates can be done directly in the testing tool, which allows you to download the file to update it on your website’s server.

Is your robots.txt file okay? Now it’s time to update Google on it. In the testing tool itself, there is a “Submit” button, which updates and notifies Google of the changes made.

10 examples of robots.txt

To get inspired and learn by doing, check out 10 real examples of robots.txt files:

  • 1. https://facebook.com/robots.txt  (long and configured for multiple user-agents)
  • 2. https://instagram.com/robots.txt  (even has DuckDuckGo user-agent)
  • 3. https://www.apple.com/robots.txt  (highlight for Baidu rules)
  • 4. https://www.google.com.br/robots.txt
  • 5. https://www.youtube.com/robots.txt
  • 6. https://www.estadao.com.br/robots.txt
  • 7. https://www.globo.com/robots.txt
  • 8. https://www.dell.com/robots.txt
  • 9. https://www.cocacola.com.br/robots.txt
  • 10. https://www.amazon.com.br/robots.txt

As well as being able to access and check the robots.txt file of each of the companies above, you can see what the file looks like from any website that uses it, just access the site and include /robots.txt  at the end of the address. Set out to take a look at the competition?

Still have a question about robots.txt? Comment below or visit the Google FAQ page.

This file can be a great ally of your website’s SEO, helping robots to index your content Have you ever watched the movie “ Me, Robot ”? In the movie starring Will Smith, we are in the year 2035, robots exist to serve humans and must follow 3 rules: Robots cannot harm humans; Robots must obey humans (if…