Optimizing the Robots.txt file for Google

by Janeth Kent Date: 21-03-2024 seo

The Robots.txt file serves to give information to Googlebot and other robots that crawl the Internet about the pages and files that should be indexed on our website. Although it is not essential, the Robots.txt file is of great help to Google and other crawling robots when indexing our page, so it is very important that it is configured correctly.

1 Robots.txt file location
2 Types of robots that can visit our website
3 Editing the Robots.txt file
4 Recommendations for the Robots.txt file
5 Alternative: Using the meta robots meta tag

1. Robots.txt file location

The Robots.txt file must be created in the root directory of our website and, as its name indicates, it is a simple text file with a .txt extension. We must make sure that it has public read permissions so that it is possible to access it from the outside, for example, permissions 664.

In case the file does not exist on our website, we must access via FTP to our server and create it. There are Plugins for the most used CMS like Drupal or WordPress that create and configure this file for us in case it does not exist.

2. Types of robots that can visit our website

Although Google's Googlebot is the most popular crawler bot, it is also worth considering the Bingbot of the Bing search engine, the Russian Yandexbot, the Yahoo Slurp, the Alexa bot (ia_archiver) or the Chinese search engine BaiduSpider.

There are also other bots with more specific functionalities such as Googlebot-image, in charge of crawling and indexing exclusively the images of websites.

There are many crawler bots and many of them do not crawl our website with good intentions, as they can be from bots looking for security holes to content extraction programs to duplicate our website.

3. Editing the Robots.txt file

It is very important to keep in mind that, by default, all the pages of a website will be indexable. Through the Robots.txt file we can give some guidelines to the different bots that visit us to tell them what content they can access and what they should not crawl. We can do all this through a few simple basic commands:

User-agent: Used to indicate the robot to which the rules to be defined below will be applied.
Syntax: User-agent: BotName
Example: User-agent: Googlebot

Disallow: Used to indicate to the robots that they should not crawl the URL or URLs that match the pattern defined below.
Syntax: Disallow: Pattern
Example: Disallow: /comments

Allow: Used to tell robots that they should crawl the URL or URLs that match the pattern defined below. Allow instructions take precedence over Disallow instructions, so if we define a page or pages to be indexable with Allow, they will always be indexable even if some of them are included in another Disallow instruction.

Syntax: Allow: Pattern
Example: Allow: /readme.html
Sitemap: Used to specify where the sitemap of our website is located.
Syntax: Sitemap: UrlofSitemap
Example: Sitemap: http://www.ma-no.org/sitemap.xml

When specifying patterns, there are a number of special characters. We will first see what these characters are and then explain how they are used by means of some examples.

*: The asterisk is a wildcard that is equivalent to any character or set of characters.

$: The dollar sign indicates the end of a text string, since by default, these expressions understand that if we do not indicate it, more characters can go after the last one we write in the pattern.

Finally, it is important to note that the Robots.txt file is case sensitive, so "Disallow: /file.html" is not the same as "Disallow: /File.html".

As you probably have not understood too much, it is time for you to understand everything by means of some simple examples.

3.1 Blocking a page and lower level pages

User-agent: *  
Disallow: /articles/

What we are doing with the User-agent asterisk is indicating that the following instruction or instructions will be applied for all bots. This will be maintained until the end of the document or until the User-agent command appears again referring to another bot or bots.

By means of the Disallow instruction, we will be telling the bots not to index the page "/articles/", always starting from our root directory. It is a common mistake to think that only this URL will be blocked, since as we have explained before, it is assumed that there can be more characters after the last character, which in this case is the "/" of "/articles/". For example, the URL "/articles/example" and other URLs starting with "/articles/" will also be blocked. Next we will see how to block only the page "/articles/", making it possible to index the pages hanging from it at a lower level such as "/articles/July" or "/articles/August".

3.2 Blocking a page while maintaining access to lower level pages

User-agent: *  
Disallow: /articles$

This case is exactly the same as the previous one, with the difference that by means of the dollar sign we delimit the URL so that only "/articles" is excluded, being able to index lower level pages such as "/articles/january" or "/articles/february".

As we can see, we have excluded the backslash at the end of the URL, since it is common that sometimes it is included and sometimes it is not, thus covering all cases.

3.3 Block a page and all the lower level pages except those we define

User-agent: *  
Disallow: /articles/  
Allow: /articles/january

By default, bots are allowed to access all pages. What we do first is to prevent access to the page "/articles/" and all the lower level pages, but by Allow we allow the URL "/articles/january" to be indexed. In this way, only the page "/articles/january" will be indexed, but not the pages "/articles/february", "/articles/march" and other subpages.

3.4 Blocking all lower-level pages but allowing access to the top-level one

User-agent: *  
Allow: /articles/$  
Disallow: /articles/

In this case, we allow access to the page "/articles/" and only to it, not specifying anything about the pages that might be at a lower level which, by default, would be accessible to bots for the time being as well.

By the following Disallow instruction, we are excluding the page "/articles/" and all the lower level subpages, but since we have explicitly defined that it is possible to index "/articles/" by the instruction immediately above, it will be indexable.

3.5 Blocking URLs using wildcards

User-agent: *  
Disallow: /page/*/articles/

What we are indicating by means of the Disallow instruction of the example, is that the pages that have as first element of the URL "/page/" and as third element "/articulos/" should not be indexed, independently of which is the second element. As we can see, the asterisk can be used to replace any character string.

3.6 Assigning different instructions for different robots

User-agent: *  
Disallow: /hide    
User-agent: WebZIP  
Disallow: /

In the example, we first tell all bots not to index the "/hide" page. Then we select the "WebZIP" bot and tell it not to index any URL of our website, indicating it with a backslash "/", which represents the root directory. It is possible to reference many robots in the Robots.txt file. The common commands will affect all the robots and the specific ones for each robot, only the selected robot, having precedence the specific commands for the robot itself over the general ones.

3.7 Tell crawler robots where the sitemap of the site is located

Sitemap: http://www.ma-no.org/sitemap.xml

Using the Sitemap command, we can tell the bots where the sitemap of the site is located, useful to help them find all the URLs. It is not essential, but any help is always welcome.

4. Recommendations for the Robots.txt file

It is recommended that, when it is possible to index a page, all images, CSS files and JavaScript files should also be indexable. This should be so because Google needs to have a real view of the web, being as close as possible to what a human visitor will see. In other words, so that Google does not penalize us in the rankings, CSS files, JavaScript files and images must not be blocked in the Robots.txt file.

5. Alternative: Using the meta robots meta tag

In addition to the Robots.txt file, we can also tell the robots to index or not to index certain pages using the meta robots meta tag, which can have the values Index or NoIndex to tell the robots whether or not to index the page. In addition, they can also have a second value which can be Follow or NoFollow to indicate to the robots whether, by default, they should follow the links on the page.

These meta-tags can be used in combination with the Robots.txt file, but the use of the file gives prior information to the robots so that they do not even have to see the code of the pages to know whether or not they can index them.

by Janeth Kent Date: 21-03-2024 seo hits : 3025

Janeth Kent

Licenciada en Bellas Artes y programadora por pasión. Cuando tengo un rato retoco fotos, edito vídeos y diseño cosas. El resto del tiempo escribo en MA-NO WEB DESIGN AND DEVELOPMENT.

The Impact of Social Media Engagement on SEO Maximising Results with Link Building Agency

Our daily lives now include social media, and businesses have realised its potential for engaging and interacting with the target audiences. Social media not only makes it easier to communicate…

Use the SRCSET attribute to improve your SEO

There is a new standard HTML attribute that can be used in conjunction with IMG called SRCSET. It is new and important as it allows webmasters to display different images…

SEO: How to choose the best Anchor Text

Anchor Text are the words used to insert a link within a piece of content. This text can be anything from "click here" to the name of a brand or…

Cumulative Layout Shift, what is and How to optimize CLS

Cumulative Layout Shift, one of the new Core Web Vitals metrics, is the first metric that focuses on user experience beyond performance. Unexpected movement of web page content is a major…

Understanding LCP, CLS, FID. All about Core Web Vitals in Google Search Console

A few months ago we talked about certain Google metrics that were displayed in Search Console. The reason for writing another post on this topic is that Google has changed…

The best free tools for linkbuilding

Linkbuilding is one of the main factors in improving the SEO positioning of a page. Having a profile of inbound links from pages with great authority can mean the difference…

SEO: How to find and remove artificial links

At Ma-no we are aware of the importance of a good linkbuilding strategy in order to achieve success with a website. Links are key to placing a website among the top…

5 Tips to Bring More Traffic to Your Blog

Publishing a blog on your business website is an effective marketing tool for several reasons. Blog posts are the ideal place to share information about your company, products, services, and showcase…

How to Deal with Unnatural Inbound Links

A website that has a good rank on search engines, especially Google is a big task. Backlinks or Inbound links are one of the best ways to achieve this ranking.…

SEO in Google News: How to appear in Google News

Google News is a tool, from Google, that spreads current, reliable and truthful content from different websites or portals dedicated exclusively to news. The sites that appear in Google News have…

How to comply with Google's quality guidelines in 2020

Google provides a set of guidelines on what your website's content should look like in order to appear in search results. There are several categories within the Google guidelines: Webmaster Guidelines. General guidelines. Content-specific…

5 Remote Careers You Can Start Online in 2020

In 2020, life has moved indoors. School, shopping, entertainment, and work have all moved online to keep up with the fight against COVID-19. And with it came an enormous demand…