The Robots.txt file serves to give information to Googlebot and other robots that crawl the Internet about the pages and files that should be indexed on our website. Although it is not essential, the Robots.txt file is of great help to Google and other crawling robots when indexing our page, so it is very important that it is configured correctly.
- 1 Robots.txt file location
- 2 Types of robots that can visit our website
- 3 Editing the Robots.txt file
- 3.1 Blocking a page and lower level pages
- 3.2 Blocking a page while maintaining access to lower level pages
- 3.3 Block a page and all the lower level pages except those we define
- 3.4 Blocking all lower-level pages but allowing access to the top-level one
- 3.5 Blocking URLs using wildcards
- 3.6 Assigning different instructions for different robots
- 3.7 Tell crawler robots where the sitemap of the site is located
- 4 Recommendations for the Robots.txt file
- 5 Alternative: Using the meta robots meta tag
1. Robots.txt file location
The Robots.txt file must be created in the root directory of our website and, as its name indicates, it is a simple text file with a .txt extension. We must make sure that it has public read permissions so that it is possible to access it from the outside, for example, permissions 664.
In case the file does not exist on our website, we must access via FTP to our server and create it. There are Plugins for the most used CMS like Drupal or WordPress that create and configure this file for us in case it does not exist.
2. Types of robots that can visit our website
Although Google's Googlebot is the most popular crawler bot, it is also worth considering the Bingbot of the Bing search engine, the Russian Yandexbot, the Yahoo Slurp, the Alexa bot (ia_archiver) or the Chinese search engine BaiduSpider.
There are also other bots with more specific functionalities such as Googlebot-image, in charge of crawling and indexing exclusively the images of websites.
There are many crawler bots and many of them do not crawl our website with good intentions, as they can be from bots looking for security holes to content extraction programs to duplicate our website.
3. Editing the Robots.txt file
It is very important to keep in mind that, by default, all the pages of a website will be indexable. Through the Robots.txt file we can give some guidelines to the different bots that visit us to tell them what content they can access and what they should not crawl. We can do all this through a few simple basic commands:
User-agent: Used to indicate the robot to which the rules to be defined below will be applied.
Syntax: User-agent: BotName
Example: User-agent: Googlebot
Disallow: Used to indicate to the robots that they should not crawl the URL or URLs that match the pattern defined below.
Syntax: Disallow: Pattern
Example: Disallow: /comments
Allow: Used to tell robots that they should crawl the URL or URLs that match the pattern defined below. Allow instructions take precedence over Disallow instructions, so if we define a page or pages to be indexable with Allow, they will always be indexable even if some of them are included in another Disallow instruction.
Syntax: Allow: Pattern
Example: Allow: /readme.html
Sitemap: Used to specify where the sitemap of our website is located.
Syntax: Sitemap: UrlofSitemap
Example: Sitemap: http://www.ma-no.org/sitemap.xml
When specifying patterns, there are a number of special characters. We will first see what these characters are and then explain how they are used by means of some examples.
*: The asterisk is a wildcard that is equivalent to any character or set of characters.
$: The dollar sign indicates the end of a text string, since by default, these expressions understand that if we do not indicate it, more characters can go after the last one we write in the pattern.
Finally, it is important to note that the Robots.txt file is case sensitive, so "Disallow: /file.html" is not the same as "Disallow: /File.html".
As you probably have not understood too much, it is time for you to understand everything by means of some simple examples.
3.1 Blocking a page and lower level pages
User-agent: * Disallow: /articles/
What we are doing with the User-agent asterisk is indicating that the following instruction or instructions will be applied for all bots. This will be maintained until the end of the document or until the User-agent command appears again referring to another bot or bots.
By means of the Disallow instruction, we will be telling the bots not to index the page "/articles/", always starting from our root directory. It is a common mistake to think that only this URL will be blocked, since as we have explained before, it is assumed that there can be more characters after the last character, which in this case is the "/" of "/articles/". For example, the URL "/articles/example" and other URLs starting with "/articles/" will also be blocked. Next we will see how to block only the page "/articles/", making it possible to index the pages hanging from it at a lower level such as "/articles/July" or "/articles/August".
3.2 Blocking a page while maintaining access to lower level pages
User-agent: * Disallow: /articles$
This case is exactly the same as the previous one, with the difference that by means of the dollar sign we delimit the URL so that only "/articles" is excluded, being able to index lower level pages such as "/articles/january" or "/articles/february".
As we can see, we have excluded the backslash at the end of the URL, since it is common that sometimes it is included and sometimes it is not, thus covering all cases.
3.3 Block a page and all the lower level pages except those we define
User-agent: * Disallow: /articles/ Allow: /articles/january
By default, bots are allowed to access all pages. What we do first is to prevent access to the page "/articles/" and all the lower level pages, but by Allow we allow the URL "/articles/january" to be indexed. In this way, only the page "/articles/january" will be indexed, but not the pages "/articles/february", "/articles/march" and other subpages.
3.4 Blocking all lower-level pages but allowing access to the top-level one
User-agent: * Allow: /articles/$ Disallow: /articles/
In this case, we allow access to the page "/articles/" and only to it, not specifying anything about the pages that might be at a lower level which, by default, would be accessible to bots for the time being as well.
By the following Disallow instruction, we are excluding the page "/articles/" and all the lower level subpages, but since we have explicitly defined that it is possible to index "/articles/" by the instruction immediately above, it will be indexable.
3.5 Blocking URLs using wildcards
User-agent: * Disallow: /page/*/articles/
What we are indicating by means of the Disallow instruction of the example, is that the pages that have as first element of the URL "/page/" and as third element "/articulos/" should not be indexed, independently of which is the second element. As we can see, the asterisk can be used to replace any character string.
3.6 Assigning different instructions for different robots
User-agent: * Disallow: /hide User-agent: WebZIP Disallow: /
In the example, we first tell all bots not to index the "/hide" page. Then we select the "WebZIP" bot and tell it not to index any URL of our website, indicating it with a backslash "/", which represents the root directory. It is possible to reference many robots in the Robots.txt file. The common commands will affect all the robots and the specific ones for each robot, only the selected robot, having precedence the specific commands for the robot itself over the general ones.
3.7 Tell crawler robots where the sitemap of the site is located
Using the Sitemap command, we can tell the bots where the sitemap of the site is located, useful to help them find all the URLs. It is not essential, but any help is always welcome.
4. Recommendations for the Robots.txt file
5. Alternative: Using the meta robots meta tag
In addition to the Robots.txt file, we can also tell the robots to index or not to index certain pages using the meta robots meta tag, which can have the values Index or NoIndex to tell the robots whether or not to index the page. In addition, they can also have a second value which can be Follow or NoFollow to indicate to the robots whether, by default, they should follow the links on the page.
These meta-tags can be used in combination with the Robots.txt file, but the use of the file gives prior information to the robots so that they do not even have to see the code of the pages to know whether or not they can index them.
Licenciada en Bellas Artes y programadora por pasión. Cuando tengo un rato retoco fotos, edito vídeos y diseño cosas. El resto del tiempo escribo en MA-NO WEB DESIGN AND DEVELOPMENT.
The Impact of automation and Robots on human jobs: exploring controversies and opportunities
Automation and technological advancements have raised concerns in some sectors about the possibility of robots taking away human jobs. While it is true that robots and artificial intelligence can perform…