What is a robots.txt file and what does it do?

Sascha Lienesch

| 09. 05. 2017

A robots.txt file is a simple text file which webmasters can create to tell web crawlers which parts of a website should be crawled and which should not. The file is stored in the main directory (root) on the server. When a crawler arrives at a website, it first reads the robots.txt file to determine which parts of the website it should crawl and which parts it should ignore, according to the so-called Robots Exclusion Standard Protocol. You don’t have to create a robots.txt file but it’s often advisable to do so.

With a robots.txt file, it’s possible to exclude entire directories from crawls. If necessary, you can even block bots from an entire domain.

Allow and Disallow

On a basic level, there are two instructions which you can give the crawler – allow and disallow. All files in a domain are free to be crawled by default, so if you want to exclude them, you need to set them to “disallow.”

Example

Disallow: /admin

This command would exclude the admin directory from crawls, as simple as that. If you have other directories which you would like to exclude from crawls, simply add them on lines beneath.

Disallow: /wp-admin/
Disallow: /xmlrpc.php

And so on, and so on.

An “allow” command has the opposite effect. If you have a directory which should be “disallowed” but which contains a single URL which should still be crawled, you can “allow” this address to be crawled by starting a line in your robots.txt file with “allow.” With this command, you can exclude large parts of a domain from crawls while still including a small, relevant section.

Wildcard commands

Wildcards enable you to exclude certain types of URLs from crawls.

Examples

Exclusion of all .gif files:

Disallow: /*.gif$

Block all URLs which contain a question mark (?):

Disallow: /*?

To block a certain sequence of characters, use an asterisk (*). In this example, all sub-directories which begin with the word “private” will be excluded:

Disallow: /private*/

Commands for specific bots

Crawlers and bots have specific names with which they can be recognized on a server. A note in the robots.txt file can lay out which crawlers much follow which commands. An asterisk (*) denotes a rule for all bots.

Google uses various user agents to crawl the internet, the most important of which is the “Googlebot.” So if you want to explicitly block Google from crawling particular pages or directories, but not other search engines or crawlers, you can insert the following “disallow” commands into you robots.txt:

User agent: Googlebot

Decide which rules apply to which crawlers. Here’s an example:

User agent: BingBot

Disallow: /sources/dtd/

User-agent: *

Disallow: /photos/
Disallow: /temp/
Disallow: /photoalbum.html

In this example, the rules for the Bing crawler are defined and other rules for all crawlers then follow.

Include sitemap in your robots.txt file

It’s a good idea to include the location of your website’s sitemap in the robots.txt file. Since the robots.txt file is the first port of call for crawlers, Google recommends including the sitemap as the final line, like this:

Sitemap: https://www.example.com/sitemap.xml

This allows the search engine to find the sitemap straight away and enables the crawler to better understand the layout and structure of the domain.

Folgende Dinge sollten Sie wissen

Commands in a robots.txt file are not binding. They can be overruled. According to Google however, all genuine, trustworthy crawlers, including their own, follow the commands.
Exclusion of a URL via a robots.txt does not prevent it being indexed. The crawler can still access the URL via internal or external links and index it – but it won’t crawl it.
If you don’t want a particular URL to be indexed, don’t block it via robots.txt. If you block the URL, the crawler won’t see the “noindex” tag and therefore won’t know that the URL shouldn’t be indexed. Instead, allow the crawler to crawl the URL but make sure the relevant meta commands are applied (noindex, nofollow, etc).

Tip: Comments intended for other webmasters can be added to the robots-txt file, preceded by a hashtag (#) so that they don’t infect the code.

Test out your robots.txt with the Google Search Console

We have emphasized the importance of the Google Search Console for Webmasters in various different articles. Add your domain to the console and let Google test whether it has any problems reading your robots.txt file.

The robots.txt tester can be found under the “Crawling” tab in the main menu. With the domain already added to the console, the robots.txt file is directly accessible. Just beneath, you can see errors and warnings. In this example, everything is fine.

How to test your robots.txt file in the Google Search Console:

Open the tester tool for your site, and scroll through the robots.txt code to locate the highlighted syntax warnings and logic errors. The number of syntax warnings and logic errors is shown immediately below the editor.
Type in the URL of a page on your site in the text box at the bottom of the page.
Select the user-agent you want to simulate in the dropdown list to the right of the text box.
Click the TEST button to test access.
Check to see if TEST button now reads ACCEPTED or BLOCKED to find out if the URL you entered is blocked from Google web crawlers.
Edit the file on the page and retest as necessary. Note that changes made in the page are not saved to your site! See the next step.
Copy your changes to your robots.txt file on your site. This tool does not make changes to the actual file on your site, it only tests against the copy hosted in the tool.

Source: https://support.google.com/webmasters/answer/6062598?hl=en&ref_topic=6061961