What you can expect from this article

This article explains what the robots.txt file is and how you can use it to:

  1. prevent search engines from accessing certain parts of your website
  2. prevent duplicate content
  3. have search engines crawl your website more efficiently.

What is a Robots.txt file?

A robots.txt file tells search engines your website’s rules of engagement.

Before visiting any page on a website, search engines will first try to fetch the robots.txt file to see if there are any instructions for crawling the website. We call these instructions ‘directives’.

If there’s no robots.txt file present or if there are no applicable directives, search engines will crawl the entire website.

Although all major search engines respect the robots.txt file, search engines may choose to ignore (parts of) your robots.txt file. It’s important to remember the robots.txt file is a set of optional directives to search engines rather than a mandate.

Synonyms for Robots.txt

Although they are not commonly used, sometimes the robots.txt file is referred to as the robots exclusion standard, or the robots exclusion protocol.

Why should you care about Robots.txt?

The robots.txt file plays an essential role from a search engine optimization (SEO) point of view. It tells search engines how they can best crawl your website.

Using the robots.txt file you can prevent search engines from accessing certain parts of your website, prevent duplicate content and give search engines helpful tips on how they can crawl your website more efficiently.

Example

Let’s look at an example to illustrate this:

You’re running an E-commerce website and visitors can use a filter to quickly search through your products. This filter generates pages which basically show the same content as other pages do. This works great for users, but confuses search engines because it creates duplicate content. You don’t want search engines to index these filtered pages and waste their valuable time on these URLs with filtered content.

Preventing duplicate content can also be done using the canonical URL or the meta robots tag, however these don’t address letting search engines only crawl pages that matter. Using a canonical URL or meta robots tag will not prevent search engines from crawling these pages. It will only prevent search engines from showing these pages in the search results. Since search engines have limit time to crawl a website, this time should be spend on pages that you want to appear in search engines.

What does a Robots.txt file look like?

An example of what a simple robots.txt file for a WordPress website may look like:

User-agent: *
Disallow: /wp-admin/

Let’s explain the anatomy of a robots.txt file based on the example above:

  • User-agent: the user-agent indicates for which search engines the directives that follow are meant.
  • *: this indicates that the directives are meant for all search engines.
  • Disallow: this is a directive indicating what content is not accessible to the user-agent.
  • /wp-admin/: this is the path which is inaccessible for the user-agent.

In summary: this robots.txt file tells all search engines to stay out of the /wp-admin/ directory.

User-agent in robots.txt

Each search engine should identify himself with a user-agent. Google’s robots identify as Googlebot for example, Yahoo’s robots as Slurp and Bing’s robot as BingBot and so on.

The user-agent record defines the start of a group of directives. All directives in between the first user-agent and the next user-agent record are treated as directives for the first user-agent.

Directives can apply to specific user-agents, but they can also be applicable to all user-agents. In that case, a wildcard is used: User-agent: *.

Disallow in robots.txt

You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive. The Disallow directive is followed by the path that should not be accessed. If no path is defined, the directive is ignored.

Example

User-agent: *
Disallow: /wp-admin/

In this example all search engines are told not to access the /wp-admin/ directory.

Allow in robots.txt

The Allow directive is used to counteract a Disallow directive. The Allow directive is supported by Google and Bing. Using the Allow and Disallow directives together you can tell search engines they can access a specific file or page within a directory that’s otherwise disallowed. The Allow directive is followed by the path that can be accessed. If no path is defined, the directive is ignored.

Example

User-agent: *
Allow: /media/terms-and-conditions.pdf
Disallow: /media/

In the example above all search engines are not allowed to access the /media/ directory, except for the file /media/terms-and-conditions.pdf.

Important: when using Allow and Disallow directives together, be sure not to use wildcards since this may lead to conflicting directives.

Example of conflicting directives

User-agent: *
Allow: /directory
Disallow: /*.html

Search engines will not know what to do with the URL http://www.domain.com/directory.html. It’s unclear to them whether they’re allowed to access.

Separate line for each directive

Each directive should be on a separate line, otherwise search engines may get confused when parsing the robots.txt file.

Example of incorrect robots.txt file

Prevent a robots.txt file like this:

User-agent: *
Disallow: /directory-1/ Disallow: /directory-2/ Disallow: /directory-3/

Using wildcard *

Not only can the wildcard be used for defining the user-agent, it can also be used to match URLs that contain a certain string. The wildcard is supported by Google, Bing, Yahoo and Ask.

Example

User-agent: *
Disallow: /*?

In the example above all search engines aren’t allowed access to URLs which include a question mark (?).

Using end of URL $

To indicate the end of a URL, you can use the dollar sign ($) at the end of the path.

Example

User-agent: *
Disallow: /*.php$

In the example above search engines aren’t allowed to access all URLs which end with .php.

Sitemap in robots.txt

Even though the robots.txt was invented to tell search engines what pages not to crawl, the robots.txt can also be used to point search engines to the XML sitemap. This is supported by Google, Bing, Yahoo and Ask.

The XML sitemap should be referenced as an absolute URL. The URL does not have to be on the same host as the robots.txt file. Referencing the XML sitemap in the robots.txt file is one of the best practices we advise you to always do, even though you may have already submitted your XML sitemap in Google Search Console or Bing Webmaster Tools. Remember, there are more search engines out there.

Please note that it’s possible to reference multiple XML sitemaps in a robots.txt file.

Examples

Multiple XML sitemaps:

User-agent: *
Disallow: /wp-admin/

Sitemap: https://www.example.com/sitemap1.xml
Sitemap: https://www.example.com/sitemap2.xml

The example above tells all search engines not to access the directory /wp-admin/ and that there are two XML sitemaps which can be found at https://www.example.com/sitemap1.xml and https://www.example.com/sitemap2.xml.

A single XML sitemap:

User-agent: *
Disallow: /wp-admin/

Sitemap: https://www.example.com/sitemap_index.xml

The example above tells all search engines not to access the directory /wp-admin/ and that the XML sitemap can be found at https://www.example.com/sitemap_index.xml.

Comments

Comments are preceded by a ‘#’ and can either be placed at the start of a line or after a directive on the same line. Comments are meant for humans only.

Example 1

# Don’t allow access to the /wp-admin/ directory for all robots.
User-agent: *
Disallow: /wp-admin/

Example 2

User-agent: * #Applies to all robots
Disallow: /wp-admin/ #Don’t allow access to the /wp-admin/ directory.

The examples above communicate the same.

Crawl-delay in robots.txt

The Crawl-delay directive is an unofficial directive used to prevent overloading servers with too many requests. If search engines are able to overload a server, adding Crawl-delay to your robots.txt is only a temporary fix. The fact of the matter is, your website is running on a poor hosting environment and you should fix that as soon as possible.

The way search engines handle the Crawl-delay differs. Below we explain how major search engines handle it.

Google

Google does not support the Crawl-delay directive. However, Google does support defining a crawl rate in Google Search Console. Follow the steps below to set it:

  1. Log onto Google Search Console.
  2. Choose the website you want to define the crawl rate for.
  3. Click on the gear icon at the top right and select ‘Site Settings’.
  4. There’s an option called ‘Crawl rate’ with a slider where you can set the preferred crawl rate. By default the crawl rate is set to “Let Google optimize for my site (recommended)”.

 

Define Crawl Rate in Google Search Console

Bing, Yahoo and Yandex

Bing, Yahoo and Yandex all support the Crawl-delay directive to throttle crawling of a website (see the documentation for Bing, Yahoo and Yandex). The Crawl-delay directive should be placed right after the Disallow or Allow directives.

Example:

User-agent: BingBot
Disallow: /private/
Crawl-delay: 10

Baidu

Baidu does not support the crawl-delay directive, however it’s possible to register a Baidu Webmaster Tools account in which you can control the crawl frequency similar to Google Search Console.

When to use a Robots.txt?

We recommend to always use a robots.txt file. There’s absolutely no harm in having one, and it’s a great place to hand search engines directives on how they can best crawl your website.

Best practices for Robots.txt

Location and filename

The robots.txt file should always be placed in the root of a website (in the top-level directory of the host) and carry the filename robots.txt, for example: https://www.example.com/robots.txt. Note that the URL for the robots.txt file is, like any other URL, case-sensitive.

If the robots.txt file cannot be found in the default location, search engines will assume there are no directives and crawl away on your website.

Order of precedence

It’s important to note that search engines handle robots.txt files differently. By default, the first matching directive always wins.

However, with Google and Bing specificity wins. For example: an Allow directive wins over a Disallow directive if its character length is longer.

Example

User-agent: *
Allow: /about/company/
Disallow: /about/

In the example above all search engines, including Google and Bing are not allowed to access the /about/ directory, except for the sub-directory /about/company/.

Example

User-agent: *
Disallow: /about/
Allow: /about/company/

In the example above all search engines except for Google and Bing aren’t allowed access to /about/ directory, including /about/company/.

Google and Bing are allowed access because the Allow directive is longer than the Disallow directive.

Only one group of directives per robot

You can only define one group of directives per search engine. Having multiple groups of directives for one search engine confuses them.

Be as specific as possible

The disallow directive triggers on partial matches as well. Be as specific as possible when defining the Disallow directive to prevent unintentionally disallowing access to files.

Example:

User-agent: *
Disallow: /directory

The example above doesn’t allow search engines access to:

  • /directory/
  • /directory-name-1
  • /directory-name.html
  • /directory-name.php
  • /directory-name.pdf

Directives for all robots while also including directives for a specific robot

For a robot only one group of directives is valid. In case directives meant for all robots are followed with directives for a specific robot, only these specific directives will be taken into considering. For the specific robot to also follow the directives for all robots, you need to repeat these directives for the specific robot.

Let’s look at an example which will make this clear:

Example

User-agent: *
Disallow: /secret/
Disallow: /not-launched-yet/

User-agent: googlebot
Disallow: /not-launched-yet/

In the example above all search engines except for Google will not be allowed to access /secret/ and /not-launched-yet/. Google will only not be allowed access to /not-launched-yet/, but will be allowed access to /secret/.

If you don’t want googlebot to access /secret/ and /not-launched-yet/ then you need to repeat these directives for googlebot specifically:

User-agent: *
Disallow: /secret/
Disallow: /not-launched-yet/

User-agent: googlebot
Disallow: /secret/
Disallow: /not-launched-yet/

Robots.txt file for each (sub)domain.

Directives included in a robots.txt file only apply to the host where the file is hosted.

Examples

http://example.com/robots.txt is valid for http://example.com, but not for http://www.example.com or https://example.com.

Conflicting guidelines: robots.txt vs. Google Search Console

In case your robots.txt is conflicting with settings defined in Google Search Console, Google often chooses to use the settings defined in Google Search Console over the directives defined in the robots.txt file.

Check robots.txt after launch

After launching new features or a new website from a test environment to production environment: always check the robots.txt for a Disallow /.

Don’t use noindex in your robots.txt

Although some say it’s a good idea to use a noindex directive in your robots.txt file, it’s not an official standard and Google openly recommends on not using it. It’s not clear why, but we believe we should take their recommendation seriously.

Examples of robots.txt files

In this chapter we’ll cover a wide range of robots.txt examples.

All robots can access everything

There’s multiple ways to tell search engines they can access all files:

User-agent: *
Disallow:

Or having an empty robots.txt file or not having a robots.txt at all.

All robots don’t have access

User-agent: *
Disallow: /

Please note: one extra character can make all the difference.

All Google bots doesn’t have any access

User-agent: googlebot
Disallow: /

Please note that when disallowing Googlebot, this goes for all Googlebots. That includes Google robots which are searching for instance for news (googlebot-news) and images (googlebot-images).

All Google bots, except for Googlebot news don’t have access

User-agent: googlebot
Disallow: /

User-agent: googlebot-news
Disallow:

Googlebot and Slurp don’t have any access

User-agent: Slurp
User-agent: googlebot
Disallow: /

All robots don’t have access to two directories

User-agent: *
Disallow: /admin/
Disallow: /private/

All robots don’t have access to one specific file

User-agent: *
Disallow: /directory/some-pdf.pdf

Googlebot doesn’t have access to /admin/ and Slurp doesn’t have access to /private/

User-agent: googlebot
Disallow: /admin/

User-agent: Slurp
Disallow: /private/

Robots.txt for WordPress

The robots.txt file below is specifically optimized for WordPress, assuming:

  • You don’t want to have your admin section to be crawled.
  • You don’t want to have your internal search result pages crawled.
  • You don’t want to have your tag and author pages not to be crawled.
  • You don’t want your 404 page to be crawled.

User-agent: *
Disallow: /wp-admin/ #block access to admin section
Disallow: /wp-login.php #block access to admin section
Disallow: /search/ #block access to internal search result pages
Disallow: *?s=* #block access to internal search result pages
Disallow: *?p=* #block access to pages for which permalinks fails
Disallow: *&p=* #block access to pages for which permalinks fails
Disallow: *&preview=* #block access to preview pages
Disallow: /tag/ #block access to tag pages
Disallow: /author/ #block access to author pages
Disallow: /404-error/ #block access to 404 page

Sitemap: https://www.example.com/sitemap_index.xml

Please note that this robots.txt file will work in most cases, but you should always adjust it and test it to make sure it applies to your exact situation.

What are the limitations of Robots.txt?

Robots.txt file contains directives

Even though the robots.txt is well respected by search engines, it’s still a directive and not a mandate.

Pages still appearing in search results

Pages that are inaccessible for search engines due to the robots.txt, but do have links to them can still appear in search results if they are linked from a page that is crawled. An example of what this looks like:

Google description not available robots.txt

Protip: it’s possible to remove these URLs from Google using Google Search Console’s URL removal tool. Please note that these URLs will only be temporarily removed. In order for them to stay out Google’s result pages you need to remove the URLs every 90 days.

Caching

Google has indicated that a robots.txt file is generally cached for up to 24 hours. It’s important to take this into consideration when you make changes in your robots.txt file.

It’s unclear how other search engines deal with caching of robots.txt.

File size

For robots.txt files Google currently supports a file size limit of 500 kb. Any content after this maximum file size may be ignored.

It’s unclear whether other search engines have a maximum filesize for robots.txt files.

Frequently asked questions about Robots.txt

  1. Will using a robots.txt file prevent search engines from showing disallowed pages in the search engine result pages?
  2. Should I be careful about using a robots.txt file?
  3. Is it illegal to ignore robots.txt when scraping a website?
  4. I don’t have a robots.txt file. Will search engines still crawl my website?
  5. Can I use Noindex instead of Disallow in my robots.txt file?
  6. What search engines respect the robots.txt file?
  7. How can I prevent search engines from indexing search result pages on my WordPress website?

1. Will using a robots.txt file prevent search engines from showing disallowed pages in the search engine result pages?

No, take this example:

Google description not available robots.txt

Also: if a page is disallowed using robots.txt and the page itself contains a <meta name="robots" content="noindex,nofollow"> then search engines robots will still keep the page in the index, because they’ll never find out about <meta name="robots" content="noindex,nofollow"> since they are not allowed access.

2. Should I be careful about using a robots.txt file?

Yes, you should be careful. But don’t be afraid to use it. It’s a great tool to help search engines better crawl your website.

3. Is it illegal to ignore robots.txt when scraping a website?

From a technical point of view, no. The robots.txt file is an optional directive. We can’t say anything about if from a legal point of view.

4. I don’t have a robots.txt file. Will search engines still crawl my website?

Yes. When search engine don’t encounter a robots.txt file in the root (in the top-level directory of the host) they’ll assume there are no directives for them and they will try to crawl your entire website.

5. Can I use Noindex instead of Disallow in my robots.txt file?

No, this is not advisable. Google specifically recommends against using the noindex directive in the robots.txt file.

6. What search engines respect the robots.txt file?

We know that all major search engines below respect the robots.txt file:

7. How can I prevent search engines from indexing search result pages on my WordPress website?

Including the following directives in your robots.txt prevents all search engines from indexing search result pages on your WordPress website, assuming no changes were made to the functioning of the search result pages.

User-agent: *
Disallow: /?s=
Disallow: /search/

Further reading

Was this article helpful?

See for yourself how ContentKing can help you

Finally understand what’s really happening on your website

  • A proper test drive

    Try out everything ContentKing has to offer for 14 days.
  • No credit card required

    No need to give us anything but your website URL and email address.
  • No installation needed

    Best of all: get started in 20 seconds!
Please enter a valid domain name (www.example.com).