Recognizing the “X-Robots-Tag” HTTP Header

In general, we use robots.txtfiles to tell search engines which files/folders are allowed to crawl or not, and things like the X-Robots-Tag HTTP header, you know? Using this feature is good for both search engines and website servers, reducing server load by disabling crawling access to certain unimportant areas of the website.

Before continuing, let’s take robots.txta look at what files do. In simple terms, what it does is tell search engines not to crawl specific pages, files or directories, etc. on your website.

Blocking the entire site is not recommended robots.txtunless it is a very private site.

X-Robots-Tag

Back in 2007, Google announced the addition of support for the X-Robots-Tag directive, which meant that access to search engines could not only be restricted via the robots.txt file, but also programmatically set in the headers of the HTTP response. A command related to robot.txt.

X-Robots-Tag instruction

There are two different types of directives: crawler directives and indexer directives, and this article will briefly explain the differences below.

crawler command

robots.txtThe files only contain “crawler directives” that tell search engines where to allow or disallow them. By using this directive, you can specify where search engines are allowed to crawl:

Allow

This directive does the exact opposite (disallows crawling):

Disallow

Additionally, the following directives can be used to help search engines crawl your site faster (submitting a sitemap):

Sitemap

Note that it is also possible to specify directives for different search engines by combining the following directives:

User-agent

However, sometimes Disallowsome resources may appear in search engine results even if they are banned, indicating that just using them robots.txtis not enough.

indexer directive

Indexer directives are directives that are set on a per-page and/or per-element basis. As of July 2007, there are two directives: rel =“ nofollow”(indicating that the link should not pass Authorization/PageRank) and the Meta Robots tag.

With Meta Robots tags, you can really stop search engines from showing pages you want to keep out of search results. The same result can be achieved using the X-Robots-Tag HTTP header. As mentioned earlier, X-Robots-Tag also allows more flexibility by allowing control over how specific files (types) are indexed.

Usage example of X-Robots-Tag

If you want to prevent search engines from showing files generated with PHP, you can add the following to the beginning of the header.php (WordPress) file:

header("X-Robots-Tag: noindex", true);

If you want to organize search engines to follow links on these pages, you can follow this example:

header("X-Robots-Tag: noindex, nofollow", true);

Now, although using this method in PHP is very convenient, if you want to block some specific file types outside of PHP, a better way is to add the X-Robots-Tag to your Nginx/Apache server configuration or .htaccess file.

If a website provides .doc files, but for a specific reason does not want search engines to index that file type, X-Robots-Tag can be used. On the Apache server, the following lines should be added to the / .htaccessfile:

<FilesMatch ".doc$">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>

If you want to do this for both .doc and .pdf files:

<FilesMatch ".(doc|pdf)$">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>

If you are running Nginx instead of Apache, you can get the same effect by adding the following to your server configuration:

location ~* \.(doc|pdf)$ {
    add_header  X-Robots-Tag "noindex, noarchive, nosnippet";
}

in conclusion

As you can see from the example above, the X-Robots-Tag HTTP header is a very powerful tool and can be robots.txtused with it for even better results.

Leave a Comment