In general, we use robots.txt
files to tell search engines which files/folders are allowed to crawl or not, and things like the X-Robots-Tag HTTP header, you know? Using this feature is good for both search engines and website servers, reducing server load by disabling crawling access to certain unimportant areas of the website.
Before continuing, let’s take robots.txt
a look at what files do. In simple terms, what it does is tell search engines not to crawl specific pages, files or directories, etc. on your website.
Blocking the entire site is not recommended robots.txt
unless it is a very private site.
X-Robots-Tag
Back in 2007, Google announced the addition of support for the X-Robots-Tag directive, which meant that access to search engines could not only be restricted via the robots.txt file, but also programmatically set in the headers of the HTTP response. A command related to robot.txt.
X-Robots-Tag instruction
There are two different types of directives: crawler directives and indexer directives, and this article will briefly explain the differences below.
crawler command
robots.txt
The files only contain “crawler directives” that tell search engines where to allow or disallow them. By using this directive, you can specify where search engines are allowed to crawl:
Allow
This directive does the exact opposite (disallows crawling):
Disallow
Additionally, the following directives can be used to help search engines crawl your site faster (submitting a sitemap):
Sitemap
Note that it is also possible to specify directives for different search engines by combining the following directives:
User-agent
However, sometimes Disallow
some resources may appear in search engine results even if they are banned, indicating that just using them robots.txt
is not enough.
indexer directive
Indexer directives are directives that are set on a per-page and/or per-element basis. As of July 2007, there are two directives: rel =“ nofollow”
(indicating that the link should not pass Authorization/PageRank) and the Meta Robots tag.
With Meta Robots tags, you can really stop search engines from showing pages you want to keep out of search results. The same result can be achieved using the X-Robots-Tag HTTP header. As mentioned earlier, X-Robots-Tag also allows more flexibility by allowing control over how specific files (types) are indexed.
Usage example of X-Robots-Tag
If you want to prevent search engines from showing files generated with PHP, you can add the following to the beginning of the header.php (WordPress) file:
header("X-Robots-Tag: noindex", true);
If you want to organize search engines to follow links on these pages, you can follow this example:
header("X-Robots-Tag: noindex, nofollow", true);
Now, although using this method in PHP is very convenient, if you want to block some specific file types outside of PHP, a better way is to add the X-Robots-Tag to your Nginx/Apache server configuration or .htaccess file.
If a website provides .doc files, but for a specific reason does not want search engines to index that file type, X-Robots-Tag can be used. On the Apache server, the following lines should be added to the / .htaccess
file:
<FilesMatch ".doc$">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>
If you want to do this for both .doc and .pdf files:
<FilesMatch ".(doc|pdf)$">
Header set X-Robots-Tag "noindex, noarchive, nosnippet"
</FilesMatch>
If you are running Nginx instead of Apache, you can get the same effect by adding the following to your server configuration:
location ~* \.(doc|pdf)$ {
add_header X-Robots-Tag "noindex, noarchive, nosnippet";
}
in conclusion
As you can see from the example above, the X-Robots-Tag HTTP header is a very powerful tool and can be robots.txt
used with it for even better results.