Correct usage of robots.txt

robots.txt (uniform lowercase) is an ASCII-encoded text file stored in the root directory of a website. It usually tells web search engines’ robots (also known as web spiders) what content on this website should not be searched. The engine’s bot obtained, which can be obtained by the bot.

robots.txt should be lowercase and placed in the root directory of the website

Because URLs on some systems are case-sensitive, the robots.txt filename should be uniformly lowercase. robots.txt should be placed in the root directory of the website.

If you want to define the behavior of search engine robots when accessing subdirectories, you can merge the custom settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as metadata).

Placing robots.txt in a subdirectory of a website is not valid.

Note: The robots.txt protocol is not a specification, but a convention (even some crawlers do not abide by this convention), so it does not guarantee the privacy of the website.

robots.txt directive

Allow all search engine crawlers:

User-agent: *
Disallow:

Or another way of writing:

User-agent: *
Allow:/

User-agent is a designated search engine crawler. When the value is *, it represents all search engines.

For example, the following example allows Baidu search engine to crawl all pages:

User-agent: Baiduspider
Allow:/

Common search engine crawlers and their corresponding names:

reptile nameCorresponding search engine
BaiduspiderBaidu search
GooglebotGoogle search
BingbotBing Search
360Spider360 search
YoudaoBotYoudao search
ChinasoSpiderSearch in China
SosospiderSoso
YisouspiderYisou
Sogou web spider
Sogou inst spider
Sogou spider2
Sogou blog
Sogou News Spider
Sogou Orion spider
Sogou search

The above data was updated in February 2021

Block all crawlers from accessing a specific directory:

User-agent: *
Disallow: /cgi-bin/
Disallow: /js/
Disallow: /tmp/

Only block Google from accessing specific directories:

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /js/
Disallow: /tmp/

Block all bots from accessing certain file types:

User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

Automatic discovery of Sitemaps files

The Sitemap directive is supported by several major search engines (including Baidu, Google, Bing, and Sogou), specifying the location of a website’s Sitemaps file. SitemapA directive is not User-agentrestricted by a directive, so it can be placed anywhere in the robots.txt file. Example:

Sitemap: <https://www.example.com/sitemap.xml>

alternative method

While robots.txt is the most widely accepted method, it can also be used with the robots META tag. The robots META tag is mainly set for an independent page. Like other META tags (such as language used, page description, keywords, etc.), the robots META tag is also placed in the HEAD tag of the page, which is specially used to tell search How engine robots crawl the content of this page.

<head>
<meta name="robots" content="noindex,nofollow" />
</head>

In addition to using the robots.txt file in the root directory of the website, the same functionality can be achieved by adding the “X-Robots-Tag” HTTP header .

Leave a Comment