robots.txt (uniform lowercase) is an ASCII-encoded text file stored in the root directory of a website. It usually tells web search engines’ robots (also known as web spiders) what content on this website should not be searched. The engine’s bot obtained, which can be obtained by the bot.
robots.txt should be lowercase and placed in the root directory of the website
Because URLs on some systems are case-sensitive, the robots.txt filename should be uniformly lowercase. robots.txt should be placed in the root directory of the website.
If you want to define the behavior of search engine robots when accessing subdirectories, you can merge the custom settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as metadata).
Placing robots.txt in a subdirectory of a website is not valid.
Note: The robots.txt protocol is not a specification, but a convention (even some crawlers do not abide by this convention), so it does not guarantee the privacy of the website.
robots.txt directive
Allow all search engine crawlers:
User-agent: *
Disallow:
Or another way of writing:
User-agent: * Allow:/
User-agent is a designated search engine crawler. When the value is *, it represents all search engines.
For example, the following example allows Baidu search engine to crawl all pages:
User-agent: Baiduspider
Allow:/
Common search engine crawlers and their corresponding names:
reptile name | Corresponding search engine |
Baiduspider | Baidu search |
Googlebot | Google search |
Bingbot | Bing Search |
360Spider | 360 search |
YoudaoBot | Youdao search |
ChinasoSpider | Search in China |
Sosospider | Soso |
Yisouspider | Yisou |
Sogou web spider Sogou inst spider Sogou spider2 Sogou blog Sogou News Spider Sogou Orion spider | Sogou search |
The above data was updated in February 2021
Block all crawlers from accessing a specific directory:
User-agent: *
Disallow: /cgi-bin/
Disallow: /js/
Disallow: /tmp/
Only block Google from accessing specific directories:
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /js/
Disallow: /tmp/
Block all bots from accessing certain file types:
User-agent: * Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$
Automatic discovery of Sitemaps files
The Sitemap directive is supported by several major search engines (including Baidu, Google, Bing, and Sogou), specifying the location of a website’s Sitemaps file. Sitemap
A directive is not User-agent
restricted by a directive, so it can be placed anywhere in the robots.txt file. Example:
Sitemap: <https://www.example.com/sitemap.xml>
alternative method
While robots.txt is the most widely accepted method, it can also be used with the robots META tag. The robots META tag is mainly set for an independent page. Like other META tags (such as language used, page description, keywords, etc.), the robots META tag is also placed in the HEAD tag of the page, which is specially used to tell search How engine robots crawl the content of this page.
<head>
<meta name="robots" content="noindex,nofollow" />
</head>
In addition to using the robots.txt file in the root directory of the website, the same functionality can be achieved by adding the “X-Robots-Tag” HTTP header .