How Google Interprets the robots.txt Specification: Documentatio

The robots.txt
file is part of the Robots Exclusion Protocol (REP), which defines how websites can control crawler access to their pages. Google follows the REP standard with a few practical interpretations.
This article outlines how Google interprets robots.txt files, where they should be placed, how to format them, and how specific rules are applied.
For the original REP standard, refer to RFC 9309.
What Is a robots.txt File?
A robots.txt
file is a plain text file placed at the root directory of your site. It contains rules that tell crawlers which parts of your site they’re allowed or disallowed from accessing.
Example for https://example.com
:
makefileCopyEditUser-agent: *
Disallow: /private/
This tells all crawlers they should not access the /private/
directory.
💡 If you’re new to this, start with Google’s intro to robots.txt and robots.txt generator tools.
File Location and Valid Scope
You must place robots.txt
in the top-level root of your site. Its URL is case-sensitive and only valid for:
- The exact host, protocol, and port
- All directories under the same origin
✅ Valid URLs
https://example.com/robots.txt
Applies to all subdirectories onhttps://example.com
ftp://example.com/robots.txt
Applies to FTP content for that IP or domain
❌ Invalid Examples
https://www.example.com/robots.txt
does not apply tohttps://example.com
https://example.com:8181/robots.txt
is valid only for that port- Subdirectory robots.txt (e.g.,
/folder/robots.txt
) is ignored - IP-based rules apply only to that IP (e.g.,
https://212.96.82.21/robots.txt
)
How Google Fetches robots.txt
Google fetches the robots.txt file using:
- A GET request for HTTP/HTTPS
- A RETR command via anonymous FTP login for FTP
It does not follow redirects via:
- JavaScript
- Frames
- Meta refresh tags
Handling HTTP Status Codes
Google reacts to robots.txt availability based on its HTTP response code:
HTTP Code | Behavior |
---|---|
2xx | File is processed normally. |
3xx | Follows up to 5 redirects; then treats it as 404 if unresolved. |
4xx (except 429) | Treated as if no robots.txt exists — full crawl allowed. |
401/403 | Treated as disallowing all crawling. |
429 | Has no impact on crawl rate. |
5xx | Google halts crawling for up to 12 hours, retries frequently, and may fall back to cached versions. |
DNS/Timeout Errors | Treated like server errors (same behavior as 5xx). |
⚠️ If errors persist beyond 30 days, Google may treat the site as crawlable or stop crawling altogether depending on availability.
File Format and Encoding
- Encoding: UTF-8 only
- Line Breaks: Use LF or CRLF (standard UNIX/Windows line endings)
- Comment Lines: Start with
#
- Case Sensitivity: Directives are case-sensitive (except
User-agent
)
Example:
txtCopyEdit# This is a comment
User-agent: Googlebot
Disallow: /Private/
In the example above, Disallow: /Private/
blocks /Private/
, but not /private/
.
Valid Directives
Here’s how Google interprets the main REP directives:
1. User-agent
Defines which crawler(s) the rules apply to. You can define multiple user-agents.
txtCopyEditUser-agent: Googlebot
Disallow: /no-google/
*
is a wildcard to target all crawlers:
txtCopyEditUser-agent: *
Disallow: /private/
🚫 Don’t use spaces between
User-agent
and the colon — must be:User-agent:
2. Disallow
Tells the crawler not to visit specific URLs.
txtCopyEditDisallow: /private/
Disallow: /tmp/page.html
- A blank
Disallow:
line allows crawling all content:
txtCopyEditDisallow:
3. Allow
Overrides Disallow
rules if a more specific path is allowed.
txtCopyEditUser-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
In this case, Googlebot will still crawl /private/public-page.html
.
4. Sitemap
Points to an XML sitemap URL. It doesn’t control crawling, just informs Google about sitemap location.
txtCopyEditSitemap: https://example.com/sitemap.xml
- Multiple
Sitemap
lines are allowed.
5. Crawl-Delay (Not Supported by Google)
Some crawlers use it to limit request rate, but Google ignores this directive.
Order of Precedence
Google applies rules based on:
- Most specific path match
- If multiple rules apply, the longest matching rule wins.
Example:
txtCopyEditUser-agent: Googlebot
Disallow: /docs/
Allow: /docs/public/
- Googlebot can crawl
/docs/public/page.html
becauseAllow
is more specific.
Wildcards and Pattern Matching
Google supports two pattern matching characters:
Symbol | Meaning | Example |
---|---|---|
* | Matches any number of characters | /private/*/data.html |
$ | Anchors the match to the end of URL | /index.html$ |
Example:
txtCopyEditUser-agent: *
Disallow: /*.pdf$
Blocks all URLs ending with .pdf
.
Max File Size
- Googlebot only reads the first 500 kibibytes (KiB) of a
robots.txt
file. - Anything beyond that is ignored.
Grouping Rules for Specific Crawlers
You can write multiple groups in a robots.txt
file for different user-agents.
Example:
txtCopyEditUser-agent: Googlebot
Disallow: /nogoogle/
User-agent: Bingbot
Disallow: /nobing/
User-agent: *
Disallow: /nopublic/
- Each group starts with a
User-agent
line and ends before the nextUser-agent
line.
✅ Crawlers read the group where their user-agent matches exactly, or fallback to wildcard
*
.
Combining Rules for the Same User-agent
You can define multiple directives for a single user-agent within the same group.
txtCopyEditUser-agent: Googlebot
Disallow: /private/
Disallow: /tmp/
Allow: /private/public-info.html
Googlebot will:
- Not crawl
/private/
- Not crawl
/tmp/
- Still crawl
/private/public-info.html
(because ofAllow
)
Googlebot’s Matching Rules
Googlebot applies the longest match rule based on the specificity of the path.
Example:
txtCopyEditDisallow: /a/
Allow: /a/b/c/page.html
/a/b/c/page.html
is allowed due to more specificAllow
.
Default Behavior If No Match
If no rule matches the request URL, Googlebot assumes it’s allowed.
txtCopyEditUser-agent: Googlebot
Disallow: /private/
- A request to
/public/page.html
is allowed because it doesn’t match any disallowed path.
Rules Precedence Summary
Scenario | What Happens? |
---|---|
Multiple matching rules | Most specific (longest) rule applies |
No rules match | Crawling allowed |
Disallow + Allow | Most specific wins |
Invalid line | Ignored by Googlebot |
User-agent not matched | Googlebot checks fallback User-agent: * group |
Crawling vs Indexing
robots.txt
only controls crawling, not indexing.- Even if a page is disallowed, Google might index it if other pages link to it.
- To block indexing, use
noindex
meta tags or HTTP headers on the page itself.
How Google Handles Errors
1. robots.txt Not Found (404)
- Google assumes all content is allowed to crawl.
2. Access Denied (403/401)
- Google assumes all content is disallowed (very restrictive).
3. Redirects (301/302)
- Google follows redirects (up to 5 hops) to get the actual
robots.txt
.
4. Server Errors (5xx)
- Treated like temporary disallow.
- Google tries again later.
robots.txt Caching by Google
Google doesn’t download your robots.txt
file on every crawl request. Instead:
- It caches the file for up to 24 hours.
- During outages, Google may reuse the cached version.
- You can request an early recrawl using the robots.txt Tester in Search Console.
Test Your robots.txt File
Use the Google Search Console robots.txt Tester to:
- Validate syntax.
- Check if a specific URL is blocked or allowed.
- Identify formatting or logic issues.
🔗 You can access it at: https://www.google.com/webmasters/tools/robots-testing-tool
Common Mistakes to Avoid
Mistake | Description |
---|---|
Typos in directive names | Disalow instead of Disallow |
Forgetting the slash | Disallow: private/ (should be /private/ ) |
Misuse of wildcards | * or $ used incorrectly |
Mixing User-agent groups | Confuses crawler behavior |
Blocking important resources | Like /css/ , /js/ folders |
Best Practices
- Always test changes before publishing.
- Place
robots.txt
in the root directory (e.g.,https://example.com/robots.txt
) - Keep file size under 500KB
- Limit to 5000 directives per file
Related Tools and Resources
- robots.txt Tester – Search Console
- Robots.txt Specifications – Google Developers
- Blocking URLs with robots.txt
✅ Summary
- The
robots.txt
file instructs crawlers on what to avoid crawling. - Googlebot respects the most specific rule and ignores invalid ones.
robots.txt
does not block indexing unless paired withnoindex
rules.- Regularly test and monitor your file to ensure your site’s visibility isn’t unintentionally restricted.
Leave a Reply