How Google Interprets the robots.txt Specification: Documentatio

How Google Interprets the robots.txt Specification: Documentatio

The robots.txt file is part of the Robots Exclusion Protocol (REP), which defines how websites can control crawler access to their pages. Google follows the REP standard with a few practical interpretations.

This article outlines how Google interprets robots.txt files, where they should be placed, how to format them, and how specific rules are applied.

For the original REP standard, refer to RFC 9309.


What Is a robots.txt File?

A robots.txt file is a plain text file placed at the root directory of your site. It contains rules that tell crawlers which parts of your site they’re allowed or disallowed from accessing.

Example for https://example.com:

makefileCopyEditUser-agent: *
Disallow: /private/

This tells all crawlers they should not access the /private/ directory.

💡 If you’re new to this, start with Google’s intro to robots.txt and robots.txt generator tools.


File Location and Valid Scope

You must place robots.txt in the top-level root of your site. Its URL is case-sensitive and only valid for:

  • The exact host, protocol, and port
  • All directories under the same origin

✅ Valid URLs

  • https://example.com/robots.txt
    Applies to all subdirectories on https://example.com
  • ftp://example.com/robots.txt
    Applies to FTP content for that IP or domain

❌ Invalid Examples

  • https://www.example.com/robots.txt does not apply to https://example.com
  • https://example.com:8181/robots.txt is valid only for that port
  • Subdirectory robots.txt (e.g., /folder/robots.txt) is ignored
  • IP-based rules apply only to that IP (e.g., https://212.96.82.21/robots.txt)

How Google Fetches robots.txt

Google fetches the robots.txt file using:

  • A GET request for HTTP/HTTPS
  • A RETR command via anonymous FTP login for FTP

It does not follow redirects via:

  • JavaScript
  • Frames
  • Meta refresh tags

Handling HTTP Status Codes

Google reacts to robots.txt availability based on its HTTP response code:

HTTP CodeBehavior
2xxFile is processed normally.
3xxFollows up to 5 redirects; then treats it as 404 if unresolved.
4xx (except 429)Treated as if no robots.txt exists — full crawl allowed.
401/403Treated as disallowing all crawling.
429Has no impact on crawl rate.
5xxGoogle halts crawling for up to 12 hours, retries frequently, and may fall back to cached versions.
DNS/Timeout ErrorsTreated like server errors (same behavior as 5xx).

⚠️ If errors persist beyond 30 days, Google may treat the site as crawlable or stop crawling altogether depending on availability.

File Format and Encoding

  • Encoding: UTF-8 only
  • Line Breaks: Use LF or CRLF (standard UNIX/Windows line endings)
  • Comment Lines: Start with #
  • Case Sensitivity: Directives are case-sensitive (except User-agent)

Example:

txtCopyEdit# This is a comment
User-agent: Googlebot
Disallow: /Private/

In the example above, Disallow: /Private/ blocks /Private/, but not /private/.


Valid Directives

Here’s how Google interprets the main REP directives:

1. User-agent

Defines which crawler(s) the rules apply to. You can define multiple user-agents.

txtCopyEditUser-agent: Googlebot
Disallow: /no-google/
  • * is a wildcard to target all crawlers:
txtCopyEditUser-agent: *
Disallow: /private/

🚫 Don’t use spaces between User-agent and the colon — must be: User-agent:


2. Disallow

Tells the crawler not to visit specific URLs.

txtCopyEditDisallow: /private/
Disallow: /tmp/page.html
  • A blank Disallow: line allows crawling all content:
txtCopyEditDisallow:

3. Allow

Overrides Disallow rules if a more specific path is allowed.

txtCopyEditUser-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html

In this case, Googlebot will still crawl /private/public-page.html.


4. Sitemap

Points to an XML sitemap URL. It doesn’t control crawling, just informs Google about sitemap location.

txtCopyEditSitemap: https://example.com/sitemap.xml
  • Multiple Sitemap lines are allowed.

5. Crawl-Delay (Not Supported by Google)

Some crawlers use it to limit request rate, but Google ignores this directive.


Order of Precedence

Google applies rules based on:

  1. Most specific path match
  2. If multiple rules apply, the longest matching rule wins.

Example:

txtCopyEditUser-agent: Googlebot
Disallow: /docs/
Allow: /docs/public/
  • Googlebot can crawl /docs/public/page.html because Allow is more specific.

Wildcards and Pattern Matching

Google supports two pattern matching characters:

SymbolMeaningExample
*Matches any number of characters/private/*/data.html
$Anchors the match to the end of URL/index.html$

Example:

txtCopyEditUser-agent: *
Disallow: /*.pdf$

Blocks all URLs ending with .pdf.


Max File Size

  • Googlebot only reads the first 500 kibibytes (KiB) of a robots.txt file.
  • Anything beyond that is ignored.

Grouping Rules for Specific Crawlers

You can write multiple groups in a robots.txt file for different user-agents.

Example:

txtCopyEditUser-agent: Googlebot
Disallow: /nogoogle/

User-agent: Bingbot
Disallow: /nobing/

User-agent: *
Disallow: /nopublic/
  • Each group starts with a User-agent line and ends before the next User-agent line.

✅ Crawlers read the group where their user-agent matches exactly, or fallback to wildcard *.


Combining Rules for the Same User-agent

You can define multiple directives for a single user-agent within the same group.

txtCopyEditUser-agent: Googlebot
Disallow: /private/
Disallow: /tmp/
Allow: /private/public-info.html

Googlebot will:

  • Not crawl /private/
  • Not crawl /tmp/
  • Still crawl /private/public-info.html (because of Allow)

Googlebot’s Matching Rules

Googlebot applies the longest match rule based on the specificity of the path.

Example:

txtCopyEditDisallow: /a/
Allow: /a/b/c/page.html
  • /a/b/c/page.html is allowed due to more specific Allow.

Default Behavior If No Match

If no rule matches the request URL, Googlebot assumes it’s allowed.

txtCopyEditUser-agent: Googlebot
Disallow: /private/
  • A request to /public/page.html is allowed because it doesn’t match any disallowed path.

Rules Precedence Summary

ScenarioWhat Happens?
Multiple matching rulesMost specific (longest) rule applies
No rules matchCrawling allowed
Disallow + AllowMost specific wins
Invalid lineIgnored by Googlebot
User-agent not matchedGooglebot checks fallback User-agent: * group

Crawling vs Indexing

  • robots.txt only controls crawling, not indexing.
  • Even if a page is disallowed, Google might index it if other pages link to it.
  • To block indexing, use noindex meta tags or HTTP headers on the page itself.

How Google Handles Errors

1. robots.txt Not Found (404)

  • Google assumes all content is allowed to crawl.

2. Access Denied (403/401)

  • Google assumes all content is disallowed (very restrictive).

3. Redirects (301/302)

  • Google follows redirects (up to 5 hops) to get the actual robots.txt.

4. Server Errors (5xx)

  • Treated like temporary disallow.
  • Google tries again later.

robots.txt Caching by Google

Google doesn’t download your robots.txt file on every crawl request. Instead:

  • It caches the file for up to 24 hours.
  • During outages, Google may reuse the cached version.
  • You can request an early recrawl using the robots.txt Tester in Search Console.

Test Your robots.txt File

Use the Google Search Console robots.txt Tester to:

  • Validate syntax.
  • Check if a specific URL is blocked or allowed.
  • Identify formatting or logic issues.

🔗 You can access it at: https://www.google.com/webmasters/tools/robots-testing-tool


Common Mistakes to Avoid

MistakeDescription
Typos in directive namesDisalow instead of Disallow
Forgetting the slashDisallow: private/ (should be /private/)
Misuse of wildcards* or $ used incorrectly
Mixing User-agent groupsConfuses crawler behavior
Blocking important resourcesLike /css/, /js/ folders

Best Practices

  • Always test changes before publishing.
  • Place robots.txt in the root directory (e.g., https://example.com/robots.txt)
  • Keep file size under 500KB
  • Limit to 5000 directives per file

Related Tools and Resources

  • robots.txt Tester – Search Console
  • Robots.txt Specifications – Google Developers
  • Blocking URLs with robots.txt

✅ Summary

  • The robots.txt file instructs crawlers on what to avoid crawling.
  • Googlebot respects the most specific rule and ignores invalid ones.
  • robots.txt does not block indexing unless paired with noindex rules.
  • Regularly test and monitor your file to ensure your site’s visibility isn’t unintentionally restricted.

Leave a Reply

Your email address will not be published. Required fields are marked *

*