How Google Interprets the robots.txt Specification: Documentation

The robots.txt file is part of the Robots Exclusion Protocol (REP), which defines how websites can control crawler access to their pages. Google follows the REP standard with a few practical interpretations.

This article outlines how Google interprets robots.txt files, where they should be placed, how to format them, and how specific rules are applied.

For the original REP standard, refer to RFC 9309.

How to Use Review and AggregateRating Structured Data for Rich Snippets

What Is a robots.txt File?

A robots.txt file is a plain text file placed at the root directory of your site. It contains rules that tell crawlers which parts of your site they’re allowed or disallowed from accessing.

Example for https://example.com:

makefileCopyEditUser-agent: *
Disallow: /private/

This tells all crawlers they should not access the /private/ directory.

💡 If you’re new to this, start with Google’s intro to robots.txt and robots.txt generator tools.

App Deep Links: Connecting Website & App

File Location and Valid Scope

You must place robots.txt in the top-level root of your site. Its URL is case-sensitive and only valid for:

The exact host, protocol, and port
All directories under the same origin

✅ Valid URLs

https://example.com/robots.txt
Applies to all subdirectories on https://example.com
ftp://example.com/robots.txt
Applies to FTP content for that IP or domain

Using Google Search Console & Analytics Data for SEO

❌ Invalid Examples

https://www.example.com/robots.txt does not apply to https://example.com
https://example.com:8181/robots.txt is valid only for that port
Subdirectory robots.txt (e.g., /folder/robots.txt) is ignored
IP-based rules apply only to that IP (e.g., https://212.96.82.21/robots.txt)

How to Add Merchant listing (Product, Offer) structured data? Documents

How Google Fetches robots.txt

Google fetches the robots.txt file using:

A GET request for HTTP/HTTPS
A RETR command via anonymous FTP login for FTP

It does not follow redirects via:

JavaScript
Frames
Meta refresh tags

Robots meta tag, data-nosnippet & X-Robots-Tag specifications: Documentation

Handling HTTP Status Codes

Google reacts to robots.txt availability based on its HTTP response code:

HTTP Code	Behavior
2xx	File is processed normally.
3xx	Follows up to 5 redirects; then treats it as 404 if unresolved.
4xx (except 429)	Treated as if no robots.txt exists — full crawl allowed.
401/403	Treated as disallowing all crawling.
429	Has no impact on crawl rate.
5xx	Google halts crawling for up to 12 hours, retries frequently, and may fall back to cached versions.
DNS/Timeout Errors	Treated like server errors (same behavior as 5xx).

⚠️ If errors persist beyond 30 days, Google may treat the site as crawlable or stop crawling altogether depending on availability.

File Format and Encoding

Encoding: UTF-8 only
Line Breaks: Use LF or CRLF (standard UNIX/Windows line endings)
Comment Lines: Start with #
Case Sensitivity: Directives are case-sensitive (except User-agent)

Example:

txtCopyEdit# This is a comment
User-agent: Googlebot
Disallow: /Private/

In the example above, Disallow: /Private/ blocks /Private/, but not /private/.

Structured Data for Subscription & Paywalled Content (CreativeWork): Documents Guide

Valid Directives

Here’s how Google interprets the main REP directives:

1. `User-agent`

Defines which crawler(s) the rules apply to. You can define multiple user-agents.

txtCopyEditUser-agent: Googlebot
Disallow: /no-google/

* is a wildcard to target all crawlers:

txtCopyEditUser-agent: *
Disallow: /private/

🚫 Don’t use spaces between User-agent and the colon — must be: User-agent:

OpenAI Sora Explained: The New Future integrated into ChatGPT

2. `Disallow`

Tells the crawler not to visit specific URLs.

txtCopyEditDisallow: /private/
Disallow: /tmp/page.html

A blank Disallow: line allows crawling all content:

txtCopyEditDisallow:

3. `Allow`

Overrides Disallow rules if a more specific path is allowed.

txtCopyEditUser-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html

In this case, Googlebot will still crawl /private/public-page.html.

4. `Sitemap`

Points to an XML sitemap URL. It doesn’t control crawling, just informs Google about sitemap location.

txtCopyEditSitemap: https://example.com/sitemap.xml

Multiple Sitemap lines are allowed.

ChatGPT vs Kiwi.Ai: The Ultimate AI Showdown

5. Crawl-Delay (Not Supported by Google)

Some crawlers use it to limit request rate, but Google ignores this directive.

Order of Precedence

Google applies rules based on:

Most specific path match
If multiple rules apply, the longest matching rule wins.

Example:

txtCopyEditUser-agent: Googlebot
Disallow: /docs/
Allow: /docs/public/

Googlebot can crawl /docs/public/page.html because Allow is more specific.

Education Q&A (Quiz, Question, and Answer) Structured Data Guide

Wildcards and Pattern Matching

Google supports two pattern matching characters:

Symbol	Meaning	Example
`*`	Matches any number of characters	`/private/*/data.html`
`$`	Anchors the match to the end of URL	`/index.html$`

Example:

txtCopyEditUser-agent: *
Disallow: /*.pdf$

Blocks all URLs ending with .pdf.

Structured Data Carousels (Beta): A Third‑Party Guide

Max File Size

Googlebot only reads the first 500 kibibytes (KiB) of a robots.txt file.
Anything beyond that is ignored.

Grouping Rules for Specific Crawlers

You can write multiple groups in a robots.txt file for different user-agents.

Example:

txtCopyEditUser-agent: Googlebot
Disallow: /nogoogle/

User-agent: Bingbot
Disallow: /nobing/

User-agent: *
Disallow: /nopublic/

Each group starts with a User-agent line and ends before the next User-agent line.

✅ Crawlers read the group where their user-agent matches exactly, or fallback to wildcard *.

Google Search Analytics API Now Supports Hourly Data (With Code Samples)

Combining Rules for the Same User-agent

You can define multiple directives for a single user-agent within the same group.

txtCopyEditUser-agent: Googlebot
Disallow: /private/
Disallow: /tmp/
Allow: /private/public-info.html

Googlebot will:

Not crawl /private/
Not crawl /tmp/
Still crawl /private/public-info.html (because of Allow)

Robots Meta Tag, Data-Nosnippet, and X-Robots-Tag Specifications

Googlebot’s Matching Rules

Googlebot applies the longest match rule based on the specificity of the path.

Example:

txtCopyEditDisallow: /a/
Allow: /a/b/c/page.html

/a/b/c/page.html is allowed due to more specific Allow.

Update May 1, 2025: Variable Freelancer Service Fees on Upwork

Default Behavior If No Match

If no rule matches the request URL, Googlebot assumes it’s allowed.

txtCopyEditUser-agent: Googlebot
Disallow: /private/

A request to /public/page.html is allowed because it doesn’t match any disallowed path.

How I Fixed the “Crawled – Currently Not Indexed” Indexing Error?

Rules Precedence Summary

Scenario	What Happens?
Multiple matching rules	Most specific (longest) rule applies
No rules match	Crawling allowed
`Disallow` + `Allow`	Most specific wins
Invalid line	Ignored by Googlebot
`User-agent` not matched	Googlebot checks fallback `User-agent: *` group

Crawling vs Indexing

robots.txt only controls crawling, not indexing.
Even if a page is disallowed, Google might index it if other pages link to it.
To block indexing, use noindex meta tags or HTTP headers on the page itself.

SEO Is DEAD! But Google Surpasses 5 Trillion Searches Annually

How Google Handles Errors

1. robots.txt Not Found (404)

Google assumes all content is allowed to crawl.

2. Access Denied (403/401)

Google assumes all content is disallowed (very restrictive).

3. Redirects (301/302)

Google follows redirects (up to 5 hops) to get the actual robots.txt.

4. Server Errors (5xx)

Treated like temporary disallow.
Google tries again later.

Robots.txt Refresher: Google Launches a New Educational Series

robots.txt Caching by Google

Google doesn’t download your robots.txt file on every crawl request. Instead:

It caches the file for up to 24 hours.
During outages, Google may reuse the cached version.
You can request an early recrawl using the robots.txt Tester in Search Console.

Google Remove Breadcrumbs for Mobile Search Results WorldWide

Test Your robots.txt File

Use the Google Search Console robots.txt Tester to:

Validate syntax.
Check if a specific URL is blocked or allowed.
Identify formatting or logic issues.

🔗 You can access it at: https://www.google.com/webmasters/tools/robots-testing-tool

Crawling Out of December: A 2024 Recap

Common Mistakes to Avoid

Mistake	Description
Typos in directive names	`Disalow` instead of `Disallow`
Forgetting the slash	`Disallow: private/` (should be `/private/`)
Misuse of wildcards	`*` or `$` used incorrectly
Mixing `User-agent` groups	Confuses crawler behavior
Blocking important resources	Like `/css/`, `/js/` folders

Best Practices

Always test changes before publishing.
Place robots.txt in the root directory (e.g., https://example.com/robots.txt)
Keep file size under 500KB
Limit to 5000 directives per file

Top 10 SEO Freelancers in India: Meet the Experts

Related Tools and Resources

robots.txt Tester – Search Console
Robots.txt Specifications – Google Developers
Blocking URLs with robots.txt

History & Latest Google Algorithm Updates 2024

✅ Summary

The robots.txt file instructs crawlers on what to avoid crawling.
Googlebot respects the most specific rule and ignores invalid ones.
robots.txt does not block indexing unless paired with noindex rules.
Regularly test and monitor your file to ensure your site’s visibility isn’t unintentionally restricted.

Meet Harshit Kumar: India’s Most Famous SEO Specialist & Freelancer

How Google Interprets the robots.txt Specification: Documentatio

What Is a robots.txt File?

File Location and Valid Scope

✅ Valid URLs

❌ Invalid Examples

How Google Fetches robots.txt

Handling HTTP Status Codes

File Format and Encoding

Valid Directives

1. User-agent

2. Disallow

3. Allow

4. Sitemap

5. Crawl-Delay (Not Supported by Google)

Order of Precedence

Wildcards and Pattern Matching

Max File Size

Grouping Rules for Specific Crawlers

Example:

Combining Rules for the Same User-agent

Googlebot’s Matching Rules

Example:

Default Behavior If No Match

Rules Precedence Summary

Crawling vs Indexing

How Google Handles Errors

1. robots.txt Not Found (404)

2. Access Denied (403/401)

3. Redirects (301/302)

4. Server Errors (5xx)

robots.txt Caching by Google

Test Your robots.txt File

Common Mistakes to Avoid

Best Practices

Related Tools and Resources

✅ Summary

Related Posts

Large Site Owner’s Guide to Managing Crawl Budget: Crawl Budget Documentaion

Image License Metadata SEO Guide 2025: Add Licensable Structured Data & IPTC Tags for Google Images

Leave a Reply Cancel reply

1. `User-agent`

2. `Disallow`

3. `Allow`

4. `Sitemap`