Robots.txt Tester: Check for Errors Before They Tank Rankings

A broken robots.txt file can silently block Google from crawling your most important pages. The worst part: you won't notice until traffic drops and you start investigating.

This guide covers how to test your robots.txt file, the tools that make it fast, the mistakes that cause the most damage, and the validation workflow you should run before every deployment.

What robots.txt actually controls

Robots.txt tells search engine crawlers which URL paths they can and cannot request. It does not:

Remove pages from Google's index (use noindex for that)
Prevent pages from appearing in search results if other pages link to them
Apply authentication or access control

What it does:

Manages crawl budget by steering crawlers away from low-value URLs
Prevents indexing of staging, admin, and internal paths (when combined with other signals)
Controls access per user-agent (Googlebot, Bingbot, GPTBot, etc.)

If your robots.txt blocks a URL, Google won't crawl it—but if another site links to that URL, Google may still show it in search results with no snippet. This is one of the most common misunderstandings in technical SEO.

How to test robots.txt: step-by-step

Step 1: Check if your robots.txt exists and is accessible

Visit https://yourdomain.com/robots.txt directly. You should see a plain text file. If you get a 404, HTML page, or redirect, your robots.txt isn't serving correctly.

What to verify:

HTTP status code is 200
Content-Type header is text/plain
File is served from the root domain (not a subdirectory)
No redirect chain before the final response

Step 2: Validate syntax

Robots.txt has a specific syntax. Common formatting issues:

# Correct
User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/

# Incorrect - missing trailing slash blocks only exact match
Disallow: /admin

# Incorrect - spaces before directive
 Disallow: /private/

Key syntax rules:

Each Disallow and Allow directive must be preceded by a User-agent line
Directives are case-sensitive for the path, case-insensitive for the directive name
Blank lines separate rule groups
# starts a comment
* is a wildcard matching any sequence of characters
$ anchors a pattern to the end of the URL

Step 3: Test specific URLs against your rules

The most important test: do your rules actually block what you intend and allow what you need?

For every URL you test, answer two questions:

Should this URL be crawlable? (Is it a page you want in search results?)
Does robots.txt allow it? (Does the tester confirm access?)

If the answers don't match, you have a misconfiguration.

Step 4: Check AI crawler rules

If you're managing AI crawler access (GPTBot, ChatGPT-User, Anthropic, PerplexityBot), verify those rules separately. AI crawlers have their own user-agent strings and your rules may need explicit handling:

User-agent: GPTBot
Disallow: /proprietary/
Allow: /blog/
Allow: /resources/

User-agent: ChatGPT-User
Disallow: /proprietary/
Allow: /blog/
Allow: /resources/

Robots.txt testing tools

Google Search Console robots.txt tester

URL: Available inside Google Search Console under Settings > robots.txt

What it does:

Tests any URL against your live robots.txt
Shows which rule matched (or didn't)
Highlights syntax errors and warnings
Shows the exact file Google fetched

Best for: Confirming Google's interpretation of your rules. Google's parser has specific behaviors (like treating unrecognized directives as comments) that other parsers may handle differently.

Limitations: Only tests Googlebot behavior. Won't show how Bingbot or AI crawlers interpret your rules.

Bing Webmaster Tools

URL: Available inside Bing Webmaster Tools

What it does:

Tests URLs against your robots.txt for Bing's crawler
Shows blocking rules
Identifies syntax issues

Best for: Verifying Bing-specific behavior, especially if you have Bing-specific rules.

Merkle robots.txt tester

URL: https://technicalseo.com/tools/robots-txt/

What it does:

Tests any robots.txt against any URL and user-agent
No login required
Shows which directive matched

Best for: Quick testing without logging into Search Console. Useful for testing competitor robots.txt files or checking rules for user-agents other than Googlebot.

Screaming Frog

What it does:

Crawls your site and flags URLs blocked by robots.txt
Shows which rules block which URLs
Identifies pages linked internally but blocked from crawling

Best for: Bulk testing. When you need to check hundreds or thousands of URLs against your robots.txt rules, Screaming Frog's crawl-based approach is faster than testing URLs one by one.

Command-line testing with curl

For developers who want to verify robots.txt programmatically:

# Fetch and display robots.txt
curl -s https://yourdomain.com/robots.txt

# Check headers
curl -sI https://yourdomain.com/robots.txt

# Verify Content-Type
curl -sI https://yourdomain.com/robots.txt | grep -i content-type

This is useful for CI/CD pipelines where you want to validate robots.txt before deployment.

Common robots.txt mistakes (and how to fix them)

1. Blocking CSS and JavaScript files

# Bad - blocks rendering resources
User-agent: *
Disallow: /wp-content/themes/
Disallow: /wp-content/plugins/

Google needs CSS and JS to render pages. Blocking these files means Google sees a broken page and can't evaluate your content properly.

Fix: Remove blanket blocks on theme and plugin directories. If you need to block specific files, be precise:

User-agent: *
Disallow: /wp-content/plugins/private-plugin/
Allow: /wp-content/themes/
Allow: /wp-content/plugins/

2. Blocking entire subdirectories unintentionally

# Blocks /blog/ and everything under it
Disallow: /blog/

If you meant to block only /blog/drafts/ but wrote /blog/, you've blocked your entire blog from crawling.

Fix: Be specific with paths. Test the exact URLs you care about after writing rules.

3. Using robots.txt instead of noindex

If you want to keep a page out of search results, robots.txt is the wrong tool. When you block crawling, Google can't see a noindex meta tag on the page, and the URL may still appear in results (with no snippet) if other sites link to it.

Fix: Use <meta name="robots" content="noindex"> or X-Robots-Tag: noindex header for pages you want excluded from the index. Only use robots.txt for crawl budget management.

4. Forgetting the sitemap directive

# Add at the bottom of robots.txt
Sitemap: https://yourdomain.com/sitemap.xml

The Sitemap directive tells crawlers where your sitemap lives. While crawlers can discover sitemaps through Search Console, including the directive in robots.txt provides an additional discovery path.

5. Conflicting Allow and Disallow rules

User-agent: Googlebot
Disallow: /resources/
Allow: /resources/guides/

When Allow and Disallow conflict, the more specific rule wins. In this case, /resources/guides/ would be allowed because it's more specific than /resources/. But if both rules have the same specificity, the Allow rule takes precedence for Googlebot.

Fix: Write rules from most specific to most general, and test edge cases to confirm behavior.

6. Missing user-agent line

# Broken - no user-agent specified
Disallow: /admin/

Directives without a preceding User-agent line are ignored by compliant crawlers.

Fix: Always include a User-agent line before your directives:

User-agent: *
Disallow: /admin/

7. Using absolute URLs in Disallow

# Wrong - absolute URL
Disallow: https://yourdomain.com/admin/

# Correct - relative path
Disallow: /admin/

Robots.txt directives use relative paths from the root. Absolute URLs won't match correctly.

8. Not handling www vs non-www

Robots.txt is domain-specific. www.yourdomain.com/robots.txt and yourdomain.com/robots.txt are separate files. If you redirect one to the other, the robots.txt for the canonical domain is what matters.

Fix: Ensure your canonical domain serves the correct robots.txt and that the non-canonical domain redirects properly.

A pre-deployment robots.txt validation checklist

Run this before every production deployment that touches robots.txt:

Syntax check - Run the file through Google's tester or Merkle's tool
Critical page test - Test your homepage, top landing pages, and sitemap URL
CSS/JS access - Confirm rendering resources are not blocked
Sitemap reference - Verify the Sitemap directive points to a valid, accessible sitemap
AI crawler rules - If you manage AI crawler access, confirm GPTBot/ChatGPT-User rules are correct
Staging vs production - Confirm you're not deploying staging robots.txt (which often blocks all crawlers) to production
HTTP status - Verify the file returns a 200 status with text/plain content type
Encoding - File should be UTF-8 encoded, no BOM

The staging-to-production mistake is particularly common. Many staging environments use Disallow: / to block all crawlers. Deploying that file to production kills your organic traffic instantly.

When to update robots.txt

Review your robots.txt when:

Launching new URL paths - Ensure they're crawlable unless you specifically want them blocked
Migrating domains or CMS platforms - New platforms may ship different default robots.txt files
Adding AI crawler rules - As new AI crawlers emerge (GPTBot, PerplexityBot, etc.), you may want explicit rules
After a traffic drop - Always check robots.txt early in any traffic investigation
Before major deployments - Include robots.txt in your deployment checklist

Robots.txt for AI search visibility

As AI search engines become more important, your robots.txt decisions directly affect whether AI platforms can access your content for training and retrieval.

Key considerations:

Blocking AI crawlers blocks AI visibility. If you disallow GPTBot, your content won't be available for ChatGPT responses. This is a deliberate trade-off.
Allow selectively. You can allow AI crawlers to access public content (blog posts, guides) while blocking proprietary or gated content.
Monitor AI crawler behavior. Check your server logs for AI crawler activity. High crawl rates from AI bots may warrant rate limiting through crawl-delay directives (though not all crawlers respect this).

The balance between protecting content and maximizing AI visibility is strategic. There's no universal right answer—it depends on your business model and content strategy.

Key takeaways

Test before deploying. A five-minute validation catches mistakes that take weeks to diagnose in production.
Use Google Search Console as your primary tester for Googlebot behavior, supplemented by third-party tools for other crawlers.
Robots.txt controls crawling, not indexing. Use noindex when you want pages excluded from search results.
Be specific with paths. Broad rules block more than you intend. Test individual URLs, not just directory patterns.
Don't forget AI crawlers. If AI search visibility matters to your strategy, your robots.txt rules for GPTBot, ChatGPT-User, and similar crawlers need the same care as your Googlebot rules.

Robots.txt Tester: Check for Errors Before They Tank Rankings

What robots.txt actually controls

How to test robots.txt: step-by-step

Step 1: Check if your robots.txt exists and is accessible

Step 2: Validate syntax

Step 3: Test specific URLs against your rules

Step 4: Check AI crawler rules

Robots.txt testing tools

Google Search Console robots.txt tester

Bing Webmaster Tools

Merkle robots.txt tester

Screaming Frog

Command-line testing with curl

Common robots.txt mistakes (and how to fix them)

1. Blocking CSS and JavaScript files

2. Blocking entire subdirectories unintentionally

3. Using robots.txt instead of noindex

4. Forgetting the sitemap directive

5. Conflicting Allow and Disallow rules

6. Missing user-agent line

7. Using absolute URLs in Disallow

8. Not handling www vs non-www

A pre-deployment robots.txt validation checklist

When to update robots.txt

Robots.txt for AI search visibility

Key takeaways

Stay ahead of AI search

See also