A broken robots.txt file can silently block Google from crawling your most important pages. The worst part: you won't notice until traffic drops and you start investigating.
This guide covers how to test your robots.txt file, the tools that make it fast, the mistakes that cause the most damage, and the validation workflow you should run before every deployment.
What robots.txt actually controls
Robots.txt tells search engine crawlers which URL paths they can and cannot request. It does not:
- Remove pages from Google's index (use
noindexfor that) - Prevent pages from appearing in search results if other pages link to them
- Apply authentication or access control
What it does:
- Manages crawl budget by steering crawlers away from low-value URLs
- Prevents indexing of staging, admin, and internal paths (when combined with other signals)
- Controls access per user-agent (Googlebot, Bingbot, GPTBot, etc.)
If your robots.txt blocks a URL, Google won't crawl it—but if another site links to that URL, Google may still show it in search results with no snippet. This is one of the most common misunderstandings in technical SEO.
How to test robots.txt: step-by-step
Step 1: Check if your robots.txt exists and is accessible
Visit https://yourdomain.com/robots.txt directly. You should see a plain text file. If you get a 404, HTML page, or redirect, your robots.txt isn't serving correctly.
What to verify:
- HTTP status code is 200
- Content-Type header is
text/plain - File is served from the root domain (not a subdirectory)
- No redirect chain before the final response
Step 2: Validate syntax
Robots.txt has a specific syntax. Common formatting issues:
# Correct
User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/
# Incorrect - missing trailing slash blocks only exact match
Disallow: /admin
# Incorrect - spaces before directive
Disallow: /private/
Key syntax rules:
- Each
DisallowandAllowdirective must be preceded by aUser-agentline - Directives are case-sensitive for the path, case-insensitive for the directive name
- Blank lines separate rule groups
#starts a comment*is a wildcard matching any sequence of characters$anchors a pattern to the end of the URL
Step 3: Test specific URLs against your rules
The most important test: do your rules actually block what you intend and allow what you need?
For every URL you test, answer two questions:
- Should this URL be crawlable? (Is it a page you want in search results?)
- Does robots.txt allow it? (Does the tester confirm access?)
If the answers don't match, you have a misconfiguration.
Step 4: Check AI crawler rules
If you're managing AI crawler access (GPTBot, ChatGPT-User, Anthropic, PerplexityBot), verify those rules separately. AI crawlers have their own user-agent strings and your rules may need explicit handling:
User-agent: GPTBot
Disallow: /proprietary/
Allow: /blog/
Allow: /resources/
User-agent: ChatGPT-User
Disallow: /proprietary/
Allow: /blog/
Allow: /resources/
Robots.txt testing tools
Google Search Console robots.txt tester
URL: Available inside Google Search Console under Settings > robots.txt
What it does:
- Tests any URL against your live robots.txt
- Shows which rule matched (or didn't)
- Highlights syntax errors and warnings
- Shows the exact file Google fetched
Best for: Confirming Google's interpretation of your rules. Google's parser has specific behaviors (like treating unrecognized directives as comments) that other parsers may handle differently.
Limitations: Only tests Googlebot behavior. Won't show how Bingbot or AI crawlers interpret your rules.
Bing Webmaster Tools
URL: Available inside Bing Webmaster Tools
What it does:
- Tests URLs against your robots.txt for Bing's crawler
- Shows blocking rules
- Identifies syntax issues
Best for: Verifying Bing-specific behavior, especially if you have Bing-specific rules.
Merkle robots.txt tester
URL: https://technicalseo.com/tools/robots-txt/
What it does:
- Tests any robots.txt against any URL and user-agent
- No login required
- Shows which directive matched
Best for: Quick testing without logging into Search Console. Useful for testing competitor robots.txt files or checking rules for user-agents other than Googlebot.
Screaming Frog
What it does:
- Crawls your site and flags URLs blocked by robots.txt
- Shows which rules block which URLs
- Identifies pages linked internally but blocked from crawling
Best for: Bulk testing. When you need to check hundreds or thousands of URLs against your robots.txt rules, Screaming Frog's crawl-based approach is faster than testing URLs one by one.
Command-line testing with curl
For developers who want to verify robots.txt programmatically:
# Fetch and display robots.txt
curl -s https://yourdomain.com/robots.txt
# Check headers
curl -sI https://yourdomain.com/robots.txt
# Verify Content-Type
curl -sI https://yourdomain.com/robots.txt | grep -i content-type
This is useful for CI/CD pipelines where you want to validate robots.txt before deployment.
Common robots.txt mistakes (and how to fix them)
1. Blocking CSS and JavaScript files
# Bad - blocks rendering resources
User-agent: *
Disallow: /wp-content/themes/
Disallow: /wp-content/plugins/
Google needs CSS and JS to render pages. Blocking these files means Google sees a broken page and can't evaluate your content properly.
Fix: Remove blanket blocks on theme and plugin directories. If you need to block specific files, be precise:
User-agent: *
Disallow: /wp-content/plugins/private-plugin/
Allow: /wp-content/themes/
Allow: /wp-content/plugins/
2. Blocking entire subdirectories unintentionally
# Blocks /blog/ and everything under it
Disallow: /blog/
If you meant to block only /blog/drafts/ but wrote /blog/, you've blocked your entire blog from crawling.
Fix: Be specific with paths. Test the exact URLs you care about after writing rules.
3. Using robots.txt instead of noindex
If you want to keep a page out of search results, robots.txt is the wrong tool. When you block crawling, Google can't see a noindex meta tag on the page, and the URL may still appear in results (with no snippet) if other sites link to it.
Fix: Use <meta name="robots" content="noindex"> or X-Robots-Tag: noindex header for pages you want excluded from the index. Only use robots.txt for crawl budget management.
4. Forgetting the sitemap directive
# Add at the bottom of robots.txt
Sitemap: https://yourdomain.com/sitemap.xml
The Sitemap directive tells crawlers where your sitemap lives. While crawlers can discover sitemaps through Search Console, including the directive in robots.txt provides an additional discovery path.
5. Conflicting Allow and Disallow rules
User-agent: Googlebot
Disallow: /resources/
Allow: /resources/guides/
When Allow and Disallow conflict, the more specific rule wins. In this case, /resources/guides/ would be allowed because it's more specific than /resources/. But if both rules have the same specificity, the Allow rule takes precedence for Googlebot.
Fix: Write rules from most specific to most general, and test edge cases to confirm behavior.
6. Missing user-agent line
# Broken - no user-agent specified
Disallow: /admin/
Directives without a preceding User-agent line are ignored by compliant crawlers.
Fix: Always include a User-agent line before your directives:
User-agent: *
Disallow: /admin/
7. Using absolute URLs in Disallow
# Wrong - absolute URL
Disallow: https://yourdomain.com/admin/
# Correct - relative path
Disallow: /admin/
Robots.txt directives use relative paths from the root. Absolute URLs won't match correctly.
8. Not handling www vs non-www
Robots.txt is domain-specific. www.yourdomain.com/robots.txt and yourdomain.com/robots.txt are separate files. If you redirect one to the other, the robots.txt for the canonical domain is what matters.
Fix: Ensure your canonical domain serves the correct robots.txt and that the non-canonical domain redirects properly.
A pre-deployment robots.txt validation checklist
Run this before every production deployment that touches robots.txt:
- Syntax check - Run the file through Google's tester or Merkle's tool
- Critical page test - Test your homepage, top landing pages, and sitemap URL
- CSS/JS access - Confirm rendering resources are not blocked
- Sitemap reference - Verify the Sitemap directive points to a valid, accessible sitemap
- AI crawler rules - If you manage AI crawler access, confirm GPTBot/ChatGPT-User rules are correct
- Staging vs production - Confirm you're not deploying staging robots.txt (which often blocks all crawlers) to production
- HTTP status - Verify the file returns a 200 status with
text/plaincontent type - Encoding - File should be UTF-8 encoded, no BOM
The staging-to-production mistake is particularly common. Many staging environments use Disallow: / to block all crawlers. Deploying that file to production kills your organic traffic instantly.
When to update robots.txt
Review your robots.txt when:
- Launching new URL paths - Ensure they're crawlable unless you specifically want them blocked
- Migrating domains or CMS platforms - New platforms may ship different default robots.txt files
- Adding AI crawler rules - As new AI crawlers emerge (GPTBot, PerplexityBot, etc.), you may want explicit rules
- After a traffic drop - Always check robots.txt early in any traffic investigation
- Before major deployments - Include robots.txt in your deployment checklist
Robots.txt for AI search visibility
As AI search engines become more important, your robots.txt decisions directly affect whether AI platforms can access your content for training and retrieval.
Key considerations:
- Blocking AI crawlers blocks AI visibility. If you disallow GPTBot, your content won't be available for ChatGPT responses. This is a deliberate trade-off.
- Allow selectively. You can allow AI crawlers to access public content (blog posts, guides) while blocking proprietary or gated content.
- Monitor AI crawler behavior. Check your server logs for AI crawler activity. High crawl rates from AI bots may warrant rate limiting through crawl-delay directives (though not all crawlers respect this).
The balance between protecting content and maximizing AI visibility is strategic. There's no universal right answer—it depends on your business model and content strategy.
Key takeaways
- Test before deploying. A five-minute validation catches mistakes that take weeks to diagnose in production.
- Use Google Search Console as your primary tester for Googlebot behavior, supplemented by third-party tools for other crawlers.
- Robots.txt controls crawling, not indexing. Use
noindexwhen you want pages excluded from search results. - Be specific with paths. Broad rules block more than you intend. Test individual URLs, not just directory patterns.
- Don't forget AI crawlers. If AI search visibility matters to your strategy, your robots.txt rules for GPTBot, ChatGPT-User, and similar crawlers need the same care as your Googlebot rules.