Robots.txt controls which crawlers can access your website. With the rise of AI assistants, a new category of crawlers has emerged. This guide explains how to configure robots.txt for AI crawlers and the tradeoffs involved.
AI crawlers explained
AI companies use web crawlers to gather training data and retrieve information for real-time responses. The main AI crawlers include:
| Crawler | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data, ChatGPT browsing |
| ChatGPT-User | OpenAI | Real-time web browsing |
| ClaudeBot | Anthropic | Training data |
| Claude-Web | Anthropic | Real-time retrieval |
| Google-Extended | Gemini/Bard training | |
| Googlebot | Search index (includes AI Overviews) | |
| PerplexityBot | Perplexity | Real-time search |
| Bytespider | ByteDance | Training data |
| CCBot | Common Crawl | Open dataset (used by many AI) |
Important distinction:
- Training crawlers - Gather data to train AI models
- Retrieval crawlers - Fetch content for real-time AI responses
Blocking training crawlers doesn't prevent AI from learning about you (they may use other sources). Blocking retrieval crawlers reduces your AI visibility.
The visibility tradeoff
Your robots.txt choices affect AI visibility:
| Approach | Visibility | Control | Risk |
|---|---|---|---|
| Allow all AI crawlers | Maximum | Minimum | Content used for training |
| Allow retrieval, block training | High | Medium | May miss some citations |
| Block all AI crawlers | None | Maximum | Zero AI visibility |
| No AI-specific rules | Varies | None | Default crawler behavior |
For GEO optimization: Allow at minimum the retrieval crawlers (ChatGPT-User, PerplexityBot) to maintain AI search visibility.
Common robots.txt configurations
Configuration 1: Maximum AI visibility (recommended for GEO)
Allow all AI crawlers for maximum citation potential:
# robots.txt - Maximum AI visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Googlebot
Allow: /
User-agent: *
Allow: /
Configuration 2: Retrieval only (balanced approach)
Allow real-time retrieval but block training crawlers:
# robots.txt - Retrieval only
# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow retrieval crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Configuration 3: Selective access
Allow AI access to specific sections only:
# robots.txt - Selective access
User-agent: GPTBot
Allow: /blog/
Allow: /glossary/
Allow: /resources/
Disallow: /
User-agent: ChatGPT-User
Allow: /blog/
Allow: /glossary/
Allow: /resources/
Disallow: /
User-agent: PerplexityBot
Allow: /blog/
Allow: /glossary/
Allow: /resources/
Disallow: /
Configuration 4: Block all AI (not recommended for GEO)
Block all AI crawlers (eliminates AI visibility):
# robots.txt - Block all AI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Decision framework
Use this framework to choose your approach:
Should you allow AI training crawlers?
| Consider allowing if | Consider blocking if |
|---|---|
| You want maximum visibility | You have proprietary content |
| Your content is already public | You're concerned about content licensing |
| Brand awareness is the goal | You monetize content directly |
| You benefit from AI recommendations | Legal/compliance requirements exist |
Should you allow AI retrieval crawlers?
| Consider allowing if | Consider blocking if |
|---|---|
| GEO/AI visibility matters | You don't want AI traffic |
| Your audience uses AI assistants | Content is behind paywalls |
| Citations drive brand awareness | Legal restrictions apply |
| You want to appear in AI responses | Competitive concerns |
Testing and verification
Verify your robots.txt
-
Check the file is accessible: Visit
yoursite.com/robots.txtand confirm it loads -
Validate syntax: Use Google's robots.txt Tester in Search Console
-
Test specific rules: Check if specific URLs are allowed or blocked per crawler
Monitor crawler activity
Check server logs for AI crawler requests:
# Example log entries to look for
GPTBot: 66.249.66.* (OpenAI IP ranges)
ClaudeBot: Various Anthropic IPs
PerplexityBot: Perplexity IP ranges
Test AI visibility
After configuration changes:
- Wait 1-2 weeks for crawlers to re-process
- Query AI assistants about your content
- Check if citations appear
- Adjust configuration if needed
Common mistakes
Mistake 1: Blocking Googlebot thinking it only affects AI
# Wrong - This blocks all Google search
User-agent: Googlebot
Disallow: /
Googlebot powers both traditional search AND AI Overviews. Blocking it removes you from Google entirely.
Mistake 2: Forgetting the wildcard fallback
# Missing fallback for unknown crawlers
User-agent: GPTBot
Allow: /
# No rule for other crawlers!
Always include a User-agent: * rule as a fallback.
Mistake 3: Typos in crawler names
# Wrong crawler name
User-agent: GPT-Bot # Should be GPTBot
Disallow: /
Use exact crawler names. Typos mean rules won't apply.
Mistake 4: Conflicting rules
# Confusing - which applies?
User-agent: *
Disallow: /
User-agent: GPTBot
Allow: /
Order matters. More specific rules (named crawlers) take precedence, but test to confirm.
Robots.txt vs other controls
| Method | What it controls | Enforcement |
|---|---|---|
| robots.txt | Crawler access | Voluntary (crawlers should respect, but don't have to) |
| Meta robots | Indexing, following links | Voluntary |
| HTTP headers | Various directives | Voluntary |
| Login/paywall | Actual access | Enforced by your server |
| AI model terms | Training data use | Legal agreement |
Robots.txt is a request, not a guarantee. For sensitive content, use actual access controls.
Platform-specific notes
OpenAI (ChatGPT)
- GPTBot: Training and product improvement
- ChatGPT-User: Real-time browsing when users ask ChatGPT to search
- OpenAI respects robots.txt for both
- Googlebot: Traditional search and AI Overviews
- Google-Extended: Specifically for Gemini/Bard training
- Blocking Google-Extended still allows search and AI Overviews
Perplexity
- PerplexityBot: Powers real-time search
- Blocking it removes you from Perplexity results
Anthropic (Claude)
- ClaudeBot: Training data
- Claude-Web: Real-time retrieval (when available)
FAQs
Does blocking AI crawlers remove me from AI training data?
Not necessarily. AI models may already have your content from before you blocked crawlers, or from other sources like Common Crawl archives.
Will robots.txt affect my Google rankings?
Only if you block Googlebot. Blocking Google-Extended (Gemini training) doesn't affect search rankings.
How quickly do changes take effect?
Crawlers re-check robots.txt periodically (often daily to weekly). Changes may take 1-2 weeks to fully propagate.
Should I block all AI crawlers for copyright reasons?
Blocking crawlers is one measure, but it's not a complete solution. Consult legal counsel for copyright concerns about AI training.
Next steps
- Check your current robots.txt configuration
- Review the AI search optimization checklist
- Learn about schema markup for AI search