AI Bot Crawlers Explained: GPTBot, ClaudeBot, and How to Control Them
GPTBot, ClaudeBot, PerplexityBot, AmazonBot — the AI crawler ecosystem in 2026. Who's out there, what they want, how to identify them in your logs, and exactly how to allow or block them.
Open your access logs in 2026 and a noticeable share of the traffic is bots you may not recognise: GPTBot, ClaudeBot, PerplexityBot, AmazonBot, Bytespider, Applebot-Extended. They're crawling for AI — either to train models, to ground real-time AI search results, or to power the assistants now embedded in browsers and operating systems. Some respect robots.txt, some don't, and the rules of engagement keep changing. This article is an honest reference: who they are, what they want, and how to control them.
The crawler landscape in 2026
There are now three distinct categories of AI crawler, and the right policy for each is different:
1. Training crawlers
Scrape content to train future model versions. Examples: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google for Gemini training, separate from Googlebot), Applebot-Extended (Apple Intelligence), Bytespider (ByteDance, often aggressive). These respect robots.txt by published policy.
2. Real-time AI search and RAG crawlers
Fetch a page on demand when an AI assistant needs to ground an answer. Examples: OAI-SearchBot (ChatGPT search), Claude-Web (Claude with web access), PerplexityBot, Amazonbot (Alexa+ and Amazon AI). Blocking these removes your site from AI-generated answers — which may or may not be what you want. They also broadly respect robots.txt.
3. Unidentified or evasive scrapers
Crawl from residential proxies, rotate user agents, ignore robots.txt. Often associated with smaller startups training their own models, or data-broker operations selling "web datasets" to anyone. These cannot be controlled by robots.txt; you need server-level or CDN-level mitigation.
How to identify AI bots in your logs
User-agent strings are still the primary signal. Look for these exact substrings in your access logs:
GPTBot,OAI-SearchBot,ChatGPT-User— OpenAIClaudeBot,Claude-Web,anthropic-ai— AnthropicGoogle-Extended— Google (Gemini training; separate fromGooglebotfor search)PerplexityBot— PerplexityApplebot-Extended— Apple (training; separate fromApplebotfor Siri/Spotlight indexing)Amazonbot— AmazonBytespider— ByteDance / TikTokCCBot— Common Crawl (powers many downstream training datasets)FacebookBot,Meta-ExternalAgent— Metacohere-ai— Cohere
Quick log scan to see which AI crawlers have visited recently:
grep -iE 'GPTBot|ClaudeBot|Google-Extended|PerplexityBot|Bytespider|Applebot-Extended|CCBot' /var/log/nginx/access.log | awk '{print $14}' | sort | uniq -c | sort -rn
Should you block them?
There is no single right answer. The trade-off is real and worth thinking about per category.
The case for blocking
- Your content is the product (publishing, courses, journalism). Training crawlers extract value without compensation.
- You sell data, research, or analysis that loses value once it's free in chat answers.
- The crawler is hammering your server enough to affect performance.
- You believe AI training without explicit consent is wrong on principle.
The case for allowing
- AI search (ChatGPT, Perplexity, Claude) is becoming a real source of referral traffic. Blocking the crawler removes you from those results.
- Brand visibility in AI answers is valuable in itself, even without click-throughs — analogous to being mentioned in a Wikipedia article.
- Your content benefits from broader distribution (documentation, marketing copy, public-interest writing).
- You believe AI tools are net-positive and want to contribute to better outputs.
robots.txt: the polite layer
Robots.txt is a public file at the root of your site (e.g. https://yourdomain.com/robots.txt) that lists which user agents may or may not crawl which paths. Major AI companies have publicly committed to respecting it.
Block all training crawlers, allow real-time AI search:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
# Allow real-time AI search crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-Web
Allow: /
# Default for everything else
User-agent: *
Disallow:
The block-all default for AI training is a defensible posture in 2026. New training crawlers appear monthly; if you don't list them explicitly, the wildcard at the bottom catches them. To allow specific bots, add explicit Allow rules above the wildcard.
Server-level enforcement
Robots.txt is a request, not an enforcement. Bots that ignore it need server-level blocking. Add this to your nginx config to return a 403 for the worst offenders:
if ($http_user_agent ~* (GPTBot|ClaudeBot|anthropic-ai|Google-Extended|Bytespider|Applebot-Extended|CCBot)) {
return 403;
}
Apache equivalent in .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|anthropic-ai|Google-Extended|Bytespider|Applebot-Extended|CCBot) [NC]
RewriteRule .* - [F,L]
Server-level rules catch any bot that announces itself honestly even if it ignores robots.txt. They don't catch crawlers that lie about their user agent.
CDN-level blocking (the strongest option)
If your site is fronted by Cloudflare, Fastly, or BunnyCDN, you have access to bot-management features that go beyond user-agent matching. Cloudflare's free plan now includes one-click "Block AI Bots" — a managed list that includes the major AI crawlers and is updated as new ones appear. Pro and Business plans add behavioural detection that catches scrapers using residential proxies and rotating user agents.
For most site owners worried about AI scraping, putting the site behind Cloudflare and toggling "Block AI Bots" on the dashboard is the highest-leverage single action available. It costs nothing and stops the great majority of identifiable training traffic.
The bots that ignore everything
A growing share of training data collection now uses residential proxy networks and rotating user agents specifically to evade controls. You will see traffic that looks like normal human visitors — Chrome on Windows from a residential IP — but at impossible request rates. Mitigations:
- Rate-limit per-IP at the CDN. Real users don't fetch 200 pages per minute.
- Challenge suspicious patterns with CAPTCHA or Cloudflare Turnstile.
- Watermark unique strings into your content so you can later prove it appeared in a model. Doesn't prevent scraping but creates leverage.
- Accept that perfect prevention isn't possible and focus on the bots you can identify.
What's coming next
The IETF is working on standardised AI-preference signals (aipref working group) that go beyond robots.txt — for example, expressing "may be used for AI search but not training" or "may be used with attribution". A separate community effort, llms.txt, proposes a structured file that helps AI assistants understand your site's content and licensing without scraping the whole thing. None of these are enforceable today, but the major AI labs are publicly engaging with the standards work, and we expect a more granular consent model within 18–24 months.
Until then, the practical advice is: pick a posture (block, allow, or middle-ground), implement it in robots.txt and at the server or CDN level, and revisit the bot list every six months as new crawlers appear. If you're hosting at Maxinames and want help wiring up nginx-level rules or Cloudflare bot management for your site, our support team can walk you through it.
Ready to put this into practice?
Search for your domain, pick a hosting plan, or talk to our team.
More from the blog
How to Choose a Domain Name in 2026: A Practical Guide
Your domain is the front door to everything you build online. Here's how to pick one that's memorable, brandable, and won't paint you into a corner two years from now.
HostingShared Hosting vs VPS: Which Hosting Plan Is Right for You?
Shared hosting is cheap and easy. VPS is fast and flexible. The right choice depends less on your traffic today and more on what you're planning for next quarter.
EmailWhy You Need a Custom Domain Email (and How to Set One Up)
Sending business email from a Gmail or Yahoo address quietly costs you sales. A custom-domain inbox is one of the cheapest credibility upgrades available — and it takes about an hour.