Maxinames
All articles
Security9 min read

AI Bot Crawlers Explained: GPTBot, ClaudeBot, and How to Control Them

GPTBot, ClaudeBot, PerplexityBot, AmazonBot — the AI crawler ecosystem in 2026. Who's out there, what they want, how to identify them in your logs, and exactly how to allow or block them.

Lines of glowing code on a dark monitor

Open your access logs in 2026 and a noticeable share of the traffic is bots you may not recognise: GPTBot, ClaudeBot, PerplexityBot, AmazonBot, Bytespider, Applebot-Extended. They're crawling for AI — either to train models, to ground real-time AI search results, or to power the assistants now embedded in browsers and operating systems. Some respect robots.txt, some don't, and the rules of engagement keep changing. This article is an honest reference: who they are, what they want, and how to control them.

The crawler landscape in 2026

There are now three distinct categories of AI crawler, and the right policy for each is different:

1. Training crawlers

Scrape content to train future model versions. Examples: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google for Gemini training, separate from Googlebot), Applebot-Extended (Apple Intelligence), Bytespider (ByteDance, often aggressive). These respect robots.txt by published policy.

2. Real-time AI search and RAG crawlers

Fetch a page on demand when an AI assistant needs to ground an answer. Examples: OAI-SearchBot (ChatGPT search), Claude-Web (Claude with web access), PerplexityBot, Amazonbot (Alexa+ and Amazon AI). Blocking these removes your site from AI-generated answers — which may or may not be what you want. They also broadly respect robots.txt.

3. Unidentified or evasive scrapers

Crawl from residential proxies, rotate user agents, ignore robots.txt. Often associated with smaller startups training their own models, or data-broker operations selling "web datasets" to anyone. These cannot be controlled by robots.txt; you need server-level or CDN-level mitigation.

How to identify AI bots in your logs

User-agent strings are still the primary signal. Look for these exact substrings in your access logs:

  • GPTBot, OAI-SearchBot, ChatGPT-User — OpenAI
  • ClaudeBot, Claude-Web, anthropic-ai — Anthropic
  • Google-Extended — Google (Gemini training; separate from Googlebot for search)
  • PerplexityBot — Perplexity
  • Applebot-Extended — Apple (training; separate from Applebot for Siri/Spotlight indexing)
  • Amazonbot — Amazon
  • Bytespider — ByteDance / TikTok
  • CCBot — Common Crawl (powers many downstream training datasets)
  • FacebookBot, Meta-ExternalAgent — Meta
  • cohere-ai — Cohere

Quick log scan to see which AI crawlers have visited recently:

grep -iE 'GPTBot|ClaudeBot|Google-Extended|PerplexityBot|Bytespider|Applebot-Extended|CCBot' /var/log/nginx/access.log | awk '{print $14}' | sort | uniq -c | sort -rn

Should you block them?

There is no single right answer. The trade-off is real and worth thinking about per category.

The case for blocking

  • Your content is the product (publishing, courses, journalism). Training crawlers extract value without compensation.
  • You sell data, research, or analysis that loses value once it's free in chat answers.
  • The crawler is hammering your server enough to affect performance.
  • You believe AI training without explicit consent is wrong on principle.

The case for allowing

  • AI search (ChatGPT, Perplexity, Claude) is becoming a real source of referral traffic. Blocking the crawler removes you from those results.
  • Brand visibility in AI answers is valuable in itself, even without click-throughs — analogous to being mentioned in a Wikipedia article.
  • Your content benefits from broader distribution (documentation, marketing copy, public-interest writing).
  • You believe AI tools are net-positive and want to contribute to better outputs.

robots.txt: the polite layer

Robots.txt is a public file at the root of your site (e.g. https://yourdomain.com/robots.txt) that lists which user agents may or may not crawl which paths. Major AI companies have publicly committed to respecting it.

Block all training crawlers, allow real-time AI search:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow real-time AI search crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-Web
Allow: /

# Default for everything else
User-agent: *
Disallow:

The block-all default for AI training is a defensible posture in 2026. New training crawlers appear monthly; if you don't list them explicitly, the wildcard at the bottom catches them. To allow specific bots, add explicit Allow rules above the wildcard.

Server-level enforcement

Robots.txt is a request, not an enforcement. Bots that ignore it need server-level blocking. Add this to your nginx config to return a 403 for the worst offenders:

if ($http_user_agent ~* (GPTBot|ClaudeBot|anthropic-ai|Google-Extended|Bytespider|Applebot-Extended|CCBot)) {
  return 403;
}

Apache equivalent in .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|anthropic-ai|Google-Extended|Bytespider|Applebot-Extended|CCBot) [NC]
RewriteRule .* - [F,L]

Server-level rules catch any bot that announces itself honestly even if it ignores robots.txt. They don't catch crawlers that lie about their user agent.

CDN-level blocking (the strongest option)

If your site is fronted by Cloudflare, Fastly, or BunnyCDN, you have access to bot-management features that go beyond user-agent matching. Cloudflare's free plan now includes one-click "Block AI Bots" — a managed list that includes the major AI crawlers and is updated as new ones appear. Pro and Business plans add behavioural detection that catches scrapers using residential proxies and rotating user agents.

For most site owners worried about AI scraping, putting the site behind Cloudflare and toggling "Block AI Bots" on the dashboard is the highest-leverage single action available. It costs nothing and stops the great majority of identifiable training traffic.

The bots that ignore everything

A growing share of training data collection now uses residential proxy networks and rotating user agents specifically to evade controls. You will see traffic that looks like normal human visitors — Chrome on Windows from a residential IP — but at impossible request rates. Mitigations:

  • Rate-limit per-IP at the CDN. Real users don't fetch 200 pages per minute.
  • Challenge suspicious patterns with CAPTCHA or Cloudflare Turnstile.
  • Watermark unique strings into your content so you can later prove it appeared in a model. Doesn't prevent scraping but creates leverage.
  • Accept that perfect prevention isn't possible and focus on the bots you can identify.

What's coming next

The IETF is working on standardised AI-preference signals (aipref working group) that go beyond robots.txt — for example, expressing "may be used for AI search but not training" or "may be used with attribution". A separate community effort, llms.txt, proposes a structured file that helps AI assistants understand your site's content and licensing without scraping the whole thing. None of these are enforceable today, but the major AI labs are publicly engaging with the standards work, and we expect a more granular consent model within 18–24 months.

Until then, the practical advice is: pick a posture (block, allow, or middle-ground), implement it in robots.txt and at the server or CDN level, and revisit the bot list every six months as new crawlers appear. If you're hosting at Maxinames and want help wiring up nginx-level rules or Cloudflare bot management for your site, our support team can walk you through it.

Ready to put this into practice?

Search for your domain, pick a hosting plan, or talk to our team.