Is GPTBot blocked on your site? How to check and fix it in 5 minutes

Right now, at least a dozen AI crawlers are probably hitting your site. Some are scraping your content to train models. Others are fetching pages in real time so ChatGPT or Perplexity can answer questions about you. Most site owners have no idea which ones they're allowing and which ones they've accidentally blocked.

Your robots.txt file controls all of this. And there's a good chance yours is either wide open or blocking the wrong things.

Here's how to check and fix it.

The bots you should know about

Not all AI crawlers do the same thing. Some collect training data. Others fetch your pages live when a user asks a question. That difference matters, because you might want to block one but not the other.

Training bots scrape your content to build or fine-tune AI models. Your pages get absorbed into the model's weights. You don't get a link, attribution, or traffic in return.

User-agent bots fetch your pages in real time when someone asks an AI tool a question. These work more like search engines - they can send you traffic, and they usually link back to your site.

Here's the list that actually matters:

Bot	Company	What it does
GPTBot	OpenAI	Training data collection + powers ChatGPT's web browsing
OAI-SearchBot	OpenAI	Fetches pages for SearchGPT results
ChatGPT-User	OpenAI	ChatGPT browsing plugin, real-time page fetches
ClaudeBot	Anthropic	Training data collection
anthropic-ai	Anthropic	Older training crawler, still active
Google-Extended	Google	Gemini training data. This is NOT Googlebot - blocking it won't affect your search rankings
Googlebot	Google	Regular search indexing. Don't block this unless you want to disappear from Google
CCBot	Common Crawl	Open dataset used by many AI labs for training
Bytespider	ByteDance	TikTok's parent company, training data collection
PerplexityBot	Perplexity AI	Real-time search, links back to sources
Applebot-Extended	Apple	Apple Intelligence training. Separate from the regular Applebot used for Siri/Spotlight

GPTBot gets the most attention, but it's only one of many. If you block GPTBot and leave CCBot wide open, your content is still ending up in training datasets.

How to check what you're currently blocking

Open your browser and go to:

https://yourdomain.com/robots.txt

That's it. The file is public. You'll see something like this:

User-agent: *
Disallow: /admin/
Disallow: /private/

If you don't see any AI bot names in there, you're allowing all of them. If you see User-agent: GPTBot with a Disallow: /, that one is blocked.

Most sites I check fall into one of two buckets: they either block nothing, or they block GPTBot specifically because someone read a headline about it in 2023 and forgot about the other ten crawlers doing the exact same thing.

Training vs. browsing - why it matters

Here's the decision most people skip: do you actually want AI tools to be able to reference your content?

If you run a recipe blog and someone asks ChatGPT "how do I make sourdough starter," you probably want ChatGPT-User and OAI-SearchBot to be able to fetch your page and link to it. That's traffic. What you probably don't want is GPTBot scraping your entire archive so the model can answer sourdough questions without ever sending anyone to your site.

The distinction is between bots that take your content for training and bots that bring users to your content in real time.

Block the training bots. Think twice before blocking the browsing ones.

Copy-paste robots.txt configs

Pick the scenario that fits your situation. Add the relevant lines to the end of your existing robots.txt file.

Option A: Block AI training, allow AI search (recommended for most sites)

This blocks bots that scrape for training while still letting AI search tools reference your pages.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow AI search/browsing bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

If you don't want any AI system crawling your site, training or otherwise:

# Block all AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Option C: Allow everything

If you want maximum AI visibility and don't mind your content being used for training:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: Google-Extended
Allow: /

If your robots.txt doesn't mention a bot at all, it's allowed by default. So Option C is really just making your intent explicit.

The catch: robots.txt is voluntary

robots.txt is a protocol, not a firewall. It's a polite request. Well-behaved bots respect it. The major companies - OpenAI, Google, Anthropic, Apple - do follow robots.txt directives. But nothing technically prevents a crawler from ignoring the file entirely.

Some lesser-known crawlers don't check robots.txt at all. If you need hard enforcement, you'll need server-side blocking by user-agent string or IP range. Cloudflare and several other CDNs offer bot management tools that can handle this. But for most sites, robots.txt covers the crawlers that matter.

Blocking is step one. Being agent-friendly is the bigger move.

Controlling which bots can access your site is the defensive play. It's worth doing, but it's only part of the picture.

The sites that will get the most out of AI - not just search engines, but AI agents that browse, compare, and buy on behalf of users - are the ones that make themselves easy for machines to understand. That means structured data, clean APIs, and files like llms.txt that tell AI systems what your site is about and how to use it.

If you want to see where your site stands on all of this, Silicon Friendly rates websites across 30 criteria on a scale from L0 to L5 for agent-friendliness. It checks robots.txt configuration, but also structured data, API availability, and a lot more. Worth a look if you've just fixed your robots.txt and want to know what else you might be missing.

Quick checklist

Visit yourdomain.com/robots.txt right now
Check which AI bots are mentioned (probably none or just GPTBot)
Decide: do you want to block training, browsing, or both?
Copy the relevant config above and add it to your file
Verify the change by revisiting the URL

Five minutes. That's all it takes to go from "I have no idea what's crawling my site" to having a clear, intentional policy about it.

Is GPTBot blocked on your site? Here's what that means for AI agents