robots.txt vs llms.txt: what's the difference and which do you need?

Your website talks to two audiences now: search engine crawlers and large language models. They want different things, they read content differently, and the files that serve them have almost nothing in common.

Most developers know robots.txt. Fewer know llms.txt. Almost nobody understands why you need both.

robots.txt: the bouncer

robots.txt has been around since 1994. Martijn Koster, a Dutch engineer, proposed it after web crawlers kept hammering his server. The idea was simple: put a text file at your site root that tells bots which pages they can and can't visit.

For 28 years it existed as an informal standard. Everyone followed it, nobody had formally ratified it. The IETF finally published RFC 9309 in September 2022, making it official.

A typical robots.txt in 2025 looks like this:

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

It's a bouncer. It says "you can come in" or "you can't." That's it. No nuance, no context, no guidance. And compliance is voluntary. There is no enforcement mechanism. Bots follow it because they choose to.

The AI era has turned robots.txt into a battlefield. As of late 2025, 79% of top news sites block AI training bots. GPTBot went from zero mentions to 578,000 websites in about 15 months. ChatGPT-User requests surged 2,825% in a single year.

Blocking a crawler is not the same as helping one. And that's where robots.txt runs out of ideas.

llms.txt: the tour guide

In September 2024, Jeremy Howard (co-founder of Answer.AI and fast.ai) published a proposal for a new file: llms.txt. The idea came from a specific frustration. LLMs increasingly need to use website content at inference time, when they're actively generating a response. Not for training. For answering questions right now.

And they're terrible at it.

Web pages are built for humans with browsers. They have nav bars, footers, ad scripts, cookie banners, tracking pixels, JavaScript bundles. When an LLM tries to read a typical HTML page, most of the tokens go to junk. Converting HTML to clean text can reduce token consumption by 20-30%, but the conversion itself is unreliable.

llms.txt solves this differently. Instead of telling bots what to avoid, it tells them what to read. It's a Markdown file at your site root that curates the most important content and provides clean, structured links.

A well-structured one looks like this:

# Acme API

> Cloud infrastructure API for deploying and managing containers.

Acme API provides RESTful endpoints for container orchestration,
monitoring, and scaling. Supports Docker and OCI images.

## Documentation

- [Getting Started](https://acme.dev/docs/quickstart): Set up your first container in 5 minutes
- [API Reference](https://acme.dev/docs/api): Complete endpoint documentation
- [Authentication](https://acme.dev/docs/auth): API keys, OAuth, and JWT setup

## Guides

- [Scaling Guide](https://acme.dev/guides/scaling): Auto-scaling configuration
- [Migration from AWS](https://acme.dev/guides/aws-migration): Step-by-step migration

## Optional

- [Changelog](https://acme.dev/changelog): Release notes
- [Status Page](https://status.acme.dev): Current system status

The format is simple. An H1 with your project name (the only required element). A blockquote summary. Then H2 sections with curated links and descriptions. The "Optional" section marks stuff that can be skipped when context windows are tight.

The spec also encourages companion files: an llms-full.txt containing all your documentation in one Markdown file, and .md mirrors of individual pages. Anthropic's docs site, for example, has an llms-full.txt that runs to 481,349 tokens.

Why LLMs need something different

This is worth spelling out, because I keep seeing people ask "isn't sitemap.xml enough?"

No. And the reason matters.

A search crawler indexes content. It visits pages, extracts text, stores it in a database for later. The crawler has unlimited time and near-unlimited storage. It doesn't care about token efficiency. It can visit every page on your site and sort out relevance later.

An LLM synthesizes content. It needs to ingest information, reason about it, and produce an answer, often in real time. Context windows are finite. Even Gemini's 1M token window can't hold a medium-sized documentation site. The model needs a curated view: what are the most important pages, what do they cover, and where are the clean Markdown versions?

robots.txt can't express any of that. It's binary. Allow or disallow. It knows nothing about content priority, page relationships, or which resources matter most for understanding your product.

robots.txt is the bouncer checking IDs at the door. llms.txt is the building directory that tells you accounting is on the third floor.

The adoption picture

BuiltWith reports 844,000+ sites now have an llms.txt file. That sounds impressive until you dig in. Most implementations are bare-bones or broken. Common mistakes include wrong filenames (llm.txt or LLMS.txt), missing the required H1 heading, or just dumping a flat list of URLs with no descriptions.

The sites doing it well tend to be developer-facing: Anthropic, Cloudflare, Stripe, Supabase, Cursor. They have extensive documentation that AI coding assistants actively consume. For them, llms.txt is already useful, because tools like Cursor and Windsurf actually read it to pull in documentation context.

The skeptics have fair points too. Google's John Mueller compared llms.txt to the discredited <meta name="keywords"> tag. SE Ranking studied 300,000 domains and found no correlation between having llms.txt and being cited in LLM answers. Then, in December 2025, Google quietly added an llms.txt to its own Search Central documentation. When an SEO professional pointed out the irony, Mueller responded with a cryptic "hmmn :-/".

I think the truth is boring but practical: llms.txt matters right now for developer tools and documentation-heavy products, where AI coding assistants are the primary consumers. For a restaurant website or a local business, probably not worth the effort yet. But AI agents are getting better at using web content every month, and the sites that give them clean structured context will have an advantage when that usage becomes mainstream.

Do you need both?

Yes. They solve entirely different problems.

robots.txt controls access. Without it, you have no way to tell AI training crawlers to stay away from your content. If you care about that (and most publishers do), you need a robots.txt that specifically addresses GPTBot, ClaudeBot, Google-Extended, CCBot, and the growing list of AI crawlers.

llms.txt controls understanding. If you want AI systems to accurately represent your product, answer questions about your API, or help developers use your tools, llms.txt gives those systems the curated context they need.

Blocking training crawlers while providing an llms.txt is not contradictory. You can say "don't scrape my site for training data" in robots.txt while also saying "here's the best way to understand my product" in llms.txt. One is about what bots take from you. The other is about what you give to them.

Where agent-friendliness goes beyond both files

robots.txt and llms.txt are two pieces of a bigger puzzle. A truly agent-friendly website also considers structured data, API access, authentication flows for AI agents, and how well its pages render without JavaScript.

At Silicon Friendly, we rate websites across 30 criteria on an L0-L5 scale, where L0 means "completely invisible to agents" and L5 means "built for agents from the ground up." Having both robots.txt and llms.txt properly configured is part of that picture, but far from the whole thing.

Checklist: making your site work for both crawlers and LLMs

robots.txt

File exists at yourdomain.com/robots.txt
Explicitly addresses AI training bots (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, anthropic-ai)
Separates training bots from search/user bots (OAI-SearchBot and ChatGPT-User are different from GPTBot)
Includes a Sitemap directive
Review and update quarterly as new bots appear

llms.txt

File exists at yourdomain.com/llms.txt (exact filename, lowercase)
Starts with an H1 heading containing your site or project name
Includes a blockquote summary explaining what you do
Links are curated, not exhaustive. 10-30 of your most important pages, not 500
Each link has a short description
Uses an "Optional" section for secondary content
Companion llms-full.txt exists if you have extensive documentation
Linked pages have .md mirror versions
File is reviewed when you add or remove major content

Neither file is complicated. A developer can set up both in an afternoon. The hard part is curation: deciding which content matters most, writing useful descriptions, and keeping the file current as your site evolves. But that's what makes llms.txt valuable in the first place. Automated discovery is noisy. Human curation is signal.

robots.txt vs llms.txt: What's the difference and why it matters