Search engine crawler robot reading a robots.txt file with User-agent, Disallow and Sitemap rules — how robots.txt controls crawling

2026-06-24

Robots.txt Explained: What It Can and Cannot Do

Robots.txt controls crawling, not indexing — the #1 SEO misconception. Learn the syntax, the noindex trap, real examples, and the mistakes that deindex entire sites.

robots.txt is a plain text file with outsized power: one wrong character can make Google abandon your entire site. Yet the most dangerous thing about it is a misconception — robots.txt controls crawling, not indexing. Understanding that distinction prevents the two most damaging SEO mistakes a site can make.

What robots.txt is

A text file at your domain root — example.com/robots.txt, no other location works — that tells crawlers which URL paths they may fetch. It is the first thing Googlebot requests when visiting your site.

User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml

User-agent — which bot the rules address (* = all)
Disallow — path prefixes not to crawl
Allow — exceptions within disallowed paths
Sitemap — where your page list lives (always include this)

Build a correct file in seconds with our Robots.txt Generator.

The crawling vs indexing trap

Here is the counterintuitive part: Disallow does not remove a page from Google. If other sites link to a blocked page, Google can index the bare URL anyway — shown with "No information is available for this page."

Worse, the two controls sabotage each other:

Goal	Right tool	Common mistake
Stop crawling (save crawl budget)	robots.txt Disallow	—
Remove from search results	`noindex` meta tag, page crawlable	Blocking it in robots.txt too

If you block a page in robots.txt and add noindex, Google never crawls the page, never sees the noindex, and the URL can stay indexed indefinitely. To deindex: allow crawling, add noindex, wait for recrawl, then optionally block.

What robots.txt cannot do

Security. It is a public file that merely requests good behaviour. Malicious bots ignore it — and read it as a map of interesting paths. Never list secret URLs; protect private content with authentication.
Guaranteed compliance. Reputable crawlers (Google, Bing) obey; scrapers don't.
Per-page nuance. For that, use meta robots tags or X-Robots-Tag headers.

The mistakes that hurt sites

Disallow: / left over from staging. The classic launch-day disaster — the entire site becomes uncrawlable. Check your live file after every deploy.
Blocking CSS/JS directories. Google renders pages; blocked assets mean it sees broken layouts, which harms rankings.
Using Disallow to deindex (the trap above).
Wrong location or case. Must be /robots.txt at the root, lowercase — subdirectory copies are ignored.
No Sitemap line. Free crawl-discovery help, skipped by half the web.

Sensible defaults

Most sites need very little:

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block only genuine crawl-waste: infinite calendar pages, faceted search combinations, cart/checkout flows. When in doubt, leave it open — an over-permissive robots.txt is harmless; an over-restrictive one is a slow-motion catastrophe.

Frequently asked questions

How do I check what my robots.txt is doing? Google Search Console → Settings → robots.txt report shows the fetched file, parse errors and which rules block which URLs.

Can I block AI crawlers like GPTBot? Yes — major AI crawlers respect robots.txt: User-agent: GPTBot + Disallow: /. Weigh the trade-off: blocking AI crawlers also removes you from AI-powered answer engines that could cite and link you.

Does crawl-delay work? Google ignores it (use Search Console's crawl settings); Bing and Yandex respect it. Rarely needed on modern hosting.

How fast do robots.txt changes take effect? Google re-fetches the file roughly every 24 hours. Do not expect instant behaviour changes.