Glossary

Robots.txt

Definition: A text file at the root of a website that instructs search engine crawlers which pages or directories they are allowed or disallowed from accessing.

The robots.txt file lives at the root of your domain (https://example.com/robots.txt) and communicates crawling instructions to search engine bots. It uses the Robots Exclusion Protocol, a widely adopted informal standard.

Basic Syntax

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot
Disallow: /staging/

Sitemap: https://example.com/sitemap.xml

Key Directives

  • User-agent — Specifies which bot the rules apply to. * means all bots.
  • Disallow — Paths the bot should not crawl.
  • Allow — Overrides a Disallow for a specific sub-path.
  • Sitemap — Tells bots where to find the sitemap.

Critical Distinction: Disallow vs Noindex

  • Disallow prevents crawling — the bot will not visit the page.
  • Noindex (in a meta tag or HTTP header) prevents indexing — the bot visits but does not add to the index.
  • If you Disallow a page, Google cannot see a Noindex tag on it, so it might still appear in search results if linked from elsewhere.