Glossary

Robots.txt

Definition: A text file at the root of a website that instructs search engine crawlers which pages or directories they are allowed or disallowed from accessing.

The robots.txt file lives at the root of your domain (https://example.com/robots.txt) and communicates crawling instructions to search engine bots. It uses the Robots Exclusion Protocol, a widely adopted informal standard.

Basic Syntax

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot
Disallow: /staging/

Sitemap: https://example.com/sitemap.xml

Key Directives

User-agent — Specifies which bot the rules apply to. * means all bots.
Disallow — Paths the bot should not crawl.
Allow — Overrides a Disallow for a specific sub-path.
Sitemap — Tells bots where to find the sitemap.

Critical Distinction: Disallow vs Noindex

Disallow prevents crawling — the bot will not visit the page.
Noindex (in a meta tag or HTTP header) prevents indexing — the bot visits but does not add to the index.
If you Disallow a page, Google cannot see a Noindex tag on it, so it might still appear in search results if linked from elsewhere.

Robots.txt

Basic Syntax

Key Directives

Critical Distinction: Disallow vs Noindex

Free Tools