Part of Cluster:Technical SEO Robots.txt — Controlling Crawlers

Robots.txt — Controlling Crawlers

Learn how to use robots.txt to control search engine crawlers. Covers syntax, common directives, testing, and mistakes that can block your pages from ranking.

Intermediate8 min readUpdated 04 Mar 2026Bukhosi Moyo

Share this guide

The robots.txt file is a simple text file at the root of your website that tells search engine crawlers which pages they are allowed to access and which they should skip. It is one of the oldest web standards (dating back to 1994) and remains a fundamental part of technical SEO.

⚡ Quick Answer

Robots.txt is a file at yourdomain.com/robots.txt that provides crawl instructions to search engine bots.
It controls crawling, not indexing. Disallowing a page does not prevent it from appearing in search results if other signals exist.
Use it to block crawling of admin pages, duplicate content, parameter URLs, and resources that waste crawl budget.
Never use robots.txt to block pages you want to rank: blocked pages cannot be crawled and may be indexed with incomplete information.
Always test changes before deploying
a misplaced rule can accidentally block your entire site.

If you want the full breakdown, continue below.

How Robots.txt Works

When a search engine crawler visits your site, it first checks yourdomain.com/robots.txt for instructions. The file contains rules that tell the crawler:

Which paths it may crawl (Allow)
Which paths it should skip (Disallow)
Which crawlers the rules apply to (User-agent)
Where to find the sitemap (Sitemap)

Basic Syntax

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/

Sitemap: https://symaxx.co.za/sitemap.xml

Key Directives

Directive	Purpose	Example
`User-agent`	Specifies which crawler the rules apply to	`User-agent: Googlebot`
`Disallow`	Blocks crawling of a path	`Disallow: /admin/`
`Allow`	Explicitly allows crawling (overrides Disallow)	`Allow: /admin/public-page`
`Sitemap`	Points to your XML sitemap	`Sitemap: https://example.com/sitemap.xml`
`Crawl-delay`	Requests a delay between crawl requests (not supported by Google)	`Crawl-delay: 10`

What to Block

Recommended Disallow Rules

# Admin and authentication pages
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/

# API endpoints
Disallow: /api/

# Search results pages
Disallow: /search

# Parameter-based duplicate pages
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Private or staging content
Disallow: /staging/
Disallow: /preview/

# WordPress defaults (if applicable)
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

What NOT to Block

Your main content pages. Never disallow pages you want to rank.

CSS and JavaScript files. Google needs access to these to render your pages. Blocking them prevents Google from understanding your page layout and content.

Images you want indexed. If you block image directories, your images will not appear in Google Image Search.

Robots.txt vs Noindex

A critical distinction:

	Robots.txt (Disallow)	Noindex Meta Tag
Controls	Crawling	Indexing
Effect	"Do not visit this page"	"Visit the page but do not index it"
Removes from index?	No - page may still be indexed via external links	Yes - removes from index
Best for	Saving crawl budget	Removing pages from search results

If you want a page to not appear in search results, use noindex, not robots.txt. If you block a page via robots.txt, Google cannot see the noindex tag, and may still index the page based on external signals (like anchor text from links pointing to it).

Testing Your Robots.txt

Google Search Console - Robots.txt Tester

Google Search Console does not have a dedicated robots.txt testing tool anymore, but you can:

Check the URL Inspection tool for any crawl blocking issues
Use the "Page indexing" report to see if robots.txt is blocking important pages

Manual Testing

Visit yourdomain.com/robots.txt directly in your browser
Verify all rules are correct
Check that no important pages are accidentally blocked
Verify the sitemap reference is correct

Third-Party Tools

Screaming Frog: crawls your site and reports robots.txt-blocked pages
Ahrefs Site Audit: identifies pages blocked by robots.txt
Google PageSpeed Insights: reports if resources are blocked

Common Robots.txt Mistakes

Blocking the entire site.

User-agent: *
Disallow: /

This single line prevents all crawlers from accessing any page. Usually the result of a staging/development configuration accidentally deployed to production.

Blocking CSS and JavaScript.

Disallow: /css/
Disallow: /js/

Google cannot render your page without these resources, leading to indexing issues and poor rankings.

Using robots.txt instead of noindex. Disallowing a page does not remove it from the index. Use noindex for that purpose.

Inconsistent rules. Conflicting Allow and Disallow rules can confuse crawlers. More specific paths override less specific ones, but keep rules clear and non-contradictory.

Forgetting the sitemap reference. Always include Sitemap: https://yourdomain.com/sitemap.xml in your robots.txt.

Case sensitivity. Robots.txt paths are case-sensitive. /Admin/ is different from /admin/.

Key Takeaways

Robots.txt controls crawling, not indexing. Use noindex tags to prevent indexing.
Block admin pages, API endpoints, parameter URLs, and staging content.
Never block your main content pages, CSS, JavaScript, or images.
Always test changes before deploying - a wrong rule can block your entire site.
Include a sitemap reference in your robots.txt.

Quick Robots.txt Checklist

File exists at the root domain (/robots.txt)
No critical pages accidentally blocked
CSS and JavaScript files are accessible to crawlers
Admin, login, and staging pages are blocked
Parameter-based duplicate URLs are blocked
Sitemap reference included
Rules tested before deployment
Verified in Google Search Console URL Inspection
Reviewed quarterly for accuracy

Tools & Resources (Coming Soon)

Robots.txt Tester (Coming soon)
Free SEO Audit Tool
Crawlability Checker (Coming soon)