Robots.txt — Controlling Crawlers

Learn how to use robots.txt to control search engine crawlers. Covers syntax, common directives, testing, and mistakes that can block your pages from ranking.

Intermediate8 min readUpdated 04 Mar 2026Bukhosi Moyo

The robots.txt file is a simple text file at the root of your website that tells search engine crawlers which pages they are allowed to access and which they should skip. It is one of the oldest web standards (dating back to 1994) and remains a fundamental part of technical SEO.

Quick Answer
  • Robots.txt is a file at yourdomain.com/robots.txt that provides crawl instructions to search engine bots.
  • It controls crawling, not indexing. Disallowing a page does not prevent it from appearing in search results if other signals exist.
  • Use it to block crawling of admin pages, duplicate content, parameter URLs, and resources that waste crawl budget.
  • Never use robots.txt to block pages you want to rank — blocked pages cannot be crawled and may be indexed with incomplete information.
  • Always test changes before deploying — a misplaced rule can accidentally block your entire site.

If you want the full breakdown, continue below.

How Robots.txt Works

When a search engine crawler visits your site, it first checks yourdomain.com/robots.txt for instructions. The file contains rules that tell the crawler:

  • Which paths it may crawl (Allow)
  • Which paths it should skip (Disallow)
  • Which crawlers the rules apply to (User-agent)
  • Where to find the sitemap (Sitemap)

Basic Syntax

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/

Sitemap: https://symaxx.co.za/sitemap.xml

Key Directives

Directive Purpose Example
User-agent Specifies which crawler the rules apply to User-agent: Googlebot
Disallow Blocks crawling of a path Disallow: /admin/
Allow Explicitly allows crawling (overrides Disallow) Allow: /admin/public-page
Sitemap Points to your XML sitemap Sitemap: https://example.com/sitemap.xml
Crawl-delay Requests a delay between crawl requests (not supported by Google) Crawl-delay: 10

What to Block

Recommended Disallow Rules

# Admin and authentication pages
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/

# API endpoints
Disallow: /api/

# Search results pages
Disallow: /search

# Parameter-based duplicate pages
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Private or staging content
Disallow: /staging/
Disallow: /preview/

# WordPress defaults (if applicable)
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

What NOT to Block

Your main content pages. Never disallow pages you want to rank.

CSS and JavaScript files. Google needs access to these to render your pages. Blocking them prevents Google from understanding your page layout and content.

Images you want indexed. If you block image directories, your images will not appear in Google Image Search.

Robots.txt vs Noindex

A critical distinction:

Robots.txt (Disallow) Noindex Meta Tag
Controls Crawling Indexing
Effect "Do not visit this page" "Visit the page but do not index it"
Removes from index? No — page may still be indexed via external links Yes — removes from index
Best for Saving crawl budget Removing pages from search results

If you want a page to not appear in search results, use noindex, not robots.txt. If you block a page via robots.txt, Google cannot see the noindex tag, and may still index the page based on external signals (like anchor text from links pointing to it).

Testing Your Robots.txt

Google Search Console — Robots.txt Tester

Google Search Console does not have a dedicated robots.txt testing tool anymore, but you can:

  1. Check the URL Inspection tool for any crawl blocking issues
  2. Use the "Page indexing" report to see if robots.txt is blocking important pages

Manual Testing

  1. Visit yourdomain.com/robots.txt directly in your browser
  2. Verify all rules are correct
  3. Check that no important pages are accidentally blocked
  4. Verify the sitemap reference is correct

Third-Party Tools

  • Screaming Frog — crawls your site and reports robots.txt-blocked pages
  • Ahrefs Site Audit — identifies pages blocked by robots.txt
  • Google PageSpeed Insights — reports if resources are blocked

Common Robots.txt Mistakes

Blocking the entire site.

User-agent: *
Disallow: /

This single line prevents all crawlers from accessing any page. Usually the result of a staging/development configuration accidentally deployed to production.

Blocking CSS and JavaScript.

Disallow: /css/
Disallow: /js/

Google cannot render your page without these resources, leading to indexing issues and poor rankings.

Using robots.txt instead of noindex. Disallowing a page does not remove it from the index. Use noindex for that purpose.

Inconsistent rules. Conflicting Allow and Disallow rules can confuse crawlers. More specific paths override less specific ones, but keep rules clear and non-contradictory.

Forgetting the sitemap reference. Always include Sitemap: https://yourdomain.com/sitemap.xml in your robots.txt.

Case sensitivity. Robots.txt paths are case-sensitive. /Admin/ is different from /admin/.

Key Takeaways

  • Robots.txt controls crawling, not indexing. Use noindex tags to prevent indexing.
  • Block admin pages, API endpoints, parameter URLs, and staging content.
  • Never block your main content pages, CSS, JavaScript, or images.
  • Always test changes before deploying — a wrong rule can block your entire site.
  • Include a sitemap reference in your robots.txt.

Quick Robots.txt Checklist

  • File exists at the root domain (/robots.txt)
  • No critical pages accidentally blocked
  • CSS and JavaScript files are accessible to crawlers
  • Admin, login, and staging pages are blocked
  • Parameter-based duplicate URLs are blocked
  • Sitemap reference included
  • Rules tested before deployment
  • Verified in Google Search Console URL Inspection
  • Reviewed quarterly for accuracy

Tools & Resources (Coming Soon)

Related SEO Documentation

Was this helpful?