Robots.txt — Controlling Crawlers
Learn how to use robots.txt to control search engine crawlers. Covers syntax, common directives, testing, and mistakes that can block your pages from ranking.
The robots.txt file is a simple text file at the root of your website that tells search engine crawlers which pages they are allowed to access and which they should skip. It is one of the oldest web standards (dating back to 1994) and remains a fundamental part of technical SEO.
- Robots.txt is a file at
yourdomain.com/robots.txtthat provides crawl instructions to search engine bots. - It controls crawling, not indexing. Disallowing a page does not prevent it from appearing in search results if other signals exist.
- Use it to block crawling of admin pages, duplicate content, parameter URLs, and resources that waste crawl budget.
- Never use robots.txt to block pages you want to rank — blocked pages cannot be crawled and may be indexed with incomplete information.
- Always test changes before deploying — a misplaced rule can accidentally block your entire site.
If you want the full breakdown, continue below.
How Robots.txt Works
When a search engine crawler visits your site, it first checks yourdomain.com/robots.txt for instructions. The file contains rules that tell the crawler:
- Which paths it may crawl (Allow)
- Which paths it should skip (Disallow)
- Which crawlers the rules apply to (User-agent)
- Where to find the sitemap (Sitemap)
Basic Syntax
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://symaxx.co.za/sitemap.xml
Key Directives
| Directive | Purpose | Example |
|---|---|---|
User-agent |
Specifies which crawler the rules apply to | User-agent: Googlebot |
Disallow |
Blocks crawling of a path | Disallow: /admin/ |
Allow |
Explicitly allows crawling (overrides Disallow) | Allow: /admin/public-page |
Sitemap |
Points to your XML sitemap | Sitemap: https://example.com/sitemap.xml |
Crawl-delay |
Requests a delay between crawl requests (not supported by Google) | Crawl-delay: 10 |
What to Block
Recommended Disallow Rules
# Admin and authentication pages
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/
# API endpoints
Disallow: /api/
# Search results pages
Disallow: /search
# Parameter-based duplicate pages
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
# Private or staging content
Disallow: /staging/
Disallow: /preview/
# WordPress defaults (if applicable)
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
What NOT to Block
Your main content pages. Never disallow pages you want to rank.
CSS and JavaScript files. Google needs access to these to render your pages. Blocking them prevents Google from understanding your page layout and content.
Images you want indexed. If you block image directories, your images will not appear in Google Image Search.
Robots.txt vs Noindex
A critical distinction:
| Robots.txt (Disallow) | Noindex Meta Tag | |
|---|---|---|
| Controls | Crawling | Indexing |
| Effect | "Do not visit this page" | "Visit the page but do not index it" |
| Removes from index? | No — page may still be indexed via external links | Yes — removes from index |
| Best for | Saving crawl budget | Removing pages from search results |
If you want a page to not appear in search results, use noindex, not robots.txt. If you block a page via robots.txt, Google cannot see the noindex tag, and may still index the page based on external signals (like anchor text from links pointing to it).
Testing Your Robots.txt
Google Search Console — Robots.txt Tester
Google Search Console does not have a dedicated robots.txt testing tool anymore, but you can:
- Check the URL Inspection tool for any crawl blocking issues
- Use the "Page indexing" report to see if robots.txt is blocking important pages
Manual Testing
- Visit
yourdomain.com/robots.txtdirectly in your browser - Verify all rules are correct
- Check that no important pages are accidentally blocked
- Verify the sitemap reference is correct
Third-Party Tools
- Screaming Frog — crawls your site and reports robots.txt-blocked pages
- Ahrefs Site Audit — identifies pages blocked by robots.txt
- Google PageSpeed Insights — reports if resources are blocked
Common Robots.txt Mistakes
Blocking the entire site.
User-agent: *
Disallow: /
This single line prevents all crawlers from accessing any page. Usually the result of a staging/development configuration accidentally deployed to production.
Blocking CSS and JavaScript.
Disallow: /css/
Disallow: /js/
Google cannot render your page without these resources, leading to indexing issues and poor rankings.
Using robots.txt instead of noindex. Disallowing a page does not remove it from the index. Use noindex for that purpose.
Inconsistent rules. Conflicting Allow and Disallow rules can confuse crawlers. More specific paths override less specific ones, but keep rules clear and non-contradictory.
Forgetting the sitemap reference. Always include Sitemap: https://yourdomain.com/sitemap.xml in your robots.txt.
Case sensitivity. Robots.txt paths are case-sensitive. /Admin/ is different from /admin/.
Key Takeaways
- Robots.txt controls crawling, not indexing. Use noindex tags to prevent indexing.
- Block admin pages, API endpoints, parameter URLs, and staging content.
- Never block your main content pages, CSS, JavaScript, or images.
- Always test changes before deploying — a wrong rule can block your entire site.
- Include a sitemap reference in your robots.txt.
Quick Robots.txt Checklist
- File exists at the root domain (/robots.txt)
- No critical pages accidentally blocked
- CSS and JavaScript files are accessible to crawlers
- Admin, login, and staging pages are blocked
- Parameter-based duplicate URLs are blocked
- Sitemap reference included
- Rules tested before deployment
- Verified in Google Search Console URL Inspection
- Reviewed quarterly for accuracy
Tools & Resources (Coming Soon)
- Robots.txt Tester (Coming soon)
- SEO Audit Tool (Coming soon)
- Crawlability Checker (Coming soon)
Related SEO Documentation
More from Robots.txt — Controlling Crawlers
Was this helpful?