Free tools that run locally in your browser with zero data storage.
Tyzo
Technical SEO Guide

Robots.txt Complete Guide Control Search Engine Crawling

Master robots.txt to control which pages search engines can crawl, save crawl budget, and protect sensitive content.

10 min read
Intermediate Level
Updated for 2024

Chapter 1: What is Robots.txt?

Robots.txt is a text file placed in the root directory of your website that tells search engine crawlers which pages or sections of your site they should or shouldn't crawl.

File Location:
https://yourdomain.com/robots.txt

Must be in the root directory and named exactly "robots.txt" (case-sensitive).

50%+
of websites use robots.txt incorrectly
2-4 weeks
crawl budget savings with proper robots.txt
Important Distinction:

Robots.txt blocks crawling, not indexing. If your page is blocked by robots.txt but linked from other sites, it might still appear in search results (without a description). To block indexing, use the noindex meta tag or password protection.

Chapter 2: How Robots.txt Works

Before crawling your site, search engines first check your robots.txt file to see which URLs they're allowed to crawl.

Step 1: Search engine wants to crawl https://example.com/page Step 2: Checks https://example.com/robots.txt Step 3: If allowed โ†’ crawl the page Step 4: If disallowed โ†’ don't crawl the page
Simple Robots.txt Example:
# Allow all crawlers to access everything User-agent: * Disallow: # Or block all crawlers from everything User-agent: * Disallow: / # Block specific folder User-agent: * Disallow: /admin/
Which Search Engines Respect Robots.txt?
  • โœ“ Googlebot (Google) - Yes
  • โœ“ Bingbot (Bing) - Yes
  • โœ“ Slurp (Yahoo) - Yes
  • โœ“ DuckDuckBot (DuckDuckGo) - Yes
  • โœ“ Baiduspider (Baidu) - Yes
  • โœ“ Yandex Bot - Yes
  • โœ— Malicious bots - Often ignore robots.txt

Chapter 3: Robots.txt Syntax & Directives

User-agent: Specifies which search engine bot the rule applies to.
User-agent: * # Applies to all bots User-agent: Googlebot # Applies only to Google User-agent: Bingbot # Applies only to Bing
Disallow: Tells bots NOT to crawl specific URLs or directories.
Disallow: / # Block entire site Disallow: /admin/ # Block /admin/ folder Disallow: /secret-page.html # Block specific page
Allow: Overrides a Disallow rule for a specific path.
Allow: /public/ Disallow: / # Block everything except /public/
Sitemap: Tells bots where to find your XML sitemap.
Sitemap: https://example.com/sitemap.xml
Crawl-delay: Slows down bot requests (not supported by Google).
Crawl-delay: 10 # Wait 10 seconds between requests

Note: Google ignores Crawl-delay. Use Google Search Console to set crawl rate instead.

Syntax Rules:

โ€ข Each directive on its own line โ€ข No spaces around colon โ€ข Use # for comments โ€ข Case-sensitive paths โ€ข Trailing slash matters (/admin vs /admin/)

Chapter 4: Common User Agents

User Agent Search Engine When to Use
*All botsGeneral rules for everyone
GooglebotGoogle (all Google crawlers)Google-specific rules
Googlebot-ImageGoogle Image SearchBlock images specifically
Googlebot-VideoGoogle Video SearchBlock videos specifically
Googlebot-NewsGoogle NewsGoogle News specific rules
BingbotBingBing-specific rules
SlurpYahooYahoo-specific rules
DuckDuckBotDuckDuckGoDuckDuckGo-specific rules
BaiduspiderBaidu (China)Baidu-specific rules
YandexBotYandex (Russia)Yandex-specific rules
AhrefsBotAhrefs SEO toolBlock SEO tools from crawling
SemrushBotSemrush SEO toolBlock SEO tools from crawling
# Example: Different rules for different bots User-agent: * Disallow: /private/ User-agent: Googlebot Disallow: /admin/ Allow: /admin/public/ User-agent: Bingbot Disallow: /temp/

Chapter 5: Allow vs Disallow Directives

The Allow directive (not supported by all bots) can override a broader Disallow rule.

# Block everything except the /public/ folder User-agent: * Disallow: / Allow: /public/ # Block an entire directory except one file User-agent: * Disallow: /private/ Allow: /private/index.html # Block a specific file type User-agent: * Disallow: /*.pdf$
Pattern Matching Examples:
Disallow: /admin # Blocks /admin, /admin/, /admin123 Disallow: /admin/ # Blocks /admin/ and subfolders Disallow: /*.jpg$ # Blocks all JPG images Allow: /public/*.css # Allows CSS files in /public/
Note:

Not all search engines support the Allow directive. Google and Bing do. For maximum compatibility, structure your disallows to not need Allow directives.

Chapter 6: Adding Sitemap to Robots.txt

Adding your sitemap location to robots.txt helps search engines discover all your important pages.

Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-index.xml # Multiple sitemaps Sitemap: https://example.com/sitemap-posts.xml Sitemap: https://example.com/sitemap-pages.xml Sitemap: https://example.com/sitemap-products.xml
Benefits of Sitemap in Robots.txt:
  • โœ“ Google finds your sitemap automatically
  • โœ“ No need to submit manually in Google Search Console
  • โœ“ Helps bots discover all your content
  • โœ“ Especially useful for large websites
Complete Robots.txt with Sitemap:
# Allow all crawlers User-agent: * Disallow: # Block admin area User-agent: * Disallow: /wp-admin/ Disallow: /admin/ # Tell bots where sitemap is Sitemap: https://example.com/sitemap.xml

Chapter 7: Common Robots.txt Use Cases

๐Ÿ”ง 1. Block WordPress Admin Area
User-agent: * Disallow: /wp-admin/ Disallow: /wp-login.php
๐Ÿ”ง 2. Block Duplicate Content (Search Results)
User-agent: * Disallow: /search Disallow: /*?s= Disallow: /*?filter=
๐Ÿ”ง 3. Block Temporary Maintenance Page
User-agent: * Disallow: /maintenance.html
๐Ÿ”ง 4. Block Specific File Types
User-agent: * Disallow: /*.pdf$ Disallow: /*.zip$ Disallow: /*.mp4$
๐Ÿ”ง 5. Block During Development
User-agent: * Disallow: /

Warning: Only use this for development sites. On live sites, this blocks ALL crawling!

๐Ÿ”ง 6. SEO Tools Blocking (Save Bandwidth)
User-agent: AhrefsBot Disallow: / User-agent: SemrushBot Disallow: /

Chapter 8: 10 Common Robots.txt Mistakes

  • โŒ Accidentally blocking your entire site - Disallow: / on a live site blocks all search engines. Very common and disastrous mistake.
  • โŒ Using robots.txt for security - Robots.txt is publicly visible. Anyone can see what you're blocking. Use password protection for sensitive content.
  • โŒ Blocking CSS/JS files - Google needs CSS and JS to render your page properly. Don't block these files.
  • โŒ Multiple Disallow lines for same path - Inefficient and harder to maintain.
  • โŒ Missing trailing slashes - Disallow: /admin is different from Disallow: /admin/
  • โŒ Case sensitivity issues - /Admin is different from /admin. Be consistent.
  • โŒ Blocking Googlebot but allowing others - 90%+ of your traffic comes from Google. Usually not what you want.
  • โŒ Not adding sitemap URL - You're making it harder for search engines to find your content.
  • โŒ Using robots.txt to block pages you want noindexed - Use noindex meta tag instead. Blocked pages can still appear in results.
  • โŒ Not testing after changes - Always validate your robots.txt before deploying.
Worst Mistake:

This robots.txt on a LIVE production site:

User-agent: * Disallow: /

Result: Your site disappears from Google for weeks or months. It can take a long time to recover after fixing.

Chapter 9: Testing Your Robots.txt

Method 1: Google Search Console

  1. Log into Google Search Console
  2. Go to "Settings" โ†’ "Robots.txt Tester"
  3. Enter URL to test
  4. See if it's blocked or allowed
  5. Submit updated robots.txt

Method 2: Fetch as Google

Use "URL Inspection" tool to see how Googlebot sees a specific URL and whether robots.txt blocks it.

Method 3: Direct Browser Access

https://yourdomain.com/robots.txt

View your robots.txt directly in any browser to verify it's accessible.

Check Your Robots.txt:

Use our Robots.txt Generator to create and validate your robots.txt file before deploying.

Cache Warning:

Changes to robots.txt can take days or weeks for Google to recache. Use Google Search Console to request re-crawling after major changes.

Chapter 10: Robots.txt Generator & Tools

Additional Testing Resources

๐Ÿ” Google Search Console

Official robots.txt tester and validator

๐Ÿ•ท๏ธ Screaming Frog

Crawl your site and verify robots.txt rules are working

๐ŸŒ robots-txt.com

Online robots.txt validator and checker

๐Ÿ“Š Bing Webmaster Tools

Bing's robots.txt testing tool

Robots.txt Cheat Sheet

โœ… DO:
  • Test before deploying
  • Add sitemap location
  • Block duplicate content (search results)
  • Block admin areas
  • Use specific user-agents when needed
  • Keep it simple and maintainable
โŒ DON'T:
  • Block the entire live site
  • Use robots.txt for security
  • Block CSS/JS files
  • Use it to noindex pages
  • Forget to test after changes
  • Make overly complex rules
๐Ÿ“ Template:
# Basic template User-agent: * Disallow: /wp-admin/ Disallow: /search Sitemap: https://example.com/sitemap.xml

Frequently Asked Questions

Does robots.txt block indexing?
No. Robots.txt blocks crawling, not indexing. If your page is blocked but linked from other sites, it may still appear in search results (without a description). Use the noindex meta tag to block indexing.
How long does it take for robots.txt changes to take effect?
Google can take hours to weeks to re-crawl and respect changes to robots.txt. Use Google Search Console's URL Inspection tool to request re-crawling after major changes.
Where should robots.txt be located?
Robots.txt must be placed in the root directory of your website. Example: https://example.com/robots.txt. It won't work in subdirectories.
What happens if I don't have a robots.txt file?
Nothing bad. Search engines assume they can crawl all pages. However, you miss opportunities to save crawl budget and block duplicate content.
Will robots.txt stop Google from seeing my page?
No. If your page is linked from other websites, Google might still discover it and show it in search results (without a snippet). Use password protection or noindex for true blocking.
What is crawl budget and how does robots.txt affect it?
Crawl budget is the number of pages a search engine will crawl on your site. Robots.txt helps preserve crawl budget by blocking unimportant pages (admin, search results), allowing Google to focus on valuable content.

Ready to create your robots.txt?

Use our free Robots.txt Generator to create a valid file in seconds.

Try Robots.txt Generator Read Canonical Tags Guide