Technical SEO Guide

Robots.txt Complete Guide Control Search Engine Crawling

Master robots.txt to control which pages search engines can crawl, save crawl budget, and protect sensitive content.

10 min read

Intermediate Level

Updated for 2024

What You'll Learn in This Guide

📖 Part 1-5

What is Robots.txt?
How Robots.txt Works
Robots.txt Syntax & Directives
User Agents Explained
Allow vs Disallow Directives

Chapter 1: What is Robots.txt?

Robots.txt is a text file placed in the root directory of your website that tells search engine crawlers which pages or sections of your site they should or shouldn't crawl.

File Location:

https://yourdomain.com/robots.txt

Must be in the root directory and named exactly "robots.txt" (case-sensitive).

50%+

of websites use robots.txt incorrectly

2-4 weeks

crawl budget savings with proper robots.txt

Important Distinction:

Robots.txt blocks crawling, not indexing. If your page is blocked by robots.txt but linked from other sites, it might still appear in search results (without a description). To block indexing, use the noindex meta tag or password protection.

Chapter 2: How Robots.txt Works

Before crawling your site, search engines first check your robots.txt file to see which URLs they're allowed to crawl.

Step 1: Search engine wants to crawl https://example.com/page
Step 2: Checks https://example.com/robots.txt
Step 3: If allowed → crawl the page
Step 4: If disallowed → don't crawl the page
        

Simple Robots.txt Example:

# Allow all crawlers to access everything
User-agent: *
Disallow:

# Or block all crawlers from everything
User-agent: *
Disallow: /

# Block specific folder
User-agent: *
Disallow: /admin/
            

             Which Search Engines Respect Robots.txt?
            ✓ Googlebot (Google) - Yes
✓ Bingbot (Bing) - Yes
✓ Slurp (Yahoo) - Yes
✓ DuckDuckBot (DuckDuckGo) - Yes
✓ Baiduspider (Baidu) - Yes
✓ Yandex Bot - Yes
✗ Malicious bots - Often ignore robots.txt

        

Chapter 3: Robots.txt Syntax & Directives

User-agent: Specifies which search engine bot the rule applies to.

User-agent: *        # Applies to all bots
User-agent: Googlebot # Applies only to Google
User-agent: Bingbot  # Applies only to Bing
            

Disallow: Tells bots NOT to crawl specific URLs or directories.

Disallow: /          # Block entire site
Disallow: /admin/    # Block /admin/ folder
Disallow: /secret-page.html # Block specific page
            

Allow: Overrides a Disallow rule for a specific path.

Allow: /public/
Disallow: /           # Block everything except /public/
            

Sitemap: Tells bots where to find your XML sitemap.

Sitemap: https://example.com/sitemap.xml

Crawl-delay: Slows down bot requests (not supported by Google).

Crawl-delay: 10 # Wait 10 seconds between requests

Note: Google ignores Crawl-delay. Use Google Search Console to set crawl rate instead.

Syntax Rules:

• Each directive on its own line • No spaces around colon • Use # for comments • Case-sensitive paths • Trailing slash matters (/admin vs /admin/)

Chapter 4: Common User Agents

User Agent	Search Engine	When to Use
*	All bots	General rules for everyone
Googlebot	Google (all Google crawlers)	Google-specific rules
Googlebot-Image	Google Image Search	Block images specifically
Googlebot-Video	Google Video Search	Block videos specifically
Googlebot-News	Google News	Google News specific rules
Bingbot	Bing	Bing-specific rules
Slurp	Yahoo	Yahoo-specific rules
DuckDuckBot	DuckDuckGo	DuckDuckGo-specific rules
Baiduspider	Baidu (China)	Baidu-specific rules
YandexBot	Yandex (Russia)	Yandex-specific rules
AhrefsBot	Ahrefs SEO tool	Block SEO tools from crawling
SemrushBot	Semrush SEO tool	Block SEO tools from crawling

# Example: Different rules for different bots
User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/

User-agent: Bingbot
Disallow: /temp/
        

Chapter 5: Allow vs Disallow Directives

The Allow directive (not supported by all bots) can override a broader Disallow rule.

# Block everything except the /public/ folder
User-agent: *
Disallow: /
Allow: /public/

# Block an entire directory except one file
User-agent: *
Disallow: /private/
Allow: /private/index.html

# Block a specific file type
User-agent: *
Disallow: /*.pdf$
        

Pattern Matching Examples:

Disallow: /admin        # Blocks /admin, /admin/, /admin123
Disallow: /admin/       # Blocks /admin/ and subfolders
Disallow: /*.jpg$       # Blocks all JPG images
Allow: /public/*.css    # Allows CSS files in /public/
            

Note:

Not all search engines support the Allow directive. Google and Bing do. For maximum compatibility, structure your disallows to not need Allow directives.

Chapter 6: Adding Sitemap to Robots.txt

Adding your sitemap location to robots.txt helps search engines discover all your important pages.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-index.xml

# Multiple sitemaps
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-products.xml
        

             Benefits of Sitemap in Robots.txt:
            ✓ Google finds your sitemap automatically
✓ No need to submit manually in Google Search Console
✓ Helps bots discover all your content
✓ Especially useful for large websites

        

Complete Robots.txt with Sitemap:

# Allow all crawlers
User-agent: *
Disallow:

# Block admin area
User-agent: *
Disallow: /wp-admin/
Disallow: /admin/

# Tell bots where sitemap is
Sitemap: https://example.com/sitemap.xml
            

Chapter 7: Common Robots.txt Use Cases

🔧 1. Block WordPress Admin Area

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
            

🔧 2. Block Duplicate Content (Search Results)

User-agent: *
Disallow: /search
Disallow: /*?s=
Disallow: /*?filter=
            

🔧 3. Block Temporary Maintenance Page

User-agent: *
Disallow: /maintenance.html
            

🔧 4. Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.zip$
Disallow: /*.mp4$
            

🔧 5. Block During Development

User-agent: *
Disallow: /
            

Warning: Only use this for development sites. On live sites, this blocks ALL crawling!

🔧 6. SEO Tools Blocking (Save Bandwidth)

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /
            

Chapter 8: 10 Common Robots.txt Mistakes

❌ Accidentally blocking your entire site - Disallow: / on a live site blocks all search engines. Very common and disastrous mistake.
❌ Using robots.txt for security - Robots.txt is publicly visible. Anyone can see what you're blocking. Use password protection for sensitive content.
❌ Blocking CSS/JS files - Google needs CSS and JS to render your page properly. Don't block these files.
❌ Multiple Disallow lines for same path - Inefficient and harder to maintain.
❌ Missing trailing slashes - Disallow: /admin is different from Disallow: /admin/
❌ Case sensitivity issues - /Admin is different from /admin. Be consistent.
❌ Blocking Googlebot but allowing others - 90%+ of your traffic comes from Google. Usually not what you want.
❌ Not adding sitemap URL - You're making it harder for search engines to find your content.
❌ Using robots.txt to block pages you want noindexed - Use noindex meta tag instead. Blocked pages can still appear in results.
❌ Not testing after changes - Always validate your robots.txt before deploying.

Worst Mistake:

This robots.txt on a LIVE production site:

User-agent: *
Disallow: /
            

Result: Your site disappears from Google for weeks or months. It can take a long time to recover after fixing.

Chapter 9: Testing Your Robots.txt

Method 1: Google Search Console

Log into Google Search Console
Go to "Settings" → "Robots.txt Tester"
Enter URL to test
See if it's blocked or allowed
Submit updated robots.txt

Method 2: Fetch as Google

Use "URL Inspection" tool to see how Googlebot sees a specific URL and whether robots.txt blocks it.

Method 3: Direct Browser Access

https://yourdomain.com/robots.txt

View your robots.txt directly in any browser to verify it's accessible.

Check Your Robots.txt:

Use our Robots.txt Generator to create and validate your robots.txt file before deploying.

Cache Warning:

Changes to robots.txt can take days or weeks for Google to recache. Use Google Search Console to request re-crawling after major changes.

Chapter 10: Robots.txt Generator & Tools

Robots.txt Generator

Generate a valid robots.txt file with common directives in seconds

Instant Generation

Robots.txt Tester

Test your robots.txt file before deploying to production

Validate Syntax

Additional Testing Resources

🔍 Google Search Console

Official robots.txt tester and validator

🕷️ Screaming Frog

Crawl your site and verify robots.txt rules are working

🌐 robots-txt.com

Online robots.txt validator and checker

📊 Bing Webmaster Tools

Bing's robots.txt testing tool

Robots.txt Cheat Sheet

                 ✅ DO:
                Test before deploying
Add sitemap location
Block duplicate content (search results)
Block admin areas
Use specific user-agents when needed
Keep it simple and maintainable

            

                 ❌ DON'T:
                Block the entire live site
Use robots.txt for security
Block CSS/JS files
Use it to noindex pages
Forget to test after changes
Make overly complex rules

            

 📝 Template:
# Basic template
User-agent: *
Disallow: /wp-admin/
Disallow: /search

Sitemap: https://example.com/sitemap.xml
                

Frequently Asked Questions

Does robots.txt block indexing?

No. Robots.txt blocks crawling, not indexing. If your page is blocked but linked from other sites, it may still appear in search results (without a description). Use the noindex meta tag to block indexing.

How long does it take for robots.txt changes to take effect?

Google can take hours to weeks to re-crawl and respect changes to robots.txt. Use Google Search Console's URL Inspection tool to request re-crawling after major changes.

Where should robots.txt be located?

Robots.txt must be placed in the root directory of your website. Example: https://example.com/robots.txt. It won't work in subdirectories.

What happens if I don't have a robots.txt file?

Nothing bad. Search engines assume they can crawl all pages. However, you miss opportunities to save crawl budget and block duplicate content.

Will robots.txt stop Google from seeing my page?

No. If your page is linked from other websites, Google might still discover it and show it in search results (without a snippet). Use password protection or noindex for true blocking.

What is crawl budget and how does robots.txt affect it?

Crawl budget is the number of pages a search engine will crawl on your site. Robots.txt helps preserve crawl budget by blocking unimportant pages (admin, search results), allowing Google to focus on valuable content.

Ready to create your robots.txt?

Use our free Robots.txt Generator to create a valid file in seconds.

Try Robots.txt Generator Read Canonical Tags Guide