Robots.txt Complete Guide Control Search Engine Crawling
Master robots.txt to control which pages search engines can crawl, save crawl budget, and protect sensitive content.
What You'll Learn in This Guide
Chapter 1: What is Robots.txt?
Robots.txt is a text file placed in the root directory of your website that tells search engine crawlers which pages or sections of your site they should or shouldn't crawl.
Must be in the root directory and named exactly "robots.txt" (case-sensitive).
Robots.txt blocks crawling, not indexing. If your page is blocked by robots.txt but linked from other sites, it might still appear in search results (without a description). To block indexing, use the noindex meta tag or password protection.
Chapter 2: How Robots.txt Works
Before crawling your site, search engines first check your robots.txt file to see which URLs they're allowed to crawl.
- โ Googlebot (Google) - Yes
- โ Bingbot (Bing) - Yes
- โ Slurp (Yahoo) - Yes
- โ DuckDuckBot (DuckDuckGo) - Yes
- โ Baiduspider (Baidu) - Yes
- โ Yandex Bot - Yes
- โ Malicious bots - Often ignore robots.txt
Chapter 3: Robots.txt Syntax & Directives
Note: Google ignores Crawl-delay. Use Google Search Console to set crawl rate instead.
โข Each directive on its own line โข No spaces around colon โข Use # for comments โข Case-sensitive paths โข Trailing slash matters (/admin vs /admin/)
Chapter 4: Common User Agents
| User Agent | Search Engine | When to Use |
|---|---|---|
| * | All bots | General rules for everyone |
| Googlebot | Google (all Google crawlers) | Google-specific rules |
| Googlebot-Image | Google Image Search | Block images specifically |
| Googlebot-Video | Google Video Search | Block videos specifically |
| Googlebot-News | Google News | Google News specific rules |
| Bingbot | Bing | Bing-specific rules |
| Slurp | Yahoo | Yahoo-specific rules |
| DuckDuckBot | DuckDuckGo | DuckDuckGo-specific rules |
| Baiduspider | Baidu (China) | Baidu-specific rules |
| YandexBot | Yandex (Russia) | Yandex-specific rules |
| AhrefsBot | Ahrefs SEO tool | Block SEO tools from crawling |
| SemrushBot | Semrush SEO tool | Block SEO tools from crawling |
Chapter 5: Allow vs Disallow Directives
The Allow directive (not supported by all bots) can override a broader Disallow rule.
Not all search engines support the Allow directive. Google and Bing do. For maximum compatibility, structure your disallows to not need Allow directives.
Chapter 6: Adding Sitemap to Robots.txt
Adding your sitemap location to robots.txt helps search engines discover all your important pages.
- โ Google finds your sitemap automatically
- โ No need to submit manually in Google Search Console
- โ Helps bots discover all your content
- โ Especially useful for large websites
Chapter 7: Common Robots.txt Use Cases
Warning: Only use this for development sites. On live sites, this blocks ALL crawling!
Chapter 8: 10 Common Robots.txt Mistakes
- โ Accidentally blocking your entire site -
Disallow: /on a live site blocks all search engines. Very common and disastrous mistake. - โ Using robots.txt for security - Robots.txt is publicly visible. Anyone can see what you're blocking. Use password protection for sensitive content.
- โ Blocking CSS/JS files - Google needs CSS and JS to render your page properly. Don't block these files.
- โ Multiple Disallow lines for same path - Inefficient and harder to maintain.
- โ Missing trailing slashes - Disallow: /admin is different from Disallow: /admin/
- โ Case sensitivity issues - /Admin is different from /admin. Be consistent.
- โ Blocking Googlebot but allowing others - 90%+ of your traffic comes from Google. Usually not what you want.
- โ Not adding sitemap URL - You're making it harder for search engines to find your content.
- โ Using robots.txt to block pages you want noindexed - Use noindex meta tag instead. Blocked pages can still appear in results.
- โ Not testing after changes - Always validate your robots.txt before deploying.
This robots.txt on a LIVE production site:
Result: Your site disappears from Google for weeks or months. It can take a long time to recover after fixing.
Chapter 9: Testing Your Robots.txt
Method 1: Google Search Console
- Log into Google Search Console
- Go to "Settings" โ "Robots.txt Tester"
- Enter URL to test
- See if it's blocked or allowed
- Submit updated robots.txt
Method 2: Fetch as Google
Use "URL Inspection" tool to see how Googlebot sees a specific URL and whether robots.txt blocks it.
Method 3: Direct Browser Access
View your robots.txt directly in any browser to verify it's accessible.
Use our Robots.txt Generator to create and validate your robots.txt file before deploying.
Changes to robots.txt can take days or weeks for Google to recache. Use Google Search Console to request re-crawling after major changes.
Chapter 10: Robots.txt Generator & Tools
Robots.txt Generator
Generate a valid robots.txt file with common directives in seconds
Robots.txt Tester
Test your robots.txt file before deploying to production
Additional Testing Resources
Official robots.txt tester and validator
Crawl your site and verify robots.txt rules are working
Online robots.txt validator and checker
Bing's robots.txt testing tool
Robots.txt Cheat Sheet
- Test before deploying
- Add sitemap location
- Block duplicate content (search results)
- Block admin areas
- Use specific user-agents when needed
- Keep it simple and maintainable
- Block the entire live site
- Use robots.txt for security
- Block CSS/JS files
- Use it to noindex pages
- Forget to test after changes
- Make overly complex rules
Frequently Asked Questions
Ready to create your robots.txt?
Use our free Robots.txt Generator to create a valid file in seconds.