OhMyApps
Back to Blog
Tools SEO Robots.txt Crawlers Tutorial

Robots.txt Generator: Control How Search Engines Crawl Your Site

5 min read By OhMyApps

Every search engine crawler checks for a robots.txt file before it starts indexing your site. This plain text file, placed at the root of your domain, tells bots which pages they can access and which they should skip. Without one, crawlers will attempt to index everything — including pages you may not want appearing in search results.

What Is robots.txt?

The robots.txt file is a standard created in 1994 as part of the Robots Exclusion Protocol. It lives at https://yourdomain.com/robots.txt and contains rules that search engine crawlers read before accessing your pages.

A basic robots.txt file looks like this:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

Sitemap: https://yourdomain.com/sitemap.xml

This tells all crawlers (* means every bot) to index the entire site except the /admin/ and /private/ directories, and points them to your sitemap for efficient crawling.

Key Directives Explained

User-agent

Specifies which crawler the rules apply to. Use * for all bots or target specific crawlers by name:

User-agentCrawler
*All search engine bots
GooglebotGoogle’s main crawler
BingbotMicrosoft Bing’s crawler
SlurpYahoo’s crawler
DuckDuckBotDuckDuckGo’s crawler
GPTBotOpenAI’s web crawler
anthropic-aiAnthropic’s web crawler

You can create separate rule blocks for different crawlers, giving each one different access levels.

Allow

Explicitly permits access to a path. This is most useful when combined with a broader Disallow rule:

User-agent: *
Disallow: /api/
Allow: /api/docs/

This blocks all of /api/ except the /api/docs/ directory.

Disallow

Blocks crawlers from accessing specified paths. An empty Disallow: line means nothing is blocked.

Common paths to disallow:

  • /admin/ — admin panels and dashboards
  • /login/ — authentication pages
  • /cart/ — shopping cart pages
  • /search/ — internal search results (thin content)
  • /tmp/ — temporary or staging files
  • /*.pdf$ — PDF files (if you prefer they not appear in search)

Crawl-delay

Requests that crawlers wait a specified number of seconds between requests. This helps prevent bots from overloading your server:

User-agent: *
Crawl-delay: 10

Note that Googlebot does not honor Crawl-delay. For Google, configure crawl rate in Google Search Console instead.

Sitemap

Points crawlers to your XML sitemap so they can discover all your pages efficiently:

Sitemap: https://yourdomain.com/sitemap.xml

You can list multiple sitemaps, each on its own line.

How to Use Our Robots.txt Generator

  1. Add user-agent rules — select which crawlers your rules apply to
  2. Set Allow and Disallow paths — specify which directories to open or block
  3. Configure crawl delay — optionally slow down bot requests
  4. Add your sitemap URL — help crawlers find all your pages
  5. Copy the generated file — save it as robots.txt at your domain root

The generator produces a clean, standards-compliant robots.txt that follows current best practices.

Common Use Cases

Block Admin and Staging Areas

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /wp-admin/

Keep internal tools and development environments out of search indexes.

Block AI Crawlers While Allowing Search Engines

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: *
Allow: /

This allows traditional search engines to index your site while blocking AI training crawlers.

E-commerce Site Configuration

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Allow: /

Sitemap: https://shop.example.com/sitemap.xml

Block transactional pages that add no value in search results while allowing product and category pages.

Common Mistakes to Avoid

  • Blocking your entire site accidentally. A single Disallow: / with User-agent: * hides your entire site from all search engines.
  • Forgetting the trailing slash. Disallow: /admin blocks the URL /admin but also /administration. Use Disallow: /admin/ to block only the directory.
  • Blocking CSS and JavaScript files. Search engines need these to render your pages properly. Blocking them can hurt your rankings.
  • Relying on robots.txt for security. This file is a suggestion, not a restriction. Malicious bots will ignore it. Use proper authentication and access controls for sensitive content.
  • Not including a sitemap reference. Always add your sitemap URL to help crawlers discover your pages efficiently.

Practical Tips

  • Place your robots.txt file at the exact root of your domain: https://yourdomain.com/robots.txt.
  • Test your robots.txt using Google Search Console’s robots.txt Tester to verify rules work as intended.
  • Review and update your robots.txt when you add new sections to your site or restructure URLs.
  • Keep the file simple. Complex rules with many wildcard patterns are harder to maintain and debug.
  • Use the Allow directive to create exceptions within broader Disallow blocks rather than listing every allowed path individually.

Frequently Asked Questions

Does robots.txt prevent pages from appearing in Google? Disallow prevents crawling, but Google may still index a URL if other pages link to it. For guaranteed removal from search results, use a noindex meta tag or X-Robots-Tag HTTP header instead.

Is robots.txt required for every website? It is not technically required, but it is strongly recommended. Without one, crawlers will index every accessible page, including ones you may not want public.

Can I block specific file types? Yes. Use wildcard patterns like Disallow: /*.pdf$ to block PDF files or Disallow: /*.xml$ to block XML files from specific crawlers.

How quickly do search engines pick up robots.txt changes? Google typically checks robots.txt every 24 hours. You can request an immediate re-read through Google Search Console.


Try our free Robots.txt Generator to create a properly configured robots.txt file for your website.

Try Ghost Image Hub

The Chrome extension that makes managing your Ghost blog images a breeze.

Learn More