Proper configuration of your robots.txt file and XML sitemap is fundamental for SEO, crawlability, and indexation. Whether you’re using WordPress, Joomla, Shopify, Webflow, or a custom CMS, understanding how these two tools interact with search engines can make or break your organic visibility.
This comprehensive guide covers:
- What XML Sitemaps and 
robots.txtare - Best practices across all major CMS platforms
 - Blocking sensitive or redundant paths (e.g. wp-admin, feeds, internal search, etc.)
 - Cloudflare considerations
 - Deep insights into indexing control, crawl budget optimisation, and performance
 
What is an XML Sitemap?
An XML Sitemap is a structured file (usually at /sitemap.xml) that tells search engines which URLs are available for crawling and indexing. It helps ensure all important pages are discovered, especially for:
- Large websites
 - Newly launched websites
 - Sites with poor internal linking
 
Benefits:
- Improves crawl efficiency
 - Flags canonical pages
 - Provides metadata (e.g. last modified date, change frequency, priority)
 
Format:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2025-05-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>
Tools to Generate:
- WordPress: Yoast SEO, Rank Math, All-in-One SEO (auto-generate sitemaps)
 - Joomla: RSSEO, 4SEO
 - Shopify: Auto-generated at 
/sitemap.xml - Webflow: Built-in under SEO settings
 - Custom CMS: Use libraries or scripts (Python, PHP)
 
What is Robots.txt?
robots.txt is a text file at the root of your domain (e.g. https://example.com/robots.txt) that instructs search engine crawlers which parts of the site to crawl or avoid.
Syntax:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap_index.xml
Key Directives:
User-agent: Which bots the rule applies to (*= all bots)Disallow: Prevent crawling of a pathAllow: Override a disallow rule for specific files or foldersSitemap: Declare sitemap location
?? Note:
robots.txtblocks crawling, not indexing. To block indexing, usenoindexmeta tags or HTTP headers.
WordPress-Specific Recommendations
Common Paths to Block:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /cgi-bin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /?s=
Disallow: /search
Disallow: /comments/feed/
Allow: /wp-admin/admin-ajax.php
Should You Block Feeds, Tags, or Archives?
- Feeds: Yes, if duplicate or low-value (e.g. 
/feed/,/comments/feed/) - Tags and Categories: Block if thin or duplicative. Otherwise, optimise.
 - Archives (Date, Author): Block if you’re not using them for users or SEO.
 
XML Sitemap in WordPress
- Yoast: 
/sitemap_index.xml - Rank Math: 
/sitemap_index.xml - Ensure your sitemap is included in 
robots.txt 
Other CMS Considerations
Shopify
- Auto-generated sitemaps
 - Limited 
robots.txtediting (requires Shopify Plus or workaround via app/API) - Generally no need to block paths manually, but you can block checkout/login pages if indexed
 
Webflow
- XML sitemap generated automatically
 - Customisable 
robots.txtin project settings - Common disallows:
 
User-agent: *
Disallow: /?s=
Disallow: /search
Joomla
- Manually configure 
robots.txt - Include 
Sitemap:directive manually 
Wix
- Built-in control over both sitemap and robots via SEO settings
 - Custom 
robots.txtediting available 
Should You Block These?
| Path/Type | Block? | Reason | 
|---|---|---|
/wp-admin/ | ? | Backend, not useful for indexing | 
/wp-includes/ | ? | Internal codebase, not content | 
/search | ? | Duplicate/thin content | 
/xmlrpc.php | ? | Security risks and no SEO value | 
/trackback/ | ? | Outdated WordPress feature | 
/feed/ | ? | If feeds aren’t adding value or used | 
/tag/ | ?? | Only if thin/duplicate | 
/category/ | ?? | Only block if duplicate or unoptimised | 
Pagination (/page/2/) | ? | Let it be crawled unless causing crawl budget issues | 
Important Exceptions to Allow in robots.txt for WordPress!
There are various css and js files inside wp-includes folder that might be necessary for proper rendering of search engine bots. To avoid that, add the following after disallowing wp-includes folder:
Allow: /wp-includes/js/
Allow: /wp-includes/css/
Allow: /wp-includes/fonts/
If you’re getting confused, no worries. I’ll make it easy for you at the end. Keep reading!
Cloudflare Considerations
Using Cloudflare doesn’t directly affect your sitemap or robots.txt, but:
Best Practices:
- Enable Bot Fight Mode carefully (can block good bots)
 - Ensure 
robots.txtis not cached aggressively (use page rules to bypass cache) - Use Firewall Rules to block scrapers without affecting Googlebot
 - Enable Automatic Platform Optimization (APO) for WordPress to improve delivery
 - Use Workers only if you know how to avoid interfering with XML or header rules
 
Crawl Budget & Robots.txt
Large sites (10K+ URLs) need to use robots.txt to control:
- Infinite URL parameters (e.g. 
?sort=,?filter=) - Internal search results
 - Session IDs, tracking parameters
 
Tools:
- Google Search Console ? Crawl Stats
 - Log file analysis (via Screaming Frog or tools like Loggly, JetOctopus)
 
Advanced Tips
- Use 
X-Robots-Tag: noindexheaders in HTTP for non-HTML files (PDFs, etc.) - Keep sitemap updated automatically (cron jobs, plugins)
 - For international sites, use hreflang sitemap index files
 - Validate with Google Search Console & Bing Webmaster Tools
 - Avoid mixing disallowed URLs in sitemap (Google may ignore them)
 - Don’t block resources (JS, CSS) needed to render pages correctly
 
Frequently Asked Questions about the robots.txt File
Now you might be wondering what every line of the final robots.txt file does. Here’s a breakthrough:
Should You Disallow: /wp-admin/ and Why?
Yes! This is the WordPress dashboard backend. It should not be crawled or indexed, as it contains no public-facing content and could pose a security risk if unnecessarily exposed.
Should You Disallow: /wp-includes/ and Why?
In short, Yes! It contains WordPress core files, libraries, and internal logic not meant to be accessed or crawled. Indexing these could confuse crawlers and expose sensitive scripts or version info.
Does /wp-includes/ contain front-end CSS or JS?
Blocking /wp-includes/ does not block browsers or users from accessing those assets — it only stops search engine crawlers from crawling or indexing files under that path.
So to keep things safe, we allow css, js and fonts folder inside wp-includes.
Why? Google even recommends not blocking JS/CSS if it interferes with understanding the page — this is why sometimes blocking /wp-includes/ is discouraged if it causes rendering issues.
Shoud You Disallow: /cgi-bin/ and Why?
Yes! It is a legacy folder often found in older hosting setups (not specific to WordPress). Blocking this is a precautionary step to prevent access to outdated scripts or unused server-side executables.
Should You Disallow: /trackback/ and Why?
In short, Yes! WordPress trackbacks are outdated and often used for spam. Crawling this endpoint is unnecessary and could lead to spammy links or index bloat.
Should You Disallow: /xmlrpc.php and Why?
Yes! XML-RPC is used for remote publishing and app access, but it’s also a frequent target for brute force and DDoS attacks. Blocking it for crawlers helps mitigate this risk while still allowing access if needed for API calls (unless completely disabled via server or plugin).
Should You Disallow: /?s= and Why?
In most cases yes! This blocks internal search result pages (e.g. /?s=shoes) which often generate thin, low-quality, or duplicate content that is not useful in search results and can waste crawl budget.
Should You Disallow: /search and Why?
In most cases, yes. Similar to /?s=, but blocks pretty permalink versions of search URLs (e.g. /search/shoes/). Again, these are dynamically generated and provide minimal SEO value.
Should You Disallow: /comments/feed/ and Why?
Yes! This RSS feed of recent comments is generally low-value for SEO and mostly intended for RSS readers. Blocking it reduces crawl noise and duplication.
Why Should You Allow: /wp-admin/admin-ajax.php?
This exception is crucial. admin-ajax.php handles AJAX requests (e.g. for dynamic loading, frontend forms, filterable content). Blocking it can break site functionality for users or bots. So we explicitly allow it even while the rest of /wp-admin/ is disallowed.
Additional Optional Directives (Consider Adding)
Disallow: /readme.html
- Reason: Exposes WordPress version, which can be a security risk. Google doesn’t need to crawl or index it.
 
Disallow: /license.txt
- Reason: Not useful for indexing and often reveals CMS info or server software details.
 
Disallow: /tag/ and /category/
- Conditional: Block these only if:
- Tags and categories are poorly maintained
 - They cause duplication (e.g. a product is listed under multiple tags/categories with no unique content)
 - You use breadcrumbs or internal linking as a better structural tool
 - You’ve implemented proper canonical URLs elsewhere
 
 
XML Sitemap Best Practices
An effective XML sitemap doesn’t just list URLs—it reflects a strategy. These best practices ensure your sitemap performs well across all search engines and supports your SEO efforts.
Include Only Canonical, Indexable URLs
- Avoid URLs with 
noindex, canonical tags to other pages, or those blocked inrobots.txt. - Helps search engines avoid wasting crawl budget and prevents confusion.
 
Keep Sitemaps Under Limits
- Max 50,000 URLs per sitemap file or 50MB uncompressed.
 - Use a sitemap index file (
sitemap_index.xml) if you have more. 
Use Absolute URLs
- Always use full URLs (
https://example.com/page) instead of relative paths. - Ensures clarity and avoids misinterpretation by crawlers.
 
Set Accurate <lastmod> Dates
- Helps Google prioritise recently updated content.
 - Only update 
lastmodif the actual content has changed (not just layout or minor style tweaks). 
Split by Page Type
Segment large sites by content type (e.g., post-sitemap.xml, product-sitemap.xml, category-sitemap.xml) to:
- Track indexation more precisely
 - Submit high-priority sections independently
 
Update Automatically
- In WordPress, ensure sitemap plugins regenerate sitemaps on content updates.
 - For custom setups, use scheduled tasks (e.g., cron jobs).
 
Use Hreflang Where Applicable
- For multilingual/multiregional sites, use 
hreflangattributes inside sitemap or reference hreflang sitemaps from your index file. - Helps search engines serve the right regional version in SERPs.
 
Submit to Google & Bing
- Use Google Search Console and Bing Webmaster Tools to submit sitemaps and track issues (e.g., crawl errors, excluded URLs).
 - Check regularly for warnings like “Submitted URL blocked by robots.txt.”
 
Don’t Include Disallowed URLs
- Google may ignore sitemaps that list pages disallowed in 
robots.txtor taggednoindex. - Keep your sitemap clean and focused on valuable URLs.
 
Avoid Orphan Pages
- A page only in a sitemap but not internally linked is an orphan.
 - Ensure all important pages are also discoverable via internal links.
 
Use Sitemap Priority and Changefreq (Optional)
- These tags are largely ignored by Google but may still assist in some search engines or structured internal tooling.
 - Don’t rely on them for crawling behaviour.
 
Validate Before Submitting
Use tools like:
- Google’s XML Sitemap Validator
 - Screaming Frog SEO Spider
 - XML Sitemap Validator by SEObility
 
Conclusion
Understanding and configuring your XML sitemap and robots.txt properly is one of the highest ROI tasks in SEO. Done right, it ensures your site is easily crawlable, indexable, and primed for visibility — no matter which CMS you’re using. Whether you’re running a small blog or managing a multi-language enterprise site, this technical foundation will scale with you.
Need help diagnosing a robots.txt issue or generating a custom sitemap for a complex setup? Drop a comment or contact me — let’s make sure every important page on your site gets the attention it deserves.
