The Ultimate Guide to XML Sitemaps and Robots.txt for WordPress and Other CMSs

Proper configuration of your robots.txt file and XML sitemap is fundamental for SEO, crawlability, and indexation. Whether you’re using WordPress, Joomla, Shopify, Webflow, or a custom CMS, understanding how these two tools interact with search engines can make or break your organic visibility.

This comprehensive guide covers:

What XML Sitemaps and robots.txt are
Best practices across all major CMS platforms
Blocking sensitive or redundant paths (e.g. wp-admin, feeds, internal search, etc.)
Cloudflare considerations
Deep insights into indexing control, crawl budget optimisation, and performance

What is an XML Sitemap?

An XML Sitemap is a structured file (usually at /sitemap.xml) that tells search engines which URLs are available for crawling and indexing. It helps ensure all important pages are discovered, especially for:

Large websites
Newly launched websites
Sites with poor internal linking

Benefits:

Improves crawl efficiency
Flags canonical pages
Provides metadata (e.g. last modified date, change frequency, priority)

Format:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2025-05-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

Tools to Generate:

WordPress: Yoast SEO, Rank Math, All-in-One SEO (auto-generate sitemaps)
Joomla: RSSEO, 4SEO
Shopify: Auto-generated at /sitemap.xml
Webflow: Built-in under SEO settings
Custom CMS: Use libraries or scripts (Python, PHP)

What is Robots.txt?

robots.txt is a text file at the root of your domain (e.g. https://example.com/robots.txt) that instructs search engine crawlers which parts of the site to crawl or avoid.

Syntax:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap_index.xml

Key Directives:

User-agent: Which bots the rule applies to (* = all bots)
Disallow: Prevent crawling of a path
Allow: Override a disallow rule for specific files or folders
Sitemap: Declare sitemap location

?? Note: robots.txt blocks crawling, not indexing. To block indexing, use noindex meta tags or HTTP headers.

WordPress-Specific Recommendations

Common Paths to Block:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /cgi-bin/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /?s=
Disallow: /search
Disallow: /comments/feed/
Allow: /wp-admin/admin-ajax.php

Should You Block Feeds, Tags, or Archives?

Feeds: Yes, if duplicate or low-value (e.g. /feed/, /comments/feed/)
Tags and Categories: Block if thin or duplicative. Otherwise, optimise.
Archives (Date, Author): Block if you’re not using them for users or SEO.

XML Sitemap in WordPress

Yoast: /sitemap_index.xml
Rank Math: /sitemap_index.xml
Ensure your sitemap is included in robots.txt

Other CMS Considerations

Shopify

Auto-generated sitemaps
Limited robots.txt editing (requires Shopify Plus or workaround via app/API)
Generally no need to block paths manually, but you can block checkout/login pages if indexed

Webflow

XML sitemap generated automatically
Customisable robots.txt in project settings
Common disallows:

User-agent: *
Disallow: /?s=
Disallow: /search

Joomla

Manually configure robots.txt
Include Sitemap: directive manually

Wix

Built-in control over both sitemap and robots via SEO settings
Custom robots.txt editing available

Should You Block These?

Path/Type	Block?	Reason
`/wp-admin/`	?	Backend, not useful for indexing
`/wp-includes/`	?	Internal codebase, not content
`/search`	?	Duplicate/thin content
`/xmlrpc.php`	?	Security risks and no SEO value
`/trackback/`	?	Outdated WordPress feature
`/feed/`	?	If feeds aren’t adding value or used
`/tag/`	??	Only if thin/duplicate
`/category/`	??	Only block if duplicate or unoptimised
Pagination (`/page/2/`)	?	Let it be crawled unless causing crawl budget issues

Important Exceptions to Allow in robots.txt for WordPress!

There are various css and js files inside wp-includes folder that might be necessary for proper rendering of search engine bots. To avoid that, add the following after disallowing wp-includes folder:

Allow: /wp-includes/js/
Allow: /wp-includes/css/
Allow: /wp-includes/fonts/

If you’re getting confused, no worries. I’ll make it easy for you at the end. Keep reading!

Cloudflare Considerations

Using Cloudflare doesn’t directly affect your sitemap or robots.txt, but:

Best Practices:

Enable Bot Fight Mode carefully (can block good bots)
Ensure robots.txt is not cached aggressively (use page rules to bypass cache)
Use Firewall Rules to block scrapers without affecting Googlebot
Enable Automatic Platform Optimization (APO) for WordPress to improve delivery
Use Workers only if you know how to avoid interfering with XML or header rules

Crawl Budget & Robots.txt

Large sites (10K+ URLs) need to use robots.txt to control:

Infinite URL parameters (e.g. ?sort=, ?filter=)
Internal search results
Session IDs, tracking parameters

Tools:

Google Search Console ? Crawl Stats
Log file analysis (via Screaming Frog or tools like Loggly, JetOctopus)

Advanced Tips

Use X-Robots-Tag: noindex headers in HTTP for non-HTML files (PDFs, etc.)
Keep sitemap updated automatically (cron jobs, plugins)
For international sites, use hreflang sitemap index files
Validate with Google Search Console & Bing Webmaster Tools
Avoid mixing disallowed URLs in sitemap (Google may ignore them)
Don’t block resources (JS, CSS) needed to render pages correctly

Frequently Asked Questions about the robots.txt File

Now you might be wondering what every line of the final robots.txt file does. Here’s a breakthrough:

Should You Disallow: /wp-admin/ and Why?

Yes! This is the WordPress dashboard backend. It should not be crawled or indexed, as it contains no public-facing content and could pose a security risk if unnecessarily exposed.

Should You Disallow: /wp-includes/ and Why?

In short, Yes! It contains WordPress core files, libraries, and internal logic not meant to be accessed or crawled. Indexing these could confuse crawlers and expose sensitive scripts or version info.

Does `/wp-includes/` contain front-end CSS or JS?

Blocking /wp-includes/ does not block browsers or users from accessing those assets — it only stops search engine crawlers from crawling or indexing files under that path.

So to keep things safe, we allow css, js and fonts folder inside wp-includes.

Why? Google even recommends not blocking JS/CSS if it interferes with understanding the page — this is why sometimes blocking /wp-includes/ is discouraged if it causes rendering issues.

Shoud You Disallow: /cgi-bin/ and Why?

Yes! It is a legacy folder often found in older hosting setups (not specific to WordPress). Blocking this is a precautionary step to prevent access to outdated scripts or unused server-side executables.

Should You Disallow: /trackback/ and Why?

In short, Yes! WordPress trackbacks are outdated and often used for spam. Crawling this endpoint is unnecessary and could lead to spammy links or index bloat.

Should You Disallow: /xmlrpc.php and Why?

Yes! XML-RPC is used for remote publishing and app access, but it’s also a frequent target for brute force and DDoS attacks. Blocking it for crawlers helps mitigate this risk while still allowing access if needed for API calls (unless completely disabled via server or plugin).

Should You Disallow: /?s= and Why?

In most cases yes! This blocks internal search result pages (e.g. /?s=shoes) which often generate thin, low-quality, or duplicate content that is not useful in search results and can waste crawl budget.

Should You Disallow: /search and Why?

In most cases, yes. Similar to /?s=, but blocks pretty permalink versions of search URLs (e.g. /search/shoes/). Again, these are dynamically generated and provide minimal SEO value.

Should You Disallow: /comments/feed/ and Why?

Yes! This RSS feed of recent comments is generally low-value for SEO and mostly intended for RSS readers. Blocking it reduces crawl noise and duplication.

Why Should You Allow: /wp-admin/admin-ajax.php?

This exception is crucial. admin-ajax.php handles AJAX requests (e.g. for dynamic loading, frontend forms, filterable content). Blocking it can break site functionality for users or bots. So we explicitly allow it even while the rest of /wp-admin/ is disallowed.

Additional Optional Directives (Consider Adding)

Disallow: /readme.html

Reason: Exposes WordPress version, which can be a security risk. Google doesn’t need to crawl or index it.

Disallow: /license.txt

Reason: Not useful for indexing and often reveals CMS info or server software details.

Disallow: /tag/ and /category/

Conditional: Block these only if:
- Tags and categories are poorly maintained
- They cause duplication (e.g. a product is listed under multiple tags/categories with no unique content)
- You use breadcrumbs or internal linking as a better structural tool
- You’ve implemented proper canonical URLs elsewhere

XML Sitemap Best Practices

An effective XML sitemap doesn’t just list URLs—it reflects a strategy. These best practices ensure your sitemap performs well across all search engines and supports your SEO efforts.

Include Only Canonical, Indexable URLs

Avoid URLs with noindex, canonical tags to other pages, or those blocked in robots.txt.
Helps search engines avoid wasting crawl budget and prevents confusion.

Keep Sitemaps Under Limits

Max 50,000 URLs per sitemap file or 50MB uncompressed.
Use a sitemap index file (sitemap_index.xml) if you have more.

Use Absolute URLs

Always use full URLs (https://example.com/page) instead of relative paths.
Ensures clarity and avoids misinterpretation by crawlers.

Set Accurate `<lastmod>` Dates

Helps Google prioritise recently updated content.
Only update lastmod if the actual content has changed (not just layout or minor style tweaks).

Split by Page Type

Segment large sites by content type (e.g., post-sitemap.xml, product-sitemap.xml, category-sitemap.xml) to:

Track indexation more precisely
Submit high-priority sections independently

Update Automatically

In WordPress, ensure sitemap plugins regenerate sitemaps on content updates.
For custom setups, use scheduled tasks (e.g., cron jobs).

Use Hreflang Where Applicable

For multilingual/multiregional sites, use hreflang attributes inside sitemap or reference hreflang sitemaps from your index file.
Helps search engines serve the right regional version in SERPs.

Submit to Google & Bing

Use Google Search Console and Bing Webmaster Tools to submit sitemaps and track issues (e.g., crawl errors, excluded URLs).
Check regularly for warnings like “Submitted URL blocked by robots.txt.”

Don’t Include Disallowed URLs

Google may ignore sitemaps that list pages disallowed in robots.txt or tagged noindex.
Keep your sitemap clean and focused on valuable URLs.

Avoid Orphan Pages

A page only in a sitemap but not internally linked is an orphan.
Ensure all important pages are also discoverable via internal links.

Use Sitemap Priority and Changefreq (Optional)

These tags are largely ignored by Google but may still assist in some search engines or structured internal tooling.
Don’t rely on them for crawling behaviour.

Validate Before Submitting

Use tools like:

Google’s XML Sitemap Validator
Screaming Frog SEO Spider
XML Sitemap Validator by SEObility

Conclusion

Understanding and configuring your XML sitemap and robots.txt properly is one of the highest ROI tasks in SEO. Done right, it ensures your site is easily crawlable, indexable, and primed for visibility — no matter which CMS you’re using. Whether you’re running a small blog or managing a multi-language enterprise site, this technical foundation will scale with you.

Need help diagnosing a robots.txt issue or generating a custom sitemap for a complex setup? Drop a comment or contact me — let’s make sure every important page on your site gets the attention it deserves.

Categorized in:

Technical SEO,

Press ESC to close