Skip to content



Raw SKILL.md · MIT · sha256:6d5db6ffdfd8809dc9d5901405cc21cec86bf03cadcd6026a405bb9075ab9b89

Sitemap audit

A repeatable workflow for getting an honest, structured view of every public page on a website via its sitemap.xml. The goal is twofold:

  1. Catalogue homogenous page types (where one template repeats N times: blog posts, customer stories, product pages, comparisons) so scope and IA decisions are grounded in real numbers instead of vibes.
  2. Surface the singletons and anomalies (one-off landing pages, leaked test/staging pages, pagination on the wrong things, naming inconsistencies) because these are exactly the pages that get missed in scoping and bite later.

Why this matters: when proposing a rebuild or migration, missing a single hidden page type can derail the project. The CMS still has to model it, the template still has to render it, the migration still has to carry it across. A sitemap is the only source of truth that lists every page the site owner intends to be public.

The audit workflow

1. Discover the sitemap

Try, in order: the URL you were given, a Sitemap: directive in /robots.txt, then /sitemap.xml. If none of those work, also try /sitemap_index.xml, /sitemap-index.xml, /sitemap.xml.gz, /wp-sitemap.xml (WordPress), /sitemap.txt, and a <link rel="sitemap"> in the homepage source. Document which one you used; the source of truth matters when a follow-up disagrees with what you reported.

2. Extract and group the URLs

# Fetch, recurse one level of sitemap index, dedupe
curl -sL https://example.com/sitemap.xml -o sm.xml
if grep -q '<sitemapindex' sm.xml; then
  grep -o '<loc>[^<]*</loc>' sm.xml | sed 's/<\/\?loc>//g' \
    | while read -r child; do curl -sL "$child"; done > sm-children.xml
  grep -o '<loc>[^<]*</loc>' sm-children.xml
else
  grep -o '<loc>[^<]*</loc>' sm.xml
fi | sed 's/<\/\?loc>//g' | sort -u > sitemap-urls.txt
wc -l sitemap-urls.txt

# Pattern summary by first path segment
awk -F/ '{print "/" $4 "/*"}' sitemap-urls.txt | sort | uniq -c | sort -rn

# Depth distribution
awk -F/ '{print NF-3}' sitemap-urls.txt | sort | uniq -c

# Suspicious slugs worth eyeballing
grep -E '(test|staging|draft|preview|-copy|-v2|-old|^.*/page$)' sitemap-urls.txt

Gzipped sitemaps need curl -sL ... | gunzip. Use this output as the starting evidence, then layer judgment on top: the commands can flag candidates, but only reading the URLs tells you that a lone person-name slug inside /company/* is a founder bio and the only page of its kind in that bucket.

3. Categorise the homogenous types

For each pattern with 2+ URLs, write one line describing:

  • What template likely renders it (blog post, case study, product page, comparison, etc.)
  • The slug convention (keyword slugs, person names, numbered like level-101, date-stamped, etc.)
  • Whether the convention holds across all URLs in the set, or whether there are outliers within the pattern (e.g. level-custom breaking a level-{number} rule)

4. Investigate the singletons and anomalies

This is the highest-value part. For each singleton or flagged URL, decide if it is:

  • Expected: legal pages (/terms, /privacy), marketing one-offs (/pricing, /contact), conversion pages (/book-a-demo). Note them and move on.
  • Hidden type: a page that looks like the first member of a future pattern, such as a single person bio sitting in a bucket of org pages. Call this out, because a rebuild needs to decide whether to model it as its own type or extend an existing one.
  • Leak: pages that should not be public: /page, /learn-ai-test, /draft-foo, anything suffixed -copy, -v2, -old. These almost always indicate CMS publishing-workflow gaps.
  • Misfile: a one-off landing page at the root that should live under a section, or vice versa. Flag with a "consider re-IA-ing" note.
  • Quirk: pagination on pages that should not paginate (a single legal doc split into 1/2/3 is a tell), or unusual characters in slugs (literal periods, unnormalised accented characters).

5. Cross-check against the live site (worth doing for proposals)

A sitemap lists what the site owner declared. It can be stale (listed pages 404), incomplete (pages exist but are missing, check main nav and footer), or misleading (200s that redirect, are noindexed, or sit behind auth). For high-stakes audits, spot-check a handful of URLs from each pattern bucket with curl -sI and scan the homepage nav for paths the sitemap never mentioned.

Report structure

Use this exact structure. It scales from a 30-URL site to a 30,000-URL site.

## {site} sitemap analysis

**{N} total URLs**, sitemap source: {url}, {one-line note on shape: flat / index with N children / gzipped}

### Homogenous page types

| Pattern | Count | Type | Notes |
|---|---|---|---|
| `/blog/{slug}` | 77 | Blog posts | Slug-based, no date in URL |

### Hub / index pages
The pages that act as parent indexes for the patterns above.

### Singletons (expected)
Legal, marketing, conversion, briefly grouped.

### Flagged: things to investigate
Numbered list. For each one: what the URL is, why it's flagged, and the implication for a rebuild or scope decision.

The flagged list is the part the reader actually wants. Make it specific. "There's a page called /page" is the observation; "this is a likely CMS draft that shipped to production, so any rebuild needs to check that publishing workflow gates drafts properly" is the insight.

Edge cases worth knowing

  • Sitemap index files: recurse into children, and mention it in the report, because partitioned sitemaps usually signal content volume worth knowing about.
  • Multiple Sitemap: lines in robots.txt: audit each and merge the results.
  • Localised sitemaps (/sitemap-en.xml, /fr/sitemap.xml): treat each locale as its own audit and compare counts. Asymmetry between locales is a content-parity issue worth flagging.
  • Image / video / news sitemaps: these list media assets, skip them for an IA audit.
  • SPAs: sitemaps rarely list hash or query-string routes, so the sitemap may dramatically undercount the real surface area. Say so explicitly.

Anti-patterns

  • Do not paste the raw URL list back at the reader. The whole point is converting 147 URLs into roughly 7 pattern statements plus a short list of anomalies.
  • Do not claim a pattern is "consistent" without actually reading the slugs. Two URLs with the same path shape can use wildly different slug conventions.
  • Do not treat the suspicious-URL grep as a verdict. It misses things (a leaked /about-new-v2 that dodges the regex) and over-flags (legitimate slugs containing periods). It is a candidate list.

About this skill

Maintained by Roboto Studio, a UK agency that scopes and ships CMS rebuilds and migrations. It distills the audit we run before every proposal. If you would rather have it done for you: robotostudio.com/services/cms-migration.

Licensed MIT. Wow, I can't believe people are actually using these. Tell me if it worked: yo@robotostudio.com