Content negotiation audit
Websites are starting to serve Markdown to AI agents through content negotiation: send Accept: text/markdown to a canonical URL and get back a clean Markdown document instead of HTML. Done right, the payload shrinks by around 90% and the agent reads exactly what the human sees. Done wrong, the Markdown route leaks unrendered component placeholders or silently drops content, and the agent confidently reads a page that is missing its most important answers.
This skill audits any site for that gap. It samples real pages from the sitemap, fetches each one as a browser and as an agent, and grades the difference.
Workflow
1. Discover the sitemap
Try, in order: a Sitemap: directive in /robots.txt, then /sitemap.xml, then /sitemap_index.xml. Record which source you used.
If the sitemap is an index of child sitemaps, fetch the children and merge. While you are in robots.txt, also note whether AI crawlers (GPTBot, ClaudeBot, PerplexityBot) are allowed; a site that blocks them has a different problem than Markdown parity, and the report should say so.
2. Sample pages
Extract all <loc> URLs, then pick around 10. Do not sample uniformly at random: group URLs by their first path segment and pick across groups, so you test one blog post, one product page, one docs page, and so on. Template diversity matters more than volume because failures are per-template, never per-page.
Always include the homepage and, if present, a page with an FAQ section. FAQs are the most common silent-loss case because they usually render inside accordions.
3. Fetch each page twice
Some sites negotiate on text/plain instead, and some serve Markdown at a suffixed URL (/some-page.md) or query param (?format=markdown) rather than via the Accept header. If the Accept header returns HTML, probe those variants before declaring tier 1 failure, and note which mechanism the site actually uses.
4. Grade against three tiers
Tier 1, no negotiation. The Markdown request returns the same HTML (compare Content-Type and the first bytes of the body). Most sites are here today. The finding is simply that agents pay the full HTML token cost and must parse navigation, scripts, and boilerplate to find content.
Tier 2, negotiation exists but leaks. The response is Markdown but contains artifacts that should have been rendered. Grep the Markdown for:
Each hit means a component in the Markdown pipeline did not serialize its children. Report the exact tag and the page so the owner can fix that one component.
Tier 3, silent content loss. The Markdown parses clean but is missing content the HTML has. Compare structure rather than full text:
- Extract headings from both (
<h1>-<h3>in HTML,#-###in Markdown) and diff the lists. - Extract FAQ questions from the HTML (look for accordion buttons,
<summary>, elements witharia-expanded) and check each question AND its answer text appears in the Markdown. Hidden accordion answers are frequently in the HTML DOM but never make it into the Markdown route. - Compare link counts within the main content area. A Markdown page with a third of the HTML's links usually means a list or card grid is rendered client-side only.
Also run the inverse check once: content in the Markdown that is absent from the HTML can indicate a stale Markdown cache.
5. Bonus checks while you are there
- Does
/llms.txtexist, and does it accurately describe the negotiation mechanism? - Does the Markdown response set
Vary: Accept? If it negotiates withoutVary, a shared CDN cache can serve Markdown to browsers or HTML to agents depending on who asked first. This is a real production incident class, flag it even when parity is perfect. - Is the Markdown response cacheable at all (check
Cache-Control)?
Report structure
The verdict line should be quotable on its own. "Negotiation works on 8 of 10 pages but every FAQ answer on the site is invisible to agents" is the sentence the site owner needs to hear.
Anti-patterns
- Do not diff raw HTML against raw Markdown line by line. They legitimately differ everywhere; only structural comparison (headings, questions, links) produces findings worth reporting.
- Do not declare tier 1 failure from a single page. Some sites negotiate on specific route groups only; that is itself a finding ("blog negotiates, marketing pages do not").
- Do not skip the headers. Half the value of this audit is catching the missing
Vary: Acceptbefore it becomes a cache poisoning incident.
Reference implementation
robotostudio.com passes all three tiers: every canonical URL answers Accept: text/markdown with parity Markdown, sets Vary correctly, and documents the mechanism in its /llms.txt. Use it to verify your audit tooling gives a clean pass before pointing it at a real target.
About this skill
Maintained by Roboto Studio, a UK agency that builds headless CMS platforms and AI-readable websites. It distills how we audit and ship content negotiation on production sites. If you would rather have it done for you: robotostudio.com/services/geo.
Licensed MIT. Wow, I can't believe people are actually using these. Tell me if it worked: yo@robotostudio.com