Guides8 min read

How to document website accessibility evidence that holds up

OverlayRiskWitness Team
Evidence engineering ·

Learn how to document website accessibility evidence that survives a legal challenge or procurement audit — reproducible test runs, timestamps, snapshot hashes, and claims quoted back against observations.

When accessibility litigation or a procurement audit moves past the initial request, the first question from counsel is almost always the same: can you show me when the test ran, what the page looked like at that moment, and exactly which rule failed? Documentation that cannot answer those three questions does not survive a challenge. This guide covers the principles behind reproducible accessibility evidence, the structural choices that make it defensible, and the practical workflow for capturing it on sites that use accessibility overlays — the context where the gap between a public claim and a live page tends to be widest.

Start with a reproducible environment

Every reproducible test starts with a controlled environment. For web accessibility that means: the same browser engine, the same viewport, the same WCAG rule set, and the same test execution order every time you run it. Variation on any of those factors can shift a finding from one state to another with no change to the underlying page. That ambiguity is exactly what a challenge will exploit.

Load the page in a real hosted browser rather than a headless simulator. Accessibility overlays — widgets like accessiBe and UserWay that inject runtime fixes into a page — frequently detect headless contexts and behave differently than they do for an ordinary visitor. A test that runs outside a real browser session may not see the overlay at all, which would make the comparison meaningless. A hosted browser service, pinned to a specific version, runs each pass in an environment that is isolated and consistent. When the same page is retested sixty days later, the engine, the viewport, and the rule set are identical to what ran the first time.

The two-pass structure: overlay off and overlay on

The most important structural choice in accessibility evidence for overlay-bearing sites is the two-pass capture. The first pass loads the page with the overlay blocked at the network layer — the page as it ships from the server, without the vendor script running. The second pass loads the exact same URL with the overlay active and given time to inject its changes. Both passes run the same WCAG rule engine, and the two result sets are diffed rule by rule.

The two-pass structure matters because the overlay is the thing making the compliance claim. If a vendor script asserts that it brings a site into WCAG 2.1 AA conformance, the relevant question is not what the underlying page looks like without any overlay — it is whether the rule passes when the overlay is present and running. The baseline pass (overlay off) establishes what the site ships. The witness pass (overlay on) is where the claim either holds or does not. That distinction is the whole comparison.

Pagepublic URLOVERLAY OFFoverlay blockedOVERLAY ONoverlay activeaxe-coreaxe-corePer-rule diffFixed · Broken · No effectPass 1Pass 2
Each witness run loads the same URL twice in the same real browser — once with the overlay blocked, once with it active — then diffs the two axe-core result sets per rule.

Finding states and what they commit you to

Reducing an axe-core diff to a usable finding requires a clear, fixed vocabulary. OverlayRiskWitness uses three states, and the vocabulary matters precisely because each state commits the documentation to a specific, bounded claim — nothing more and nothing less.

  • Held up: the rule passed with the overlay active. The claim holds on this rule, on this page, at this moment. It is not a site-wide clearance; it is one observation on one pass.
  • Did not hold up: the rule failed with the overlay active, meaning the overlay did not correct a known violation. This is the state that surfaces a gap between a public statement and a live observation.
  • Not testable: the rule engine could not evaluate the rule in one or both passes — typically because the relevant element was absent, obscured, or behind an authentication wall. Not testable is an explicit gap in evidence, not a passing result. It must be logged rather than omitted.
Held upoverlay supported the claimDid not hold upclaim not supported on live pageNot testablerule could not be evaluated
Three states on every finding. "Not testable" is recorded explicitly because a missing evaluation is not equivalent to a passing one.
One state per run, not per site

Finding states are scoped to a single run on a single page. A site with ten pages can carry a mix of held up and did not hold up states across different rules and pages. The evidence packet preserves that granularity — collapsing across pages into a site-level score would hide the specific findings that carry the most weight.

Teams that need a plain-language way to explain that difference internally can also use Website Accessibility Scores: What a 0–100 Number Can Show — and What It Still Can't Prove before they turn a scanner number into a stand-in for the packet itself.

Anchoring each observation to the public claim it tests

The second structural requirement is quoting the site's own accessibility statement alongside each finding. A public claim — "this site conforms to WCAG 2.1 Level AA" or "all form fields are accessible to screen readers" — is what a potential plaintiff, auditor, or procurement reviewer reads before they start testing. An observation that does not reference the specific claim it is evaluating is floating evidence. It documents that a rule failed but not what commitment that failure contradicts.

The pairing works like this: the witness locates the site's published accessibility statement, extracts the relevant claim language verbatim, and stores it in the same exhibit as the axe-core finding it relates to. If the page fails color-contrast with the overlay active, and the accessibility statement says all text meets contrast requirements, those two data points live in the same row of the exhibit. You do not need to cross-reference separate documents to see the discrepancy. The gap is visible in a single exhibit.

This is not a legal judgment about what that gap means. It is careful documentation of what the site says and what an objective rule engine observed when it checked. Interpretation is for counsel. But counsel cannot interpret a gap they cannot clearly see.

Once that gap is visible, the next step is usually tightening the public claim itself. Writing an accessibility statement when you use an overlay shows how to keep the statement aligned with named-page evidence instead of vendor-template language.

Timestamps, snapshot hashes, and chain of custody

Evidence Exhibitoverlayrisk.com/packet/run_01J...PUBLIC CLAIM"This site meets WCAG 2.1 AA and is fully keyboard accessible."Source: /accessibility · detected overlay: accessiBeOBSERVATIONOVERLAY OFFcolor-contrast: 4 violationsvsOVERLAY ONcolor-contrast: 4 violationsDID NOT HOLD UPno_effect transition TIMESTAMP + SNAPSHOT HASHUTC 2026-06-27T14:32:07Z · sha256: 3a7f2c1e9b04d8a6…
A Risk Packet: paired axe captures for each tested page, the site's own claim language quoted into each exhibit, per-finding state, UTC timestamp, and DOM snapshot hash.

Two metadata fields make evidence defensible beyond its first reading: a UTC timestamp and a DOM snapshot hash. Both belong on every exhibit, not just the ones that did not hold up.

The timestamp records the exact moment the test ran. Web pages change — overlay vendors push script updates, teams redeploy pages, marketing swaps hero content. A timestamp lets a reviewer say "on this date and at this time, the rule was failing" rather than asserting something about the page in the abstract. If the site later updates its overlay or restructures the page, the timestamped record remains anchored to that specific moment. That is the point.

The DOM snapshot hash is a content fingerprint of the page state at the time the rule engine ran. It allows anyone to verify that the captured DOM corresponds to what the evaluation actually saw. Without a hash, an axe-core output is text produced by a tool at some unspecified moment. With a hash, it is a verifiable record tied to a specific page state. Together, the timestamp and hash form the chain of custody for a single exhibit — the two properties that let evidence be checked, not just asserted.

When this evidence is and is not enough

Reproducible, timestamped, hashed documentation of an axe-core run against a public page is solid accessibility audit documentation. It is not a compliance certificate, and the word "evidence" here carries its practical meaning — material a human decision-maker can evaluate — not its legal meaning, which requires counsel to apply.

There are things automated rule engines cannot evaluate: whether an image description is meaningful to a screen reader user, whether error messaging is genuinely understandable in context, whether complex interactive components are usable when navigated by keyboard and assistive technology in real conditions. An axe-core scan finds structural violations in the DOM. It does not substitute for a full manual audit by an experienced evaluator.

Documentation produced by a two-pass witness run is appropriate for these uses: counsel reviewing exposure before deciding how to respond to a demand letter; procurement teams requiring objective evidence of accessibility claims before vendor approval; compliance leads who need a periodic snapshot showing that the site's public claims are being actively monitored; and engineering teams tracking regressions between deployments. For all of those purposes, reproducible accessibility test evidence — structured, timestamped, and tied to the specific claim being tested — is the right starting point.

If your engineering team wants that same witness inside a deploy or triage workflow, Using AI agents to test website accessibility over MCP walks through the transports, tool response shape, and the read-only contract behind the agent-facing version.

The boundary between evidence and legal conclusion is important to maintain because overstating what the documentation covers is itself a risk. A packet that says exactly what it observed, when, and how — and nothing more — is far more useful to counsel than one that overclaims. Precision is what makes it hold up.