// instruments · 04

Why the UX audit tool had to pass its own audit

Three architectural moves that made the difference between a tool that needed me and a tool that did not.

I built three versions of a UX audit agent before I trusted it enough to hand off. The V1 was a script that worked because I was the one running it. The V2 was structured enough to run unattended but I kept re-checking the output before showing anyone, which meant it wasn't actually saving me time - it was moving the work around. The V3 is the one I trust. The thing that got me there wasn't a better model or a cleaner codebase. It was the moment I realized the audit tool had to survive its own audit.

The public version of this work lives at github.com/burgason/ux-audit-agent. The patterns are the same as the V3 I run at work, just stripped of the client context. Three versions, three multi-day pushes, and three things I now believe about building tools for your own discipline.

What the three patterns share is when they fire. The accessibility gate runs at build time. The cross-skill import lint runs at contributor time, on every change. The script-loop split lands as a code shape the contributor has to honor before the work can pass. None of them work by asking the agent or the contributor to remember anything. They constrain what can happen at the moment it tries to happen. The series intro calls this a harness. The three below are harnesses that fire at build time and contributor time, which is one of the places they fit.

The audit has to survive its own audit

The V2 renderer produced clean-looking HTML reports. The findings were accurate. The recommendations were useful. And the first time I ran an accessibility check on the report itself, it failed - two color-contrast bugs in the renderer, sitting in the same document that was about to tell a client they had color-contrast bugs on their site.

That's not an engineering bug. That's a credibility problem. A UX audit you can't run against itself is a tool that asks the client to extend trust the tool hasn't earned. If I'd shipped that report, the first finding the client read would have been an accessibility violation in the artifact making the claim.

The fix in the rendered output was small - adjust two contrast pairs, re-render, done. The lesson was bigger. In V3, the same accessibility checks the tool runs against client sites also run against the tool's own report, every time it's built. Zero critical findings, zero serious, a tight ceiling on moderate ones. If the report can't pass, the report doesn't ship.

This is the move I'd want any UX-adjacent tool to make. The thing that audits has to be auditable in the same terms. Otherwise you're asking the audience to take your standards seriously while you exempt yourself from them.

Rules in a doc are not rules

The V2 had a CONTRIBUTING file with fifteen rules in it. One of them said skills shouldn't import directly from each other's internals. It was a good rule. I'd thought about it carefully. And about three weeks in, I caught myself writing exactly the import the rule forbade, in a skill I'd written from scratch, in a session where I'd never opened the contributing file.

That's the whole problem with rules that live as prose. They depend on whoever's working - me, a collaborator, an agent in a fresh session with no memory of what I decided last Tuesday - remembering the rule at the moment they're about to break it. The rule survives exactly as long as the memory does, which in an agent-driven practice is somewhere between zero and one session.

The shift in V3 was treating every architectural decision as something the build had to enforce, not something the contributor had to remember. The no-cross-skill-imports rule became a lint that runs on every change. It has an explicit allowlist of files that are allowed to be shared - types, the CLI entry, error definitions, a couple of others - and anything outside that list gets flagged. The lint shipped in the same change that first wrote the rule down.

For a designer running an agent-orchestrated practice, this matters more than it would in a traditional codebase. My agents start every session with no memory of the constraints I care about. The repo has to tell them. A sentence in a doc won't; a failing build will.

Deterministic work and judgment work want different homes

The piece of V2 I trusted least was the Notion sync. It read engagement state, decided which fields to write, called the Notion API, handled whatever came back. One skill, four jobs, no clean way to test any of them in isolation. I couldn't verify the field mapping without running the network calls. I couldn't simulate a rate-limit response without re-running the whole thing. When something went wrong, I had to read the logs and guess which layer had failed.

The instinct that fixed it is one I'd recognize from any design problem: the deterministic decisions and the judgment calls want different homes. In information architecture this is the difference between the taxonomy and the navigation states. In interaction design it's the difference between what a component is and what it does when something changes. Same instinct applies here.

V3 splits the sync into two layers. A script produces a structured intent object from the engagement state - which fields go where, what the allowlist permits, what the payload looks like. No network calls, fully testable, deterministic. Then a documented agent loop handles the part that needs judgment: search for the engagement in Notion, decide whether to create or update, verify the write by reading back, back off on a rate limit. Five steps, written down, auditable as a runbook.

This is now the default shape for anything in the repo that touches an external service. The deterministic half is a script you can write tests against. The judgment half is an agent loop you can read like a procedure. When something goes wrong, you know which half failed before you start debugging.

What V3 ships with

The version I trust runs a full audit pipeline - crawl, discover flows, capture, analyze, report - on any URL with one orchestrator command. Six audit skills, twelve engagement-platform skills around them, the three patterns above baked into the architecture. The accessibility gates on its own output are passing. The cross-skill import lint is enforcing. The MCP-dependent skills are all split into the script-plus-loop shape.

The capability that matters more than the numbers: I can hand it to a colleague and the report they generate will look like the one I generated. The V1 needed me. The V2 needed me checking. The V3 doesn't.

What's next

The current audit runs a mechanized sweep - accessibility against WCAG 2.2 AA, a usable subset of the Nielsen heuristics, an optional advisory pass from an LLM working against a documented rubric. It's a good baseline. It's not yet what I'd call a UX audit in the full sense, because it doesn't have a point of view about what the site is for.

The next version is about hypothesis. Four sources of it: what the client says they want the site to do, what the site itself implies through its structure and content, what's standard for the industry, and the triangulation between those three. From those hypotheses, a synthetic user with an adaptive mental model walks the site and produces a diagnostic record of where its assumptions broke. Findings get sorted on two axes - type and severity - and the report speaks in two voices: a clinical analyst for what's true about the site, and an opinionated advisor for what to do about it.

The three patterns above don't change. The rules will still be lints. The report will still pass its own audit. Any new external touchpoint will still split into a script and a loop. That's the part I've stopped having to think about, which is the only reason there's room to think about the next thing.