// instruments · 03

The four moves that turn an LLM into a UX auditor

Three versions of a UX audit agent later, the four moves that separate an audit from a list of confident-sounding findings.

The first time I ran my own audit tool on an enterprise CRM sandbox page, it gave it a clean report. No critical findings. A few minor cosmetic notes. Ship-ready.

The page had no data on it. Empty-state dashboard, no records, nothing for a user to evaluate. The audit didn't know that. The screenshot looked clean, so the report was clean. A client opening that report would have trusted it. That's the moment I realized the instrument I'd built wasn't broken - it was working exactly as designed, and the design was wrong.

Three versions later, the audit knows the difference between a clean page and a page it can't evaluate. Getting there took rebuilding the tool twice and discovering, late, that the work that mattered wasn't engineering. It was figuring out what an LLM-produced audit has to do differently to count as a UX audit in the first place.

Why most LLM audits read the same

If you've seen a few of these, you know the shape. A list of findings, each with a Critical / Major / Minor sticker. Four neat finding types and four neat severities, all cells filled. Nielsen heuristics run end to end. Confident prose. Nothing wrong with any individual paragraph, and nothing rises.

That's not a writing problem. That's a methodology problem, and it's harder to see because the output looks complete. Severity labels average out - when everything has a sticker, nothing weighs more. Symmetrical taxonomies look thorough - when every cell is filled, the empty cells that should have been signal disappear. Single-source audits miss the most useful gap an audit can find - the one between what the client says they want the site to do and what the site itself implies through its structure. The report reads like work was done because the surface area is covered.

My V1 produced reports that looked exactly like that. It also produced false-clean audits on empty-state pages, which is what finally let me see the rest of it.

V1 - the instrument couldn't see what wasn't there

The first version was a Python script with a desktop GUI on top. Two thousand lines of buttons and trackpad behavior wrapped around an agent that ran the audit in stages. It was brittle - a stage failing meant the whole run failed - and the GUI was the wrong surface for diagnostic work, which produces a report, not a wizard. But those were fixable problems.

The unfixable one was the sandbox audit. The agent looked at a page, took screenshots, ran heuristics against what it saw, and reported what was there. When nothing was there, nothing showed up as wrong. The audit had no idea it was looking at an empty page. There was no concept in the tool of what this page is supposed to be doing, only of what's visible in the screenshot. That's the same gap I see in every LLM audit I read now. The instrument has one source - the rendered page - and grades it against generic heuristics. Anything that requires understanding intent disappears.

This is the failure mode that made me throw the first version away.

V2 - a refactor that looked rebuilt and wasn't

V2 lasted weeks. I thought I had rebuilt the tool around agents. Underneath it was still an orchestrator calling scripts, or scripts calling agents - I genuinely can't remember which, and that ambiguity is itself the tell. A real agent-native tool is shaped in a way you don't forget. V2 wasn't shaped like anything in particular.

The lesson took longer than I want to admit to land. You can move the boxes around and call it a rebuild. The question is whether any part of the tool can run on its own and produce the same output. V2's parts couldn't. They needed the orchestration scaffolding that was the actual architecture, and the scaffolding was what was wrong.

I started V3 in a new repo.

V3 - what the instrument has to actually do

V3 is the version that runs in client delivery now, against two enterprise CRM products' UI surfaces. The thing I'm proudest of in it isn't a feature. It's that I can finally name the four moves that separate a UX audit from a list of confident-sounding findings.

One: confidence has to come from placement, not labels. A Critical / Major / Minor sticker averages out across a report. Move confidence to where the finding sits instead. A finding earns the top section by clearing four criteria - it shows up across multiple sessions, it triangulates across two or more hypothesis sources, it matches a known heuristic violation, it maps to a measurable outcome. Four-of-four is the executive summary. Two, the details section. One, footnote or cut. The structure does the weighting. The reader doesn't have to.

This is the move that broke me out of V1. V1's stickers were a way of avoiding having to commit to which finding mattered most. Once placement carried the confidence, I had to actually decide.

Two: the taxonomy has to be capped and the empty cells have to be signal. Two axes, four categories each. Heuristic violation, flow failure, objective misalignment, systemic gap. Critical, Major, Minor, Cosmetic. Sixteen cells in theory; real reports across two CRM products cluster in five or six. The empty cells are the most useful part of the matrix - they say where the audit looked and found nothing, which is information a client can actually use. V1 padded reports to look complete. V3 leaves them honest.

Three: hypotheses come from four sources, and the gap between them is the finding. What the client says they want the site to do. What the site itself implies through its content and structure. What's standard for the industry. And the gap between those three, which is usually where the highest-value finding lives, because misalignment between stated intent and implied intent is where strategic problems hide. V1's false-clean audits were single-source failures - the instrument had only the rendered page. V3 reads the design system, reads the developer CLI, reads the client brief. The substrate is what makes the recommendations specific instead of generic. Without it, the LLM produces "consider improving error states." With it, the LLM produces "the empty-state pattern from the design system isn't being used on this dashboard; here's the component and the import path."

Four: the auditor has to report its own confusion. This is the move I'd never seen in another audit tool. The agent works as a novice-but-competent user. As it interacts with the site, it reports its evolving mental model. When it gets stuck, it doesn't just log "task failed" - it logs the wrong model that produced the failure. I thought clicking this would open the record. It opened the filter panel instead. Now I don't know how to get back. That artifact is more useful than any heuristic finding, because a real user gets stuck in the same place and never tells you.

V1 hid its confusion behind clean output. V3 surfaces it, and the surfaced confusion is the diagnostic.

What ships, what doesn't

V3 ships against client engagements now. The IP-clean extract is on GitHub as ux-audit-agent. The four moves above are the design target for the next version - not all of them are fully encoded yet. Placement and the capped taxonomy are live. Triangulation is partially there, anchored to the design system but not yet running an explicit hypothesis layer. The confusion-reporting agent is the next build.

That's honest, and it's also the only version of this story that's worth telling. The framework is a thing I had to discover by shipping a tool that didn't have it, then shipping a tool that did, then watching what the audits could and couldn't do. None of it would have been visible from outside the build.

The instrument I'm building points outward - at sites, at flows, at client work. The other instrument I've been building points inward, at my own reasoning. Different objects, same practice. Build the thing that lets you see what you couldn't see on your own. Watch it run. Notice what it still gets wrong. Build the next version against that.

Building an instrument to see a site is not a deliverable you ship. It is a practice, and three versions in, I am starting to know what it's actually for.