Consent-audit: a Claude Skill

A Claude Code skill I built to audit draft consent flows against the internal guidance I helped author. It flags critical, important, and minor issues with suggested remediations. I run controlled evals (with and without the skill) to measure whether it actually helps.

Consent design is high-stakes work with low tolerance for error, and a lot of the review process is checking new drafts against the same internal guidance again and again. As Microsoft's consent and privacy guidance has matured (and I've authored a fair amount of it), I wanted to see if I could turn that body of judgement into something I could point at a draft and get back a structured review.

The deeper goal was meta: what does it look like for content design to use AI to scale our own work, not just design for AI? Claude Code felt like the right place to find out.

Claude + GitHub

The skill reads from a private GitHub repo containing the body of guidance I've helped author over the years: anti-patterns and good patterns, internal terminology, our own best practices, some regulatory guidance, and "aim for / avoid" examples.

Given a draft consent flow — a Figma comp, screenshot, document, or text file — the skill reasons over it against that knowledge and returns a structured review:

  • Critical issues

  • Important issues

  • Minor issues

  • A suggested remediation for each


I built it with one UXE consulting on the technical side.

Evals to measure usefulness

I evaluate the skill using Anthropic's eval skill, against three kinds of cases:

  1. A rubric I score outputs against

  2. Real-world drafts where I compare its review to my own manual one

  3. Synthetic good and bad flows I generated to test edge cases


The most important loop: I run the evals with and without the skill to measure whether the skill actually improves the review process, or just adds ceremony. The results feed back into adjustments to the skill, and the cycle repeats.

Where it is today

This is early-stage work. I'm using the skill on real work in flight, and I'm about to share it with a handful of partner teams to gather feedback.

What's already clicking — and it surprised me — is something quieter than "we saved X hours." Having an evals-backed skill changes the conversation with partner teams. As a Principal, my judgement can read to partners as "Amy's opinion." A skill-backed review depersonalises the feedback and shows the work behind a decision: the nuance, the regulatory complexity, the genuine difficulty of getting these things right.

It's a slower outcome to measure than time saved. I suspect it's the more important one.

What I've learned so far

Two things stand out so far.

Building a useful AI tool no longer requires being a machine learning engineer. It requires being clear about what "good" looks like and rigorous about measuring it. The hard part of this work wasn't the code; it was articulating my own judgement well enough for the skill to apply it.

Evals are the thing that makes the output trustworthy. Without them, this would be a clever experiment. With them — and especially with the with/without comparison — it becomes something I can actually point partner teams at, and stand behind when they ask "how do you know it's right?"