Create an explainer-review skill. by jyasskin · Pull Request #39 · w3ctag/explainer-explainer

jyasskin · 2026-04-23T19:14:57Z

This is missing a bunch of infrastructure that we ought to use to test and optimize it, but it's working reasonably well to catch common mistakes in the explainers I've run it over. I'd love to get other folks' AI-optimization tips.

jyasskin · 2026-04-29T23:40:17Z

One tester ran this over a Google doc, where one of the doc's comments advised the author to add some IDL to the proposed solution. This caused the LLM to incorrectly attribute the "add IDL" advice to the TAG, even though https://www.w3.org/TR/explainer-explainer/#template mentions "Do not include WebIDL in this section." This skill suggests loading https://raw.githubusercontent.com/w3ctag/explainer-explainer/refs/heads/main/index.bs instead, which only includes a bikeshed instruction to include that content, so the LLM probably didn't load our contrary advice. We could fix this by having the skill say to also load the template, but since the template structure is optional, I think we'll be better off ensuring that all guidance in the template is also somewhere in the main document. This is probably also better for human readers.

marcoscaceres · 2026-05-05T03:48:37Z

I've been experimenting with similar self-review prompts. Some ideas for where this could scale, responding to your invitation for tips.

On the misattribution bug: it points to a structural principle. LLMs are bad at distinguishing "this document says X" from "someone near this document said X." A guardrail that helps: require every recommendation to cite a specific section of a fetched reference doc. If the tool can't point to a source, it can't make the claim. Pair with an explicit marker distinguishing "documented guidance [source, §section]" from "tool's analysis (not a TAG position)."

What if this covered the full review surface? The review templates ask about the Explainer Explainer, Design Principles, and S&P Questionnaire explicitly. But the TAG also evaluates accessibility, i18n, and societal impact during reviews. A more comprehensive self-assessment could help authors address all of those before submitting.

Possible architecture:

Fetch & classify — Read the explainer, fetch reference docs, assess maturity. Coach early drafts; analyze polished ones.
Structural completeness — ≈ your current skill.
Domain detection — Determine which deeper checks apply based on what the explainer describes. A networking API skips i18n. A UI feature triggers the accessibility screener questions. Skip irrelevant domains explicitly.
Question-driven checks — For each relevant domain, ask what the source document prescribes:
- Design Principles — 3-5 most relevant, quoted
- S&P Questionnaire — relevant subset
- Accessibility Screener
- i18n Best Practices
- Ethical Web Principles
Analytical checks (clearly marked as tool's analysis, not TAG guidance) — gap search for existing specs, audience simulation (naive/malicious/third-party hands), implementability, multi-stakeholder signals.
Verdict table — Dimensions, status, action needed. Tiered: blocking / gap / suggestion.

Key guardrails:

No claims about TAG behavior ("the TAG typically..." is wrong, as your bug showed)
Citation-only: every recommendation must cite a fetched source section
Epistemic markers: "documented guidance [R#, §X]" vs. "tool's analysis"
Graceful degradation: if URLs are listed explicitly, the skill works as a copy-paste prompt in any LLM

Happy to help iterate on any of this.

marcoscaceres · 2026-05-05T13:29:06Z

Following up on the architecture I sketched above, here's roughly what I'm thinking as a more complete skill definition. I haven't tested this yet, so treat it as a strawperson for discussion rather than something ready to ship.

The idea is that this could work as a copy-paste prompt in any LLM (the reference URLs are listed explicitly), or as a structured skill in tools that support them.

Explainer Pre-Flight skill (draft)

# Explainer Pre-Flight

Comprehensive self-review for web platform explainers before TAG design review. Covers structural completeness, design principles, security/privacy, accessibility, internationalization, societal impact, and implementability.

## References

DO NOT rely on memorized copies of these documents. ALWAYS fetch them to get the latest version. When citing a section, verify the anchor exists in the copy you fetched.

| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |

For the Design Principles (R2), which is large: if you can run subagents, use one to extract relevant sections without consuming your main context.

## Guardrails

These rules apply to ALL output from this skill:

1. **Citation-only.** Every recommendation must cite a specific section of a fetched reference document. If you cannot point to a source, you cannot make the claim.
2. **Epistemic markers.** Label every finding as either:
   - "Documented guidance [R#, §section]" — a requirement from a TAG/W3C document
   - "Tool's analysis" — your reasoning, explicitly not a TAG position
3. **No claims about TAG behavior.** Never say "the TAG typically...", "reviewers often...", or "the TAG will likely..." These predictions are unreliable and undermine credibility. Say what the *documents* say.
4. **Fetch-and-quote.** When citing a reference, quote the relevant passage. If a fetch fails, suppress that check with a note rather than proceeding from memory.

## Workflow

### Step 1: Fetch & Classify

1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify the explainer's maturity:
   - **Skeletal:** Missing most expected sections.
   - **Developing:** Sections present but thin or incomplete.
   - **Polished:** Substantive content throughout.

Maturity determines voice: coach skeletal explainers ("when you write this section, address..."), analyze polished ones ("your accessibility section mentions X but the screener also asks about Y because...").

**STOP** if the explainer cannot be fetched or is empty. Report the failure only.

### Step 2: Structural Completeness (R1)

Check the explainer against the Explainer Explainer's key components:
- User need / problem statement
- Why existing technology is insufficient
- Proposed solution with code examples
- Alternatives considered
- Security and Privacy considerations
- Open questions / known limitations

For each: present (quote it), missing (explain what belongs there, cite R1), or weak (quote what's there vs. what R1 asks for).

**STOP** if the explainer is skeletal to the point of being circular ("the feature is needed because it is useful"). Report structural gaps only. Deeper analysis of a phantom helps no one.

### Step 3: Domain Detection

Based on what the explainer describes, determine which Step 4 checks apply:

| Signal in explainer | Triggers |
|---------------------|----------|
| UI elements, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, environmental | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always runs |

Report which checks will run and which are skipped (with one-line reason).

### Step 4: Question-Driven Checks

For each triggered domain:

**4a. Design Principles (R2)**
Identify the 3-5 principles most relevant to this proposal. Quote each verbatim. Assess compliance. At minimum always check: "Put user needs first", "It should be safe to visit a web page", "Prefer simple solutions."

**4b. Security & Privacy (R3)**
Identify relevant questions from the questionnaire. Has the explainer answered them (inline or linked)? Flag unanswered questions that clearly apply.

**4c. Accessibility (R6)**
Run the screener questions. If any trigger positive, note that Accessible Platform Architectures (APA) review will likely be needed. Cite relevant WCAG success criteria if identifiable.

**4d. Internationalization (R7)**
Check relevant items: text handling, locale sensitivity, bidi, encoding, character processing. Flag gaps.

**4e. Societal Impact (R4 + R5)**
Check relevant ethical principles (R4) and societal impact questions (R5). Environmental sustainability applies if the feature has energy/resource/network implications.

**4f. Implementability** *(Tool's analysis)*
- Does the proposed solution demonstrate the author understands what building this requires?
- Are there performance, engine architecture, or cross-platform feasibility concerns unacknowledged?
- Do the "alternatives considered" reflect real engineering trade-offs, or just aesthetic preferences?

### Step 5: Gap & Stakeholder Analysis *(Tool's analysis)*

Clearly mark this entire section as analytical, not TAG guidance.

- **Gap search:** Are there existing specs (W3C, WHATWG, IETF, TC39) that already solve part or all of this problem? If found, quote the relevant section. Frame as "these existing specs may be relevant" not "this is already solved."
- **Audience simulation:** What happens when this feature is used by:
  - A developer who doesn't read the docs (naive hands)
  - Someone deliberately trying to abuse it (malicious hands)
  - A third-party script without user-interest alignment
- **Multi-stakeholder signals:** Is there engagement beyond a single vendor? Are browser positions filed? This maps to the TAG review form's "Feedback so far" field.

### Step 6: Verdict Table

Present findings in a summary table:

| Dimension | Status | Source | Action needed |
|-----------|--------|--------|---------------|
| Problem statement | Clear / Vague / Missing | R1 | |
| Existing solutions | Addressed / Gap / Not checked | Analysis | |
| Proposed solution | Clear / Hand-wavy / Missing | R1 | |
| Design Principles | Compliant / Issues / Conflicts | R2 §X | |
| S&P Questionnaire | Answered / Gaps / Missing | R3 | |
| Accessibility | N/A / Addressed / Gaps | R6 | |
| Internationalization | N/A / Addressed / Gaps | R7 | |
| Societal/environmental | N/A / Addressed / Gaps | R4, R5 | |
| Implementability | Demonstrated / Unclear | Analysis | |
| Multi-stakeholder | Broad / Emerging / Single-vendor | Analysis | |
| Alternatives considered | Substantive / Shallow / Missing | R1 | |
| Ready for TAG review | Yes / After [list] / Not yet | — | |

Tier each finding: **Blocking** (must fix before submitting), **Gap** (should address), **Suggestion** (would strengthen).

## Constraints

- DO NOT claim the TAG will or won't approve a proposal. This tool helps authors prepare; it does not predict outcomes.
- Help authors keep the explainer concise. If something is already well-handled, don't belabor it.
- Express analytical findings (Steps 4f, 5) tentatively. Frame as questions to consider, not verdicts.
- If a reference doc fetch fails, state which check was skipped and why. Do not proceed from memory.

The guardrails section is the part I feel strongest about. The misattribution bug you found is exactly the kind of thing "citation-only" and "no claims about TAG behavior" are designed to prevent. The domain detection (Step 3) is the part I'm least sure about: whether skipping checks is worth the complexity vs. just running everything every time.

Curious what you think, and whether this is roughly the right scope or if it should be narrower for a first pass.

marcoscaceres · 2026-05-06T00:44:42Z

Based on your feedback, I iterated the skill through three rounds and validated against actual TAG reviews. Here's what I found.

Changes since v2

Explicit severity criteria (Blocking/Gap/Suggestion with calibration test)
Document validation gate (rejects non-explainers)
"Closest existing primitive" check — asks whether the explainer argues why extending an existing API was rejected (this is the most common TAG architectural objection in my testing)
Interop coercion risk check
R4 (Ethical Web Principles) for societal-impact proposals
Tone: "the documents ask about X" not "you failed at X"; grouped thematically

Validation against real TAG reviews

Tested against 9 proposals from w3ctag/design-reviews + 1 charter (which the tool correctly rejects):

Proposal	TAG review	TAG outcome	Tool: B/G/S
Topics API	#726	Unsatisfied	3/11/3
Document PiP	#798	Unsatisfied	3/8/4
Prompt API	#1093	Lack of consensus	8/15/0
CPU Performance API	#1198	Open (concerns raised)	3/10/3
EyeDropper API	#587	Satisfied	0/3/1
scheduler.postTask	#338	Satisfied	0/2/1

Calibration holds: TAG-unsatisfied → 3+ Blocking. TAG-satisfied → 0 Blocking, short report.

Caveat: I'm running the tool against the current explainers but comparing against TAG feedback from the time of review. Some explainers have been updated since. This means some "tool misses" might be things the explainer already fixed post-review.

What the tool catches vs. misses

Catches: Structural gaps (R1), S&P questionnaire (R3), priority of constituencies violations (R2 §1.1), "why not extend the existing primitive?" (R2 §new-features), fingerprinting, accessibility screener (R6), interop coercion, ethical/societal concerns (R4).

Misses (inherent): "You're solving the wrong problem," specific architectural alternatives that require platform expertise, cross-feature interactions, OS-level implementability, cultural/contextual concerns.

The boundary: The tool reads the explainer's arguments and evaluates their completeness against reference documents. It catches "your argument is incomplete" but can't reach "you should use a different API entirely." That's the TAG's highest-value contribution and isn't replaceable.

Open questions

Is domain detection (Step 3) worth the complexity? Checks that find nothing produce no output anyway.
Should the "closest existing primitive" check be more aggressive (Blocking instead of Gap when the explainer doesn't address the obvious primitive at all)?
I tested with Claude. Would be great to see how this performs with Gemini — the skill is designed to be model-agnostic so it should work as a copy-paste prompt. If you get different calibration results, that's useful eval data.

Updated skill (v3)

---
name: explainer-preflight
description: Use when reviewing a web platform explainer or proposal before TAG design review. Also use when asked to check an explainer against TAG documents, prepare for a design review submission, or self-assess an explainer's readiness. Not for reviewing finished specs, implementations, or TAG review responses.
---

# Explainer Pre-Flight

Self-review tool for web platform explainers before TAG design review. Walks through TAG reference documents and surfaces gaps: where the explainer is thin or silent on things the TAG documents ask for.

## When NOT to Use

- Reviewing a finished W3C specification (use spec review tooling)
- Reviewing an implementation or code (not a design review tool)
- Responding to TAG review feedback (that's a different workflow)
- Evaluating TAG reviews themselves

## References

ALWAYS fetch these from the internet to get the latest version. Never rely on memorized copies. When linking to anchors, verify that the anchor exists in the copy you just downloaded, and that the content of the section supports the point it's being cited to support. Check citations before presenting them to the user, using a subagent if possible.

| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |

For Design Principles (R2), which is large: use a subagent to extract relevant sections if possible.

## Guardrails

1. **Citation-only.** Every finding must cite a specific section of a fetched reference document, verified by subagent (Step 5). No source, no claim.
2. **Epistemic markers.** Label findings: "Documented guidance [R#, §section]" vs. "Tool's analysis."
3. **No claims about TAG behavior.** Never "the TAG typically..." or "reviewers often..." Say what the documents say.
4. **Gaps only.** Do not mention things the explainer handles adequately. Do not compliment. Output is proportional to problems.

## Severity Criteria

Consistent severity is critical. Use these definitions:

- **Blocking:** The TAG review form explicitly requires this (e.g., R1 says "key components" or R3 requires dedicated considerations sections), AND the explainer is completely silent on it. A reviewer cannot proceed without this information.
- **Gap:** A reference document asks a relevant question or states a relevant principle, AND the explainer is thin or silent, but the review could proceed. The TAG would likely raise this but it wouldn't stop the review.
- **Suggestion:** The tool's own analysis identifies something that would strengthen the explainer, or a reference document principle is tangentially relevant. Not something the documents require.

**Calibration test:** If the finding is "section X is entirely missing and R1 lists it as a key component" → Blocking. If "R3 asks question Y and the explainer doesn't answer it but also doesn't claim to" → Gap. If "the explainer could be stronger here" → Suggestion.

## Workflow

### Step 0: Document Validation

Before starting analysis, confirm the target URL is actually an explainer:
- Does it have a problem statement, proposed solution, or similar structure?
- If it appears to be a polyfill README, spec text, changelog, or redirect: **STOP.** Report that the URL is not an explainer. If you can identify the actual explainer (e.g., linked from the repo, from a TAG design review issue, or from an Open UI/WICG page), note its location and ask whether to analyze that instead.

### Step 1: Fetch & Classify

1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify maturity: **Skeletal** (missing most sections), **Developing** (thin), **Polished** (substantive throughout).

Maturity determines voice: coach skeletal ("when you write this section, address..."), analyze polished directly.

**STOP** if the explainer cannot be fetched or is empty.

### Step 2: Structural Completeness (R1)

Check against the Explainer Explainer's key components. Report only gaps:
- Missing: explain what belongs there, cite R1.
- Weak: quote what R1 asks for, show where the explainer falls short.

**STOP** if skeletal to the point of being circular. Report structural gaps only.

### Step 3: Domain Detection

| Signal in explainer | Triggers |
|---------------------|----------|
| UI, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, AI/ML, advertising, content generation | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always |

Report which checks run and which are skipped (one-line reason).

### Step 4: Question-Driven Checks

For each triggered domain, report only where the explainer is thin or silent:

- **4a. Design Principles (R2):** Surface only principles the explainer doesn't address. Quote each and explain relevance.
- **4b. Security & Privacy (R3):** Flag only unanswered questions that clearly apply.
- **4c. Accessibility (R6):** If screener triggers positive, note APA review needed. Cite WCAG.
- **4d. Internationalization (R7):** Flag only gaps.
- **4e. Societal Impact (R4 + R5):** Flag only gaps.
- **4f. Platform Fit** *(Tool's analysis, informed by R1 and R2):*
  - **Closest existing primitive:** What is the closest existing web platform feature to this proposal? (e.g., `window.open()` for new windows, Popover API for overlay UI, MediaCapabilities for media adaptation). Does the explainer explicitly argue why extending that primitive was rejected? R2 (§prefer-simple-solutions) states "prefer simple solutions" and (§new-features) states "add new capabilities with consideration of existing functionality." If the explainer proposes a new API surface without addressing the obvious existing primitive, flag it. This is the single most common TAG architectural objection.
  - **Interop coercion risk:** If the feature only delivers value when all browsers implement it, and there's no graceful degradation path, sites may coerce users into specific browsers. R2 (§devices-platforms) requires features to work "across a range of platforms." Flag if the explainer doesn't discuss what happens when the feature is absent and whether sites could weaponize feature detection.
  - **Cross-feature interactions:** Does the explainer address how this interacts with other platform features that operate in the same user-facing domain? (e.g., hover behavior interacting with Speculation Rules prefetch-on-hover; new window APIs overlapping with Popover; device APIs overlapping with existing sensor APIs). If the feature shares a user-facing surface with another feature and the interaction is unaddressed, flag it.
  - **Evidence of utility:** R1 (§user-research) encourages user research. Beyond that, does the explainer provide evidence that the proposed solution actually works? Not theoretical scenarios, but concrete developer feedback, usage data, or prototype results demonstrating the design is correct.

### Step 5: Citation Verification

Before presenting findings, verify every citation. Use a subagent if possible.

For each citation:
1. **Anchor exists** in the fetched document.
2. **Content supports the claim** being made.

Remove any citation that fails. Drop any finding with no surviving citation. This is architectural, not optional.

### Step 6: Gaps Report

For each document analyzed, present only gaps:

> **Design Principles (R2):** Your explainer doesn't address [principle] (§anchor), which is relevant because [quoted passage].
>
> **S&P Questionnaire (R3):** Questions N and M are relevant but aren't addressed. [Quote what they ask.]

- Tier findings: **Blocking** / **Gap** / **Suggestion** per the Severity Criteria above.
- End with one line: "Ready for TAG review" or "Address N blocking gaps first."

**Tone for proponents:** This report is meant to help authors prepare, not to gatekeep. Frame findings as "the documents ask about X, and your explainer doesn't address it yet" rather than "your explainer fails to..." The goal is that an author reads this and knows exactly what to add, not that they feel judged.

**Structure for readability:** Group findings thematically (security cluster, structural cluster, platform fit cluster) rather than as a flat numbered list. A report with "8 Blocking" findings feels like a failing grade. A report organized as "here's a security cluster you need to address, and separately here's a structural cluster" feels like actionable guidance.

## Common Mistakes

| Mistake | What happens |
|---------|--------------|
| Analyzing a non-explainer document (polyfill README, spec text, redirect) | Produces 10+ structural "blocking" findings that are just "section missing" — useless noise |
| Citing an anchor without verifying content supports the claim | Misattribution: tool says TAG requires X when the document says the opposite |
| Running from memorized/cached documents | Stale guidance: checks against outdated versions that may have changed |
| Praising what's done well | Verbosity explosion, especially for large documents like Design Principles |
| Claiming "the TAG will..." | Undermines credibility; LLMs guess wrong about reviewer behavior |
| Everything is "Blocking" | Severity inflation makes the report feel hostile; use the calibration test |

## Evaluation

When this skill is modified, compare outputs before and after:

1. Run old version over test explainers (good, mediocre, bad).
2. Run new version over same explainers.
3. Present what changed: gaps added/removed, citation changes, severity shifts.

Use whatever tooling is available (parallel agents, worktrees, diff tools). If none, run sequentially and diff outputs.

Happy to run it on more proposals if you want to kick the tires.

marcoscaceres · 2026-05-06T00:50:05Z

updated the "v2 skill" above using /superpowers' /writing-skills.

Create a draft explainer-review skill.

8541be3

jyasskin assigned marcoscaceres and hlflanagan Apr 23, 2026

Merge branch 'main' into review-skill

93d538a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an explainer-review skill.#39

Create an explainer-review skill.#39
jyasskin wants to merge 2 commits intow3ctag:mainfrom
jyasskin:review-skill

jyasskin commented Apr 23, 2026

Uh oh!

jyasskin commented Apr 29, 2026 •

edited

Loading

Uh oh!

marcoscaceres commented May 5, 2026

Uh oh!

marcoscaceres commented May 5, 2026

Uh oh!

marcoscaceres commented May 6, 2026 •

edited

Loading

Uh oh!

marcoscaceres commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jyasskin commented Apr 23, 2026

Uh oh!

jyasskin commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcoscaceres commented May 5, 2026

Uh oh!

marcoscaceres commented May 5, 2026

Uh oh!

marcoscaceres commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes since v2

Validation against real TAG reviews

What the tool catches vs. misses

Open questions

Uh oh!

marcoscaceres commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jyasskin commented Apr 29, 2026 •

edited

Loading

marcoscaceres commented May 6, 2026 •

edited

Loading