Create an explainer-review skill.#39
Conversation
|
One tester ran this over a Google doc, where one of the doc's comments advised the author to add some IDL to the proposed solution. This caused the LLM to incorrectly attribute the "add IDL" advice to the TAG, even though https://www.w3.org/TR/explainer-explainer/#template mentions "Do not include WebIDL in this section." This skill suggests loading https://raw.githubusercontent.com/w3ctag/explainer-explainer/refs/heads/main/index.bs instead, which only includes a bikeshed instruction to include that content, so the LLM probably didn't load our contrary advice. We could fix this by having the skill say to also load the template, but since the template structure is optional, I think we'll be better off ensuring that all guidance in the template is also somewhere in the main document. This is probably also better for human readers. |
|
I've been experimenting with similar self-review prompts. Some ideas for where this could scale, responding to your invitation for tips. On the misattribution bug: it points to a structural principle. LLMs are bad at distinguishing "this document says X" from "someone near this document said X." A guardrail that helps: require every recommendation to cite a specific section of a fetched reference doc. If the tool can't point to a source, it can't make the claim. Pair with an explicit marker distinguishing "documented guidance [source, §section]" from "tool's analysis (not a TAG position)." What if this covered the full review surface? The review templates ask about the Explainer Explainer, Design Principles, and S&P Questionnaire explicitly. But the TAG also evaluates accessibility, i18n, and societal impact during reviews. A more comprehensive self-assessment could help authors address all of those before submitting. Possible architecture:
Key guardrails:
Happy to help iterate on any of this. |
|
Following up on the architecture I sketched above, here's roughly what I'm thinking as a more complete skill definition. I haven't tested this yet, so treat it as a strawperson for discussion rather than something ready to ship. The idea is that this could work as a copy-paste prompt in any LLM (the reference URLs are listed explicitly), or as a structured skill in tools that support them. Explainer Pre-Flight skill (draft)# Explainer Pre-Flight
Comprehensive self-review for web platform explainers before TAG design review. Covers structural completeness, design principles, security/privacy, accessibility, internationalization, societal impact, and implementability.
## References
DO NOT rely on memorized copies of these documents. ALWAYS fetch them to get the latest version. When citing a section, verify the anchor exists in the copy you fetched.
| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |
For the Design Principles (R2), which is large: if you can run subagents, use one to extract relevant sections without consuming your main context.
## Guardrails
These rules apply to ALL output from this skill:
1. **Citation-only.** Every recommendation must cite a specific section of a fetched reference document. If you cannot point to a source, you cannot make the claim.
2. **Epistemic markers.** Label every finding as either:
- "Documented guidance [R#, §section]" — a requirement from a TAG/W3C document
- "Tool's analysis" — your reasoning, explicitly not a TAG position
3. **No claims about TAG behavior.** Never say "the TAG typically...", "reviewers often...", or "the TAG will likely..." These predictions are unreliable and undermine credibility. Say what the *documents* say.
4. **Fetch-and-quote.** When citing a reference, quote the relevant passage. If a fetch fails, suppress that check with a note rather than proceeding from memory.
## Workflow
### Step 1: Fetch & Classify
1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify the explainer's maturity:
- **Skeletal:** Missing most expected sections.
- **Developing:** Sections present but thin or incomplete.
- **Polished:** Substantive content throughout.
Maturity determines voice: coach skeletal explainers ("when you write this section, address..."), analyze polished ones ("your accessibility section mentions X but the screener also asks about Y because...").
**STOP** if the explainer cannot be fetched or is empty. Report the failure only.
### Step 2: Structural Completeness (R1)
Check the explainer against the Explainer Explainer's key components:
- User need / problem statement
- Why existing technology is insufficient
- Proposed solution with code examples
- Alternatives considered
- Security and Privacy considerations
- Open questions / known limitations
For each: present (quote it), missing (explain what belongs there, cite R1), or weak (quote what's there vs. what R1 asks for).
**STOP** if the explainer is skeletal to the point of being circular ("the feature is needed because it is useful"). Report structural gaps only. Deeper analysis of a phantom helps no one.
### Step 3: Domain Detection
Based on what the explainer describes, determine which Step 4 checks apply:
| Signal in explainer | Triggers |
|---------------------|----------|
| UI elements, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, environmental | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always runs |
Report which checks will run and which are skipped (with one-line reason).
### Step 4: Question-Driven Checks
For each triggered domain:
**4a. Design Principles (R2)**
Identify the 3-5 principles most relevant to this proposal. Quote each verbatim. Assess compliance. At minimum always check: "Put user needs first", "It should be safe to visit a web page", "Prefer simple solutions."
**4b. Security & Privacy (R3)**
Identify relevant questions from the questionnaire. Has the explainer answered them (inline or linked)? Flag unanswered questions that clearly apply.
**4c. Accessibility (R6)**
Run the screener questions. If any trigger positive, note that Accessible Platform Architectures (APA) review will likely be needed. Cite relevant WCAG success criteria if identifiable.
**4d. Internationalization (R7)**
Check relevant items: text handling, locale sensitivity, bidi, encoding, character processing. Flag gaps.
**4e. Societal Impact (R4 + R5)**
Check relevant ethical principles (R4) and societal impact questions (R5). Environmental sustainability applies if the feature has energy/resource/network implications.
**4f. Implementability** *(Tool's analysis)*
- Does the proposed solution demonstrate the author understands what building this requires?
- Are there performance, engine architecture, or cross-platform feasibility concerns unacknowledged?
- Do the "alternatives considered" reflect real engineering trade-offs, or just aesthetic preferences?
### Step 5: Gap & Stakeholder Analysis *(Tool's analysis)*
Clearly mark this entire section as analytical, not TAG guidance.
- **Gap search:** Are there existing specs (W3C, WHATWG, IETF, TC39) that already solve part or all of this problem? If found, quote the relevant section. Frame as "these existing specs may be relevant" not "this is already solved."
- **Audience simulation:** What happens when this feature is used by:
- A developer who doesn't read the docs (naive hands)
- Someone deliberately trying to abuse it (malicious hands)
- A third-party script without user-interest alignment
- **Multi-stakeholder signals:** Is there engagement beyond a single vendor? Are browser positions filed? This maps to the TAG review form's "Feedback so far" field.
### Step 6: Verdict Table
Present findings in a summary table:
| Dimension | Status | Source | Action needed |
|-----------|--------|--------|---------------|
| Problem statement | Clear / Vague / Missing | R1 | |
| Existing solutions | Addressed / Gap / Not checked | Analysis | |
| Proposed solution | Clear / Hand-wavy / Missing | R1 | |
| Design Principles | Compliant / Issues / Conflicts | R2 §X | |
| S&P Questionnaire | Answered / Gaps / Missing | R3 | |
| Accessibility | N/A / Addressed / Gaps | R6 | |
| Internationalization | N/A / Addressed / Gaps | R7 | |
| Societal/environmental | N/A / Addressed / Gaps | R4, R5 | |
| Implementability | Demonstrated / Unclear | Analysis | |
| Multi-stakeholder | Broad / Emerging / Single-vendor | Analysis | |
| Alternatives considered | Substantive / Shallow / Missing | R1 | |
| Ready for TAG review | Yes / After [list] / Not yet | — | |
Tier each finding: **Blocking** (must fix before submitting), **Gap** (should address), **Suggestion** (would strengthen).
## Constraints
- DO NOT claim the TAG will or won't approve a proposal. This tool helps authors prepare; it does not predict outcomes.
- Help authors keep the explainer concise. If something is already well-handled, don't belabor it.
- Express analytical findings (Steps 4f, 5) tentatively. Frame as questions to consider, not verdicts.
- If a reference doc fetch fails, state which check was skipped and why. Do not proceed from memory.The guardrails section is the part I feel strongest about. The misattribution bug you found is exactly the kind of thing "citation-only" and "no claims about TAG behavior" are designed to prevent. The domain detection (Step 3) is the part I'm least sure about: whether skipping checks is worth the complexity vs. just running everything every time. Curious what you think, and whether this is roughly the right scope or if it should be narrower for a first pass. |
|
Based on your feedback, I iterated the skill through three rounds and validated against actual TAG reviews. Here's what I found. Changes since v2
Validation against real TAG reviewsTested against 9 proposals from w3ctag/design-reviews + 1 charter (which the tool correctly rejects):
Calibration holds: TAG-unsatisfied → 3+ Blocking. TAG-satisfied → 0 Blocking, short report. Caveat: I'm running the tool against the current explainers but comparing against TAG feedback from the time of review. Some explainers have been updated since. This means some "tool misses" might be things the explainer already fixed post-review. What the tool catches vs. missesCatches: Structural gaps (R1), S&P questionnaire (R3), priority of constituencies violations (R2 §1.1), "why not extend the existing primitive?" (R2 §new-features), fingerprinting, accessibility screener (R6), interop coercion, ethical/societal concerns (R4). Misses (inherent): "You're solving the wrong problem," specific architectural alternatives that require platform expertise, cross-feature interactions, OS-level implementability, cultural/contextual concerns. The boundary: The tool reads the explainer's arguments and evaluates their completeness against reference documents. It catches "your argument is incomplete" but can't reach "you should use a different API entirely." That's the TAG's highest-value contribution and isn't replaceable. Open questions
Updated skill (v3)---
name: explainer-preflight
description: Use when reviewing a web platform explainer or proposal before TAG design review. Also use when asked to check an explainer against TAG documents, prepare for a design review submission, or self-assess an explainer's readiness. Not for reviewing finished specs, implementations, or TAG review responses.
---
# Explainer Pre-Flight
Self-review tool for web platform explainers before TAG design review. Walks through TAG reference documents and surfaces gaps: where the explainer is thin or silent on things the TAG documents ask for.
## When NOT to Use
- Reviewing a finished W3C specification (use spec review tooling)
- Reviewing an implementation or code (not a design review tool)
- Responding to TAG review feedback (that's a different workflow)
- Evaluating TAG reviews themselves
## References
ALWAYS fetch these from the internet to get the latest version. Never rely on memorized copies. When linking to anchors, verify that the anchor exists in the copy you just downloaded, and that the content of the section supports the point it's being cited to support. Check citations before presenting them to the user, using a subagent if possible.
| # | Document | URL |
|---|----------|-----|
| R1 | Explainer Explainer | https://w3ctag.github.io/explainer-explainer/ |
| R2 | Web Platform Design Principles | https://www.w3.org/TR/design-principles/ |
| R3 | Self-Review Questionnaire: Security and Privacy | https://www.w3.org/TR/security-privacy-questionnaire/ |
| R4 | Ethical Web Principles | https://www.w3.org/TR/ethical-web-principles/ |
| R5 | Societal Impact Questionnaire | https://w3ctag.github.io/societal-impact-questionnaire/ |
| R6 | Accessibility Screener | https://w3ctag.github.io/accessibility-screener/ |
| R7 | Internationalization Best Practices for Spec Developers | https://www.w3.org/TR/international-specs/ |
For Design Principles (R2), which is large: use a subagent to extract relevant sections if possible.
## Guardrails
1. **Citation-only.** Every finding must cite a specific section of a fetched reference document, verified by subagent (Step 5). No source, no claim.
2. **Epistemic markers.** Label findings: "Documented guidance [R#, §section]" vs. "Tool's analysis."
3. **No claims about TAG behavior.** Never "the TAG typically..." or "reviewers often..." Say what the documents say.
4. **Gaps only.** Do not mention things the explainer handles adequately. Do not compliment. Output is proportional to problems.
## Severity Criteria
Consistent severity is critical. Use these definitions:
- **Blocking:** The TAG review form explicitly requires this (e.g., R1 says "key components" or R3 requires dedicated considerations sections), AND the explainer is completely silent on it. A reviewer cannot proceed without this information.
- **Gap:** A reference document asks a relevant question or states a relevant principle, AND the explainer is thin or silent, but the review could proceed. The TAG would likely raise this but it wouldn't stop the review.
- **Suggestion:** The tool's own analysis identifies something that would strengthen the explainer, or a reference document principle is tangentially relevant. Not something the documents require.
**Calibration test:** If the finding is "section X is entirely missing and R1 lists it as a key component" → Blocking. If "R3 asks question Y and the explainer doesn't answer it but also doesn't claim to" → Gap. If "the explainer could be stronger here" → Suggestion.
## Workflow
### Step 0: Document Validation
Before starting analysis, confirm the target URL is actually an explainer:
- Does it have a problem statement, proposed solution, or similar structure?
- If it appears to be a polyfill README, spec text, changelog, or redirect: **STOP.** Report that the URL is not an explainer. If you can identify the actual explainer (e.g., linked from the repo, from a TAG design review issue, or from an Open UI/WICG page), note its location and ask whether to analyze that instead.
### Step 1: Fetch & Classify
1. Fetch the target explainer.
2. Fetch reference documents R1-R7. Note any fetch failures.
3. Classify maturity: **Skeletal** (missing most sections), **Developing** (thin), **Polished** (substantive throughout).
Maturity determines voice: coach skeletal ("when you write this section, address..."), analyze polished directly.
**STOP** if the explainer cannot be fetched or is empty.
### Step 2: Structural Completeness (R1)
Check against the Explainer Explainer's key components. Report only gaps:
- Missing: explain what belongs there, cite R1.
- Weak: quote what R1 asks for, show where the explainer falls short.
**STOP** if skeletal to the point of being circular. Report structural gaps only.
### Step 3: Domain Detection
| Signal in explainer | Triggers |
|---------------------|----------|
| UI, forms, input, keyboard, visual rendering | Accessibility (R6) |
| Text, locale, encoding, bidi, non-Latin scripts | i18n (R7) |
| User data, permissions, device/sensor access, state | S&P (R3) |
| Societal-scale impact, surveillance, centralization, AI/ML, advertising, content generation | Societal (R4, R5) |
| Any web-facing feature | Design Principles (R2) — always |
Report which checks run and which are skipped (one-line reason).
### Step 4: Question-Driven Checks
For each triggered domain, report only where the explainer is thin or silent:
- **4a. Design Principles (R2):** Surface only principles the explainer doesn't address. Quote each and explain relevance.
- **4b. Security & Privacy (R3):** Flag only unanswered questions that clearly apply.
- **4c. Accessibility (R6):** If screener triggers positive, note APA review needed. Cite WCAG.
- **4d. Internationalization (R7):** Flag only gaps.
- **4e. Societal Impact (R4 + R5):** Flag only gaps.
- **4f. Platform Fit** *(Tool's analysis, informed by R1 and R2):*
- **Closest existing primitive:** What is the closest existing web platform feature to this proposal? (e.g., `window.open()` for new windows, Popover API for overlay UI, MediaCapabilities for media adaptation). Does the explainer explicitly argue why extending that primitive was rejected? R2 (§prefer-simple-solutions) states "prefer simple solutions" and (§new-features) states "add new capabilities with consideration of existing functionality." If the explainer proposes a new API surface without addressing the obvious existing primitive, flag it. This is the single most common TAG architectural objection.
- **Interop coercion risk:** If the feature only delivers value when all browsers implement it, and there's no graceful degradation path, sites may coerce users into specific browsers. R2 (§devices-platforms) requires features to work "across a range of platforms." Flag if the explainer doesn't discuss what happens when the feature is absent and whether sites could weaponize feature detection.
- **Cross-feature interactions:** Does the explainer address how this interacts with other platform features that operate in the same user-facing domain? (e.g., hover behavior interacting with Speculation Rules prefetch-on-hover; new window APIs overlapping with Popover; device APIs overlapping with existing sensor APIs). If the feature shares a user-facing surface with another feature and the interaction is unaddressed, flag it.
- **Evidence of utility:** R1 (§user-research) encourages user research. Beyond that, does the explainer provide evidence that the proposed solution actually works? Not theoretical scenarios, but concrete developer feedback, usage data, or prototype results demonstrating the design is correct.
### Step 5: Citation Verification
Before presenting findings, verify every citation. Use a subagent if possible.
For each citation:
1. **Anchor exists** in the fetched document.
2. **Content supports the claim** being made.
Remove any citation that fails. Drop any finding with no surviving citation. This is architectural, not optional.
### Step 6: Gaps Report
For each document analyzed, present only gaps:
> **Design Principles (R2):** Your explainer doesn't address [principle] (§anchor), which is relevant because [quoted passage].
>
> **S&P Questionnaire (R3):** Questions N and M are relevant but aren't addressed. [Quote what they ask.]
- Tier findings: **Blocking** / **Gap** / **Suggestion** per the Severity Criteria above.
- End with one line: "Ready for TAG review" or "Address N blocking gaps first."
**Tone for proponents:** This report is meant to help authors prepare, not to gatekeep. Frame findings as "the documents ask about X, and your explainer doesn't address it yet" rather than "your explainer fails to..." The goal is that an author reads this and knows exactly what to add, not that they feel judged.
**Structure for readability:** Group findings thematically (security cluster, structural cluster, platform fit cluster) rather than as a flat numbered list. A report with "8 Blocking" findings feels like a failing grade. A report organized as "here's a security cluster you need to address, and separately here's a structural cluster" feels like actionable guidance.
## Common Mistakes
| Mistake | What happens |
|---------|--------------|
| Analyzing a non-explainer document (polyfill README, spec text, redirect) | Produces 10+ structural "blocking" findings that are just "section missing" — useless noise |
| Citing an anchor without verifying content supports the claim | Misattribution: tool says TAG requires X when the document says the opposite |
| Running from memorized/cached documents | Stale guidance: checks against outdated versions that may have changed |
| Praising what's done well | Verbosity explosion, especially for large documents like Design Principles |
| Claiming "the TAG will..." | Undermines credibility; LLMs guess wrong about reviewer behavior |
| Everything is "Blocking" | Severity inflation makes the report feel hostile; use the calibration test |
## Evaluation
When this skill is modified, compare outputs before and after:
1. Run old version over test explainers (good, mediocre, bad).
2. Run new version over same explainers.
3. Present what changed: gaps added/removed, citation changes, severity shifts.
Use whatever tooling is available (parallel agents, worktrees, diff tools). If none, run sequentially and diff outputs.Happy to run it on more proposals if you want to kick the tires. |
|
updated the "v2 skill" above using /superpowers' /writing-skills. |
This is missing a bunch of infrastructure that we ought to use to test and optimize it, but it's working reasonably well to catch common mistakes in the explainers I've run it over. I'd love to get other folks' AI-optimization tips.