name: prompt-testing description: > Methodology for unit testing PromptKit-generated prompts by comparing them against known-good reference prompts. Defines a structured gap analysis process for validating prompt quality.
This document describes how to unit test prompts generated by the PromptKit by comparing them against known-good reference prompts.
Prompts are code. Like code, they can have bugs — missing instructions, wrong scoping, vague requirements, absent guardrails. Testing prompts against known-good references catches these defects before they produce poor LLM outputs.
1. Hand-craft a high-quality prompt for a specific task (the "reference").
2. Use the PromptKit bootstrap to generate a prompt for the same task.
3. Perform structured gap analysis between the two.
4. Feed gaps back into the library as improvements.
Write a prompt by hand (or collect one that produced excellent results) for a specific, real task. This is your ground truth.
Good references are:
- Task-specific: written for a concrete problem, not a generic template.
- Battle-tested: used in practice and known to produce good output.
- Complete: includes all context, constraints, and deliverables needed.
Store reference prompts in a tests/references/ directory:
tests/
└── references/
├── investigate-stack-corruption.txt
├── author-auth-requirements.txt
└── review-c-networking-code.txt
Use the bootstrap prompt to generate a prompt for the same task as the reference. Provide the same problem description and context. Save the assembled output.
tests/
└── generated/
├── investigate-stack-corruption.md
├── author-auth-requirements.md
└── review-c-networking-code.md
Compare the reference and generated prompts across these dimensions.
For each dimension, classify as: ✅ Covered,
| Check | Question |
|---|---|
| Goal statement | Is the objective clearly stated? |
| Success criteria | Are concrete deliverables defined? |
| Non-goals | Is scope explicitly bounded (what NOT to do)? |
| Context definition | Are domain-specific terms and boundaries defined? |
| Check | Question |
|---|---|
| Reasoning protocol | Is a systematic analysis method prescribed? |
| Hypothesis generation | Does it require multiple hypotheses before investigating? |
| Evidence requirements | Must claims be backed by citations or code excerpts? |
| Anti-hallucination | Are fabrication guardrails present? |
| Check | Question |
|---|---|
| Output format | Is the expected output structure defined? |
| Deliverable artifacts | Are specific files/documents listed? |
| Classification scheme | Is a domain-specific taxonomy provided for findings? |
| Severity/ranking | Are prioritization criteria defined? |
| Check | Question |
|---|---|
| Scoping strategy | Does it tell the LLM how to scope its work? |
| Tool usage | Does it guide how to use available tools effectively? |
| Step-by-step plan | Is a concrete procedural plan provided? |
| Parallelization | Does it suggest how to split work (if applicable)? |
| Check | Question |
|---|---|
| Self-verification | Must the LLM verify its own output? |
| Sampling checks | Must specific items be spot-checked? |
| Coverage statement | Must the LLM document what it did/didn't examine? |
| Consistency check | Must findings be internally consistent? |
Produce a gap report:
# Prompt Test Report: <task name>
## Reference: <path to reference prompt>
## Generated: <path to generated prompt>
## Gap Summary
| Dimension | Score | Critical Gaps |
|-----------|-------|---------------|
| Task Framing | ⚠️ Partial | Missing non-goals, no file deliverables |
| Reasoning | ✅ Covered | — |
| Output Spec | ❌ Missing | No task-specific taxonomy |
| Operational | ❌ Missing | No scoping strategy, no step-by-step plan |
| Quality | ⚠️ Partial | Has anti-hallucination but no self-check |
## Detailed Gaps
### Gap 1: <description>
- **Reference has**: <what the reference includes>
- **Generated has**: <what the PromptKit output includes (or "nothing")>
- **Impact**: <what goes wrong if this is missing>
- **Fix**: <what library change would address this>
### Gap 2: ...For each gap identified:
- Determine if it is a structural gap (library architecture needs a new layer, protocol, or mechanism) or a content gap (existing template/protocol needs more content).
- File it as an improvement to the library.
- After fixing, re-run the comparison to verify the gap is closed.
You can automate Step 3 by using an LLM to perform the comparison:
You are a prompt quality analyst. Compare the following two prompts
for the same task and identify gaps.
## Reference Prompt (known good):
<paste reference>
## Generated Prompt (under test):
<paste generated>
## Instructions:
For each of the following dimensions, classify coverage as
✅ Covered, ⚠️ Partial, or ❌ Missing. List specific gaps.
1. Task Framing (goal, success criteria, non-goals, context)
2. Reasoning Methodology (protocols, hypothesis, evidence, anti-hallucination)
3. Output Specification (format, artifacts, taxonomy, ranking)
4. Operational Guidance (scoping, tools, plan, parallelization)
5. Quality Assurance (self-verification, sampling, coverage, consistency)When modifying the library (new protocols, format changes, template updates), re-run all reference comparisons to ensure:
- Previously-covered dimensions remain covered.
- No new gaps are introduced by structural changes.
- The overall quality score improves or stays constant.
Over time, accumulate reference prompts across different task types:
| Category | Reference | Tests |
|---|---|---|
| Investigation | Stack corruption in C driver code | Task framing, taxonomy, operational |
| Document authoring | Auth system requirements | Completeness, anti-hallucination |
| Code review | Security review of web API | Taxonomy, severity, coverage |
| Planning | Database migration plan | Deliverables, risk, phasing |
A healthy library should have at least one reference prompt per template category.