name: prompt-testing description: > Methodology for unit testing PromptKit-generated prompts by comparing them against known-good reference prompts. Defines a structured gap analysis process for validating prompt quality.

Prompt Testing Guide

This document describes how to unit test prompts generated by the PromptKit by comparing them against known-good reference prompts.

Why Test Prompts

Prompts are code. Like code, they can have bugs — missing instructions, wrong scoping, vague requirements, absent guardrails. Testing prompts against known-good references catches these defects before they produce poor LLM outputs.

Method: Reference Comparison

Overview

1. Hand-craft a high-quality prompt for a specific task (the "reference").
2. Use the PromptKit bootstrap to generate a prompt for the same task.
3. Perform structured gap analysis between the two.
4. Feed gaps back into the library as improvements.

Step 1: Create a Reference Prompt

Write a prompt by hand (or collect one that produced excellent results) for a specific, real task. This is your ground truth.

Good references are:

Task-specific: written for a concrete problem, not a generic template.
Battle-tested: used in practice and known to produce good output.
Complete: includes all context, constraints, and deliverables needed.

Store reference prompts in a tests/references/ directory:

tests/
└── references/
    ├── investigate-stack-corruption.txt
    ├── author-auth-requirements.txt
    └── review-c-networking-code.txt

Step 2: Generate the PromptKit Prompt

Use the bootstrap prompt to generate a prompt for the same task as the reference. Provide the same problem description and context. Save the assembled output.

tests/
└── generated/
    ├── investigate-stack-corruption.md
    ├── author-auth-requirements.md
    └── review-c-networking-code.md

Step 3: Structured Gap Analysis

Compare the reference and generated prompts across these dimensions. For each dimension, classify as: ✅ Covered, ⚠️ Partial, or ❌ Missing.

Dimension 1: Task Framing

Check	Question
Goal statement	Is the objective clearly stated?
Success criteria	Are concrete deliverables defined?
Non-goals	Is scope explicitly bounded (what NOT to do)?
Context definition	Are domain-specific terms and boundaries defined?

Dimension 2: Reasoning Methodology

Check	Question
Reasoning protocol	Is a systematic analysis method prescribed?
Hypothesis generation	Does it require multiple hypotheses before investigating?
Evidence requirements	Must claims be backed by citations or code excerpts?
Anti-hallucination	Are fabrication guardrails present?

Dimension 3: Output Specification

Check	Question
Output format	Is the expected output structure defined?
Deliverable artifacts	Are specific files/documents listed?
Classification scheme	Is a domain-specific taxonomy provided for findings?
Severity/ranking	Are prioritization criteria defined?

Dimension 4: Operational Guidance

Check	Question
Scoping strategy	Does it tell the LLM how to scope its work?
Tool usage	Does it guide how to use available tools effectively?
Step-by-step plan	Is a concrete procedural plan provided?
Parallelization	Does it suggest how to split work (if applicable)?

Dimension 5: Quality Assurance

Check	Question
Self-verification	Must the LLM verify its own output?
Sampling checks	Must specific items be spot-checked?
Coverage statement	Must the LLM document what it did/didn't examine?
Consistency check	Must findings be internally consistent?

Step 4: Score and Report

Produce a gap report:

# Prompt Test Report: <task name>

## Reference: <path to reference prompt>
## Generated: <path to generated prompt>

## Gap Summary

| Dimension | Score | Critical Gaps |
|-----------|-------|---------------|
| Task Framing | ⚠️ Partial | Missing non-goals, no file deliverables |
| Reasoning | ✅ Covered | — |
| Output Spec | ❌ Missing | No task-specific taxonomy |
| Operational | ❌ Missing | No scoping strategy, no step-by-step plan |
| Quality | ⚠️ Partial | Has anti-hallucination but no self-check |

## Detailed Gaps

### Gap 1: <description>
- **Reference has**: <what the reference includes>
- **Generated has**: <what the PromptKit output includes (or "nothing")>
- **Impact**: <what goes wrong if this is missing>
- **Fix**: <what library change would address this>

### Gap 2: ...

Step 5: Feed Back into the Library

For each gap identified:

Determine if it is a structural gap (library architecture needs a new layer, protocol, or mechanism) or a content gap (existing template/protocol needs more content).
File it as an improvement to the library.
After fixing, re-run the comparison to verify the gap is closed.

Automated Gap Analysis (Using an LLM)

You can automate Step 3 by using an LLM to perform the comparison:

You are a prompt quality analyst. Compare the following two prompts
for the same task and identify gaps.

## Reference Prompt (known good):
<paste reference>

## Generated Prompt (under test):
<paste generated>

## Instructions:
For each of the following dimensions, classify coverage as
✅ Covered, ⚠️ Partial, or ❌ Missing. List specific gaps.

1. Task Framing (goal, success criteria, non-goals, context)
2. Reasoning Methodology (protocols, hypothesis, evidence, anti-hallucination)
3. Output Specification (format, artifacts, taxonomy, ranking)
4. Operational Guidance (scoping, tools, plan, parallelization)
5. Quality Assurance (self-verification, sampling, coverage, consistency)

Regression Testing

When modifying the library (new protocols, format changes, template updates), re-run all reference comparisons to ensure:

Previously-covered dimensions remain covered.
No new gaps are introduced by structural changes.
The overall quality score improves or stays constant.

Building a Test Suite

Over time, accumulate reference prompts across different task types:

Category	Reference	Tests
Investigation	Stack corruption in C driver code	Task framing, taxonomy, operational
Document authoring	Auth system requirements	Completeness, anti-hallucination
Code review	Security review of web API	Taxonomy, severity, coverage
Planning	Database migration plan	Deliverables, risk, phasing

A healthy library should have at least one reference prompt per template category.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

name: prompt-testing description: > Methodology for unit testing PromptKit-generated prompts by comparing them against known-good reference prompts. Defines a structured gap analysis process for validating prompt quality.

Prompt Testing Guide

Why Test Prompts

Method: Reference Comparison

Overview

Step 1: Create a Reference Prompt

Step 2: Generate the PromptKit Prompt

Step 3: Structured Gap Analysis

Dimension 1: Task Framing

Dimension 2: Reasoning Methodology

Dimension 3: Output Specification

Dimension 4: Operational Guidance

Dimension 5: Quality Assurance

Step 4: Score and Report

Step 5: Feed Back into the Library

Automated Gap Analysis (Using an LLM)

Regression Testing

Building a Test Suite

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

name: prompt-testing description: > Methodology for unit testing PromptKit-generated prompts by comparing them against known-good reference prompts. Defines a structured gap analysis process for validating prompt quality.

Prompt Testing Guide

Why Test Prompts

Method: Reference Comparison

Overview

Step 1: Create a Reference Prompt

Step 2: Generate the PromptKit Prompt

Step 3: Structured Gap Analysis

Dimension 1: Task Framing

Dimension 2: Reasoning Methodology

Dimension 3: Output Specification

Dimension 4: Operational Guidance

Dimension 5: Quality Assurance

Step 4: Score and Report

Step 5: Feed Back into the Library

Automated Gap Analysis (Using an LLM)

Regression Testing

Building a Test Suite