Skip to content

Latest commit

 

History

History
208 lines (154 loc) · 6.8 KB

File metadata and controls

208 lines (154 loc) · 6.8 KB

name: prompt-testing description: > Methodology for unit testing PromptKit-generated prompts by comparing them against known-good reference prompts. Defines a structured gap analysis process for validating prompt quality.

Prompt Testing Guide

This document describes how to unit test prompts generated by the PromptKit by comparing them against known-good reference prompts.

Why Test Prompts

Prompts are code. Like code, they can have bugs — missing instructions, wrong scoping, vague requirements, absent guardrails. Testing prompts against known-good references catches these defects before they produce poor LLM outputs.

Method: Reference Comparison

Overview

1. Hand-craft a high-quality prompt for a specific task (the "reference").
2. Use the PromptKit bootstrap to generate a prompt for the same task.
3. Perform structured gap analysis between the two.
4. Feed gaps back into the library as improvements.

Step 1: Create a Reference Prompt

Write a prompt by hand (or collect one that produced excellent results) for a specific, real task. This is your ground truth.

Good references are:

  • Task-specific: written for a concrete problem, not a generic template.
  • Battle-tested: used in practice and known to produce good output.
  • Complete: includes all context, constraints, and deliverables needed.

Store reference prompts in a tests/references/ directory:

tests/
└── references/
    ├── investigate-stack-corruption.txt
    ├── author-auth-requirements.txt
    └── review-c-networking-code.txt

Step 2: Generate the PromptKit Prompt

Use the bootstrap prompt to generate a prompt for the same task as the reference. Provide the same problem description and context. Save the assembled output.

tests/
└── generated/
    ├── investigate-stack-corruption.md
    ├── author-auth-requirements.md
    └── review-c-networking-code.md

Step 3: Structured Gap Analysis

Compare the reference and generated prompts across these dimensions. For each dimension, classify as: ✅ Covered, ⚠️ Partial, or ❌ Missing.

Dimension 1: Task Framing

Check Question
Goal statement Is the objective clearly stated?
Success criteria Are concrete deliverables defined?
Non-goals Is scope explicitly bounded (what NOT to do)?
Context definition Are domain-specific terms and boundaries defined?

Dimension 2: Reasoning Methodology

Check Question
Reasoning protocol Is a systematic analysis method prescribed?
Hypothesis generation Does it require multiple hypotheses before investigating?
Evidence requirements Must claims be backed by citations or code excerpts?
Anti-hallucination Are fabrication guardrails present?

Dimension 3: Output Specification

Check Question
Output format Is the expected output structure defined?
Deliverable artifacts Are specific files/documents listed?
Classification scheme Is a domain-specific taxonomy provided for findings?
Severity/ranking Are prioritization criteria defined?

Dimension 4: Operational Guidance

Check Question
Scoping strategy Does it tell the LLM how to scope its work?
Tool usage Does it guide how to use available tools effectively?
Step-by-step plan Is a concrete procedural plan provided?
Parallelization Does it suggest how to split work (if applicable)?

Dimension 5: Quality Assurance

Check Question
Self-verification Must the LLM verify its own output?
Sampling checks Must specific items be spot-checked?
Coverage statement Must the LLM document what it did/didn't examine?
Consistency check Must findings be internally consistent?

Step 4: Score and Report

Produce a gap report:

# Prompt Test Report: <task name>

## Reference: <path to reference prompt>
## Generated: <path to generated prompt>

## Gap Summary

| Dimension | Score | Critical Gaps |
|-----------|-------|---------------|
| Task Framing | ⚠️ Partial | Missing non-goals, no file deliverables |
| Reasoning | ✅ Covered ||
| Output Spec | ❌ Missing | No task-specific taxonomy |
| Operational | ❌ Missing | No scoping strategy, no step-by-step plan |
| Quality | ⚠️ Partial | Has anti-hallucination but no self-check |

## Detailed Gaps

### Gap 1: <description>
- **Reference has**: <what the reference includes>
- **Generated has**: <what the PromptKit output includes (or "nothing")>
- **Impact**: <what goes wrong if this is missing>
- **Fix**: <what library change would address this>

### Gap 2: ...

Step 5: Feed Back into the Library

For each gap identified:

  1. Determine if it is a structural gap (library architecture needs a new layer, protocol, or mechanism) or a content gap (existing template/protocol needs more content).
  2. File it as an improvement to the library.
  3. After fixing, re-run the comparison to verify the gap is closed.

Automated Gap Analysis (Using an LLM)

You can automate Step 3 by using an LLM to perform the comparison:

You are a prompt quality analyst. Compare the following two prompts
for the same task and identify gaps.

## Reference Prompt (known good):
<paste reference>

## Generated Prompt (under test):
<paste generated>

## Instructions:
For each of the following dimensions, classify coverage as
✅ Covered, ⚠️ Partial, or ❌ Missing. List specific gaps.

1. Task Framing (goal, success criteria, non-goals, context)
2. Reasoning Methodology (protocols, hypothesis, evidence, anti-hallucination)
3. Output Specification (format, artifacts, taxonomy, ranking)
4. Operational Guidance (scoping, tools, plan, parallelization)
5. Quality Assurance (self-verification, sampling, coverage, consistency)

Regression Testing

When modifying the library (new protocols, format changes, template updates), re-run all reference comparisons to ensure:

  1. Previously-covered dimensions remain covered.
  2. No new gaps are introduced by structural changes.
  3. The overall quality score improves or stays constant.

Building a Test Suite

Over time, accumulate reference prompts across different task types:

Category Reference Tests
Investigation Stack corruption in C driver code Task framing, taxonomy, operational
Document authoring Auth system requirements Completeness, anti-hallucination
Code review Security review of web API Taxonomy, severity, coverage
Planning Database migration plan Deliverables, risk, phasing

A healthy library should have at least one reference prompt per template category.