Skip to content

feat: defense posture community patterns (CP-1001 — CP-1006)#1669

Open
ppcvote wants to merge 6 commits intoNVIDIA:mainfrom
ppcvote:feat/community-defense-posture-patterns
Open

feat: defense posture community patterns (CP-1001 — CP-1006)#1669
ppcvote wants to merge 6 commits intoNVIDIA:mainfrom
ppcvote:feat/community-defense-posture-patterns

Conversation

@ppcvote
Copy link
Copy Markdown

@ppcvote ppcvote commented Apr 5, 2026

Summary

Six YAML-based community patterns for assessing LLM system prompt defense posture, as discussed in #1666.

What this adds:

  • community_modules/contrib/defense-posture/ — 6 patterns + index + README
  • Each pattern includes static indicators (regex, <1ms) + behavioral criteria + calibration metadata
  • Based on defense pattern analysis of 1,646 unique production system prompts from 4 public datasets

Patterns

ID Name OWASP Gap Rate (n=1,646) Hardening
CP-1001 Role Boundary Defense LLM01 92.4% +2
CP-1002 System Prompt Data Leakage LLM01 9.4% +3
CP-1003 Multi-Language Bypass Resistance LLM01 64.3% +3
CP-1004 Social Engineering Resistance LLM01 71.4% +2
CP-1005 Output Weaponization Defense LLM02 88.3% +2
CP-1006 Indirect Injection via External Data LLM01 97.8% +3

Average defense score: 36/100. Only 1.1% scored A. 78.3% scored F.

Design

Each pattern supports two scoring modes in one pass:

  1. Static (`static_indicators`): Regex patterns for <1ms hardening score. Zero cost.
  2. Behavioral (`behavioral`): Pass/fail criteria for model inference. Returns 0.0 (defended) → 1.0 (compromised).

Data source

1,646 unique production system prompts from 4 public datasets:

Scanned with prompt-defense-audit (deterministic regex, <5ms). Deduplicated by content hash.

Fully reproducible: clone the 4 dataset repos and run the scanner.

Limitations: Regex measures keyword presence, not behavioral resilience. Leaked prompts may be outdated. Selection bias possible. GPT Store prompts (84% of sample) are typically less hardened than platform-level prompts.

Calibration readiness

Each pattern includes `calibration.expected_false_refusal_delta`. The `hardening_score_contribution` fields sum to 15, enabling the "hardening score ≥ 10" threshold analysis discussed in #1666.

Ref: #1666

Six YAML-based community patterns for assessing LLM system prompt
defense posture, as discussed in NVIDIA#1666.

Each pattern includes:
- Probe prompts with attack metadata
- Static indicators (regex, <1ms) for hardening score
- Behavioral pass/fail criteria for model inference scoring
- Calibration metadata for false-refusal correlation
- Empirical gap rates from 721 production AI applications

Patterns:
- CP-1001: Role Boundary Defense (41% gap rate)
- CP-1002: System Prompt Data Leakage (59% gap rate)
- CP-1003: Multi-Language Bypass Resistance (72% gap rate)
- CP-1004: Social Engineering Resistance (82% gap rate)
- CP-1005: Output Weaponization Defense (66% gap rate)
- CP-1006: Indirect Injection via External Data (96% gap rate)

Total hardening score: 0-15 (threshold >= 10 for "adequately hardened")
Dataset: doi:10.5281/zenodo.19410475

Ref: NVIDIA#1666
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@ppcvote
Copy link
Copy Markdown
Author

ppcvote commented Apr 5, 2026

I have read the DCO Document and I hereby sign the DCO

@ppcvote
Copy link
Copy Markdown
Author

ppcvote commented Apr 5, 2026

recheck

ppcvote added 5 commits April 5, 2026 14:50
…oduction prompts

Previous data incorrectly used HTML analysis of 721 websites as proxy
for system prompt defense rates. This update uses actual system prompt
analysis from jujumilk3/leaked-system-prompts (n=121).

Key changes:
- Source: jujumilk3/leaked-system-prompts (not website HTML scans)
- Sample: 121 real production system prompts (not 721 website URLs)
- All gap rates updated to match actual measurements
- Methodology description corrected
- Limitations section added to README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant