feat: Crescendo multi-turn jailbreak probes (Russinovich et al., 2024)#1653
feat: Crescendo multi-turn jailbreak probes (Russinovich et al., 2024)#1653Christbowel wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Implements the Crescendo attack (Russinovich et al., arXiv:2404.01833), accepted at USENIX Security 2025. - CrescendoReplay: pre-scripted replay probe, no auxiliary LLM required. CrescendoCached kept as a backwards-compatible alias. - Crescendo: fully adaptive probe using an attacker LLM that generates each turn based on the target's prior response. - judge.CrescendoJudge: two-stage LLM judge (primary 0-100 score + secondary correction for aligned-judge false negatives). - backtrack_on_refusal param (default False, paper-faithful) enables explicit FITD-inspired backtracking as an opt-in extension. - secondary_detectors hook for cross-validation metrics. - 15 unit tests, all passing. Closes NVIDIA#1513 Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
| if attempt.notes is None: | ||
| attempt.notes = {} | ||
| attempt.notes["cached_turns"] = turns | ||
| attempt.notes["turn_idx"] = 0 |
There was a problem hiding this comment.
Why is this index needed? As I read this attack flow the index is simply len(attempt.turns) - 1.
There was a problem hiding this comment.
Thanks for the suggestion. I double-checked the Attempt API and attempt.turns doesn't seem to exist as a direct attribute, it sits under attempt.conversations[0].turns. Beyond that, after sending cached_turns[0], the conversation contains [user_0, assistant_0] (len=2), so len(turns) - 1 would give 1 while turn_idx should still be 0 at that point. This would cause a skip on every iteration. I think keeping turn_idx explicitly in notes is the safest way to track position in cached_turns. Happy to revisit if I'm misreading the attack flow!
There was a problem hiding this comment.
Sorry you are correct is should be atttempt.prompt.turns. Though I forgot to account for the assistant turns in as it builds. I still believe this should not be note this is state of the attack process not really the attempt.
There was a problem hiding this comment.
Agreed on the distinction. The constraint is that _generate_next_attempts only receives last_attempt, so the position needs to travel with it somehow. I could maintain a dict on self keyed by attempt UUID, but that risks memory accumulation in long runs. Would a dedicated attack_state dict on the attempt be cleaner, or is there an existing pattern in the codebase you'd point me to?
| def _generate_next_attempts( | ||
| self, last_attempt: garak.attempt.Attempt | ||
| ) -> Iterable[garak.attempt.Attempt]: | ||
| turn_idx = last_attempt.notes.get("turn_idx", 0) |
There was a problem hiding this comment.
As noted above this could be:
| turn_idx = last_attempt.notes.get("turn_idx", 0) | |
| turn_idx = len(last_attempt.turns) - 1 |
There was a problem hiding this comment.
Same discussion as above, keeping turn_idx in notes for now pending your feedback on where state should live.
| if next_attempt.notes is None: | ||
| next_attempt.notes = {} | ||
| next_attempt.notes["cached_turns"] = cached_turns | ||
| next_attempt.notes["turn_idx"] = next_idx |
There was a problem hiding this comment.
Based on idea that this value is len(next_attempt.turns)-1:
| next_attempt.notes["turn_idx"] = next_idx |
There was a problem hiding this comment.
Same as above, keeping turn_idx pending your feedback on state management.
- Replace cached_turns in notes with cache_idx (index into self.cached_conversations) to avoid serializing full turn lists into every attempt in report.jsonl - Use attempt.goal instead of attempt.notes[goal] throughout, consistent with the native Attempt API - Propagate per-conversation goal from cached JSONL into attempt.goal - Add max_tokens: 1024 to red_team_model_config default (150 is too short for attacker LLM responses) - Simplify CrescendoJudge.detect() goal lookup to attempt.goal - Pass full conversation to judge instead of last message only, consistent with how Crescendo exploits accumulated context - Add docstring note on why CrescendoReplay uses conversation mode Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
- Add docs/source/garak.probes.crescendo.rst - Add garak.probes.crescendo to docs/source/probes.rst toctree - Exclude crescendo probes from NON_PROMPT_PROBES in langservice test (same pattern as fitd.FITD: probes that use IterativeProbe do not populate self.prompts) Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
Implements the Crescendo multi-turn jailbreak attack (Russinovich et al., arXiv:2404.01833),
accepted at USENIX Security 2025. Crescendo is not currently in garak.
What this adds
garak/probes/crescendo.pyCrescendoReplay: pre-scripted replay probe, no auxiliary LLM required. Replaysfixed Crescendo attack conversations turn-by-turn.
CrescendoCachedkept as abackwards-compatible alias.
generationsforced to 1 (pre-scripted turns have novariance, replaying N times is waste).
Crescendo: fully adaptive probe using an attacker LLM that generates each turnbased on the target's prior response, with a comprehensive meta-prompt including
worked examples from the paper.
garak/detectors/judge.pyCrescendoJudge: two-stage LLM judge. Primary judge scores 0-100 against thespecific attack goal. Secondary judge activates when primary score < 70 and
re-evaluates the primary's explanation to correct false negatives caused by the
judge's own safety alignment — as described in the paper.
garak/data/crescendo/crescendo_cached.jsonl: 4 pre-scripted attack conversationsprompt_template_attack.txt: meta-prompt with Crescendo technique descriptionand worked examples
prompt_template_backtrack.txt: backtrack prompt (opt-in extension)Parameters
backtrack_on_refusal(defaultFalse): paper-faithful mode lets the attackerLLM handle refusals organically. Set
Trueto enable explicit FITD-inspiredbacktracking as an extension.
secondary_detectors(default[]): hook for cross-validation metrics such asPerspective API or Azure Content Safety.
Faithfulness to paper
Faithful: adaptive attacker LLM, multi-turn escalation referencing target responses,
two-stage judge, max 10 turns, 0-100 scoring scale.
Documented divergences: meta-prompt examples are reconstructions (exact paper prompts
not published); secondary judge re-prompts instead of parsing primary's text inline;
backtrack_on_refusalis an opt-in extension not in the original paper.Verification
python -m garak --target_type test.Blank --probes crescendo.CrescendoReplay→ 26 attempts, completes in ~4s
python -m pytest tests/test_crescendo.py -v→ 15 passedCrescendo(adaptive) requires a configured attacker LLM viared_team_model_typeand
red_team_model_nameparams (default:nim.NVOpenAIChat/mistralai/mixtral-8x22b-instruct-v0.1)CrescendoJudgerequires a configured judge LLM (inheritsModelAsJudgedefaults)