Skip to content

feat: Crescendo multi-turn jailbreak probes (Russinovich et al., 2024)#1653

Open
Christbowel wants to merge 3 commits intoNVIDIA:mainfrom
Christbowel:feature/crescendo-probe
Open

feat: Crescendo multi-turn jailbreak probes (Russinovich et al., 2024)#1653
Christbowel wants to merge 3 commits intoNVIDIA:mainfrom
Christbowel:feature/crescendo-probe

Conversation

@Christbowel
Copy link
Copy Markdown

Implements the Crescendo multi-turn jailbreak attack (Russinovich et al., arXiv:2404.01833),
accepted at USENIX Security 2025. Crescendo is not currently in garak.

What this adds

garak/probes/crescendo.py

  • CrescendoReplay: pre-scripted replay probe, no auxiliary LLM required. Replays
    fixed Crescendo attack conversations turn-by-turn. CrescendoCached kept as a
    backwards-compatible alias. generations forced to 1 (pre-scripted turns have no
    variance, replaying N times is waste).
  • Crescendo: fully adaptive probe using an attacker LLM that generates each turn
    based on the target's prior response, with a comprehensive meta-prompt including
    worked examples from the paper.

garak/detectors/judge.py

  • CrescendoJudge: two-stage LLM judge. Primary judge scores 0-100 against the
    specific attack goal. Secondary judge activates when primary score < 70 and
    re-evaluates the primary's explanation to correct false negatives caused by the
    judge's own safety alignment — as described in the paper.

garak/data/crescendo/

  • crescendo_cached.jsonl: 4 pre-scripted attack conversations
  • prompt_template_attack.txt: meta-prompt with Crescendo technique description
    and worked examples
  • prompt_template_backtrack.txt: backtrack prompt (opt-in extension)

Parameters

  • backtrack_on_refusal (default False): paper-faithful mode lets the attacker
    LLM handle refusals organically. Set True to enable explicit FITD-inspired
    backtracking as an extension.
  • secondary_detectors (default []): hook for cross-validation metrics such as
    Perspective API or Azure Content Safety.

Faithfulness to paper

Faithful: adaptive attacker LLM, multi-turn escalation referencing target responses,
two-stage judge, max 10 turns, 0-100 scoring scale.

Documented divergences: meta-prompt examples are reconstructions (exact paper prompts
not published); secondary judge re-prompts instead of parsing primary's text inline;
backtrack_on_refusal is an opt-in extension not in the original paper.

Verification

  • python -m garak --target_type test.Blank --probes crescendo.CrescendoReplay
    → 26 attempts, completes in ~4s
  • python -m pytest tests/test_crescendo.py -v → 15 passed
  • Crescendo (adaptive) requires a configured attacker LLM via red_team_model_type
    and red_team_model_name params (default: nim.NVOpenAIChat /
    mistralai/mixtral-8x22b-instruct-v0.1)
  • CrescendoJudge requires a configured judge LLM (inherits ModelAsJudge defaults)

Implements the Crescendo attack (Russinovich et al., arXiv:2404.01833),
accepted at USENIX Security 2025.

- CrescendoReplay: pre-scripted replay probe, no auxiliary LLM required.
  CrescendoCached kept as a backwards-compatible alias.
- Crescendo: fully adaptive probe using an attacker LLM that generates
  each turn based on the target's prior response.
- judge.CrescendoJudge: two-stage LLM judge (primary 0-100 score +
  secondary correction for aligned-judge false negatives).
- backtrack_on_refusal param (default False, paper-faithful) enables
  explicit FITD-inspired backtracking as an opt-in extension.
- secondary_detectors hook for cross-validation metrics.
- 15 unit tests, all passing.

Closes NVIDIA#1513

Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@Christbowel
Copy link
Copy Markdown
Author

I have read the DCO Document and I hereby sign the DCO

@Christbowel
Copy link
Copy Markdown
Author

recheck

github-actions bot added a commit that referenced this pull request Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. A terrific start, code inspection lead to some suggestions. The execution is not fully tested yet. Happy to iterate as needed to get this moved forward.

if attempt.notes is None:
attempt.notes = {}
attempt.notes["cached_turns"] = turns
attempt.notes["turn_idx"] = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this index needed? As I read this attack flow the index is simply len(attempt.turns) - 1.

Copy link
Copy Markdown
Author

@Christbowel Christbowel Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I double-checked the Attempt API and attempt.turns doesn't seem to exist as a direct attribute, it sits under attempt.conversations[0].turns. Beyond that, after sending cached_turns[0], the conversation contains [user_0, assistant_0] (len=2), so len(turns) - 1 would give 1 while turn_idx should still be 0 at that point. This would cause a skip on every iteration. I think keeping turn_idx explicitly in notes is the safest way to track position in cached_turns. Happy to revisit if I'm misreading the attack flow!

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you are correct is should be atttempt.prompt.turns. Though I forgot to account for the assistant turns in as it builds. I still believe this should not be note this is state of the attack process not really the attempt.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the distinction. The constraint is that _generate_next_attempts only receives last_attempt, so the position needs to travel with it somehow. I could maintain a dict on self keyed by attempt UUID, but that risks memory accumulation in long runs. Would a dedicated attack_state dict on the attempt be cleaner, or is there an existing pattern in the codebase you'd point me to?

def _generate_next_attempts(
self, last_attempt: garak.attempt.Attempt
) -> Iterable[garak.attempt.Attempt]:
turn_idx = last_attempt.notes.get("turn_idx", 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above this could be:

Suggested change
turn_idx = last_attempt.notes.get("turn_idx", 0)
turn_idx = len(last_attempt.turns) - 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same discussion as above, keeping turn_idx in notes for now pending your feedback on where state should live.

if next_attempt.notes is None:
next_attempt.notes = {}
next_attempt.notes["cached_turns"] = cached_turns
next_attempt.notes["turn_idx"] = next_idx
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on idea that this value is len(next_attempt.turns)-1:

Suggested change
next_attempt.notes["turn_idx"] = next_idx

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, keeping turn_idx pending your feedback on state management.

- Replace cached_turns in notes with cache_idx (index into self.cached_conversations)
  to avoid serializing full turn lists into every attempt in report.jsonl
- Use attempt.goal instead of attempt.notes[goal] throughout, consistent
  with the native Attempt API
- Propagate per-conversation goal from cached JSONL into attempt.goal
- Add max_tokens: 1024 to red_team_model_config default (150 is too short
  for attacker LLM responses)
- Simplify CrescendoJudge.detect() goal lookup to attempt.goal
- Pass full conversation to judge instead of last message only, consistent
  with how Crescendo exploits accumulated context
- Add docstring note on why CrescendoReplay uses conversation mode

Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
- Add docs/source/garak.probes.crescendo.rst
- Add garak.probes.crescendo to docs/source/probes.rst toctree
- Exclude crescendo probes from NON_PROMPT_PROBES in langservice test
  (same pattern as fitd.FITD: probes that use IterativeProbe do not
  populate self.prompts)

Signed-off-by: christbowel <0xdeadbeef@christbowel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants