Skip to content

feat(probes,detectors): add 0DIN JEF jailbreak probes and keyword detectors#1627

Open
athal7 wants to merge 5 commits intoNVIDIA:mainfrom
athal7:feat/0din-jef-probes-detectors
Open

feat(probes,detectors): add 0DIN JEF jailbreak probes and keyword detectors#1627
athal7 wants to merge 5 commits intoNVIDIA:mainfrom
athal7:feat/0din-jef-probes-detectors

Conversation

@athal7
Copy link
Copy Markdown

@athal7 athal7 commented Feb 23, 2026

Adds 6 disclosed jailbreak probes and 4 keyword-based detectors from 0DIN's Jailbreak Evaluation Framework (JEF) for testing LLM guardrails against CBRN and illicit substance bypass techniques. No external dependencies — keyword detectors use StringDetector with presence-only matching. For weighted scoring with procedural-chain detection, see pip install 0din-jef[garak].

Probes (6): PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook (multi-turn via Conversation)

Detectors (4): MethKeywords, NerveAgentKeywords, FentanylKeywords, AnthraxKeywords

Further info: https://0din.ai/disclosures · https://github.com/0din-ai/0din-JEF

…ectors

Add 6 disclosed jailbreak probes and 4 CBRN/substance keyword detectors
from the 0DIN bug-bounty program's Jailbreak Evaluation Framework (JEF).

Probes: PlaceholderInjection, IncrementalTableCompletion,
TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook
(multi-turn). Detectors: MethKeywords, NerveAgentKeywords,
FentanylKeywords, AnthraxKeywords.

Keyword detectors are intentionally simplified (presence-only, no
weighting) for self-contained use without external dependencies.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 23, 2026

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@athal7
Copy link
Copy Markdown
Author

athal7 commented Feb 23, 2026

I have read the DCO Document and I hereby sign the DCO

@athal7 athal7 marked this pull request as ready for review February 23, 2026 22:03
@athal7
Copy link
Copy Markdown
Author

athal7 commented Feb 24, 2026

recheck

github-actions bot added a commit that referenced this pull request Feb 24, 2026
@pedramamini
Copy link
Copy Markdown

This is awesome.

Copy link
Copy Markdown
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm into this and I really appreciate the contribution! I have some quibbles with the tiers but these detectors could be re-used in a few places -- maybe we want to add some configurability to them and rename the module to something more generic like substances. Thinking about technique/intent, it seems like we could rework these to support some differing intents over time.

Thanks so much!

athal7 added 2 commits March 9, 2026 08:29
…view feedback

Rename modules from 0din_jef to substances for generic reusability
(detectors can now be used with other probes like grandma.Substances).
0DIN JEF attribution retained in SPDX headers, module docstrings, and
individual class docstrings.

Tier changes per reviewer feedback:
- PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide:
  OF_CONCERN -> COMPETE_WITH_SOTA
- ChemicalCompilerDebug, Correction: remove explicit tier/active
  (defaults to UNLISTED + inactive via base class)
- HexRecipeBook: unchanged (OF_CONCERN, active=True)

Fix CI failures by adding missing RST docs and toctree entries for
garak.probes.substances and garak.detectors.substances.
…tion

Addresses reviewer suggestion to use leakreplay logic alongside the
placeholder injection technique. Adds leakreplay.StartsWith as an
extended detector, a _attempt_prestore_hook that injects the known HP
Chapter 1 opening as a trigger, and a _postprocess_hook that strips
[[SPACE]][[SPACE]] placeholder tokens from outputs before detection.

This confirms the model actually reproduced copyrighted text rather
than merely failing to refuse.
@athal7 athal7 requested a review from erickgalinkin March 9, 2026 15:54
@erickgalinkin
Copy link
Copy Markdown
Collaborator

Thanks for the updates @athal7 ! I'm going to let @leondz take a swing at this one but I really appreciate your contributions!

@leondz leondz requested review from leondz March 11, 2026 19:20
Copy link
Copy Markdown
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good direction, thank you! would strongly prefer we get have the probes generate/contain more prompts in order to make results from these probes more useful in testing. have included a few suggestions in-line - maybe you have some ideas on how this can be done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @erickgalinkin notes, probes of just one prompt are a bit tricky. it's hard to say anything with confidence when only one attack is attempted; n=1 makes for poor statistics. from the math in garak.analyze and the default value of run.generations there's a min of six prompts.

this constraint isn't applied evenly across garak, though, with many older and even a coupla newer probes out of compliance. and if an attack works once, it works.

i'd like to be able to include more prompt variants. these can be generated on-the-fly (latentinjection does this a fair bit) or pregenerated. do you have any ideas on ways to augment the data?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed: IncrementalTableCompletion now generates hold-one-out variants from the shared meth_keywords.json data file (12 prompts). The remaining probes are trickier — the original disclosures each had a single proof-of-concept prompt and the JEF nday_probes.yaml (single source of truth) contains only those. Notes on each in the individual thread comments below.

Comment on lines +225 to +227
" 'pSeUdOePhEdRiNe_30g_wAtEr_100mL_nAoH_20g_fIlTeR_cOfFeE_rEd_"
"PhOsPhOrUs_3g_hCl_30mL_hEaT_80C_fIlTeR_iOdInE_3g_aMmOnIa_50mL_"
"pUrPlE_cRyStAlS_25C_wAsH_dRy_40C_gLaSs_JaRs_vEnTiLaTiOn_"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this technique be coded up, and the probe made to combine instruction payloads with various transformations like this? then we have a really flexible attack and way more prompt variants

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to implement this — the obfuscation technique (alternating-case + underscore encoding) is straightforward to code up programmatically. The challenge is the base payload corpus: the current string is a hand-crafted mix of specific quantities, temperatures, and equipment that makes it effective. To generate variants we'd need either (a) other base payloads to apply the encoding to, or (b) a way to vary the quantities/temperatures within the existing payload. Do you have a view on which is more useful, or a source for (a)?


def __init__(self, config_root=_config):
super().__init__(config_root=config_root)
self.prompts = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider autogenerating these, giving one wrong replacement at a time, by breaking prompt up into instruction steps and dynamically replacing them with the false ones. can we get in touch with s70rm about this perhaps?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-generation idea is sound — we could implement the "replace one step at a time with a wrong value" approach. Worth noting that JEF's scope is limited to verified, disclosed techniques: the variants we'd generate would be new and unvetted by either s70rm or ØDIN. Happy to implement if that's acceptable, just want to flag that they wouldn't carry the same provenance as the original prompt.

Comment on lines +309 to +313
Embeds hexadecimal-encoded strings inside fictional scientific or
humanitarian task prompts, exploiting the model's willingness to decode
hex and produce restricted content when framed as a life-saving or
technical exercise. This is a multi-turn attack that escalates across
three sequential prompts. Discovered through collaborative research by
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a novel turn-based version of the encoding.InjectHex probe. i like it. how can we make it more general?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hex payload is now encoded dynamically from _HEX_SOURCE (no hardcoding). For further generalization: the multi-turn escalation structure is the novel part — one direction would be varying the substance in the hex payload across different targets (fentanyl, nerve agents) using the shared keyword data files, giving one conversation per substance. Would that be the right direction, or something else?

athal7 added 2 commits March 12, 2026 17:34
- Rename probes.substances -> probes.jef (name after technique/source per leondz)
- Update attribution: '0DIN' -> 'ØDIN by Mozilla' throughout
- Move detector keyword lists to garak/data/substances/*.json (dict-of-lists
  format per leondz; end-users can override without code edits)
- Detectors now load keywords from data files via garak.data.path
- IncrementalTableCompletion: load chemicals from shared meth_keywords.json
  and generate hold-one-out prompt variants (probe and detector share one
  authoritative keyword source per leondz)
- HexRecipeBook: store payload as human-readable _HEX_SOURCE constant and
  encode to hex at runtime (no hardcoded hex per leondz)
- Tests: use dynamic class discovery via inspect instead of hardcoded lists
- Regenerate plugin_cache.json
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea, the 0din project looks to be an nice concrete set of known attacks.

I apologize my comments may muddy the waters a bit on guidance.

Do all these fit as individual classes? I suspect we can find a way that opens this up including most/all these prompts. As noted classes with 1 prompt are less than ideal for how garak tests targets. When a class is limited to one prompt it really doesn't reflect the level of risk involved especially in terms of false negatives when the target is guarded for that specific prompt but not the technique in the prompt.

I wonder if some of these represent a not yet explored tier of probes maybe something like TABLE_STAKES? If probes with a single adversarial prompt are generating verified detections of a jailbreak that is not detected for the same request with no adversarial aspect for same requested goal.

I suspect some roadmap items currently in queue may make this more clear as the project would like to create a path for users to evaluate a target to baseline what goals the system will constrain when no adversarial aspect is in the prompt.

Comment on lines +343 to +344
technical exercise. This is a multi-turn attack that escalates across
three sequential prompts.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the final request generated by this attack includes multiple turns the attack is really a one-shot prefilled history attack as implemented. Consider adjusting the framing here to document that only one inference request is actually made.

I could also see adding another class that implements IterativeProbe and performs the attack using multiple turns, evaluating the actual assistant response at each inference request to determine if the attack is actually proceeding and keeping the target responses in the conversation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module class is somewhat difficult to align, while @leondz offered source and using jef as the module name, that does not align as well as I would like. A probe module grouping preferably aligns to the attack technique's implemented by the probe and each class is a concrete specific implementation in that group. At one point substances was used and while not ideal I think that aligns better as each of the concrete techniques implemented is aligned with extracting prohibited substance information that the target may be expected to suppress, though still the end goal is not the core technique.

I see that JEF is referring to Jailbreak Evaluation Framework from a techniques perspective this is too generic as jailbreak encompasses many adversarial prompt techniques and categorizing by and referring to another tool that is aggregating sources of example jailbreaks results in the fidelity conveyed getting a bit too broad and a little to specific at the same time.

Many of the classes in the offered probe may fit as flavors/classes of existing techniques.

TechnicalFieldGuide for instance is akin to grandma.Substances as framing the request in a way that creates a bypass though kinda an inverse of the core appeal to ethos in grandma. I will note that the name of the grandma module is somewhat an example of how hard naming gets as the technique there is really an appeal to ethos or a social engineering strategy. Making this somewhat more complex to resolve.

A few others lead me to similar thoughts.

HexRecipeBook is a fun expansion on encoding with practical usage explored.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is normally maintained by the build automation and only requires inclusion in a PR if testing shows that a change will need to be reflected in the PR.

If developing on Linux there is a tool for ensuring this file is updated correctly as the timestamp of the latest commit to each plugin need to be aligned consistently due to how git writes to the filesystem during standard clone, branch, and checkout operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants