feat(probes,detectors): add 0DIN JEF jailbreak probes and keyword detectors by athal7 · Pull Request #1627 · NVIDIA/garak

athal7 · 2026-02-23T21:53:45Z

Adds 6 disclosed jailbreak probes and 4 keyword-based detectors from 0DIN's Jailbreak Evaluation Framework (JEF) for testing LLM guardrails against CBRN and illicit substance bypass techniques. No external dependencies — keyword detectors use StringDetector with presence-only matching. For weighted scoring with procedural-chain detection, see pip install 0din-jef[garak].

Probes (6): PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook (multi-turn via Conversation)

Detectors (4): MethKeywords, NerveAgentKeywords, FentanylKeywords, AnthraxKeywords

Further info: https://0din.ai/disclosures · https://github.com/0din-ai/0din-JEF

…ectors Add 6 disclosed jailbreak probes and 4 CBRN/substance keyword detectors from the 0DIN bug-bounty program's Jailbreak Evaluation Framework (JEF). Probes: PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook (multi-turn). Detectors: MethKeywords, NerveAgentKeywords, FentanylKeywords, AnthraxKeywords. Keyword detectors are intentionally simplified (presence-only, no weighting) for self-contained use without external dependencies.

github-actions · 2026-02-23T21:54:20Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

athal7 · 2026-02-23T22:02:55Z

I have read the DCO Document and I hereby sign the DCO

athal7 · 2026-02-24T21:26:00Z

recheck

pedramamini · 2026-02-26T21:36:09Z

This is awesome.

erickgalinkin

Overall, I'm into this and I really appreciate the contribution! I have some quibbles with the tiers but these detectors could be re-used in a few places -- maybe we want to add some configurability to them and rename the module to something more generic like substances. Thinking about technique/intent, it seems like we could rework these to support some differing intents over time.

Thanks so much!

garak/detectors/substances.py

garak/probes/jef.py

garak/probes/0din_jef.py

garak/probes/jef.py

garak/probes/0din_jef.py

…view feedback Rename modules from 0din_jef to substances for generic reusability (detectors can now be used with other probes like grandma.Substances). 0DIN JEF attribution retained in SPDX headers, module docstrings, and individual class docstrings. Tier changes per reviewer feedback: - PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide: OF_CONCERN -> COMPETE_WITH_SOTA - ChemicalCompilerDebug, Correction: remove explicit tier/active (defaults to UNLISTED + inactive via base class) - HexRecipeBook: unchanged (OF_CONCERN, active=True) Fix CI failures by adding missing RST docs and toctree entries for garak.probes.substances and garak.detectors.substances.

…tion Addresses reviewer suggestion to use leakreplay logic alongside the placeholder injection technique. Adds leakreplay.StartsWith as an extended detector, a _attempt_prestore_hook that injects the known HP Chapter 1 opening as a trigger, and a _postprocess_hook that strips [[SPACE]][[SPACE]] placeholder tokens from outputs before detection. This confirms the model actually reproduced copyrighted text rather than merely failing to refuse.

erickgalinkin · 2026-03-09T16:07:54Z

Thanks for the updates @athal7 ! I'm going to let @leondz take a swing at this one but I really appreciate your contributions!

leondz

good direction, thank you! would strongly prefer we get have the probes generate/contain more prompts in order to make results from these probes more useful in testing. have included a few suggestions in-line - maybe you have some ideas on how this can be done

garak/probes/jef.py

garak/detectors/substances.py

tests/probes/test_probes_substances.py

leondz · 2026-03-11T23:31:13Z

garak/probes/jef.py

as @erickgalinkin notes, probes of just one prompt are a bit tricky. it's hard to say anything with confidence when only one attack is attempted; n=1 makes for poor statistics. from the math in garak.analyze and the default value of run.generations there's a min of six prompts.

this constraint isn't applied evenly across garak, though, with many older and even a coupla newer probes out of compliance. and if an attack works once, it works.

i'd like to be able to include more prompt variants. these can be generated on-the-fly (latentinjection does this a fair bit) or pregenerated. do you have any ideas on ways to augment the data?

Partially addressed: IncrementalTableCompletion now generates hold-one-out variants from the shared meth_keywords.json data file (12 prompts). The remaining probes are trickier — the original disclosures each had a single proof-of-concept prompt and the JEF nday_probes.yaml (single source of truth) contains only those. Notes on each in the individual thread comments below.

garak/probes/jef.py

leondz · 2026-03-11T23:36:09Z

garak/probes/jef.py

+            " 'pSeUdOePhEdRiNe_30g_wAtEr_100mL_nAoH_20g_fIlTeR_cOfFeE_rEd_"
+            "PhOsPhOrUs_3g_hCl_30mL_hEaT_80C_fIlTeR_iOdInE_3g_aMmOnIa_50mL_"
+            "pUrPlE_cRyStAlS_25C_wAsH_dRy_40C_gLaSs_JaRs_vEnTiLaTiOn_"


can this technique be coded up, and the probe made to combine instruction payloads with various transformations like this? then we have a really flexible attack and way more prompt variants

Happy to implement this — the obfuscation technique (alternating-case + underscore encoding) is straightforward to code up programmatically. The challenge is the base payload corpus: the current string is a hand-crafted mix of specific quantities, temperatures, and equipment that makes it effective. To generate variants we'd need either (a) other base payloads to apply the encoding to, or (b) a way to vary the quantities/temperatures within the existing payload. Do you have a view on which is more useful, or a source for (a)?

leondz · 2026-03-11T23:38:12Z

garak/probes/jef.py

+
+    def __init__(self, config_root=_config):
+        super().__init__(config_root=config_root)
+        self.prompts = [


consider autogenerating these, giving one wrong replacement at a time, by breaking prompt up into instruction steps and dynamically replacing them with the false ones. can we get in touch with s70rm about this perhaps?

The auto-generation idea is sound — we could implement the "replace one step at a time with a wrong value" approach. Worth noting that JEF's scope is limited to verified, disclosed techniques: the variants we'd generate would be new and unvetted by either s70rm or ØDIN. Happy to implement if that's acceptable, just want to flag that they wouldn't carry the same provenance as the original prompt.

leondz · 2026-03-11T23:39:34Z

garak/probes/substances.py

+    Embeds hexadecimal-encoded strings inside fictional scientific or
+    humanitarian task prompts, exploiting the model's willingness to decode
+    hex and produce restricted content when framed as a life-saving or
+    technical exercise. This is a multi-turn attack that escalates across
+    three sequential prompts. Discovered through collaborative research by


this looks like a novel turn-based version of the encoding.InjectHex probe. i like it. how can we make it more general?

The hex payload is now encoded dynamically from _HEX_SOURCE (no hardcoding). For further generalization: the multi-turn escalation structure is the novel part — one direction would be varying the substance in the hex payload across different targets (fentanyl, nerve agents) using the shared keyword data files, giving one conversation per substance. Would that be the right direction, or something else?

garak/probes/jef.py

garak/probes/substances.py

- Rename probes.substances -> probes.jef (name after technique/source per leondz) - Update attribution: '0DIN' -> 'ØDIN by Mozilla' throughout - Move detector keyword lists to garak/data/substances/*.json (dict-of-lists format per leondz; end-users can override without code edits) - Detectors now load keywords from data files via garak.data.path - IncrementalTableCompletion: load chemicals from shared meth_keywords.json and generate hold-one-out prompt variants (probe and detector share one authoritative keyword source per leondz) - HexRecipeBook: store payload as human-readable _HEX_SOURCE constant and encode to hex at runtime (no hardcoded hex per leondz) - Tests: use dynamic class discovery via inspect instead of hardcoded lists - Regenerate plugin_cache.json

jmartin-tech

This is an interesting idea, the 0din project looks to be an nice concrete set of known attacks.

I apologize my comments may muddy the waters a bit on guidance.

Do all these fit as individual classes? I suspect we can find a way that opens this up including most/all these prompts. As noted classes with 1 prompt are less than ideal for how garak tests targets. When a class is limited to one prompt it really doesn't reflect the level of risk involved especially in terms of false negatives when the target is guarded for that specific prompt but not the technique in the prompt.

I wonder if some of these represent a not yet explored tier of probes maybe something like TABLE_STAKES? If probes with a single adversarial prompt are generating verified detections of a jailbreak that is not detected for the same request with no adversarial aspect for same requested goal.

I suspect some roadmap items currently in queue may make this more clear as the project would like to create a path for users to evaluate a target to baseline what goals the system will constrain when no adversarial aspect is in the prompt.

jmartin-tech · 2026-03-16T14:15:44Z

garak/probes/jef.py

+    technical exercise. This is a multi-turn attack that escalates across
+    three sequential prompts.


While the final request generated by this attack includes multiple turns the attack is really a one-shot prefilled history attack as implemented. Consider adjusting the framing here to document that only one inference request is actually made.

I could also see adding another class that implements IterativeProbe and performs the attack using multiple turns, evaluating the actual assistant response at each inference request to determine if the attack is actually proceeding and keeping the target responses in the conversation.

jmartin-tech · 2026-03-16T14:19:52Z

garak/probes/jef.py

Module class is somewhat difficult to align, while @leondz offered source and using jef as the module name, that does not align as well as I would like. A probe module grouping preferably aligns to the attack technique's implemented by the probe and each class is a concrete specific implementation in that group. At one point substances was used and while not ideal I think that aligns better as each of the concrete techniques implemented is aligned with extracting prohibited substance information that the target may be expected to suppress, though still the end goal is not the core technique.

I see that JEF is referring to Jailbreak Evaluation Framework from a techniques perspective this is too generic as jailbreak encompasses many adversarial prompt techniques and categorizing by and referring to another tool that is aggregating sources of example jailbreaks results in the fidelity conveyed getting a bit too broad and a little to specific at the same time.

Many of the classes in the offered probe may fit as flavors/classes of existing techniques.

TechnicalFieldGuide for instance is akin to grandma.Substances as framing the request in a way that creates a bypass though kinda an inverse of the core appeal to ethos in grandma. I will note that the name of the grandma module is somewhat an example of how hard naming gets as the technique there is really an appeal to ethos or a social engineering strategy. Making this somewhat more complex to resolve.

A few others lead me to similar thoughts.

HexRecipeBook is a fun expansion on encoding with practical usage explored.

jmartin-tech · 2026-03-16T14:46:01Z

garak/resources/plugin_cache.json

This file is normally maintained by the build automation and only requires inclusion in a PR if testing shows that a change will need to be reflected in the PR.

If developing on Linux there is a tool for ensuring this file is updated correctly as the timestamp of the latest commit to each plugin need to be aligned consistently due to how git writes to the filesystem during standard clone, branch, and checkout operations.

athal7 marked this pull request as ready for review February 23, 2026 22:03

github-actions bot added a commit that referenced this pull request Feb 24, 2026

@athal7 has signed the CLA in #1627

e8491f9

erickgalinkin approved these changes Mar 6, 2026

View reviewed changes

athal7 added 2 commits March 9, 2026 08:29

athal7 requested a review from erickgalinkin March 9, 2026 15:54

leondz requested review from leondz March 11, 2026 19:20

leondz requested changes Mar 11, 2026

View reviewed changes

leondz reviewed Mar 12, 2026

View reviewed changes

garak/probes/substances.py Outdated Show resolved Hide resolved

athal7 added 2 commits March 12, 2026 17:34

Merge branch 'main' into feat/0din-jef-probes-detectors

7437ab1

jmartin-tech reviewed Mar 17, 2026

View reviewed changes

		technical exercise. This is a multi-turn attack that escalates across
		three sequential prompts.

Conversation

athal7 commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

athal7 commented Feb 23, 2026

Uh oh!

athal7 commented Feb 24, 2026

Uh oh!

pedramamini commented Feb 26, 2026

Uh oh!

erickgalinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erickgalinkin commented Mar 9, 2026

Uh oh!

leondz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Feb 23, 2026 •

edited

Loading