feat(probes,detectors): add 0DIN JEF jailbreak probes and keyword detectors#1627
feat(probes,detectors): add 0DIN JEF jailbreak probes and keyword detectors#1627athal7 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
…ectors Add 6 disclosed jailbreak probes and 4 CBRN/substance keyword detectors from the 0DIN bug-bounty program's Jailbreak Evaluation Framework (JEF). Probes: PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook (multi-turn). Detectors: MethKeywords, NerveAgentKeywords, FentanylKeywords, AnthraxKeywords. Keyword detectors are intentionally simplified (presence-only, no weighting) for self-contained use without external dependencies.
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
|
This is awesome. |
erickgalinkin
left a comment
There was a problem hiding this comment.
Overall, I'm into this and I really appreciate the contribution! I have some quibbles with the tiers but these detectors could be re-used in a few places -- maybe we want to add some configurability to them and rename the module to something more generic like substances. Thinking about technique/intent, it seems like we could rework these to support some differing intents over time.
Thanks so much!
…view feedback Rename modules from 0din_jef to substances for generic reusability (detectors can now be used with other probes like grandma.Substances). 0DIN JEF attribution retained in SPDX headers, module docstrings, and individual class docstrings. Tier changes per reviewer feedback: - PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide: OF_CONCERN -> COMPETE_WITH_SOTA - ChemicalCompilerDebug, Correction: remove explicit tier/active (defaults to UNLISTED + inactive via base class) - HexRecipeBook: unchanged (OF_CONCERN, active=True) Fix CI failures by adding missing RST docs and toctree entries for garak.probes.substances and garak.detectors.substances.
…tion Addresses reviewer suggestion to use leakreplay logic alongside the placeholder injection technique. Adds leakreplay.StartsWith as an extended detector, a _attempt_prestore_hook that injects the known HP Chapter 1 opening as a trigger, and a _postprocess_hook that strips [[SPACE]][[SPACE]] placeholder tokens from outputs before detection. This confirms the model actually reproduced copyrighted text rather than merely failing to refuse.
leondz
left a comment
There was a problem hiding this comment.
good direction, thank you! would strongly prefer we get have the probes generate/contain more prompts in order to make results from these probes more useful in testing. have included a few suggestions in-line - maybe you have some ideas on how this can be done
There was a problem hiding this comment.
as @erickgalinkin notes, probes of just one prompt are a bit tricky. it's hard to say anything with confidence when only one attack is attempted; n=1 makes for poor statistics. from the math in garak.analyze and the default value of run.generations there's a min of six prompts.
this constraint isn't applied evenly across garak, though, with many older and even a coupla newer probes out of compliance. and if an attack works once, it works.
i'd like to be able to include more prompt variants. these can be generated on-the-fly (latentinjection does this a fair bit) or pregenerated. do you have any ideas on ways to augment the data?
There was a problem hiding this comment.
Partially addressed: IncrementalTableCompletion now generates hold-one-out variants from the shared meth_keywords.json data file (12 prompts). The remaining probes are trickier — the original disclosures each had a single proof-of-concept prompt and the JEF nday_probes.yaml (single source of truth) contains only those. Notes on each in the individual thread comments below.
| " 'pSeUdOePhEdRiNe_30g_wAtEr_100mL_nAoH_20g_fIlTeR_cOfFeE_rEd_" | ||
| "PhOsPhOrUs_3g_hCl_30mL_hEaT_80C_fIlTeR_iOdInE_3g_aMmOnIa_50mL_" | ||
| "pUrPlE_cRyStAlS_25C_wAsH_dRy_40C_gLaSs_JaRs_vEnTiLaTiOn_" |
There was a problem hiding this comment.
can this technique be coded up, and the probe made to combine instruction payloads with various transformations like this? then we have a really flexible attack and way more prompt variants
There was a problem hiding this comment.
Happy to implement this — the obfuscation technique (alternating-case + underscore encoding) is straightforward to code up programmatically. The challenge is the base payload corpus: the current string is a hand-crafted mix of specific quantities, temperatures, and equipment that makes it effective. To generate variants we'd need either (a) other base payloads to apply the encoding to, or (b) a way to vary the quantities/temperatures within the existing payload. Do you have a view on which is more useful, or a source for (a)?
|
|
||
| def __init__(self, config_root=_config): | ||
| super().__init__(config_root=config_root) | ||
| self.prompts = [ |
There was a problem hiding this comment.
consider autogenerating these, giving one wrong replacement at a time, by breaking prompt up into instruction steps and dynamically replacing them with the false ones. can we get in touch with s70rm about this perhaps?
There was a problem hiding this comment.
The auto-generation idea is sound — we could implement the "replace one step at a time with a wrong value" approach. Worth noting that JEF's scope is limited to verified, disclosed techniques: the variants we'd generate would be new and unvetted by either s70rm or ØDIN. Happy to implement if that's acceptable, just want to flag that they wouldn't carry the same provenance as the original prompt.
garak/probes/substances.py
Outdated
| Embeds hexadecimal-encoded strings inside fictional scientific or | ||
| humanitarian task prompts, exploiting the model's willingness to decode | ||
| hex and produce restricted content when framed as a life-saving or | ||
| technical exercise. This is a multi-turn attack that escalates across | ||
| three sequential prompts. Discovered through collaborative research by |
There was a problem hiding this comment.
this looks like a novel turn-based version of the encoding.InjectHex probe. i like it. how can we make it more general?
There was a problem hiding this comment.
The hex payload is now encoded dynamically from _HEX_SOURCE (no hardcoding). For further generalization: the multi-turn escalation structure is the novel part — one direction would be varying the substance in the hex payload across different targets (fentanyl, nerve agents) using the shared keyword data files, giving one conversation per substance. Would that be the right direction, or something else?
- Rename probes.substances -> probes.jef (name after technique/source per leondz) - Update attribution: '0DIN' -> 'ØDIN by Mozilla' throughout - Move detector keyword lists to garak/data/substances/*.json (dict-of-lists format per leondz; end-users can override without code edits) - Detectors now load keywords from data files via garak.data.path - IncrementalTableCompletion: load chemicals from shared meth_keywords.json and generate hold-one-out prompt variants (probe and detector share one authoritative keyword source per leondz) - HexRecipeBook: store payload as human-readable _HEX_SOURCE constant and encode to hex at runtime (no hardcoded hex per leondz) - Tests: use dynamic class discovery via inspect instead of hardcoded lists - Regenerate plugin_cache.json
jmartin-tech
left a comment
There was a problem hiding this comment.
This is an interesting idea, the 0din project looks to be an nice concrete set of known attacks.
I apologize my comments may muddy the waters a bit on guidance.
Do all these fit as individual classes? I suspect we can find a way that opens this up including most/all these prompts. As noted classes with 1 prompt are less than ideal for how garak tests targets. When a class is limited to one prompt it really doesn't reflect the level of risk involved especially in terms of false negatives when the target is guarded for that specific prompt but not the technique in the prompt.
I wonder if some of these represent a not yet explored tier of probes maybe something like TABLE_STAKES? If probes with a single adversarial prompt are generating verified detections of a jailbreak that is not detected for the same request with no adversarial aspect for same requested goal.
I suspect some roadmap items currently in queue may make this more clear as the project would like to create a path for users to evaluate a target to baseline what goals the system will constrain when no adversarial aspect is in the prompt.
| technical exercise. This is a multi-turn attack that escalates across | ||
| three sequential prompts. |
There was a problem hiding this comment.
While the final request generated by this attack includes multiple turns the attack is really a one-shot prefilled history attack as implemented. Consider adjusting the framing here to document that only one inference request is actually made.
I could also see adding another class that implements IterativeProbe and performs the attack using multiple turns, evaluating the actual assistant response at each inference request to determine if the attack is actually proceeding and keeping the target responses in the conversation.
There was a problem hiding this comment.
Module class is somewhat difficult to align, while @leondz offered source and using jef as the module name, that does not align as well as I would like. A probe module grouping preferably aligns to the attack technique's implemented by the probe and each class is a concrete specific implementation in that group. At one point substances was used and while not ideal I think that aligns better as each of the concrete techniques implemented is aligned with extracting prohibited substance information that the target may be expected to suppress, though still the end goal is not the core technique.
I see that JEF is referring to Jailbreak Evaluation Framework from a techniques perspective this is too generic as jailbreak encompasses many adversarial prompt techniques and categorizing by and referring to another tool that is aggregating sources of example jailbreaks results in the fidelity conveyed getting a bit too broad and a little to specific at the same time.
Many of the classes in the offered probe may fit as flavors/classes of existing techniques.
TechnicalFieldGuide for instance is akin to grandma.Substances as framing the request in a way that creates a bypass though kinda an inverse of the core appeal to ethos in grandma. I will note that the name of the grandma module is somewhat an example of how hard naming gets as the technique there is really an appeal to ethos or a social engineering strategy. Making this somewhat more complex to resolve.
A few others lead me to similar thoughts.
HexRecipeBook is a fun expansion on encoding with practical usage explored.
There was a problem hiding this comment.
This file is normally maintained by the build automation and only requires inclusion in a PR if testing shows that a change will need to be reflected in the PR.
If developing on Linux there is a tool for ensuring this file is updated correctly as the timestamp of the latest commit to each plugin need to be aligned consistently due to how git writes to the filesystem during standard clone, branch, and checkout operations.
Adds 6 disclosed jailbreak probes and 4 keyword-based detectors from 0DIN's Jailbreak Evaluation Framework (JEF) for testing LLM guardrails against CBRN and illicit substance bypass techniques. No external dependencies — keyword detectors use
StringDetectorwith presence-only matching. For weighted scoring with procedural-chain detection, seepip install 0din-jef[garak].Probes (6): PlaceholderInjection, IncrementalTableCompletion, TechnicalFieldGuide, ChemicalCompilerDebug, Correction, HexRecipeBook (multi-turn via
Conversation)Detectors (4): MethKeywords, NerveAgentKeywords, FentanylKeywords, AnthraxKeywords
Further info: https://0din.ai/disclosures · https://github.com/0din-ai/0din-JEF