feat: Add probe-level resume (runservice) and report digest fixes#1589
feat: Add probe-level resume (runservice) and report digest fixes#1589shrikantpachpor wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
jmartin-tech
left a comment
There was a problem hiding this comment.
This is a preliminary review and more will be added. The current implementation is a good start and looks to build off of #1531. There are similar concerns mentioned here to comments made on that other implementation.
Further review and thought will be forthcoming.
There was a problem hiding this comment.
My comment from PR 1531 applies to this file's changes as well.
I still view modification of the input report as something that should be avoided as it introduces significant complexity to manage state and could cause corruption to the original input file if it is reused for output.
There was a problem hiding this comment.
Understood. The report modification approach introduces complexity and corruption risks. This needs to be addressed as a part of the broader architectural discussion about how resume should work. I'll address this in a follow-up where we can properly design the file strategy.
garak/cli.py
Outdated
| # RESUME SUPPORT: Restore report paths and override probe spec with probes from resumed run | ||
| from garak import resumeservice | ||
|
|
||
| if resumeservice.enabled(): | ||
| resumed_state = resumeservice.get_state() | ||
| if resumed_state: | ||
| # Restore report directory and prefix from resumed state | ||
| if "report_dir" in resumed_state: | ||
| _config.reporting.report_dir = resumed_state["report_dir"] | ||
| logging.info( | ||
| f"Restored report_dir from state: {resumed_state['report_dir']}" | ||
| ) | ||
| if "report_prefix" in resumed_state: | ||
| _config.reporting.report_prefix = resumed_state["report_prefix"] | ||
| logging.info( | ||
| f"Restored report_prefix from state: {resumed_state['report_prefix']}" | ||
| ) | ||
|
|
||
| # Use the original run_id to maintain report filename consistency | ||
| if "run_id" in resumed_state: | ||
| # Extract UUID from full run_id format "garak-run-<uuid>-<timestamp>" | ||
| full_run_id = resumed_state["run_id"] | ||
| original_run_id = resumeservice.extract_uuid_from_run_id( | ||
| full_run_id | ||
| ) | ||
| _config.transient.run_id = original_run_id | ||
| logging.info(f"Restored run_id from state: {original_run_id}") | ||
|
|
||
| # Override probe spec with probes from resumed run | ||
| if "probenames" in resumed_state: | ||
| resumed_probes = resumed_state["probenames"] | ||
| # Strip "probes." prefix if present for parse_plugin_spec compatibility | ||
| resumed_probes_clean = [ | ||
| p.replace("probes.", "") for p in resumed_probes | ||
| ] | ||
| # Convert probe list to comma-separated spec | ||
| _config.plugins.probe_spec = ",".join(resumed_probes_clean) | ||
| logging.info( | ||
| f"Resuming run with probes from state: {resumed_probes}" | ||
| ) | ||
| print( | ||
| f"🔄 Using probes from resumed run: {', '.join(resumed_probes_clean)}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
The logic here does not belong in cli. This package is intended parse and setup runtime _config state to allow launch of a run. The functionality here to tightly coupled and likely should be deferred to a separate stage in the run.
There was a problem hiding this comment.
Agreed, CLI should only parse arguments and setup config. The state restoration and probe spec override logic is too tightly coupled here and should happen at a different stage. This will be addressed as part of moving resume logic to the proper place in the run lifecycle.
garak/cli.py
Outdated
| # Handle resume-related commands | ||
| if hasattr(args, "list_runs") and args.list_runs: | ||
| from garak import resumeservice | ||
| from datetime import datetime | ||
|
|
||
| runs = resumeservice.list_runs() | ||
| if not runs: | ||
| print("\n📋 No unfinished runs found.") | ||
| print( | ||
| "\nStart a new resumable scan with: garak --resumable [options]\n" | ||
| ) | ||
| else: | ||
| print("\n📋 Resumable Runs\n") | ||
|
|
||
| # Print header | ||
| print( | ||
| f"{'#':<4} {'Run ID':<38} {'Started':<20} {'Progress':<12} {'%':<6}" | ||
| ) | ||
| print("-" * 82) | ||
|
|
||
| for idx, run in enumerate(runs, 1): | ||
| # Calculate percentage | ||
| percentage = ( | ||
| (run["progress"] / run["total"] * 100) | ||
| if run["total"] > 0 | ||
| else 0 | ||
| ) | ||
|
|
||
| # Format the timestamp more readably | ||
| try: | ||
| dt = datetime.fromisoformat(run["start_time"]) | ||
| formatted_time = dt.strftime("%Y-%m-%d %H:%M") | ||
| except: | ||
| formatted_time = run["start_time"][:16] | ||
|
|
||
| # Progress format | ||
| progress_str = f"{run['progress']}/{run['total']}" | ||
|
|
||
| print( | ||
| f"{idx:<4} {run['run_id']:<38} {formatted_time:<20} {progress_str:<12} {percentage:>5.1f}%" | ||
| ) | ||
|
|
||
| print("-" * 82) | ||
| print(f"\nTotal: {len(runs)} unfinished run(s)") | ||
| print("\nTo resume: garak --resume <run_id>") | ||
| print("To delete: garak --delete_run <run_id>\n") | ||
| return | ||
|
|
||
| if hasattr(args, "delete_run") and args.delete_run: | ||
| from garak import resumeservice | ||
|
|
||
| try: | ||
| resumeservice.delete_run(args.delete_run) | ||
| print(f"✅ Deleted run: {args.delete_run}") | ||
| except Exception as e: | ||
| print(f"❌ Failed to delete run: {e}") | ||
| import sys | ||
|
|
||
| sys.exit(1) | ||
| return | ||
|
|
There was a problem hiding this comment.
Parsing of resume related options should setup the global configuration for later actions.
Services are launched at harness initialization and should not be accessed directly by cli. The idea of managing existing runs on disk is novel and interesting, however it seems to be a unique feature that should should be independent of a resume specific service. I had noted in PR #1531 that it might make sense to have a runservice that manages run state and also provides support for a resume feature.
There was a problem hiding this comment.
You're right on multiple points. Services should be launched at harness initialization, not accessed directly by CLI. The run management feature being independent from resume-specific service makes sense - your suggestion about having a "runservice" that manages run state and also provides resume support seems like a better separation of concerns. This needs architectural rethinking about where run management lives and how CLI-only commands should work.
garak/harnesses/probewise.py
Outdated
| class MinimalAttempt: | ||
| def __init__(self, data): | ||
| self.uuid = data.get("uuid", "") | ||
| self.seq = data.get("seq", 0) | ||
| self.status = 1 # Was executed | ||
| self.probe_classname = data.get("probe_classname", "") | ||
| self.probe_params = data.get("probe_params", {}) | ||
| self.targets = data.get("targets", []) | ||
| self.notes = data.get("notes", {}) | ||
| self.goal = data.get("goal", "") | ||
| self.detector_results = data.get("detector_results", {}) | ||
| self.conversations = data.get("conversations", []) | ||
| self.reverse_translation_outputs = data.get( | ||
| "reverse_translation_outputs", [] | ||
| ) | ||
| # Store prompt and outputs as-is (already serialized) | ||
| self._prompt_data = data.get("prompt", {}) | ||
| self._outputs_data = data.get("outputs", []) | ||
| # Cache outputs as Message objects for evaluator | ||
| self._outputs_cache = None | ||
| self._prompt_cache = None | ||
|
|
||
| @property | ||
| def prompt(self): | ||
| """Return prompt as Conversation object (for evaluator)""" | ||
| if self._prompt_cache is None: | ||
| from garak.attempt import Conversation | ||
|
|
||
| # Reconstruct Conversation from the stored dict | ||
| if isinstance(self._prompt_data, dict): | ||
| self._prompt_cache = Conversation.from_dict( | ||
| self._prompt_data | ||
| ) | ||
| else: | ||
| # Fallback to empty conversation | ||
| self._prompt_cache = Conversation([]) | ||
| return self._prompt_cache | ||
|
|
||
| @property | ||
| def outputs(self): | ||
| """Return output messages as a list (for evaluator)""" | ||
| if self._outputs_cache is None: | ||
| from garak.attempt import Message | ||
|
|
||
| messages = [] | ||
| if isinstance(self._outputs_data, list): | ||
| for output in self._outputs_data: | ||
| if isinstance(output, dict): | ||
| msg = Message( | ||
| text=output.get("text", ""), | ||
| lang=output.get("lang", "en"), | ||
| data_path=output.get("data_path"), | ||
| data_type=output.get("data_type"), | ||
| data_checksum=output.get("data_checksum"), | ||
| notes=output.get("notes", {}), | ||
| ) | ||
| messages.append(msg) | ||
| elif isinstance(output, str): | ||
| messages.append(Message(text=output, lang="en")) | ||
| self._outputs_cache = messages | ||
| return self._outputs_cache | ||
|
|
||
| def as_dict(self): | ||
| """Return dictionary representation for report writing""" | ||
| return { | ||
| "entry_type": "attempt", | ||
| "uuid": self.uuid, | ||
| "seq": self.seq, | ||
| "status": self.status, | ||
| "probe_classname": self.probe_classname, | ||
| "probe_params": self.probe_params, | ||
| "targets": self.targets, | ||
| "prompt": self._prompt_data, | ||
| "outputs": self._outputs_data, | ||
| "detector_results": self.detector_results, | ||
| "notes": self.notes, | ||
| "goal": self.goal, | ||
| "conversations": self.conversations, | ||
| "reverse_translation_outputs": self.reverse_translation_outputs, | ||
| } | ||
|
|
||
| def outputs_for(self, lang_spec): | ||
| """Return output messages for detector evaluation""" | ||
| # Reconstruct Message objects from outputs_data | ||
| from garak.attempt import Message | ||
|
|
||
| messages = [] | ||
| if isinstance(self._outputs_data, list): | ||
| for output in self._outputs_data: | ||
| if isinstance(output, dict): | ||
| msg = Message( | ||
| text=output.get("text", ""), | ||
| lang=output.get("lang", lang_spec), | ||
| data_path=output.get("data_path"), | ||
| data_type=output.get("data_type"), | ||
| data_checksum=output.get("data_checksum"), | ||
| notes=output.get("notes", {}), | ||
| ) | ||
| messages.append(msg) | ||
| elif isinstance(output, str): | ||
| messages.append(Message(text=output, lang=lang_spec)) | ||
| return messages | ||
|
|
||
| attempt = MinimalAttempt(entry) | ||
| incomplete_attempts.append(attempt) |
There was a problem hiding this comment.
The need to introduce this class suggest a level of complexity here that likely needs to be reworked.
There was a problem hiding this comment.
Agreed, the need for this class indicates a design problem. I created it to bridge the gap between loading previous attempts from the report and what the evaluator expects, but this complexity suggests the approach needs reworking. This ties into the broader attempt-level resumption design issues.
There was a problem hiding this comment.
Most of the changes in this class are tightly coupled to resume specifically and should not be the responsibility of the harness. The state machine for executing a run can be organized in more cleanly to enable harnesses to request additional data from the service that is aware of the resume state vs having to build and manage that state.
There was a problem hiding this comment.
Right. The harness is managing resume state when it should be requesting data from a service that's aware of the resume context. The state machine for run execution needs cleaner organization - harness should ask the service what work needs to be done rather than building and managing that state itself. This is a fundamental separation of concerns issue.
garak/harnesses/probewise.py
Outdated
| class AttemptMock: | ||
| def __init__(self, outputs, probename, prompt=None, seq=0): | ||
| self.all_outputs = [ | ||
| OutputMock(output) | ||
| for output in (outputs if isinstance(outputs, list) else [outputs]) | ||
| ] | ||
| self.probename = probename | ||
| self.probe_classname = ".".join( | ||
| probename.split(".")[1:] | ||
| ) # Changed to "category.Class" format, e.g., "xss.ColabAIDataLeakage" | ||
| self.prompt = ( | ||
| prompt | ||
| if prompt | ||
| else { | ||
| "turns": [ | ||
| { | ||
| "role": "user", | ||
| "content": {"text": prompt or "", "lang": "en", "notes": {}}, | ||
| } | ||
| ] | ||
| } | ||
| ) | ||
| if detector: | ||
| return detector | ||
| else: | ||
| print(f" detector load failed: {detector_name}, skipping >>") | ||
| logging.error(f" detector load failed: {detector_name}, skipping >>") | ||
| return False | ||
| self.status = "success" | ||
| self.detector_results = {} | ||
| self.notes = {"terms": ["summary", "conversation"]} | ||
| self.outputs = [output.text for output in self.all_outputs] | ||
| self.uuid = str(uuid.uuid4()) | ||
| self.seq = seq | ||
| self.probe_params = {} | ||
| self.targets = [] | ||
| self.conversations = [ | ||
| { | ||
| "turns": [ | ||
| { | ||
| "role": "user", | ||
| "content": {"text": prompt or "", "lang": "en", "notes": {}}, | ||
| }, | ||
| { | ||
| "role": "assistant", | ||
| "content": {"text": output.text, "lang": "en", "notes": {}}, | ||
| }, | ||
| ] | ||
| } | ||
| for output in self.all_outputs | ||
| ] | ||
| self.reverse_translation_outputs = [] |
There was a problem hiding this comment.
I suspect this is some sort of testing method, if so it should not be in the package class.
There was a problem hiding this comment.
You're correct, this is testing/debug code that shouldn't be in the package. I'll remove it.
47c10e0 to
2949288
Compare
|
@jmartin-tech I've addressed the straightforward fixes in commits 2949288 and 07556ad. However, your review has identified fundamental architectural issues that I think warrant discussion before implementation. Core Issues Identified: The feedback points to a consistent pattern - resume logic is scattered across CLI, harness, and service instead of being properly encapsulated: CLI doing business logic: Report setup, state restoration, and direct service access all violate the principle that CLI should only parse arguments and setup config. Harness managing state: The probewise.py changes are building and managing resume state when the harness should be requesting data from a service that's aware of the resume context. Attempt-level design: The seq-based matching doesn't work due to randomization, external state storage adds unnecessary complexity when Attempts already have state indicators, and the MinimalAttempt class suggests the approach needs rethinking. Your reference to PR #1531 indicates the correct approach requires prompt comparison and seed validation. Report modification strategy: The complexity and corruption risks around appending to input reports. State management scope: Questions about where storage belongs and what the service should be responsible for. Service naming: Your suggestion about "runservice" managing run state and providing resume support makes sense as a cleaner separation. Path Forward: These issues are interconnected and need holistic design rather than piecemeal fixes. I see two approaches: Redesign in this PR: Address all architectural concerns here, but this is 2-4 weeks of work with significant design discussion needed. Two-PR strategy: The quick fixes are done. Create a new PR focused entirely on the architectural redesign with proper service/harness/CLI separation, potentially simplified scope initially (probe-level only). I'm leaning toward option 2 to keep the PRs focused and allow proper design discussion, but I'm open to your preference. Questions for design: Service architecture: Should the service provide get_probes_to_run() and get_attempts_to_run() APIs that harnesses query, rather than harnesses managing state? CLI commands: For --list_runs and --delete_run that don't launch a full run, how should they access run management if services are only launched at harness initialization? Attempt resumption: Should we simplify to probe-level only initially, or implement the full prompt-based comparison approach from PR #1531? Report strategy: New output file per resume (read-only input) vs. appending to existing? |
09eae89 to
0dda9f1
Compare
Implement garak.runservice for XDG-based run state, addressing maintainer review on NVIDIA#1589: keep CLI thin (config/dispatch only), load run state via harness runtime services, avoid mutating the original JSONL on resume (new timestamped report prefix per resume), and resume at probe granularity only with non-resumable probes marked in probe base classes. CLI: --resume, --list_runs, --delete_run. Resume validates probe_spec and generator target against saved state. Report digest HTML: neutralize </script> in embedded JSON; load digest via application/json + JSON.parse for consistent Chromium/Safari/file:// behavior; harden JSONL parsing for digest generation. Align .gitignore with NVIDIA/garak main. Black-format garak/, tests/, and tools/ per pyproject.toml. Signed-off-by: Shrikant Pachpor <shri.pachpor24@gmail.com>
0603a0c to
274bb3c
Compare
|
Could a maintainer please approve workflow runs for this fork PR? The CLA check has completed, but the pytest / install workflows are still at Action required and have not run yet. Full CI would help before further review. Thanks. |
|
All tests can be run locally in your own fork via action, the PR workflow here will be triggered after a maintainer has had time to take a look at the code to validate if tests should be run in using a project supplied token. |
|
Note the latest code push here applied formatting change to files not involved in the PR goal, this should be avoided to ensure review can focus on changes related to the purpose of the PR. This PR also adds management features that expand the scope significantly. I would prefer to see the focus of the PR limited to a usable design for resume and any management extracted as a follow on PR for the project to evaluate on its own merits. @shrikantpachpor, I will process this PR more in the coming week to attempt to offer design feedback and important caveats. |
Implements probe-level resumable scans so interrupted
garakruns can continue without redoing probes that already fully completed, using persisted run state under the XDG data directory. This addresses long-running scans interrupted by network issues, rate limits, or crashes, and reduces redundant generations/API work.This revision supersedes the earlier
resumeservice/ attempt-level direction discussed on this PR. It follows maintainer review feedback on #1589:garak.runserviceowns run/resume state (loaded with harness runtime services), the CLI stays thin (arguments → config / dispatch), resume does not append into the original JSONL (each resume uses a new timestamped report prefix; prior report remains intact), and the harness skips completed probes rather than embedding a heavy attempt-replay layer.Also includes HTML report digest hardening (embedded JSON /
file:/// browser consistency) and Black formatting applied broadly underpyproject.tomlso CI style checks stay green.Closes #141
Overview
Long-running vulnerability scans can be interrupted by network issues, rate limits, or system crashes. This feature lets users resume from saved state instead of starting from scratch, saving time and resources—at probe granularity (see below).
Key features
Run service architecture (
garak.runservice)<xdg_data_home>/garak/runs/<run_id>/state.json(see README).probe_specand generator target (--target_type/--target_name) must match the saved run; mismatch raisesResumeValidationErrorwith a clear message.garakdata-dir conventions (not a hard-coded~/.garakpath in docs).Granularity (updated from earlier PR text)
--resume_granularityswitch in this PR.CLI (updated from earlier PR text)
--resume <run_id>— continue from saved state (new report file for this continuation).--list_runs— list saved runs (newest first).--delete_run <run_id>— remove saved state for a run.There is no
--resumable [true/false]toggle in this design: progress is tracked through the run service where applicable.State management
state.json.resumable/supports_resume); certain probes (e.g. interactive / tree-search style) are marked non-resumable so they are not skipped incorrectly.Maintainer review alignment (#1589)
clibeyond parsing and dispatch;command/runservicehandle setup and state.runservicesuggested in review is what implements run/resume concerns; harness integrates with existing runtime service loading.MinimalAttempt-style attempt reconstruction path from this redesign’s scope (attempt-level approach removed).HTML report digest (related)
</script>in serialized digest before embedding inindex.html.JSON.parseover huge inline object literals (injectedapplication/json/window.reportsDatapattern) for more consistent behavior in Chromium/Safari andfile://opens.Code formatting (CI)
garak/,tests/, and related paths perpyproject.tomlso style checks pass. Some touched files are format-only; see “Modified files” grouping below.Testing
tests/test_runservice.py).tests/harnesses/test_probewise_resume.py).tests/test_probe_resumability.py).tests/test_command_start_run.py).tests/cli/test_cli.py, list filtering, etc.).tests/analyze/test_report_digest_html_embed.py).conftest/ other tests updated as needed for harness config and isolation.Modified files
Core (resume / harness / probes / CLI)
garak/cli.py—--resume,--list_runs,--delete_run; setstransient.resume_run_id/ dispatches list & delete.garak/command.py— run start/end integration withrunservice, resume prefix setup,list_runs/delete_run.garak/harnesses/base.py,garak/harnesses/probewise.py— load run service; skip completed probes when resuming.garak/probes/base.py— resumability flags; non-resumable probe classes updated.garak/exception.py—ResumeValidationError.garak/_config.py— transient/config hooks as needed for resume.garak/interactive.py— small related adjustments if present in diff.README.md— Resumable Scans user documentation (probe-level, XDG paths, new report on resume, validation rules).Report digest
garak/analyze/report_digest.py— embed + parse hardening as above.Also changed (primarily Black / style consistency)
These files are included in the same commit largely for formatter alignment with
pyproject.toml/ CI; behavior changes are not the focus:garak/analyze/bootstrap_ci.py,garak/analyze/ci_calculator.py,garak/analyze/rebuild_cis.pygarak/evaluators/base.py,garak/generators/websocket.py,garak/resources/autodan/autodan.pytests/analyze/test_ci_calculator.py,tests/analyze/test_detector_metrics.pytests/buffs/test_buff_config.py,tests/conftest.pytests/detectors/test_detectors_packagehallucination.pytests/generators/test_function.py,tests/generators/test_generators.pytests/test_attempt.pyNew files
garak/runservice.py— run state, resume validation, list/delete runs, resume report prefix handling.tests/test_runservice.pytests/harnesses/test_probewise_resume.pytests/test_command_start_run.pytests/test_probe_resumability.py(new or materially extended—see diff)tests/analyze/test_report_digest_html_embed.pySuperseded / removed vs earlier iterations of this PR
The following are not part of this revision (replaced by the design above):
garak/resumeservice.pyand related attempt-level resume machinery.garak/serializers.pyand largegarak-config.example.yamlresume blocks from the earlier approach (if they were present on older branch revisions).--resumable,--resume_granularity, and attempt-level semantics described in the previous PR description.Compatibility
--resume/ run state.Testing (how to run)
python -m pytest tests/(or CI) — contributions are expected to pass project tests per contributing docs.Benefits
runservice, thinner CLI, harness-driven service loading.file://.Tell us what this change does. If you're fixing a bug, please mention the GitHub issue number.
Please ensure you are submitting from a unique branch in your repository to
mainupstream.Verification
Use this as a maintainer/reviewer checklist (GitHub task list). Check items off as you complete them locally or in review.
shrikantpachpor:feature/resumable-scanning(single squashed commit is OK).python -m pip install -e .(or project-recommended env) succeeds.python -m pytest tests/passes (or explain any environment-specific skips).black --check .(or repo’s CI equivalent) passes.--probes+ target → interrupt mid-run →--list_runsshows run →--resume <id>with same probes + target → completed probes skipped → new report artifact; original JSONL unchanged.--probesor target vs checkpoint → resume fails withResumeValidationError/ clear CLI error.file://or local open if relevant.Supporting configuration (example)
Optional generator config still applies as today, e.g.:
{ "huggingface": { "torch_type": "float32" } }If you are opening a PR for a new plugin that targets a specific piece of hardware or requires a complex or hard-to-find testing environment, we recommend that you send us as much detail as possible.
Specific Hardware Examples:
cuda/mps( Please notcudaviaROCmif related )Complex Software Examples: