[core] agentic: on_failure: start working on agentic troubleshooting by kpouget · Pull Request #70 · openshift-psap/forge

kpouget · 2026-06-03T16:31:18Z

Summary by CodeRabbit

New Features
- AI-powered failure analysis with structured root-cause guidance, optional HTML reports, and a CLI for JSON or human summaries.
- Automatic FAILURE artifact creation on task/toolbox errors to enable post-mortem analysis.
- Generation of job-run notification files (HTML/Markdown) linking MLflow runs and task logs.
Behavior Changes
- CI commands now trigger automated failure analysis on errors.
- One test path skips finalizers and model-cache preparation.
- CI runtime now installs additional ML/HTTP client packages and accepts an extra optional model vault.

openshift-ci · 2026-06-03T16:31:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-03T16:31:31Z

📝 Walkthrough

Walkthrough

New module projects/core/agentic/on_failure.py implements a Vault-configured LLM agent that scans artifact directories for FAILURE files, prompts a ChatOpenAI client (with custom httpx transport), optionally requests additional files for a deep investigation, extracts a 5-field structured analysis, and exposes CLI/API/HTML outputs; integrations add FAILURE generation, notification artifact creation, and CI decorator wiring.

Changes

On Failure Analysis Agent

Layer / File(s)	Summary
Module setup & CI wiring `projects/core/agentic/on_failure.py`	Module imports, logging adjustments, and CI decorator wiring for the on-failure agent.
LLM config & client creation `projects/core/agentic/on_failure.py`, `vaults/psap-models-corp-rh.yaml`	Vault-backed `agent-models.yaml` loading and ChatOpenAI client creation using a custom `httpx.Client` (TLS verification disabled, base_url normalized to `/v1`).
FAILURE artifact discovery and ingestion `projects/core/agentic/on_failure.py`	Find `FAILURE` files, read paired `_ansible.log` or fallback `task.log`, trim ansible `
HTML/report helpers & content cleaning `projects/core/agentic/on_failure.py`	Clean/elide file contents, minimal markdown-to-HTML conversion, and `failure_analysis_report.html` generation from recorded queries/responses.
First-stage analysis & validation `projects/core/agentic/on_failure.py`	Build initial prompt from `FAILURE`+logs, call LLM, validate response, and detect `NEED_MORE_FILES:` to start follow-up.
File-request parsing and deep investigation `projects/core/agentic/on_failure.py`	Parse requested filenames (strict and fallback formats), resolve/read/truncate files (including `/tmp/topsail_*` mapping and ansible-log preprocessing), run the deep LLM call, and extract the 5-field structured result.
Orchestration, structured retrieval & CLI `projects/core/agentic/on_failure.py`	Process all `FAILURE` artifacts in a directory, provide structured-only retrieval, expose `run_on_failure_agent`/`get_failure_explanations`, and add Click CLI/json/text/verbose outputs.

Failure generation & notifications wiring

Layer / File(s)	Summary
CI runtime dependency additions `projects/core/ci_entrypoint/run_common.py`	Add `httpx`, `langchain-openai`, and `langchain-core` to `EXTRA_PACKAGES` so they are installed in CI/CLI environments.
Runtime FAILURE generation `projects/core/dsl/runtime.py`	On task-execution exceptions, write `ARTIFACT_DIR/FAILURE` via new helpers that format a timestamped exception, stack trace, TaskExecutionError fields, and environment/artifact context.
Toolbox FAILURE wrapper `projects/core/dsl/toolbox.py`	Call `_generate_failure_file(e)` in toolbox command exception handler; wrapper lazily imports generator and logs generation failures.
Generate NOTIFICATION artifacts `projects/fournos_launcher/orchestration/submit.py`	Scan artifact status YAML and task logs, and write `NOTIFICATION.html` and `NOTIFICATION.md` linking MLflow runs and task logs when found.
Notification consumers prefer artifacts `projects/core/notifications/send.py`	`get_common_message` and `get_slack_thread_message` now read and prefer `NOTIFICATION.html`/`NOTIFICATION.md` when present.
CI decorator & config wiring `projects/llm_d/orchestration/ci.py`, `projects/llm_d/orchestration/config.yaml`	Import and apply `@agent_review_on_failure` to CI subcommands and add `psap-models-corp-rh` to optional vault list.
Test phase behavioral changes `projects/llm_d/orchestration/test_phase.py`	Set `do_finalizers=False` in `do_test()` and short-circuit `_prepare_model_cache()` to skip finalizer and cache-prep paths.

Sequence Diagram(s)

sequenceDiagram
  participant CI_Command
  participant Runtime
  participant ArtifactDir
  participant Vault
  participant ChatOpenAI
  CI_Command->>Runtime: run tasks
  Runtime->>ArtifactDir: write `FAILURE` on exception
  CI_Command->>Vault: load `agent-models.yaml`
  CI_Command->>ChatOpenAI: create client (httpx, TLS disabled)
  CI_Command->>ArtifactDir: scan for `FAILURE` files
  CI_Command->>ChatOpenAI: submit analysis prompt (FAILURE + trimmed logs)
  ChatOpenAI->>CI_Command: analysis or NEED_MORE_FILES:
  CI_Command->>ArtifactDir: read requested files
  CI_Command->>ChatOpenAI: submit deep-investigation prompt
  ChatOpenAI->>CI_Command: structured 5-field result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

openshift-psap/forge#26: Related changes to execute_tasks and artifact_dir population that interact with FAILURE artifact handling.

Poem

🐰 A vault of failures opened wide,
The LLM peers deep inside,
Logs and markers, files to comb,
A rabbit finds the root and home,
Reports hop back to help you guide.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing an agentic troubleshooting module for analyzing CI failures via the new on_failure.py agent.
Docstring Coverage	✅ Passed	Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1005-1077: The code currently trusts LLM-provided paths and
patterns (resolved_files, resolved_path, safe_filename, rglob) which allows
directory traversal and glob injection; fix by resolving each candidate path to
an absolute Path via resolved_path.resolve(strict=False) and verifying it is
inside base_artifact_dir.resolve() (e.g., compare parents or use
is_relative_to), and reject/skip any path that is outside the artifact tree
(return an error or mark as not found). Stop passing model-controlled patterns
into rglob: instead build a safe index once from base_artifact_dir (enumerate
base_artifact_dir.rglob("*") to collect actual artifact file paths and names)
and only allow fallback matches by exact filename lookup against that enumerated
list; do not perform wildcard rglob with user input. Ensure any
rejected/unauthorized requests are logged and not read/sent back to the LLM.
- Around line 802-815: Update the prompt construction in on_failure.py so the
initial user-facing prompt always requests the five-field schema (Root Cause,
Failed Step, Trigger, Fix, Prevent) on the very first turn; modify the logic in
the functions that build or call get_failure_explanations() (and the related
prompt text around lines referenced near 802-815 and 878-884) to prepend or
include an explicit instruction asking for the 5-field schema and the
NEED_MORE_FILES line format whenever FAILURE + _ansible.log are present,
ensuring extract_structured_analysis() receives those structured fields
populated.
- Around line 741-746: process_failure_analysis() currently writes every HTML
report to the constant name html_filename = "failure_analysis_report.html",
causing subsequent reports to overwrite previous ones; change the filename
generation to be unique per failure (e.g., include a unique failure identifier
or a timestamp/uuid) when building html_filename/html_path so each call writes
to a distinct file (reference html_filename, html_path, and the
process_failure_analysis() call site to ensure the returned/stored html_report
path matches the new unique name).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9bedd50-d492-43e7-b8a3-7c8641712fc9

📥 Commits

Reviewing files that changed from the base of the PR and between 191d353 and 0743170.

📒 Files selected for processing (1)

projects/core/agentic/on_failure.py

coderabbitai · 2026-06-03T16:38:12Z

+    # Save HTML file with consistent name (overwrite existing)
+    html_filename = "failure_analysis_report.html"
+    html_path = base_artifact_dir / html_filename
+
+    with open(html_path, "w", encoding="utf-8") as f:
+        f.write(html_content)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique report filename per failure.

process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.

📝 One simple fix

- html_filename = "failure_analysis_report.html" + safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/")) + html_filename = f"failure_analysis_report_{safe_failure_dir}.html"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 741 - 746, process_failure_analysis() currently writes every HTML report to the constant name html_filename = "failure_analysis_report.html", causing subsequent reports to overwrite previous ones; change the filename generation to be unique per failure (e.g., include a unique failure identifier or a timestamp/uuid) when building html_filename/html_path so each call writes to a distinct file (reference html_filename, html_path, and the process_failure_analysis() call site to ensure the returned/stored html_report path matches the new unique name).

coderabbitai · 2026-06-03T16:38:12Z

+## TASK:
+1. Find "----- FAILED ----" patterns. Which tasks failed.
+2. What step broke. Why.
+3. What action caused failure.
+4. Need more files? Use EXACTLY this format:
+
+**NEED_MORE_FILES: filename1.yaml, filename2.log**
+
+DO NOT use bullet points or descriptions. Just the comma-separated list after "NEED_MORE_FILES:".
+
+Files available: {", ".join(available_files[:10])}...
+
+Be factual. Be brief.
+"""


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ask for the 5-field schema on the first turn too.

When the model can answer from FAILURE + _ansible.log, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis(). In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.

Also applies to: 878-884

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 802 - 815, Update the prompt construction in on_failure.py so the initial user-facing prompt always requests the five-field schema (Root Cause, Failed Step, Trigger, Fix, Prevent) on the very first turn; modify the logic in the functions that build or call get_failure_explanations() (and the related prompt text around lines referenced near 802-815 and 878-884) to prepend or include an explicit instruction asking for the 5-field schema and the NEED_MORE_FILES line format whenever FAILURE + _ansible.log are present, ensuring extract_structured_analysis() receives those structured fields populated.

coderabbitai · 2026-06-03T16:38:12Z

+        # Clean up file paths - replace /tmp/topsail_* with base artifact directory
+        resolved_files = []
+        for file_path in raw_requested_files:
+            file_path = file_path.strip()
+
+            # Additional cleaning for special characters that might cause glob issues
+            file_path = file_path.replace("**", "*").replace("***", "*")
+            # Remove any remaining trailing special characters
+            file_path = file_path.rstrip("*'\"` }$()[]")
+
+            # If path starts with /tmp/topsail_, replace it with base artifact directory
+            if file_path.startswith("/tmp/topsail_"):
+                # Extract the path after the topsail directory
+                parts = file_path.split("/")
+                # Find the topsail part and take everything after it
+                for i, part in enumerate(parts):
+                    if part.startswith("topsail_"):
+                        if i + 1 < len(parts):
+                            relative_path = "/".join(parts[i + 1 :])
+                            resolved_path = base_artifact_dir / relative_path
+                        else:
+                            resolved_path = base_artifact_dir
+                        break
+                else:
+                    # Fallback to just the filename
+                    resolved_path = base_artifact_dir / Path(file_path).name
+            elif file_path.startswith("/"):
+                # Other absolute path, just use filename
+                resolved_path = base_artifact_dir / Path(file_path).name
+            else:
+                # Relative path, resolve against base artifact directory
+                resolved_path = base_artifact_dir / file_path
+
+            resolved_files.append(resolved_path)
+
+        logger.info(f"LLM requested files (resolved): {[str(f) for f in resolved_files]}")
+        logger.debug(f"Original requests: {raw_requested_files}")
+
+    except Exception as e:
+        logger.error(f"Failed to parse file request: {e}")
+        logger.error(f"LLM response that caused parsing error: {llm_response[:500]}...")
+        return {
+            "status": "error",
+            "failure_dir": failure_data["failure_dir"],
+            "error": f"Failed to parse file request: {e}",
+        }
+
+    # Read requested files
+    file_contents = {}
+
+    for resolved_path in resolved_files:
+        filename = resolved_path.name
+        found_path = None
+
+        # First try the resolved path
+        if resolved_path.exists():
+            found_path = resolved_path
+        else:
+            # Search in the base artifact directory recursively (silent search)
+            try:
+                # Use a safe search pattern - escape any problematic characters
+                safe_filename = filename.replace("**", "*")  # Replace double asterisks
+                for search_path in base_artifact_dir.rglob(safe_filename):
+                    if search_path.is_file():
+                        found_path = search_path
+                        break
+
+                # If no match with pattern, try exact filename search
+                if not found_path:
+                    for search_path in base_artifact_dir.rglob("*"):
+                        if search_path.is_file() and search_path.name == filename:
+                            found_path = search_path
+                            break


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Constrain model-requested paths to the artifact tree before reading them.

The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1005 - 1077, The code currently trusts LLM-provided paths and patterns (resolved_files, resolved_path, safe_filename, rglob) which allows directory traversal and glob injection; fix by resolving each candidate path to an absolute Path via resolved_path.resolve(strict=False) and verifying it is inside base_artifact_dir.resolve() (e.g., compare parents or use is_relative_to), and reject/skip any path that is outside the artifact tree (return an error or mark as not found). Stop passing model-controlled patterns into rglob: instead build a safe index once from base_artifact_dir (enumerate base_artifact_dir.rglob("*") to collect actual artifact file paths and names) and only allow fallback matches by exact filename lookup against that enumerated list; do not perform wildcard rglob with user input. Ensure any rejected/unauthorized requests are logged and not read/sent back to the LLM.

kpouget · 2026-06-05T12:30:55Z

/test fournos llm_d
/cluster psap-mgmt
/pipeline forge-test-only

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (4)

projects/core/agentic/on_failure.py (4)

982-1007: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ask for the 5-field schema on the first turn too.

When the model can answer from FAILURE + log content without requesting additional files, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis() at line 1093. In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.

💡 Suggested fix

Add the 5-field schema request to the initial prompt:

 ## TASK:
 1. Find failure patterns:
    - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks
    - In execution logs: Look for "----- FAILED ----" patterns or error indicators
    - Identify which specific tasks failed and why
 2. What step broke. Why.
 3. What action caused failure.
-4. Need more files? Use EXACTLY this format:
+4. Provide structured analysis in this format:
+
+1. **Root Cause**: One sentence. What broke.
+2. **Failed Step**: Which step. Why it failed.
+3. **Trigger**: Specific action/config that caused failure.
+4. **Fix**: What to change.
+5. **Prevent**: How to avoid next time.
+
+5. Need more files? Use EXACTLY this format:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial
prompt assigned to prompt_content should explicitly request the 5-field schema
(Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so
downstream extract_structured_analysis() and get_failure_explanations() receive
structured fields; modify the string built in prompt_content inside
on_failure.py to append a short line asking for the five named fields in a fixed
format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ...,
Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact
field names match what extract_structured_analysis() and
get_failure_explanations() expect.

1219-1299: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Constrain model-requested paths to the artifact tree before reading them.

The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.

🔒 Recommended fix

+        # Resolve and validate path is within artifact tree
+        try:
+            resolved_abs = resolved_path.resolve()
+            artifact_abs = base_artifact_dir.resolve()
+            if not resolved_abs.is_relative_to(artifact_abs):
+                logger.warning(f"❌ Rejected path outside artifact tree: {filename}")
+                file_contents[filename] = "❌ Unauthorized: File path outside artifact directory"
+                continue
+        except Exception as e:
+            logger.warning(f"Path resolution failed for {filename}: {e}")
+            file_contents[filename] = f"❌ Path resolution error: {e}"
+            continue
+
         # First try the resolved path
         if resolved_path.exists():
             found_path = resolved_path
         else:
-            # Search in the base artifact directory recursively (silent search)
+            # Search only in pre-enumerated artifact files (no glob patterns)
             try:
-                # Use a safe search pattern - escape any problematic characters
-                safe_filename = filename.replace("**", "*")  # Replace double asterisks
-                for search_path in base_artifact_dir.rglob(safe_filename):
-                    if search_path.is_file():
-                        found_path = search_path
-                        break
-
-                # If no match with pattern, try exact filename search
-                if not found_path:
-                    for search_path in base_artifact_dir.rglob("*"):
-                        if search_path.is_file() and search_path.name == filename:
-                            found_path = search_path
-                            break
+                # Only allow exact filename matches from available_files list
+                for available_file in available_files:
+                    if Path(available_file).name == filename:
+                        candidate = base_artifact_dir / available_file
+                        if candidate.exists() and candidate.is_file():
+                            found_path = candidate
+                            break
             except Exception as search_error:
                 logger.warning(f"Search error for '{filename}': {search_error}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The
LLM-controlled filenames (handled in resolved_files, resolved_path, and later
rglob search) can escape the artifact tree; fix by canonicalizing and
constraining every candidate: when building resolved_path (and before any
rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is
within base_artifact_dir.resolve() (e.g., via startswith or comparison),
rejecting or skipping any path that does not lie under base_artifact_dir; remove
use of user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.

914-921: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique report filename per failure.

process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.

📝 One simple fix

-    html_filename = "failure_analysis_report.html"
+    safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/"))
+    html_filename = f"failure_analysis_report_{safe_failure_dir}.html"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 914 - 921,
process_failure_analysis currently writes every failure report to the constant
"failure_analysis_report.html" (variables html_filename/html_path), causing
later reports to overwrite earlier ones; change it to generate a unique filename
per invocation (for example include the failure identifier, timestamp, or a uuid
in the filename) before constructing html_path in the same block so each call
creates a distinct file and return the new unique path string from
process_failure_analysis.

205-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disabling TLS verification exposes the LLM client to man-in-the-middle attacks.

Setting verify=False and suppressing InsecureRequestWarning disables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configure httpx.Client(verify="/path/to/ca-bundle.crt") instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.

🔒 Recommended fix to support custom CA bundle

+    # Allow custom CA bundle or disable verification only if explicitly opted in
+    ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE")
+    verify_tls = ca_bundle if ca_bundle else True  # Default to secure
+
+    if verify_tls is False:
+        logger.warning("⚠️  TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure")
+
-    with warnings.catch_warnings():
-        warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning)
-
-        # Create HTTP client with proper headers for litellm endpoint
-        http_client = httpx.Client(
-            verify=False,
+        http_client = httpx.Client(
+            verify=verify_tls,
             headers={

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client
currently disables TLS verification by passing verify=False in the httpx.Client
call in on_failure.py; replace this with a secure approach: read a configurable
CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string
to httpx.Client(verify=ca_path) when present, and only fall back to verify=False
when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set;
also restrict suppressing InsecureRequestWarning to the insecure branch and emit
a clear log warning when verify is intentionally disabled. Ensure changes target
the http_client construction site and associated warnings.catch_warnings() usage
so the default behavior enforces certificate validation.

🧹 Nitpick comments (2)

projects/core/agentic/on_failure.py (1)

1559-1564: ⚡ Quick win

Consider parameterizing the hardcoded model key.

The model key "qwen-3-6-35b" is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.

♻️ Suggested change

-        MODEL_KEY = "qwen-3-6-35b"
+        MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the
hardcoded MODEL_KEY assignment by reading a configurable environment variable:
instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment
lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default
string; then use that variable when calling models_config.get(MODEL_KEY) and
raise the same ValueError if missing. Ensure you import os if not present and
keep the existing models_config lookup and error handling intact.

projects/fournos_launcher/orchestration/submit.py (1)

59-161: ⚡ Quick win

Consider adding artifact directory validation.

The function uses pathlib.Path(env.ARTIFACT_DIR) without verifying that ARTIFACT_DIR is set or that the path exists. If ARTIFACT_DIR is None or an invalid path, the subsequent glob operations could fail silently or raise exceptions.

🛡️ Suggested defensive check

 def generate_notification_files():
     """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information."""
     artifact_dir = pathlib.Path(env.ARTIFACT_DIR)
+    
+    if not artifact_dir or not artifact_dir.exists():
+        logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}")
+        return
+    
     mlflow_urls = []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The
generate_notification_files function uses artifact_dir =
pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.

In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.

In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.

In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.

---

Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.

---

Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877

📥 Commits

Reviewing files that changed from the base of the PR and between 0743170 and 24fa1ab.

📒 Files selected for processing (8)

projects/core/agentic/on_failure.py
projects/core/dsl/runtime.py
projects/core/dsl/toolbox.py
projects/core/notifications/send.py
projects/fournos_launcher/orchestration/submit.py
projects/llm_d/orchestration/ci.py
projects/llm_d/orchestration/test_phase.py
vaults/psap-models-corp-rh.yaml

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 6

♻️ Duplicate comments (4)

projects/core/agentic/on_failure.py (4)

982-1007: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ask for the 5-field schema on the first turn too.

When the model can answer from FAILURE + log content without requesting additional files, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis() at line 1093. In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.

💡 Suggested fix

Add the 5-field schema request to the initial prompt:

 ## TASK:
 1. Find failure patterns:
    - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks
    - In execution logs: Look for "----- FAILED ----" patterns or error indicators
    - Identify which specific tasks failed and why
 2. What step broke. Why.
 3. What action caused failure.
-4. Need more files? Use EXACTLY this format:
+4. Provide structured analysis in this format:
+
+1. **Root Cause**: One sentence. What broke.
+2. **Failed Step**: Which step. Why it failed.
+3. **Trigger**: Specific action/config that caused failure.
+4. **Fix**: What to change.
+5. **Prevent**: How to avoid next time.
+
+5. Need more files? Use EXACTLY this format:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial
prompt assigned to prompt_content should explicitly request the 5-field schema
(Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so
downstream extract_structured_analysis() and get_failure_explanations() receive
structured fields; modify the string built in prompt_content inside
on_failure.py to append a short line asking for the five named fields in a fixed
format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ...,
Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact
field names match what extract_structured_analysis() and
get_failure_explanations() expect.

1219-1299: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Constrain model-requested paths to the artifact tree before reading them.

The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.

🔒 Recommended fix

+        # Resolve and validate path is within artifact tree
+        try:
+            resolved_abs = resolved_path.resolve()
+            artifact_abs = base_artifact_dir.resolve()
+            if not resolved_abs.is_relative_to(artifact_abs):
+                logger.warning(f"❌ Rejected path outside artifact tree: {filename}")
+                file_contents[filename] = "❌ Unauthorized: File path outside artifact directory"
+                continue
+        except Exception as e:
+            logger.warning(f"Path resolution failed for {filename}: {e}")
+            file_contents[filename] = f"❌ Path resolution error: {e}"
+            continue
+
         # First try the resolved path
         if resolved_path.exists():
             found_path = resolved_path
         else:
-            # Search in the base artifact directory recursively (silent search)
+            # Search only in pre-enumerated artifact files (no glob patterns)
             try:
-                # Use a safe search pattern - escape any problematic characters
-                safe_filename = filename.replace("**", "*")  # Replace double asterisks
-                for search_path in base_artifact_dir.rglob(safe_filename):
-                    if search_path.is_file():
-                        found_path = search_path
-                        break
-
-                # If no match with pattern, try exact filename search
-                if not found_path:
-                    for search_path in base_artifact_dir.rglob("*"):
-                        if search_path.is_file() and search_path.name == filename:
-                            found_path = search_path
-                            break
+                # Only allow exact filename matches from available_files list
+                for available_file in available_files:
+                    if Path(available_file).name == filename:
+                        candidate = base_artifact_dir / available_file
+                        if candidate.exists() and candidate.is_file():
+                            found_path = candidate
+                            break
             except Exception as search_error:
                 logger.warning(f"Search error for '{filename}': {search_error}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The
LLM-controlled filenames (handled in resolved_files, resolved_path, and later
rglob search) can escape the artifact tree; fix by canonicalizing and
constraining every candidate: when building resolved_path (and before any
rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is
within base_artifact_dir.resolve() (e.g., via startswith or comparison),
rejecting or skipping any path that does not lie under base_artifact_dir; remove
use of user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.

914-921: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique report filename per failure.

process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.

📝 One simple fix

-    html_filename = "failure_analysis_report.html"
+    safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/"))
+    html_filename = f"failure_analysis_report_{safe_failure_dir}.html"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 914 - 921,
process_failure_analysis currently writes every failure report to the constant
"failure_analysis_report.html" (variables html_filename/html_path), causing
later reports to overwrite earlier ones; change it to generate a unique filename
per invocation (for example include the failure identifier, timestamp, or a uuid
in the filename) before constructing html_path in the same block so each call
creates a distinct file and return the new unique path string from
process_failure_analysis.

205-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disabling TLS verification exposes the LLM client to man-in-the-middle attacks.

Setting verify=False and suppressing InsecureRequestWarning disables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configure httpx.Client(verify="/path/to/ca-bundle.crt") instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.

🔒 Recommended fix to support custom CA bundle

+    # Allow custom CA bundle or disable verification only if explicitly opted in
+    ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE")
+    verify_tls = ca_bundle if ca_bundle else True  # Default to secure
+
+    if verify_tls is False:
+        logger.warning("⚠️  TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure")
+
-    with warnings.catch_warnings():
-        warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning)
-
-        # Create HTTP client with proper headers for litellm endpoint
-        http_client = httpx.Client(
-            verify=False,
+        http_client = httpx.Client(
+            verify=verify_tls,
             headers={

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client
currently disables TLS verification by passing verify=False in the httpx.Client
call in on_failure.py; replace this with a secure approach: read a configurable
CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string
to httpx.Client(verify=ca_path) when present, and only fall back to verify=False
when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set;
also restrict suppressing InsecureRequestWarning to the insecure branch and emit
a clear log warning when verify is intentionally disabled. Ensure changes target
the http_client construction site and associated warnings.catch_warnings() usage
so the default behavior enforces certificate validation.

🧹 Nitpick comments (2)

projects/core/agentic/on_failure.py (1)

1559-1564: ⚡ Quick win

Consider parameterizing the hardcoded model key.

The model key "qwen-3-6-35b" is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.

♻️ Suggested change

-        MODEL_KEY = "qwen-3-6-35b"
+        MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the
hardcoded MODEL_KEY assignment by reading a configurable environment variable:
instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment
lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default
string; then use that variable when calling models_config.get(MODEL_KEY) and
raise the same ValueError if missing. Ensure you import os if not present and
keep the existing models_config lookup and error handling intact.

projects/fournos_launcher/orchestration/submit.py (1)

59-161: ⚡ Quick win

Consider adding artifact directory validation.

The function uses pathlib.Path(env.ARTIFACT_DIR) without verifying that ARTIFACT_DIR is set or that the path exists. If ARTIFACT_DIR is None or an invalid path, the subsequent glob operations could fail silently or raise exceptions.

🛡️ Suggested defensive check

 def generate_notification_files():
     """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information."""
     artifact_dir = pathlib.Path(env.ARTIFACT_DIR)
+    
+    if not artifact_dir or not artifact_dir.exists():
+        logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}")
+        return
+    
     mlflow_urls = []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The
generate_notification_files function uses artifact_dir =
pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.

In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.

In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.

In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.

---

Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.

---

Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877

📥 Commits

Reviewing files that changed from the base of the PR and between 0743170 and 24fa1ab.

📒 Files selected for processing (8)

projects/core/agentic/on_failure.py
projects/core/dsl/runtime.py
projects/core/dsl/toolbox.py
projects/core/notifications/send.py
projects/fournos_launcher/orchestration/submit.py
projects/llm_d/orchestration/ci.py
projects/llm_d/orchestration/test_phase.py
vaults/psap-models-corp-rh.yaml

🛑 Comments failed to post (5)

projects/core/agentic/on_failure.py (1)

34-49: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the import sorting to resolve the pipeline failure.

The Ruff linter reports that the import block is unsorted or unformatted (I001). This is causing the CI pipeline to fail.

Run ruff check projects/core/agentic/on_failure.py --fix to automatically organize the imports, or manually sort them according to PEP 8 conventions (standard library, third-party, local imports).

🧰 Tools

🪛 GitHub Actions: Ruff / 0_lint (3.12).txt

[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with ruff check --fix).

🪛 GitHub Actions: Ruff / lint (3.12)

[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with ruff check projects/ bin/ --fix).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The import block
in projects/core/agentic/on_failure.py is unsorted causing Ruff I001; fix it by
running "ruff check projects/core/agentic/on_failure.py --fix" or manually
reorder the imports in the module-level import section so that standard library
imports (datetime, functools, json, logging, os, re, sys, warnings,
pathlib.Path) come first, followed by third-party imports (click, httpx,
urllib3, yaml, langchain_core.messages.HumanMessage), and then any local
imports, ensure blank lines between groups and remove any unused imports; after
reordering, run the linter to confirm the I001 is resolved.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if httpx, urllib3, langchain-core, langchain-openai are in pyproject.toml

echo "=== Checking pyproject.toml for required dependencies ==="
rg -n 'httpx|urllib3|langchain' pyproject.toml || echo "❌ Missing dependencies detected"

echo ""
echo "=== Dependencies imported in on_failure.py ==="
rg -n '^import (httpx|urllib3)|^from langchain' projects/core/agentic/on_failure.py

Repository: openshift-psap/forge

Length of output: 288

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Imports in on_failure.py (full scan for relevant packages) ==="
rg -n "^(import (httpx|urllib3)|from langchain(_core|_openai)|from langchain|import langchain|import .*langchain|from langchain_openai|import .*openai|from .*openai)" projects/core/agentic/on_failure.py || true
echo

echo "=== pyproject.toml: look for dependency declarations (more precise) ==="
rg -n "(^|\\W)(httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain)(\\W|$)" pyproject.toml || true
echo

echo "=== pyproject.toml: show dependency sections around project.dependencies and optional-dependencies ==="
# Print the first ~200 lines to locate [project] sections quickly (bounded output)
sed -n '1,220p' pyproject.toml | nl -ba
echo

echo "=== pyproject.toml: if optional-dependencies exists, show it (bounded) ==="
if rg -n "^\[project\.optional-dependencies\]" -S pyproject.toml >/dev/null; then
  start=$(rg -n "^\[project\.optional-dependencies\]" -n pyproject.toml | head -n1 | cut -d: -f1)
  end=$((start+220))
  sed -n "${start},${end}p" pyproject.toml | nl -ba
else
  echo "No [project.optional-dependencies] section found."
fi

Repository: openshift-psap/forge

Length of output: 456

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locate pyproject.toml files ==="
fd -t f -e toml "pyproject.toml" . || true
echo

# Choose the root pyproject.toml if it exists; otherwise fall back to first found
PYPROJECT="$(fd -t f -e toml "pyproject.toml" . | head -n1 || true)"
if [[ -z "${PYPROJECT}" ]]; then
  echo "No pyproject.toml found in repo."
  exit 0
fi
echo "Using: ${PYPROJECT}"
echo

echo "=== Imports in projects/core/agentic/on_failure.py (full scan for relevant packages) ==="
rg -n "(^import (httpx|urllib3)|^from langchain(_core|_openai)?(\.| )|^from langchain import|^from langchain_openai|^from .*openai|^import .*langchain|^import .*openai|^import .*yaml|^import .*click)" projects/core/agentic/on_failure.py || true
echo

echo "=== pyproject.toml: search for declared dependencies (strings) ==="
rg -n "httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain" "${PYPROJECT}" || echo "No matching dependency strings found."
echo

echo "=== pyproject.toml: show [project] dependencies sections (bounded) ==="
# Show near likely dependency keys
rg -n "^\[project\]" "${PYPROJECT}" || true
rg -n "^(dependencies|optional-dependencies)\s*=" "${PYPROJECT}" || true

# Print first 260 lines to include [project] metadata and dependency lists (bounded output)
sed -n '1,260p' "${PYPROJECT}" | cat -n
echo

if rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" >/dev/null; then
  start=$(rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" | head -n1 | cut -d: -f1)
  end=$((start+260))
  echo "=== [project.optional-dependencies] (bounded) ==="
  sed -n "${start},${end}p" "${PYPROJECT}" | cat -n
else
  echo "No [project.optional-dependencies] section found."
fi

Repository: openshift-psap/forge

Length of output: 7970

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "langchain_openai|langchain(_core|_openai)?|import .*langchain|from .*langchain" projects/core/agentic/on_failure.py || true

Repository: openshift-psap/forge

Length of output: 235

Declare missing dependencies in pyproject.toml.

projects/core/agentic/on_failure.py imports httpx, urllib3, and langchain_core, and later dynamically imports langchain_openai.ChatOpenAI, but ./pyproject.toml has neither of these packages in [project].dependencies nor in [project.optional-dependencies]. This will cause ImportErrors on a fresh install.

🧰 Tools

🪛 GitHub Actions: Ruff / 0_lint (3.12).txt

[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with ruff check --fix).

🪛 GitHub Actions: Ruff / lint (3.12)

[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with ruff check projects/ bin/ --fix).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The file
on_failure.py imports httpx, urllib3, and langchain_core and dynamically loads
langchain_openai.ChatOpenAI, but these packages are not declared in
pyproject.toml; update the project's pyproject.toml to add httpx, urllib3, and
langchain_core to [project].dependencies (or put them under
[project.optional-dependencies] with appropriate extras if optional) and ensure
langchain_openai (or the package that provides ChatOpenAI) is declared as well
so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol succeed
on a fresh install.

projects/core/notifications/send.py (1)

299-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add exception handling for file read operations.

The NOTIFICATION.md file read (lines 304-305) is not wrapped in a try-except block, unlike the NOTIFICATION.html read in get_common_message (lines 215-223). If the file read fails (e.g., permissions issue, encoding error), it will raise an exception and potentially break the Slack notification flow.

🛡️ Suggested error handling

     # Check for fournos_launcher generated Slack notification content
     artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", ""))
     fournos_notification_md = artifact_dir / "NOTIFICATION.md"

     if fournos_notification_md.exists():
-        with open(fournos_notification_md, encoding="utf-8") as f:
-            fournos_content = f.read().strip()
-        if fournos_content:
-            # Return custom notification content with status icon
-            return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+        try:
+            with open(fournos_notification_md, encoding="utf-8") as f:
+                fournos_content = f.read().strip()
+            if fournos_content:
+                # Return custom notification content with status icon
+                return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+        except Exception as e:
+            logger.warning("Failed to read NOTIFICATION.md: %s", e)

     # Fallback to standard message if no custom notification

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Check for fournos_launcher generated Slack notification content
    artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", ""))
    fournos_notification_md = artifact_dir / "NOTIFICATION.md"

    if fournos_notification_md.exists():
        try:
            with open(fournos_notification_md, encoding="utf-8") as f:
                fournos_content = f.read().strip()
            if fournos_content:
                # Return custom notification content with status icon
                return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
        except Exception as e:
            logger.warning("Failed to read NOTIFICATION.md: %s", e)

    # Fallback to standard message if no custom notification
    return get_common_message(
        finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold
    )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/send.py` around lines 299 - 313, The
NOTIFICATION.md file read (artifact_dir / "NOTIFICATION.md" using
fournos_notification_md) must be wrapped in a try/except to avoid raising on
I/O/encoding/permission errors; update the block in send.py around the with
open(fournos_notification_md, ...) and fournos_content = f.read().strip() to
catch Exception as e, handle/log the error (consistent with the existing logging
approach used by get_common_message's NOTIFICATION.html handling), and fall back
to returning get_common_message(finish_reason, f"{status_icon} {status}",
get_link, get_italics, get_bold) when an error occurs or the file is unreadable.

projects/llm_d/orchestration/test_phase.py (2)

93-96: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Finalizers are unconditionally disabled.

do_finalizers = False on line 93 prevents all finalizer callbacks from executing, including:

Capturing inference-service state

Writing endpoint URL to artifacts

Capturing namespace events

Cleaning up test resources (pods, jobs, PVCs)

The SignalInterrupt check on lines 94-96 is redundant since do_finalizers is already False. If this is intentional debugging code, consider removing it before merge or adding a comment explaining why finalizers are disabled. Otherwise, resource leaks and missing artifacts will occur.
Suggested fix to re-enable finalizers
     finally:
-        do_finalizers = False
+        do_finalizers = True
         if primary_exc and isinstance(primary_exc[1], SignalInterrupt):
             logging.warning("Caught a SignalInterrupt, skipping the finalizers")
             do_finalizers = False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/test_phase.py` around lines 93 - 96, The code
unconditionally disables finalizers by setting do_finalizers = False before
checking primary_exc; remove that unconditional assignment and instead
initialize do_finalizers = True (or omit the assignment) and only set
do_finalizers = False when primary_exc indicates a SignalInterrupt (i.e., keep
the isinstance(primary_exc[1], SignalInterrupt) branch and set do_finalizers =
False there); ensure the variable controlling execution of finalizer callbacks
(do_finalizers) is True in the normal flow so finalizer callbacks (capture
state, write endpoint, capture events, cleanup) run unless a SignalInterrupt is
detected.
170-181: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unreachable code: return before docstring makes function body dead code.

The return statement on line 171 causes all subsequent code (lines 172-180) to be unreachable, including the docstring and the actual prepare_model_cache() call. This bypasses model cache preparation entirely, which per the codebase handles cache validation, PVC setup, and HF token retrieval.

If this is temporary debugging code, please revert before merge. The docstring placement after return also indicates this may have been an accidental edit.
Suggested fix to restore model cache preparation
 def _prepare_model_cache() -> None:
-    return
     """Ensure model cache PVC is ready for deployment."""
     from projects.llm_d.orchestration import runtime_config

     model_key = runtime_config.get_model_key()
     logger.info("Preparing model cache for model: %s", model_key)

     # Use the same prepare_model_cache function as the prepare phase
     # This includes vault token handling and PVC existence checks
     prepare_model_cache()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _prepare_model_cache() -> None:
    """Ensure model cache PVC is ready for deployment."""
    from projects.llm_d.orchestration import runtime_config

    model_key = runtime_config.get_model_key()
    logger.info("Preparing model cache for model: %s", model_key)

    # Use the same prepare_model_cache function as the prepare phase
    # This includes vault token handling and PVC existence checks
    prepare_model_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/test_phase.py` around lines 170 - 181, The
_prepare_model_cache function currently returns immediately making its docstring
and body (including the call to prepare_model_cache()) unreachable; remove the
stray return, place the docstring right after the def if desired, and restore
the body so it imports runtime_config, retrieves model_key via
runtime_config.get_model_key(), logs via logger.info, and calls
prepare_model_cache() to perform PVC/token handling; verify references to
_prepare_model_cache(), runtime_config, logger, and prepare_model_cache are
unchanged.

vaults/psap-models-corp-rh.yaml (1)

8-8: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in description.

"use" should be "used" for grammatical correctness.
📝 Proposed fix
-      Access to the RH models use by FORGE agent
+      Access to the RH models used by FORGE agent
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
      Access to the RH models used by FORGE agent
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vaults/psap-models-corp-rh.yaml` at line 8, Fix the typo in the YAML
description string "Access to the RH models use by FORGE agent" by changing
"use" to "used" so the line reads "Access to the RH models used by FORGE agent";
locate and update that description entry in the psap-models-corp-rh YAML block.

psap-forge-bot · 2026-06-05T12:33:34Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 59 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster psap-mgmt
/pipeline forge-test-only

• Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 12:33:29
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-123136 failed: Resolution failed: Job has reached the specified backoff limit


[...]

• Execution logs

kpouget · 2026-06-05T12:59:45Z

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

psap-forge-bot · 2026-06-05T13:11:30Z

🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 10 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-130024
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

• Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:11:28
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

psap-forge-bot · 2026-06-05T13:12:14Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 51 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

• Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:12:06
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-130024 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

• Execution logs

…ns with the mlflow_urls and log_files links

…h NOTIFICATIONS.{html,md}

kpouget · 2026-06-05T13:25:07Z

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/ci_entrypoint/run_common.py`:
- Line 32: EXTRA_PACKAGES is currently an unpinned list (EXTRA_PACKAGES =
["httpx", "langchain-openai", "langchain-core"]) which makes CI installs
non-deterministic; update the EXTRA_PACKAGES variable to include explicit
version pins (or modify install_extra_packages() to install using a shared
constraints/lockfile) so CI always installs fixed versions — for example,
replace unpinned names in EXTRA_PACKAGES with versioned strings or change
install_extra_packages() to call pip with --constraint pointing to your
repository's constraints/lockfile (ensure you update any related docs/tests to
reflect the pinned versions).

In `@projects/core/notifications/send.py`:
- Around line 303-309: Wrap the optional NOTIFICATION.md read for
fournos_notification_md in a try/except so I/O/encoding/race errors don't abort
Slack message generation: when evaluating fournos_notification_md.exists()
surround the open/read/strip sequence with a broad exception handler that logs
or ignores the error and falls back to returning the default Slack message
(i.e., only return the custom "{status_icon}
{get_bold(status)}\\n\\n{fournos_content}" when read succeeds and
fournos_content is non-empty); reference the fournos_notification_md variable
and the return site that builds the status_icon/get_bold(status) message to
locate the change.

In `@vaults/psap-models-corp-rh.yaml`:
- Around line 6-17: The vault entry contains an unused format field and a typo:
VaultManager._load_vault_definition only reads the content key's file and
description, so remove or relocate the example block under the format key from
the content entry (move it into a YAML comment or external docs) and correct the
description text from "use by" to "used by"; ensure the remaining content only
includes the keys consumed at load time (file/description) so maintainers aren't
misled by an ignored format field.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91f0387f-50d1-408e-8966-0282847d630b

📥 Commits

Reviewing files that changed from the base of the PR and between 24fa1ab and c2e59de.

📒 Files selected for processing (10)

projects/core/agentic/on_failure.py
projects/core/ci_entrypoint/run_common.py
projects/core/dsl/runtime.py
projects/core/dsl/toolbox.py
projects/core/notifications/send.py
projects/fournos_launcher/orchestration/submit.py
projects/llm_d/orchestration/ci.py
projects/llm_d/orchestration/config.yaml
projects/llm_d/orchestration/test_phase.py
vaults/psap-models-corp-rh.yaml

🚧 Files skipped from review as they are similar to previous changes (5)

projects/fournos_launcher/orchestration/submit.py
projects/llm_d/orchestration/test_phase.py
projects/core/dsl/runtime.py
projects/llm_d/orchestration/ci.py
projects/core/agentic/on_failure.py

coderabbitai · 2026-06-05T13:33:40Z

 FORGE_HOME = Path(__file__).resolve().parent.parent.parent.parent

-EXTRA_PACKAGES = []
+EXTRA_PACKAGES = ["httpx", "langchain-openai", "langchain-core"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for existing package version specifications # Search for requirements files or version pinning echo "=== Checking for requirements files ===" fd -t f 'requirements.*\.txt|pyproject\.toml|setup\.py|Pipfile' --exec echo {} echo "" echo "=== Checking for version references to these packages ===" rg -n 'httpx|langchain-openai|langchain-core' --type toml --type txt -g 'requirements*' -g 'pyproject.toml' -g 'setup.py' echo "" echo "=== Checking current PyPI versions ===" for pkg in httpx langchain-openai langchain-core; do echo "Package: $pkg" curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version // "Not found"' done

Repository: openshift-psap/forge

Length of output: 498

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Inspect run_common.py around EXTRA_PACKAGES ===" python - <<'PY' import itertools,sys,os, re path="projects/core/ci_entrypoint/run_common.py" with open(path,'r',encoding='utf-8') as f: lines=f.readlines() for i in range(1, len(lines)+1): if "EXTRA_PACKAGES" in lines[i-1]: start=max(1,i-15); end=min(len(lines), i+25) for j in range(start,end+1): print(f"{j:4d}: {lines[j-1].rstrip()}") break else: print("EXTRA_PACKAGES not found") PY echo echo "=== Find install_extra_packages definition and how it's invoked ===" rg -n "def install_extra_packages|install_extra_packages\\(|prepare\\(" projects/core/ci_entrypoint/run_common.py echo echo "=== Search repo for version constraints for these packages ===" rg -n "(httpx|langchain-openai|langchain-core)\\s*([<>=!~]{1,2}|==)" -S . echo echo "=== Search for constraints/requirements files that might be referenced by ci entrypoint ===" rg -n "(--constraint|constraints\\.txt|requirements\\.txt|pip install)" projects/core/ci_entrypoint/run_common.py

Repository: openshift-psap/forge

Length of output: 1863

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Inspect install_extra_packages() implementation ===" python - <<'PY' import re path="projects/core/ci_entrypoint/run_common.py" with open(path,'r',encoding='utf-8') as f: lines=f.readlines() def find_line(substr): for i,l in enumerate(lines, start=1): if substr in l: return i return None start=find_line("def install_extra_packages") if not start: print("install_extra_packages not found") raise SystemExit(0) # print from start to a bit after first callsite or end of function heuristically for j in range(start, min(len(lines)+1, start+120)): print(f"{j:4d}: {lines[j-1].rstrip()}") PY echo echo "=== Search for any version pins/constraints for httpx/langchain-openai/langchain-core (non-failing) ===" rg -n "(httpx|langchain-openai|langchain-core)\s*([<>=!~]{1,2}|==)" -S . || true echo echo "=== Search for constraints files / pip --constraint usage ===" rg -n "(--constraint|constraints\\.txt|constraint(s)?\\.txt|PIP_CONSTRAINT|pip\\.conf)" -S . || true echo echo "=== Search for these packages specifically in requirements/lock files (non-failing) ===" rg -n "(httpx|langchain-openai|langchain-core)" -S --glob '*/requirements*' --glob 'pyproject.toml' --glob '*.lock' . || true

Repository: openshift-psap/forge

Length of output: 4779

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Files mentioning any of the three packages (requirements/pyproject/lock) ===" rg -n "httpx|langchain-openai|langchain-core" -S --glob '*/requirements*.txt' --glob '*/requirements*.in' --glob 'pyproject.toml' --glob '*.lock' . echo echo "=== Search for version operators in lines mentioning those packages (covers httpx[...]==... too) ===" rg -n "(httpx|langchain-openai|langchain-core)[^\n#]*([<>=!~]{1,2}|==)[^\n#]*" -S --glob '*/requirements*.txt' --glob 'pyproject.toml' --glob '*.lock' . echo echo "=== Show relevant lines (up to 30) from matched files for each package ===" for f in $(rg -l "httpx|langchain-openai|langchain-core" -S --glob '*/requirements*.txt' --glob 'pyproject.toml' --glob '*.lock' .); do echo "--- $f ---" rg -n "httpx|langchain-openai|langchain-core" -S "$f" | head -n 30 || true done

Repository: openshift-psap/forge

Length of output: 145

Pin versions for CI “EXTRA_PACKAGES” to keep builds deterministic.

projects/core/ci_entrypoint/run_common.py line 32 defines EXTRA_PACKAGES = ["httpx", "langchain-openai", "langchain-core"] with no version constraints, and install_extra_packages() installs them using uv pip install --no-cache (fallback: pip install --user --no-cache-dir). The repo has no corresponding pins/constraints for these packages in requirements/pyproject/lock files, so CI will resolve to the latest releases at install time, making builds non-deterministic and increasing the chance of unexpected failures. Pin versions or use a shared constraints/lockfile.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/ci_entrypoint/run_common.py` at line 32, EXTRA_PACKAGES is currently an unpinned list (EXTRA_PACKAGES = ["httpx", "langchain-openai", "langchain-core"]) which makes CI installs non-deterministic; update the EXTRA_PACKAGES variable to include explicit version pins (or modify install_extra_packages() to install using a shared constraints/lockfile) so CI always installs fixed versions — for example, replace unpinned names in EXTRA_PACKAGES with versioned strings or change install_extra_packages() to call pip with --constraint pointing to your repository's constraints/lockfile (ensure you update any related docs/tests to reflect the pinned versions).

coderabbitai · 2026-06-05T13:33:40Z

+    if fournos_notification_md.exists():
+        with open(fournos_notification_md, encoding="utf-8") as f:
+            fournos_content = f.read().strip()
+        if fournos_content:
+            # Return custom notification content with status icon
+            return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle optional NOTIFICATION.md read errors before overriding the Slack message

On Line 304, reading NOTIFICATION.md is unguarded. If this optional file is unreadable (I/O/encoding/race), the exception can break Slack notification generation instead of falling back to the standard message path.

Suggested fix

if fournos_notification_md.exists(): - with open(fournos_notification_md, encoding="utf-8") as f: - fournos_content = f.read().strip() - if fournos_content: - # Return custom notification content with status icon - return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + try: + with open(fournos_notification_md, encoding="utf-8") as f: + fournos_content = f.read().strip() + if fournos_content: + # Return custom notification content with status icon + return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + except Exception as e: + logger.warning("Failed to read NOTIFICATION.md: %s", e)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/notifications/send.py` around lines 303 - 309, Wrap the optional NOTIFICATION.md read for fournos_notification_md in a try/except so I/O/encoding/race errors don't abort Slack message generation: when evaluating fournos_notification_md.exists() surround the open/read/strip sequence with a broad exception handler that logs or ignores the error and falls back to returning the default Slack message (i.e., only return the custom "{status_icon} {get_bold(status)}\\n\\n{fournos_content}" when read succeeds and fournos_content is non-empty); reference the fournos_notification_md variable and the return site that builds the status_icon/get_bold(status) message to locate the change.

coderabbitai · 2026-06-05T13:33:40Z

+  agent-models.yaml:
+    description: |
+      Access to the RH models use by FORGE agent
+    format: |
+      model_key_1:
+          model_api: "..."
+          model_id: "..."
+          user_key: "..."
+      model_key_2:
+          model_api: "..."
+          model_id: "..."
+          user_key: "..."


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if the 'format' field in vault content definitions is used anywhere # Search for references to 'format' in vault-related code rg -n --type=py -C3 "\.get\(['\"]format['\"]" --glob '*vault*' # Also check for any processing of vault content beyond file/description rg -n --type=py -C3 "content_def\.get\(" projects/core/library/vault.py

Repository: openshift-psap/forge

Length of output: 557

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Search for any reads of a `format` key from vault content dicts rg -n --hidden --glob '!.git/*' --type=py \ "(content_def|defn|content_defn|vault|data)\.get\(['\"]format['\"]\]" \ projects/core/library rg -n --hidden --glob '!.git/*' --type=py \ "\[['\"]format['\"]\]" \ projects/core/library # 2) Search for `format` key handling in YAML/vault parsing code rg -n --hidden --glob '!.git/*' --type=py \ "format\s*[:=]" \ projects/core/library/vault.py projects/core/library/*vault* || true # 3) Confirm the loader logic around content dict extraction python3 - <<'PY' from pathlib import Path p = Path("projects/core/library/vault.py") if p.exists(): txt = p.read_text() i = txt.find("for content_name, content_def in") print("--- context around content extraction ---") print("\n".join(txt[i-400:i+400].splitlines()[:40])) else: print("vault.py not found at projects/core/library/vault.py") PY

Repository: openshift-psap/forge

Length of output: 46

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate vault implementation entrypoints (helps constrain further searches) rg -n --hidden --glob '!.git/*' "VaultManager|_load_vault_definition|load_vault_definition|vaults" projects || true rg -n --hidden --glob '!.git/*' "class Vault|def _load_vault_definition" . || true # Search repo-wide for any code reading a 'format' key from content dicts rg -n --hidden --glob '!.git/*' --type=py \ "get$['\"]format['\"]$" . || true # Search for any usage of the literal key 'format' near vault parsing/loading rg -n --hidden --glob '!.git/*' --type=py -C2 \ "\bformat\b" projects/core/library/vault.py || true # Search for the exact substring "content_def.get(" followed by keys other than file/description (to see how strict the schema is) rg -n --hidden --glob '!.git/*' --type=py -C2 \ "content_def\.get\(" projects/core/library/vault.py || true

Repository: openshift-psap/forge

Length of output: 16677

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the relevant section of vault loader and content parsing nl -ba projects/core/library/vault.py | sed -n '70,170p'

Repository: openshift-psap/forge

Length of output: 107

🏁 Script executed:

#!/bin/bash set -euo pipefail # Print loader logic around _load_vault_definition and content parsing cat -n projects/core/library/vault.py | sed -n '60,190p'

Repository: openshift-psap/forge

Length of output: 5861

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,30p' vaults/psap-models-corp-rh.yaml | cat -n

Repository: openshift-psap/forge

Length of output: 578

Remove/relocate the unused format field in vaults/psap-models-corp-rh.yaml (and fix the typo)

VaultManager._load_vault_definition only extracts file (defaults to the content key) and description from each content entry; the format key is ignored at load time, so it won’t have runtime effect and may mislead maintainers—move this example to YAML comments or separate docs. Also fix the typo: “use by” → “used by”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vaults/psap-models-corp-rh.yaml` around lines 6 - 17, The vault entry contains an unused format field and a typo: VaultManager._load_vault_definition only reads the content key's file and description, so remove or relocate the example block under the format key from the content entry (move it into a YAML comment or external docs) and correct the description text from "use by" to "used by"; ensure the remaining content only includes the keys consumed at load time (file/description) so maintainers aren't misled by an ignored format field.

psap-forge-bot · 2026-06-05T13:37:02Z

🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 11 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-132554
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

• Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:37:00
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

psap-forge-bot · 2026-06-05T13:37:42Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 49 seconds 🔴

• Link to the test results.

• No reports generated...

• Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

• Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:37:34
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-132554 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

• Execution logs

kpouget · 2026-06-05T16:10:07Z

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

psap-forge-bot · 2026-06-05T16:22:55Z

🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 05 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-161056
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

• Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 16:21:59
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

• Execution logs

psap-forge-bot · 2026-06-05T16:23:38Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 12 minutes 43 seconds 🔴

• Link to the test results.

• No reports generated...

• Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

• Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 16:23:31
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-161056 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

• Execution logs

openshift-ci · 2026-06-05T16:23:43Z

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/fournos	`c2e59de`	link	true	`/test fournos`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

kpouget force-pushed the agentic branch 3 times, most recently from b117cb0 to 476e171 Compare June 5, 2026 12:27

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

kpouget force-pushed the agentic branch from e77aae2 to 0db96b1 Compare June 5, 2026 13:21

kpouget added 5 commits June 5, 2026 15:24

[fournos_launcher] orchestration: submit: generate custom notificatio…

501304e

…ns with the mlflow_urls and log_files links

[core] notifications: send: allow generating custom notifications wit…

7792f9d

…h NOTIFICATIONS.{html,md}

[core] agentic: on_failure: start working on agentic troubleshooting

ea66292

[core] ci_entrypoint: run_common: install extra packages

3562237

[llm_d] orchestration: config: request the model vault

c2e59de

kpouget force-pushed the agentic branch from 0db96b1 to c2e59de Compare June 5, 2026 13:24

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Conversation

kpouget commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

kpouget commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

kpouget commented Jun 5, 2026

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

kpouget commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

kpouget commented Jun 5, 2026

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

psap-forge-bot Bot commented Jun 5, 2026

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpouget commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading