[core] agentic: on_failure: start working on agentic troubleshooting#70
[core] agentic: on_failure: start working on agentic troubleshooting#70kpouget wants to merge 5 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughNew module ChangesOn Failure Analysis Agent
Failure generation & notifications wiring
Sequence Diagram(s)sequenceDiagram
participant CI_Command
participant Runtime
participant ArtifactDir
participant Vault
participant ChatOpenAI
CI_Command->>Runtime: run tasks
Runtime->>ArtifactDir: write `FAILURE` on exception
CI_Command->>Vault: load `agent-models.yaml`
CI_Command->>ChatOpenAI: create client (httpx, TLS disabled)
CI_Command->>ArtifactDir: scan for `FAILURE` files
CI_Command->>ChatOpenAI: submit analysis prompt (FAILURE + trimmed logs)
ChatOpenAI->>CI_Command: analysis or NEED_MORE_FILES:
CI_Command->>ArtifactDir: read requested files
CI_Command->>ChatOpenAI: submit deep-investigation prompt
ChatOpenAI->>CI_Command: structured 5-field result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1005-1077: The code currently trusts LLM-provided paths and
patterns (resolved_files, resolved_path, safe_filename, rglob) which allows
directory traversal and glob injection; fix by resolving each candidate path to
an absolute Path via resolved_path.resolve(strict=False) and verifying it is
inside base_artifact_dir.resolve() (e.g., compare parents or use
is_relative_to), and reject/skip any path that is outside the artifact tree
(return an error or mark as not found). Stop passing model-controlled patterns
into rglob: instead build a safe index once from base_artifact_dir (enumerate
base_artifact_dir.rglob("*") to collect actual artifact file paths and names)
and only allow fallback matches by exact filename lookup against that enumerated
list; do not perform wildcard rglob with user input. Ensure any
rejected/unauthorized requests are logged and not read/sent back to the LLM.
- Around line 802-815: Update the prompt construction in on_failure.py so the
initial user-facing prompt always requests the five-field schema (Root Cause,
Failed Step, Trigger, Fix, Prevent) on the very first turn; modify the logic in
the functions that build or call get_failure_explanations() (and the related
prompt text around lines referenced near 802-815 and 878-884) to prepend or
include an explicit instruction asking for the 5-field schema and the
NEED_MORE_FILES line format whenever FAILURE + _ansible.log are present,
ensuring extract_structured_analysis() receives those structured fields
populated.
- Around line 741-746: process_failure_analysis() currently writes every HTML
report to the constant name html_filename = "failure_analysis_report.html",
causing subsequent reports to overwrite previous ones; change the filename
generation to be unique per failure (e.g., include a unique failure identifier
or a timestamp/uuid) when building html_filename/html_path so each call writes
to a distinct file (reference html_filename, html_path, and the
process_failure_analysis() call site to ensure the returned/stored html_report
path matches the new unique name).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a9bedd50-d492-43e7-b8a3-7c8641712fc9
📒 Files selected for processing (1)
projects/core/agentic/on_failure.py
| # Save HTML file with consistent name (overwrite existing) | ||
| html_filename = "failure_analysis_report.html" | ||
| html_path = base_artifact_dir / html_filename | ||
|
|
||
| with open(html_path, "w", encoding="utf-8") as f: | ||
| f.write(html_content) |
There was a problem hiding this comment.
Use a unique report filename per failure.
process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.
📝 One simple fix
- html_filename = "failure_analysis_report.html"
+ safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/"))
+ html_filename = f"failure_analysis_report_{safe_failure_dir}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/core/agentic/on_failure.py` around lines 741 - 746,
process_failure_analysis() currently writes every HTML report to the constant
name html_filename = "failure_analysis_report.html", causing subsequent reports
to overwrite previous ones; change the filename generation to be unique per
failure (e.g., include a unique failure identifier or a timestamp/uuid) when
building html_filename/html_path so each call writes to a distinct file
(reference html_filename, html_path, and the process_failure_analysis() call
site to ensure the returned/stored html_report path matches the new unique
name).
| ## TASK: | ||
| 1. Find "----- FAILED ----" patterns. Which tasks failed. | ||
| 2. What step broke. Why. | ||
| 3. What action caused failure. | ||
| 4. Need more files? Use EXACTLY this format: | ||
|
|
||
| **NEED_MORE_FILES: filename1.yaml, filename2.log** | ||
|
|
||
| DO NOT use bullet points or descriptions. Just the comma-separated list after "NEED_MORE_FILES:". | ||
|
|
||
| Files available: {", ".join(available_files[:10])}... | ||
|
|
||
| Be factual. Be brief. | ||
| """ |
There was a problem hiding this comment.
Ask for the 5-field schema on the first turn too.
When the model can answer from FAILURE + _ansible.log, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis(). In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.
Also applies to: 878-884
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/core/agentic/on_failure.py` around lines 802 - 815, Update the
prompt construction in on_failure.py so the initial user-facing prompt always
requests the five-field schema (Root Cause, Failed Step, Trigger, Fix, Prevent)
on the very first turn; modify the logic in the functions that build or call
get_failure_explanations() (and the related prompt text around lines referenced
near 802-815 and 878-884) to prepend or include an explicit instruction asking
for the 5-field schema and the NEED_MORE_FILES line format whenever FAILURE +
_ansible.log are present, ensuring extract_structured_analysis() receives those
structured fields populated.
| # Clean up file paths - replace /tmp/topsail_* with base artifact directory | ||
| resolved_files = [] | ||
| for file_path in raw_requested_files: | ||
| file_path = file_path.strip() | ||
|
|
||
| # Additional cleaning for special characters that might cause glob issues | ||
| file_path = file_path.replace("**", "*").replace("***", "*") | ||
| # Remove any remaining trailing special characters | ||
| file_path = file_path.rstrip("*'\"` }$()[]") | ||
|
|
||
| # If path starts with /tmp/topsail_, replace it with base artifact directory | ||
| if file_path.startswith("/tmp/topsail_"): | ||
| # Extract the path after the topsail directory | ||
| parts = file_path.split("/") | ||
| # Find the topsail part and take everything after it | ||
| for i, part in enumerate(parts): | ||
| if part.startswith("topsail_"): | ||
| if i + 1 < len(parts): | ||
| relative_path = "/".join(parts[i + 1 :]) | ||
| resolved_path = base_artifact_dir / relative_path | ||
| else: | ||
| resolved_path = base_artifact_dir | ||
| break | ||
| else: | ||
| # Fallback to just the filename | ||
| resolved_path = base_artifact_dir / Path(file_path).name | ||
| elif file_path.startswith("/"): | ||
| # Other absolute path, just use filename | ||
| resolved_path = base_artifact_dir / Path(file_path).name | ||
| else: | ||
| # Relative path, resolve against base artifact directory | ||
| resolved_path = base_artifact_dir / file_path | ||
|
|
||
| resolved_files.append(resolved_path) | ||
|
|
||
| logger.info(f"LLM requested files (resolved): {[str(f) for f in resolved_files]}") | ||
| logger.debug(f"Original requests: {raw_requested_files}") | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to parse file request: {e}") | ||
| logger.error(f"LLM response that caused parsing error: {llm_response[:500]}...") | ||
| return { | ||
| "status": "error", | ||
| "failure_dir": failure_data["failure_dir"], | ||
| "error": f"Failed to parse file request: {e}", | ||
| } | ||
|
|
||
| # Read requested files | ||
| file_contents = {} | ||
|
|
||
| for resolved_path in resolved_files: | ||
| filename = resolved_path.name | ||
| found_path = None | ||
|
|
||
| # First try the resolved path | ||
| if resolved_path.exists(): | ||
| found_path = resolved_path | ||
| else: | ||
| # Search in the base artifact directory recursively (silent search) | ||
| try: | ||
| # Use a safe search pattern - escape any problematic characters | ||
| safe_filename = filename.replace("**", "*") # Replace double asterisks | ||
| for search_path in base_artifact_dir.rglob(safe_filename): | ||
| if search_path.is_file(): | ||
| found_path = search_path | ||
| break | ||
|
|
||
| # If no match with pattern, try exact filename search | ||
| if not found_path: | ||
| for search_path in base_artifact_dir.rglob("*"): | ||
| if search_path.is_file() and search_path.name == filename: | ||
| found_path = search_path | ||
| break |
There was a problem hiding this comment.
Constrain model-requested paths to the artifact tree before reading them.
The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/core/agentic/on_failure.py` around lines 1005 - 1077, The code
currently trusts LLM-provided paths and patterns (resolved_files, resolved_path,
safe_filename, rglob) which allows directory traversal and glob injection; fix
by resolving each candidate path to an absolute Path via
resolved_path.resolve(strict=False) and verifying it is inside
base_artifact_dir.resolve() (e.g., compare parents or use is_relative_to), and
reject/skip any path that is outside the artifact tree (return an error or mark
as not found). Stop passing model-controlled patterns into rglob: instead build
a safe index once from base_artifact_dir (enumerate base_artifact_dir.rglob("*")
to collect actual artifact file paths and names) and only allow fallback matches
by exact filename lookup against that enumerated list; do not perform wildcard
rglob with user input. Ensure any rejected/unauthorized requests are logged and
not read/sent back to the LLM.
b117cb0 to
476e171
Compare
|
/test fournos llm_d |
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)
982-1007:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAsk for the 5-field schema on the first turn too.
When the model can answer from
FAILURE+ log content without requesting additional files, this prompt never asks forRoot Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed intoextract_structured_analysis()at line 1093. In that path,get_failure_explanations()will usually return empty structured fields even though the public API promises structured output.💡 Suggested fix
Add the 5-field schema request to the initial prompt:
## TASK: 1. Find failure patterns: - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks - In execution logs: Look for "----- FAILED ----" patterns or error indicators - Identify which specific tasks failed and why 2. What step broke. Why. 3. What action caused failure. -4. Need more files? Use EXACTLY this format: +4. Provide structured analysis in this format: + +1. **Root Cause**: One sentence. What broke. +2. **Failed Step**: Which step. Why it failed. +3. **Trigger**: Specific action/config that caused failure. +4. **Fix**: What to change. +5. **Prevent**: How to avoid next time. + +5. Need more files? Use EXACTLY this format:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial prompt assigned to prompt_content should explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so downstream extract_structured_analysis() and get_failure_explanations() receive structured fields; modify the string built in prompt_content inside on_failure.py to append a short line asking for the five named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact field names match what extract_structured_analysis() and get_failure_explanations() expect.
1219-1299:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winConstrain model-requested paths to the artifact tree before reading them.
The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into
rglob(). A prompt-injected request such as../../../../etc/passwdor*.pemcan therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outsidebase_artifact_dir, and fall back only to exact matches from the enumerated artifact files.🔒 Recommended fix
+ # Resolve and validate path is within artifact tree + try: + resolved_abs = resolved_path.resolve() + artifact_abs = base_artifact_dir.resolve() + if not resolved_abs.is_relative_to(artifact_abs): + logger.warning(f"❌ Rejected path outside artifact tree: {filename}") + file_contents[filename] = "❌ Unauthorized: File path outside artifact directory" + continue + except Exception as e: + logger.warning(f"Path resolution failed for {filename}: {e}") + file_contents[filename] = f"❌ Path resolution error: {e}" + continue + # First try the resolved path if resolved_path.exists(): found_path = resolved_path else: - # Search in the base artifact directory recursively (silent search) + # Search only in pre-enumerated artifact files (no glob patterns) try: - # Use a safe search pattern - escape any problematic characters - safe_filename = filename.replace("**", "*") # Replace double asterisks - for search_path in base_artifact_dir.rglob(safe_filename): - if search_path.is_file(): - found_path = search_path - break - - # If no match with pattern, try exact filename search - if not found_path: - for search_path in base_artifact_dir.rglob("*"): - if search_path.is_file() and search_path.name == filename: - found_path = search_path - break + # Only allow exact filename matches from available_files list + for available_file in available_files: + if Path(available_file).name == filename: + candidate = base_artifact_dir / available_file + if candidate.exists() and candidate.is_file(): + found_path = candidate + break except Exception as search_error: logger.warning(f"Search error for '{filename}': {search_error}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The LLM-controlled filenames (handled in resolved_files, resolved_path, and later rglob search) can escape the artifact tree; fix by canonicalizing and constraining every candidate: when building resolved_path (and before any rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is within base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or skipping any path that does not lie under base_artifact_dir; remove use of user-controlled patterns in rglob (do not pass raw filename patterns to base_artifact_dir.rglob), instead pre-enumerate all files under base_artifact_dir into a safe map (path.resolve() values and names) and fall back only to exact matches from that enumerated set; log and return an error or skip entries that are outside the artifact tree and never read files outside base_artifact_dir into file_contents.
914-921:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a unique report filename per failure.
process_failure_analysis()calls this once perFAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlierhtml_reportpaths end up pointing at the last one written.📝 One simple fix
- html_filename = "failure_analysis_report.html" + safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/")) + html_filename = f"failure_analysis_report_{safe_failure_dir}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 914 - 921, process_failure_analysis currently writes every failure report to the constant "failure_analysis_report.html" (variables html_filename/html_path), causing later reports to overwrite earlier ones; change it to generate a unique filename per invocation (for example include the failure identifier, timestamp, or a uuid in the filename) before constructing html_path in the same block so each call creates a distinct file and return the new unique path string from process_failure_analysis.
205-218:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisabling TLS verification exposes the LLM client to man-in-the-middle attacks.
Setting
verify=Falseand suppressingInsecureRequestWarningdisables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configurehttpx.Client(verify="/path/to/ca-bundle.crt")instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.🔒 Recommended fix to support custom CA bundle
+ # Allow custom CA bundle or disable verification only if explicitly opted in + ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE") + verify_tls = ca_bundle if ca_bundle else True # Default to secure + + if verify_tls is False: + logger.warning("⚠️ TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure") + - with warnings.catch_warnings(): - warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning) - - # Create HTTP client with proper headers for litellm endpoint - http_client = httpx.Client( - verify=False, + http_client = httpx.Client( + verify=verify_tls, headers={🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client currently disables TLS verification by passing verify=False in the httpx.Client call in on_failure.py; replace this with a secure approach: read a configurable CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path) when present, and only fall back to verify=False when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing InsecureRequestWarning to the insecure branch and emit a clear log warning when verify is intentionally disabled. Ensure changes target the http_client construction site and associated warnings.catch_warnings() usage so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)
1559-1564: ⚡ Quick winConsider parameterizing the hardcoded model key.
The model key
"qwen-3-6-35b"is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.♻️ Suggested change
- MODEL_KEY = "qwen-3-6-35b" + MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the hardcoded MODEL_KEY assignment by reading a configurable environment variable: instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default string; then use that variable when calling models_config.get(MODEL_KEY) and raise the same ValueError if missing. Ensure you import os if not present and keep the existing models_config lookup and error handling intact.projects/fournos_launcher/orchestration/submit.py (1)
59-161: ⚡ Quick winConsider adding artifact directory validation.
The function uses
pathlib.Path(env.ARTIFACT_DIR)without verifying thatARTIFACT_DIRis set or that the path exists. IfARTIFACT_DIRisNoneor an invalid path, the subsequent glob operations could fail silently or raise exceptions.🛡️ Suggested defensive check
def generate_notification_files(): """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information.""" artifact_dir = pathlib.Path(env.ARTIFACT_DIR) + + if not artifact_dir or not artifact_dir.exists(): + logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}") + return + mlflow_urls = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The generate_notification_files function uses artifact_dir = pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the path exists; add a defensive check at the start of generate_notification_files to validate env.ARTIFACT_DIR is set (not None/empty) and that artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear warning via logger (e.g., logger.warning) and return early to skip the rest of the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.
In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.
In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.
In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.
---
Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.
---
Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.
In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877
📒 Files selected for processing (8)
projects/core/agentic/on_failure.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 6
♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)
982-1007:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAsk for the 5-field schema on the first turn too.
When the model can answer from
FAILURE+ log content without requesting additional files, this prompt never asks forRoot Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed intoextract_structured_analysis()at line 1093. In that path,get_failure_explanations()will usually return empty structured fields even though the public API promises structured output.💡 Suggested fix
Add the 5-field schema request to the initial prompt:
## TASK: 1. Find failure patterns: - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks - In execution logs: Look for "----- FAILED ----" patterns or error indicators - Identify which specific tasks failed and why 2. What step broke. Why. 3. What action caused failure. -4. Need more files? Use EXACTLY this format: +4. Provide structured analysis in this format: + +1. **Root Cause**: One sentence. What broke. +2. **Failed Step**: Which step. Why it failed. +3. **Trigger**: Specific action/config that caused failure. +4. **Fix**: What to change. +5. **Prevent**: How to avoid next time. + +5. Need more files? Use EXACTLY this format:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial prompt assigned to prompt_content should explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so downstream extract_structured_analysis() and get_failure_explanations() receive structured fields; modify the string built in prompt_content inside on_failure.py to append a short line asking for the five named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact field names match what extract_structured_analysis() and get_failure_explanations() expect.
1219-1299:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winConstrain model-requested paths to the artifact tree before reading them.
The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into
rglob(). A prompt-injected request such as../../../../etc/passwdor*.pemcan therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outsidebase_artifact_dir, and fall back only to exact matches from the enumerated artifact files.🔒 Recommended fix
+ # Resolve and validate path is within artifact tree + try: + resolved_abs = resolved_path.resolve() + artifact_abs = base_artifact_dir.resolve() + if not resolved_abs.is_relative_to(artifact_abs): + logger.warning(f"❌ Rejected path outside artifact tree: {filename}") + file_contents[filename] = "❌ Unauthorized: File path outside artifact directory" + continue + except Exception as e: + logger.warning(f"Path resolution failed for {filename}: {e}") + file_contents[filename] = f"❌ Path resolution error: {e}" + continue + # First try the resolved path if resolved_path.exists(): found_path = resolved_path else: - # Search in the base artifact directory recursively (silent search) + # Search only in pre-enumerated artifact files (no glob patterns) try: - # Use a safe search pattern - escape any problematic characters - safe_filename = filename.replace("**", "*") # Replace double asterisks - for search_path in base_artifact_dir.rglob(safe_filename): - if search_path.is_file(): - found_path = search_path - break - - # If no match with pattern, try exact filename search - if not found_path: - for search_path in base_artifact_dir.rglob("*"): - if search_path.is_file() and search_path.name == filename: - found_path = search_path - break + # Only allow exact filename matches from available_files list + for available_file in available_files: + if Path(available_file).name == filename: + candidate = base_artifact_dir / available_file + if candidate.exists() and candidate.is_file(): + found_path = candidate + break except Exception as search_error: logger.warning(f"Search error for '{filename}': {search_error}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The LLM-controlled filenames (handled in resolved_files, resolved_path, and later rglob search) can escape the artifact tree; fix by canonicalizing and constraining every candidate: when building resolved_path (and before any rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is within base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or skipping any path that does not lie under base_artifact_dir; remove use of user-controlled patterns in rglob (do not pass raw filename patterns to base_artifact_dir.rglob), instead pre-enumerate all files under base_artifact_dir into a safe map (path.resolve() values and names) and fall back only to exact matches from that enumerated set; log and return an error or skip entries that are outside the artifact tree and never read files outside base_artifact_dir into file_contents.
914-921:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a unique report filename per failure.
process_failure_analysis()calls this once perFAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlierhtml_reportpaths end up pointing at the last one written.📝 One simple fix
- html_filename = "failure_analysis_report.html" + safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/")) + html_filename = f"failure_analysis_report_{safe_failure_dir}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 914 - 921, process_failure_analysis currently writes every failure report to the constant "failure_analysis_report.html" (variables html_filename/html_path), causing later reports to overwrite earlier ones; change it to generate a unique filename per invocation (for example include the failure identifier, timestamp, or a uuid in the filename) before constructing html_path in the same block so each call creates a distinct file and return the new unique path string from process_failure_analysis.
205-218:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisabling TLS verification exposes the LLM client to man-in-the-middle attacks.
Setting
verify=Falseand suppressingInsecureRequestWarningdisables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configurehttpx.Client(verify="/path/to/ca-bundle.crt")instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.🔒 Recommended fix to support custom CA bundle
+ # Allow custom CA bundle or disable verification only if explicitly opted in + ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE") + verify_tls = ca_bundle if ca_bundle else True # Default to secure + + if verify_tls is False: + logger.warning("⚠️ TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure") + - with warnings.catch_warnings(): - warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning) - - # Create HTTP client with proper headers for litellm endpoint - http_client = httpx.Client( - verify=False, + http_client = httpx.Client( + verify=verify_tls, headers={🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client currently disables TLS verification by passing verify=False in the httpx.Client call in on_failure.py; replace this with a secure approach: read a configurable CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path) when present, and only fall back to verify=False when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing InsecureRequestWarning to the insecure branch and emit a clear log warning when verify is intentionally disabled. Ensure changes target the http_client construction site and associated warnings.catch_warnings() usage so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)
1559-1564: ⚡ Quick winConsider parameterizing the hardcoded model key.
The model key
"qwen-3-6-35b"is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.♻️ Suggested change
- MODEL_KEY = "qwen-3-6-35b" + MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the hardcoded MODEL_KEY assignment by reading a configurable environment variable: instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default string; then use that variable when calling models_config.get(MODEL_KEY) and raise the same ValueError if missing. Ensure you import os if not present and keep the existing models_config lookup and error handling intact.projects/fournos_launcher/orchestration/submit.py (1)
59-161: ⚡ Quick winConsider adding artifact directory validation.
The function uses
pathlib.Path(env.ARTIFACT_DIR)without verifying thatARTIFACT_DIRis set or that the path exists. IfARTIFACT_DIRisNoneor an invalid path, the subsequent glob operations could fail silently or raise exceptions.🛡️ Suggested defensive check
def generate_notification_files(): """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information.""" artifact_dir = pathlib.Path(env.ARTIFACT_DIR) + + if not artifact_dir or not artifact_dir.exists(): + logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}") + return + mlflow_urls = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The generate_notification_files function uses artifact_dir = pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the path exists; add a defensive check at the start of generate_notification_files to validate env.ARTIFACT_DIR is set (not None/empty) and that artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear warning via logger (e.g., logger.warning) and return early to skip the rest of the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.
In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.
In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.
In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.
---
Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.
---
Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.
In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877
📒 Files selected for processing (8)
projects/core/agentic/on_failure.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
🛑 Comments failed to post (5)
projects/core/agentic/on_failure.py (1)
34-49:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix the import sorting to resolve the pipeline failure.
The Ruff linter reports that the import block is unsorted or unformatted (I001). This is causing the CI pipeline to fail.
Run
ruff check projects/core/agentic/on_failure.py --fixto automatically organize the imports, or manually sort them according to PEP 8 conventions (standard library, third-party, local imports).🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt
[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with
ruff check --fix).🪛 GitHub Actions: Ruff / lint (3.12)
[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with
ruff check projects/ bin/ --fix).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The import block in projects/core/agentic/on_failure.py is unsorted causing Ruff I001; fix it by running "ruff check projects/core/agentic/on_failure.py --fix" or manually reorder the imports in the module-level import section so that standard library imports (datetime, functools, json, logging, os, re, sys, warnings, pathlib.Path) come first, followed by third-party imports (click, httpx, urllib3, yaml, langchain_core.messages.HumanMessage), and then any local imports, ensure blank lines between groups and remove any unused imports; after reordering, run the linter to confirm the I001 is resolved.
⚠️ Potential issue | 🟠 Major | ⚡ Quick win🧩 Analysis chain
🏁 Script executed:
#!/bin/bash # Check if httpx, urllib3, langchain-core, langchain-openai are in pyproject.toml echo "=== Checking pyproject.toml for required dependencies ===" rg -n 'httpx|urllib3|langchain' pyproject.toml || echo "❌ Missing dependencies detected" echo "" echo "=== Dependencies imported in on_failure.py ===" rg -n '^import (httpx|urllib3)|^from langchain' projects/core/agentic/on_failure.pyRepository: openshift-psap/forge
Length of output: 288
🏁 Script executed:
#!/bin/bash set -euo pipefail echo "=== Imports in on_failure.py (full scan for relevant packages) ===" rg -n "^(import (httpx|urllib3)|from langchain(_core|_openai)|from langchain|import langchain|import .*langchain|from langchain_openai|import .*openai|from .*openai)" projects/core/agentic/on_failure.py || true echo echo "=== pyproject.toml: look for dependency declarations (more precise) ===" rg -n "(^|\\W)(httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain)(\\W|$)" pyproject.toml || true echo echo "=== pyproject.toml: show dependency sections around project.dependencies and optional-dependencies ===" # Print the first ~200 lines to locate [project] sections quickly (bounded output) sed -n '1,220p' pyproject.toml | nl -ba echo echo "=== pyproject.toml: if optional-dependencies exists, show it (bounded) ===" if rg -n "^\[project\.optional-dependencies\]" -S pyproject.toml >/dev/null; then start=$(rg -n "^\[project\.optional-dependencies\]" -n pyproject.toml | head -n1 | cut -d: -f1) end=$((start+220)) sed -n "${start},${end}p" pyproject.toml | nl -ba else echo "No [project.optional-dependencies] section found." fiRepository: openshift-psap/forge
Length of output: 456
🏁 Script executed:
#!/bin/bash set -euo pipefail echo "=== Locate pyproject.toml files ===" fd -t f -e toml "pyproject.toml" . || true echo # Choose the root pyproject.toml if it exists; otherwise fall back to first found PYPROJECT="$(fd -t f -e toml "pyproject.toml" . | head -n1 || true)" if [[ -z "${PYPROJECT}" ]]; then echo "No pyproject.toml found in repo." exit 0 fi echo "Using: ${PYPROJECT}" echo echo "=== Imports in projects/core/agentic/on_failure.py (full scan for relevant packages) ===" rg -n "(^import (httpx|urllib3)|^from langchain(_core|_openai)?(\.| )|^from langchain import|^from langchain_openai|^from .*openai|^import .*langchain|^import .*openai|^import .*yaml|^import .*click)" projects/core/agentic/on_failure.py || true echo echo "=== pyproject.toml: search for declared dependencies (strings) ===" rg -n "httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain" "${PYPROJECT}" || echo "No matching dependency strings found." echo echo "=== pyproject.toml: show [project] dependencies sections (bounded) ===" # Show near likely dependency keys rg -n "^\[project\]" "${PYPROJECT}" || true rg -n "^(dependencies|optional-dependencies)\s*=" "${PYPROJECT}" || true # Print first 260 lines to include [project] metadata and dependency lists (bounded output) sed -n '1,260p' "${PYPROJECT}" | cat -n echo if rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" >/dev/null; then start=$(rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" | head -n1 | cut -d: -f1) end=$((start+260)) echo "=== [project.optional-dependencies] (bounded) ===" sed -n "${start},${end}p" "${PYPROJECT}" | cat -n else echo "No [project.optional-dependencies] section found." fiRepository: openshift-psap/forge
Length of output: 7970
🏁 Script executed:
#!/bin/bash set -euo pipefail rg -n "langchain_openai|langchain(_core|_openai)?|import .*langchain|from .*langchain" projects/core/agentic/on_failure.py || trueRepository: openshift-psap/forge
Length of output: 235
Declare missing dependencies in pyproject.toml.
projects/core/agentic/on_failure.pyimportshttpx,urllib3, andlangchain_core, and later dynamically importslangchain_openai.ChatOpenAI, but./pyproject.tomlhas neither of these packages in[project].dependenciesnor in[project.optional-dependencies]. This will cause ImportErrors on a fresh install.🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt
[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with
ruff check --fix).🪛 GitHub Actions: Ruff / lint (3.12)
[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with
ruff check projects/ bin/ --fix).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The file on_failure.py imports httpx, urllib3, and langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these packages are not declared in pyproject.toml; update the project's pyproject.toml to add httpx, urllib3, and langchain_core to [project].dependencies (or put them under [project.optional-dependencies] with appropriate extras if optional) and ensure langchain_openai (or the package that provides ChatOpenAI) is declared as well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol succeed on a fresh install.projects/core/notifications/send.py (1)
299-313:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd exception handling for file read operations.
The NOTIFICATION.md file read (lines 304-305) is not wrapped in a try-except block, unlike the NOTIFICATION.html read in
get_common_message(lines 215-223). If the file read fails (e.g., permissions issue, encoding error), it will raise an exception and potentially break the Slack notification flow.🛡️ Suggested error handling
# Check for fournos_launcher generated Slack notification content artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", "")) fournos_notification_md = artifact_dir / "NOTIFICATION.md" if fournos_notification_md.exists(): - with open(fournos_notification_md, encoding="utf-8") as f: - fournos_content = f.read().strip() - if fournos_content: - # Return custom notification content with status icon - return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + try: + with open(fournos_notification_md, encoding="utf-8") as f: + fournos_content = f.read().strip() + if fournos_content: + # Return custom notification content with status icon + return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + except Exception as e: + logger.warning("Failed to read NOTIFICATION.md: %s", e) # Fallback to standard message if no custom notification📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Check for fournos_launcher generated Slack notification content artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", "")) fournos_notification_md = artifact_dir / "NOTIFICATION.md" if fournos_notification_md.exists(): try: with open(fournos_notification_md, encoding="utf-8") as f: fournos_content = f.read().strip() if fournos_content: # Return custom notification content with status icon return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" except Exception as e: logger.warning("Failed to read NOTIFICATION.md: %s", e) # Fallback to standard message if no custom notification return get_common_message( finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/notifications/send.py` around lines 299 - 313, The NOTIFICATION.md file read (artifact_dir / "NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except to avoid raising on I/O/encoding/permission errors; update the block in send.py around the with open(fournos_notification_md, ...) and fournos_content = f.read().strip() to catch Exception as e, handle/log the error (consistent with the existing logging approach used by get_common_message's NOTIFICATION.html handling), and fall back to returning get_common_message(finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs or the file is unreadable.projects/llm_d/orchestration/test_phase.py (2)
93-96:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFinalizers are unconditionally disabled.
do_finalizers = Falseon line 93 prevents all finalizer callbacks from executing, including:
- Capturing inference-service state
- Writing endpoint URL to artifacts
- Capturing namespace events
- Cleaning up test resources (pods, jobs, PVCs)
The
SignalInterruptcheck on lines 94-96 is redundant sincedo_finalizersis alreadyFalse. If this is intentional debugging code, consider removing it before merge or adding a comment explaining why finalizers are disabled. Otherwise, resource leaks and missing artifacts will occur.Suggested fix to re-enable finalizers
finally: - do_finalizers = False + do_finalizers = True if primary_exc and isinstance(primary_exc[1], SignalInterrupt): logging.warning("Caught a SignalInterrupt, skipping the finalizers") do_finalizers = False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/orchestration/test_phase.py` around lines 93 - 96, The code unconditionally disables finalizers by setting do_finalizers = False before checking primary_exc; remove that unconditional assignment and instead initialize do_finalizers = True (or omit the assignment) and only set do_finalizers = False when primary_exc indicates a SignalInterrupt (i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set do_finalizers = False there); ensure the variable controlling execution of finalizer callbacks (do_finalizers) is True in the normal flow so finalizer callbacks (capture state, write endpoint, capture events, cleanup) run unless a SignalInterrupt is detected.
170-181:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUnreachable code:
returnbefore docstring makes function body dead code.The
returnstatement on line 171 causes all subsequent code (lines 172-180) to be unreachable, including the docstring and the actualprepare_model_cache()call. This bypasses model cache preparation entirely, which per the codebase handles cache validation, PVC setup, and HF token retrieval.If this is temporary debugging code, please revert before merge. The docstring placement after
returnalso indicates this may have been an accidental edit.Suggested fix to restore model cache preparation
def _prepare_model_cache() -> None: - return """Ensure model cache PVC is ready for deployment.""" from projects.llm_d.orchestration import runtime_config model_key = runtime_config.get_model_key() logger.info("Preparing model cache for model: %s", model_key) # Use the same prepare_model_cache function as the prepare phase # This includes vault token handling and PVC existence checks prepare_model_cache()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def _prepare_model_cache() -> None: """Ensure model cache PVC is ready for deployment.""" from projects.llm_d.orchestration import runtime_config model_key = runtime_config.get_model_key() logger.info("Preparing model cache for model: %s", model_key) # Use the same prepare_model_cache function as the prepare phase # This includes vault token handling and PVC existence checks prepare_model_cache()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/orchestration/test_phase.py` around lines 170 - 181, The _prepare_model_cache function currently returns immediately making its docstring and body (including the call to prepare_model_cache()) unreachable; remove the stray return, place the docstring right after the def if desired, and restore the body so it imports runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via logger.info, and calls prepare_model_cache() to perform PVC/token handling; verify references to _prepare_model_cache(), runtime_config, logger, and prepare_model_cache are unchanged.vaults/psap-models-corp-rh.yaml (1)
8-8:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix typo in description.
"use" should be "used" for grammatical correctness.
📝 Proposed fix
- Access to the RH models use by FORGE agent + Access to the RH models used by FORGE agent📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.Access to the RH models used by FORGE agent🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vaults/psap-models-corp-rh.yaml` at line 8, Fix the typo in the YAML description string "Access to the RH models use by FORGE agent" by changing "use" to "used" so the line reads "Access to the RH models used by FORGE agent"; locate and update that description entry in the psap-models-corp-rh YAML block.
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 59 seconds 🔴 • Link to the test results. • No reports generated... Test configuration: |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 10 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 51 seconds 🔴 • Link to the test results. • No reports generated... Test configuration: |
…ns with the mlflow_urls and log_files links
…h NOTIFICATIONS.{html,md}
|
/test fournos llm_d |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/ci_entrypoint/run_common.py`:
- Line 32: EXTRA_PACKAGES is currently an unpinned list (EXTRA_PACKAGES =
["httpx", "langchain-openai", "langchain-core"]) which makes CI installs
non-deterministic; update the EXTRA_PACKAGES variable to include explicit
version pins (or modify install_extra_packages() to install using a shared
constraints/lockfile) so CI always installs fixed versions — for example,
replace unpinned names in EXTRA_PACKAGES with versioned strings or change
install_extra_packages() to call pip with --constraint pointing to your
repository's constraints/lockfile (ensure you update any related docs/tests to
reflect the pinned versions).
In `@projects/core/notifications/send.py`:
- Around line 303-309: Wrap the optional NOTIFICATION.md read for
fournos_notification_md in a try/except so I/O/encoding/race errors don't abort
Slack message generation: when evaluating fournos_notification_md.exists()
surround the open/read/strip sequence with a broad exception handler that logs
or ignores the error and falls back to returning the default Slack message
(i.e., only return the custom "{status_icon}
{get_bold(status)}\\n\\n{fournos_content}" when read succeeds and
fournos_content is non-empty); reference the fournos_notification_md variable
and the return site that builds the status_icon/get_bold(status) message to
locate the change.
In `@vaults/psap-models-corp-rh.yaml`:
- Around line 6-17: The vault entry contains an unused format field and a typo:
VaultManager._load_vault_definition only reads the content key's file and
description, so remove or relocate the example block under the format key from
the content entry (move it into a YAML comment or external docs) and correct the
description text from "use by" to "used by"; ensure the remaining content only
includes the keys consumed at load time (file/description) so maintainers aren't
misled by an ignored format field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 91f0387f-50d1-408e-8966-0282847d630b
📒 Files selected for processing (10)
projects/core/agentic/on_failure.pyprojects/core/ci_entrypoint/run_common.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/config.yamlprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- projects/fournos_launcher/orchestration/submit.py
- projects/llm_d/orchestration/test_phase.py
- projects/core/dsl/runtime.py
- projects/llm_d/orchestration/ci.py
- projects/core/agentic/on_failure.py
| FORGE_HOME = Path(__file__).resolve().parent.parent.parent.parent | ||
|
|
||
| EXTRA_PACKAGES = [] | ||
| EXTRA_PACKAGES = ["httpx", "langchain-openai", "langchain-core"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for existing package version specifications
# Search for requirements files or version pinning
echo "=== Checking for requirements files ==="
fd -t f 'requirements.*\.txt|pyproject\.toml|setup\.py|Pipfile' --exec echo {}
echo ""
echo "=== Checking for version references to these packages ==="
rg -n 'httpx|langchain-openai|langchain-core' --type toml --type txt -g 'requirements*' -g 'pyproject.toml' -g 'setup.py'
echo ""
echo "=== Checking current PyPI versions ==="
for pkg in httpx langchain-openai langchain-core; do
echo "Package: $pkg"
curl -s "https://pypi.org/pypi/$pkg/json" | jq -r '.info.version // "Not found"'
doneRepository: openshift-psap/forge
Length of output: 498
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect run_common.py around EXTRA_PACKAGES ==="
python - <<'PY'
import itertools,sys,os, re
path="projects/core/ci_entrypoint/run_common.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
for i in range(1, len(lines)+1):
if "EXTRA_PACKAGES" in lines[i-1]:
start=max(1,i-15); end=min(len(lines), i+25)
for j in range(start,end+1):
print(f"{j:4d}: {lines[j-1].rstrip()}")
break
else:
print("EXTRA_PACKAGES not found")
PY
echo
echo "=== Find install_extra_packages definition and how it's invoked ==="
rg -n "def install_extra_packages|install_extra_packages\\(|prepare\\(" projects/core/ci_entrypoint/run_common.py
echo
echo "=== Search repo for version constraints for these packages ==="
rg -n "(httpx|langchain-openai|langchain-core)\\s*([<>=!~]{1,2}|==)" -S .
echo
echo "=== Search for constraints/requirements files that might be referenced by ci entrypoint ==="
rg -n "(--constraint|constraints\\.txt|requirements\\.txt|pip install)" projects/core/ci_entrypoint/run_common.pyRepository: openshift-psap/forge
Length of output: 1863
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect install_extra_packages() implementation ==="
python - <<'PY'
import re
path="projects/core/ci_entrypoint/run_common.py"
with open(path,'r',encoding='utf-8') as f:
lines=f.readlines()
def find_line(substr):
for i,l in enumerate(lines, start=1):
if substr in l:
return i
return None
start=find_line("def install_extra_packages")
if not start:
print("install_extra_packages not found")
raise SystemExit(0)
# print from start to a bit after first callsite or end of function heuristically
for j in range(start, min(len(lines)+1, start+120)):
print(f"{j:4d}: {lines[j-1].rstrip()}")
PY
echo
echo "=== Search for any version pins/constraints for httpx/langchain-openai/langchain-core (non-failing) ==="
rg -n "(httpx|langchain-openai|langchain-core)\s*([<>=!~]{1,2}|==)" -S . || true
echo
echo "=== Search for constraints files / pip --constraint usage ==="
rg -n "(--constraint|constraints\\.txt|constraint(s)?\\.txt|PIP_CONSTRAINT|pip\\.conf)" -S . || true
echo
echo "=== Search for these packages specifically in requirements/lock files (non-failing) ==="
rg -n "(httpx|langchain-openai|langchain-core)" -S --glob '*/requirements*' --glob 'pyproject.toml' --glob '*.lock' . || trueRepository: openshift-psap/forge
Length of output: 4779
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Files mentioning any of the three packages (requirements/pyproject/lock) ==="
rg -n "httpx|langchain-openai|langchain-core" -S --glob '*/requirements*.txt' --glob '*/requirements*.in' --glob 'pyproject.toml' --glob '*.lock' .
echo
echo "=== Search for version operators in lines mentioning those packages (covers httpx[...]==... too) ==="
rg -n "(httpx|langchain-openai|langchain-core)[^\n#]*([<>=!~]{1,2}|==)[^\n#]*" -S --glob '*/requirements*.txt' --glob 'pyproject.toml' --glob '*.lock' .
echo
echo "=== Show relevant lines (up to 30) from matched files for each package ==="
for f in $(rg -l "httpx|langchain-openai|langchain-core" -S --glob '*/requirements*.txt' --glob 'pyproject.toml' --glob '*.lock' .); do
echo "--- $f ---"
rg -n "httpx|langchain-openai|langchain-core" -S "$f" | head -n 30 || true
doneRepository: openshift-psap/forge
Length of output: 145
Pin versions for CI “EXTRA_PACKAGES” to keep builds deterministic.
projects/core/ci_entrypoint/run_common.py line 32 defines EXTRA_PACKAGES = ["httpx", "langchain-openai", "langchain-core"] with no version constraints, and install_extra_packages() installs them using uv pip install --no-cache (fallback: pip install --user --no-cache-dir). The repo has no corresponding pins/constraints for these packages in requirements/pyproject/lock files, so CI will resolve to the latest releases at install time, making builds non-deterministic and increasing the chance of unexpected failures. Pin versions or use a shared constraints/lockfile.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/core/ci_entrypoint/run_common.py` at line 32, EXTRA_PACKAGES is
currently an unpinned list (EXTRA_PACKAGES = ["httpx", "langchain-openai",
"langchain-core"]) which makes CI installs non-deterministic; update the
EXTRA_PACKAGES variable to include explicit version pins (or modify
install_extra_packages() to install using a shared constraints/lockfile) so CI
always installs fixed versions — for example, replace unpinned names in
EXTRA_PACKAGES with versioned strings or change install_extra_packages() to call
pip with --constraint pointing to your repository's constraints/lockfile (ensure
you update any related docs/tests to reflect the pinned versions).
| if fournos_notification_md.exists(): | ||
| with open(fournos_notification_md, encoding="utf-8") as f: | ||
| fournos_content = f.read().strip() | ||
| if fournos_content: | ||
| # Return custom notification content with status icon | ||
| return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" | ||
|
|
There was a problem hiding this comment.
Handle optional NOTIFICATION.md read errors before overriding the Slack message
On Line 304, reading NOTIFICATION.md is unguarded. If this optional file is unreadable (I/O/encoding/race), the exception can break Slack notification generation instead of falling back to the standard message path.
Suggested fix
if fournos_notification_md.exists():
- with open(fournos_notification_md, encoding="utf-8") as f:
- fournos_content = f.read().strip()
- if fournos_content:
- # Return custom notification content with status icon
- return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+ try:
+ with open(fournos_notification_md, encoding="utf-8") as f:
+ fournos_content = f.read().strip()
+ if fournos_content:
+ # Return custom notification content with status icon
+ return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+ except Exception as e:
+ logger.warning("Failed to read NOTIFICATION.md: %s", e)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@projects/core/notifications/send.py` around lines 303 - 309, Wrap the
optional NOTIFICATION.md read for fournos_notification_md in a try/except so
I/O/encoding/race errors don't abort Slack message generation: when evaluating
fournos_notification_md.exists() surround the open/read/strip sequence with a
broad exception handler that logs or ignores the error and falls back to
returning the default Slack message (i.e., only return the custom "{status_icon}
{get_bold(status)}\\n\\n{fournos_content}" when read succeeds and
fournos_content is non-empty); reference the fournos_notification_md variable
and the return site that builds the status_icon/get_bold(status) message to
locate the change.
| agent-models.yaml: | ||
| description: | | ||
| Access to the RH models use by FORGE agent | ||
| format: | | ||
| model_key_1: | ||
| model_api: "..." | ||
| model_id: "..." | ||
| user_key: "..." | ||
| model_key_2: | ||
| model_api: "..." | ||
| model_id: "..." | ||
| user_key: "..." |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if the 'format' field in vault content definitions is used anywhere
# Search for references to 'format' in vault-related code
rg -n --type=py -C3 "\.get\(['\"]format['\"]" --glob '*vault*'
# Also check for any processing of vault content beyond file/description
rg -n --type=py -C3 "content_def\.get\(" projects/core/library/vault.pyRepository: openshift-psap/forge
Length of output: 557
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Search for any reads of a `format` key from vault content dicts
rg -n --hidden --glob '!.git/*' --type=py \
"(content_def|defn|content_defn|vault|data)\.get\(['\"]format['\"]\]" \
projects/core/library
rg -n --hidden --glob '!.git/*' --type=py \
"\[['\"]format['\"]\]" \
projects/core/library
# 2) Search for `format` key handling in YAML/vault parsing code
rg -n --hidden --glob '!.git/*' --type=py \
"format\s*[:=]" \
projects/core/library/vault.py projects/core/library/*vault* || true
# 3) Confirm the loader logic around content dict extraction
python3 - <<'PY'
from pathlib import Path
p = Path("projects/core/library/vault.py")
if p.exists():
txt = p.read_text()
i = txt.find("for content_name, content_def in")
print("--- context around content extraction ---")
print("\n".join(txt[i-400:i+400].splitlines()[:40]))
else:
print("vault.py not found at projects/core/library/vault.py")
PYRepository: openshift-psap/forge
Length of output: 46
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate vault implementation entrypoints (helps constrain further searches)
rg -n --hidden --glob '!.git/*' "VaultManager|_load_vault_definition|load_vault_definition|vaults" projects || true
rg -n --hidden --glob '!.git/*' "class Vault|def _load_vault_definition" . || true
# Search repo-wide for any code reading a 'format' key from content dicts
rg -n --hidden --glob '!.git/*' --type=py \
"get\(['\"]format['\"]\)" . || true
# Search for any usage of the literal key 'format' near vault parsing/loading
rg -n --hidden --glob '!.git/*' --type=py -C2 \
"\bformat\b" projects/core/library/vault.py || true
# Search for the exact substring "content_def.get(" followed by keys other than file/description (to see how strict the schema is)
rg -n --hidden --glob '!.git/*' --type=py -C2 \
"content_def\.get\(" projects/core/library/vault.py || trueRepository: openshift-psap/forge
Length of output: 16677
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the relevant section of vault loader and content parsing
nl -ba projects/core/library/vault.py | sed -n '70,170p'Repository: openshift-psap/forge
Length of output: 107
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Print loader logic around _load_vault_definition and content parsing
cat -n projects/core/library/vault.py | sed -n '60,190p'Repository: openshift-psap/forge
Length of output: 5861
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,30p' vaults/psap-models-corp-rh.yaml | cat -nRepository: openshift-psap/forge
Length of output: 578
Remove/relocate the unused format field in vaults/psap-models-corp-rh.yaml (and fix the typo)
VaultManager._load_vault_definition only extracts file (defaults to the content key) and description from each content entry; the format key is ignored at load time, so it won’t have runtime effect and may mislead maintainers—move this example to YAML comments or separate docs. Also fix the typo: “use by” → “used by”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@vaults/psap-models-corp-rh.yaml` around lines 6 - 17, The vault entry
contains an unused format field and a typo: VaultManager._load_vault_definition
only reads the content key's file and description, so remove or relocate the
example block under the format key from the content entry (move it into a YAML
comment or external docs) and correct the description text from "use by" to
"used by"; ensure the remaining content only includes the keys consumed at load
time (file/description) so maintainers aren't misled by an ignored format field.
|
🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 11 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 49 seconds 🔴 • Link to the test results. • No reports generated... • Task Logs:
Test configuration: |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 05 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 12 minutes 43 seconds 🔴 • Link to the test results. • No reports generated... • Task Logs:
Test configuration: |
|
@kpouget: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
New Features
Behavior Changes