Skip to content

S3 🧠🤓🤯#142

Merged
alckasoc merged 3 commits intomainfrom
s3
Oct 3, 2025
Merged

S3 🧠🤓🤯#142
alckasoc merged 3 commits intomainfrom
s3

Conversation

@alckasoc
Copy link
Copy Markdown
Contributor

@alckasoc alckasoc commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • Adds Agent S3 (UI automation, grounding, worker/planning/reflection, code-agent) plus interactive CLI with screenshot permission flow.
    • Multi-provider LLM engine framework and LMMAgent wrapper.
    • Behavior narrator, comparative judge, and code-execution agent for iterative tasks.
    • End-to-end OSWorld orchestration, local/AWS runners, fact-caption generation, and judge pipeline.
  • Documentation

    • README updated to S3, new links, install/usage, SDK examples, and OSWorld guide.
  • Chores

    • Utilities for formatting/validation, image compression, logging, and experiment orchestration; package bumped to 0.3.0.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 3, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Replaces Agent S2.5 with Agent S3 across docs and code. Adds a new S3 stack: AgentS3/UIAgent, Worker, OSWorldACI grounding, CodeAgent, LMMAgent and multi-provider LMM engines, procedural memory, formatters and utils, CLI app, behavior narration/comparative judging, and OSWorld orchestration (runners, fact generation, judge scripts, and docs).

Changes

Cohort / File(s) Summary
Docs & packaging
README.md, setup.py, osworld_setup/s3/OSWorld.md, osworld_setup/s3/run.sh
Replace S2.5 references with S3 in README; add S3 links/media and prerequisites; update CLI/SDK import examples; bump package version and CLI entrypoint to gui_agents.s3.cli_app; add OSWorld S3 deployment guide and orchestration script.
S3 agents
gui_agents/s3/agents/agent_s.py, gui_agents/s3/agents/worker.py, gui_agents/s3/agents/code_agent.py, gui_agents/s3/agents/grounding.py
Add UIAgent and AgentS3; introduce Worker with planning/reflection and context management; add CodeAgent for iterative code-execute-verify loop; add ACI/OSWorldACI grounding with OCR, GUI actions, code-agent invocation, and many helper actions.
Behavior narration & judging
gui_agents/s3/bbon/behavior_narrator.py, gui_agents/s3/bbon/comparative_judge.py
Add BehaviorNarrator for image-based narration and captioning; add ComparativeJudge and image helpers to evaluate multiple trajectories.
Core LLM framework
gui_agents/s3/core/engine.py, gui_agents/s3/core/mllm.py, gui_agents/s3/core/module.py
Add extensible LMM engine implementations (OpenAI, Anthropic, Gemini, OpenRouter, Azure, vLLM, HuggingFace, Parasail) with backoff and provider fallbacks; add LMMAgent adapter and BaseModule helper.
Memory & prompts
gui_agents/s3/memory/procedural_memory.py
Add PROCEDURAL_MEMORY prompts and construct_simple_worker_procedural_memory to synthesize procedural memory including agent action signatures.
Formatters & utils
gui_agents/s3/utils/formatters.py, gui_agents/s3/utils/common_utils.py
Add response/code format validators (tag and code checks) and helpers: LLM call wrappers, retry/format enforcement, code parsing, image compression, and pyautogui code creation.
CLI app
gui_agents/s3/cli_app.py
Add interactive CLI to run AgentS3 + OSWorldACI: arg parsing, logging, screenshot capture, permission dialogs, pause/resume (Ctrl+C/Esc), and run loop executing agent outputs.
OSWorld experiment tooling
osworld_setup/s3/*
Add asynchronous fact-caption generator (generate_facts.py), comparative judge runner (run_judge.py), utilities (bbon/utils.py), local and cloud runners (run.py, run_local.py), single-example runner (lib_run_single.py), and related helpers for evaluation and result aggregation.
Lint workflow
.github/workflows/lint.yml
Narrow Black check scope to gui_agents only.
Legacy CLI tweaks
gui_agents/s1/cli_app.py, gui_agents/s2/cli_app.py, gui_agents/s2_5/cli_app.py
Minor input handling and pause/resume refinements (ESC resume, raw-mode char reading) and small formatting tweaks across older CLI modules.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as cli_app.py
  participant A3 as AgentS3
  participant W as Worker
  participant GA as OSWorldACI
  participant LLM as LMMAgent/LMMEngine
  participant CA as CodeAgent

  User->>CLI: submit instruction
  CLI->>GA: assign_screenshot(obs)
  CLI->>A3: predict(instruction, obs)
  A3->>W: generate_next_action(instruction, obs)
  W->>LLM: reflection prompt (optional)
  LLM-->>W: reflection
  W->>LLM: planning prompt (+history, reflection)
  LLM-->>W: plan (may include code)
  W->>CA: call_code_agent(plan or subtask) [optional]
  CA->>LLM: step prompts
  LLM-->>CA: step responses
  CA-->>W: execution results / summary
  W-->>A3: executor_info + actions/code
  A3-->>CLI: info, actions/code
  alt code requires user approval
    CLI->>User: show_permission_dialog(code)
    User-->>CLI: approve/deny
  end
  alt approved
    CLI->>GA: execute(action/code)
    GA-->>CLI: result (next/done/fail)
  end
  CLI-->>User: display status
Loading
sequenceDiagram
  autonumber
  participant Runner as run.py / run_local.py
  participant Env as DesktopEnv
  participant GA as OSWorldACI
  participant A3 as AgentS3
  participant Lib as lib_run_single.py
  participant FS as Filesystem

  Runner->>Env: create environment
  Runner->>GA: init grounding (engine params)
  Runner->>A3: init AgentS3 (Worker + GA)
  Runner->>Lib: run_single_example(...)
  Lib->>Env: reset()
  loop step N
    Lib->>Env: observe()
    Lib->>A3: predict(instruction, obs)
    A3-->>Lib: actions/code
    Lib->>Env: step(action)
    Env-->>Lib: new_obs, reward, done
    Lib->>FS: save step_N.png and traj.jsonl
  end
  Lib->>Env: evaluate()
  Lib->>FS: write result.txt
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

I hop and nibble through new code vines,
S3 sprouts leaves and shiny lines—
Engines hum, the worker thinks,
Code executes with tiny blinks.
Facts and captions, judge in queue,
A rabbit cheers: “Deployed—woohoo!” 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “S3 🧠🤓🤯” is overly brief, uses emojis, and fails to clearly describe the primary change of introducing Agent S3 support across the README, CLI, SDK, and core modules. Please rename the pull request to a concise, descriptive title such as “Introduce Agent S3 with updated CLI, SDK imports, grounding, and documentation” so that reviewers can quickly understand the main change.
Docstring Coverage ⚠️ Warning Docstring coverage is 52.31% which is insufficient. The required threshold is 80.00%. You can run `@coderabbitai generate docstrings` to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit's high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2c1e1b and 2e79129.

📒 Files selected for processing (27)
  • .github/workflows/lint.yml (1 hunks)
  • gui_agents/s1/cli_app.py (5 hunks)
  • gui_agents/s2/cli_app.py (6 hunks)
  • gui_agents/s2_5/cli_app.py (7 hunks)
  • gui_agents/s3/agents/agent_s.py (1 hunks)
  • gui_agents/s3/agents/code_agent.py (1 hunks)
  • gui_agents/s3/agents/grounding.py (1 hunks)
  • gui_agents/s3/agents/worker.py (1 hunks)
  • gui_agents/s3/bbon/behavior_narrator.py (1 hunks)
  • gui_agents/s3/bbon/comparative_judge.py (1 hunks)
  • gui_agents/s3/cli_app.py (1 hunks)
  • gui_agents/s3/core/engine.py (1 hunks)
  • gui_agents/s3/core/mllm.py (1 hunks)
  • gui_agents/s3/memory/procedural_memory.py (1 hunks)
  • gui_agents/s3/utils/common_utils.py (1 hunks)
  • gui_agents/s3/utils/formatters.py (1 hunks)
  • osworld_setup/s2_5/lib_run_single.py (3 hunks)
  • osworld_setup/s2_5/lib_run_single_local.py (3 hunks)
  • osworld_setup/s2_5/run.py (17 hunks)
  • osworld_setup/s2_5/run_local.py (8 hunks)
  • osworld_setup/s3/bbon/generate_facts.py (1 hunks)
  • osworld_setup/s3/bbon/run_judge.py (1 hunks)
  • osworld_setup/s3/bbon/utils.py (1 hunks)
  • osworld_setup/s3/lib_run_single.py (1 hunks)
  • osworld_setup/s3/run.py (1 hunks)
  • osworld_setup/s3/run_local.py (1 hunks)
  • setup.py (2 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d78a6eb and c2c1e1b.

⛔ Files ignored due to path filters (1)
  • images/s3_results.png is excluded by !**/*.png
📒 Files selected for processing (22)
  • README.md (9 hunks)
  • gui_agents/s3/agents/agent_s.py (1 hunks)
  • gui_agents/s3/agents/code_agent.py (1 hunks)
  • gui_agents/s3/agents/grounding.py (1 hunks)
  • gui_agents/s3/agents/worker.py (1 hunks)
  • gui_agents/s3/bbon/behavior_narrator.py (1 hunks)
  • gui_agents/s3/bbon/comparative_judge.py (1 hunks)
  • gui_agents/s3/cli_app.py (1 hunks)
  • gui_agents/s3/core/engine.py (1 hunks)
  • gui_agents/s3/core/mllm.py (1 hunks)
  • gui_agents/s3/core/module.py (1 hunks)
  • gui_agents/s3/memory/procedural_memory.py (1 hunks)
  • gui_agents/s3/utils/common_utils.py (1 hunks)
  • gui_agents/s3/utils/formatters.py (1 hunks)
  • osworld_setup/s3/OSWorld.md (1 hunks)
  • osworld_setup/s3/bbon/generate_facts.py (1 hunks)
  • osworld_setup/s3/bbon/run_judge.py (1 hunks)
  • osworld_setup/s3/bbon/utils.py (1 hunks)
  • osworld_setup/s3/lib_run_single.py (1 hunks)
  • osworld_setup/s3/run.py (1 hunks)
  • osworld_setup/s3/run.sh (1 hunks)
  • osworld_setup/s3/run_local.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (17)
osworld_setup/s3/run_local.py (3)
gui_agents/s3/agents/agent_s.py (1)
  • AgentS3 (48-102)
gui_agents/s3/agents/grounding.py (2)
  • OSWorldACI (180-646)
  • open (388-396)
osworld_setup/s3/lib_run_single.py (1)
  • run_single_example (12-80)
gui_agents/s3/core/module.py (2)
gui_agents/s3/core/mllm.py (2)
  • LMMAgent (17-305)
  • add_system_prompt (71-84)
gui_agents/s1/core/BaseModule.py (1)
  • BaseModule (6-18)
gui_agents/s3/utils/formatters.py (1)
gui_agents/s3/utils/common_utils.py (4)
  • extract_agent_functions (144-154)
  • parse_code_from_string (121-142)
  • create_pyautogui_code (13-30)
  • split_thinking_response (109-119)
gui_agents/s3/bbon/comparative_judge.py (4)
gui_agents/s3/core/mllm.py (1)
  • LMMAgent (17-305)
gui_agents/s3/memory/procedural_memory.py (1)
  • PROCEDURAL_MEMORY (5-367)
gui_agents/s3/utils/common_utils.py (2)
  • call_llm_formatted (55-106)
  • split_thinking_response (109-119)
osworld_setup/s3/bbon/utils.py (2)
  • get_final_screenshot_file (101-133)
  • image_to_openai_message_format (9-42)
osworld_setup/s3/lib_run_single.py (1)
gui_agents/s3/agents/agent_s.py (4)
  • reset (31-33)
  • reset (77-85)
  • predict (35-45)
  • predict (87-102)
gui_agents/s3/agents/agent_s.py (2)
gui_agents/s3/agents/grounding.py (1)
  • ACI (20-22)
gui_agents/s3/agents/worker.py (3)
  • Worker (24-335)
  • reset (62-79)
  • generate_next_action (170-335)
osworld_setup/s3/bbon/run_judge.py (2)
osworld_setup/s3/bbon/utils.py (4)
  • get_new_tasks_classification (154-215)
  • evaluate_comparative_results (254-283)
  • load_task_instruction (63-98)
  • load_facts (45-61)
gui_agents/s3/bbon/comparative_judge.py (2)
  • ComparativeJudge (58-100)
  • judge (62-100)
gui_agents/s3/agents/code_agent.py (3)
gui_agents/s3/memory/procedural_memory.py (1)
  • PROCEDURAL_MEMORY (5-367)
gui_agents/s3/utils/common_utils.py (2)
  • call_llm_safe (32-53)
  • split_thinking_response (109-119)
gui_agents/s3/core/mllm.py (3)
  • LMMAgent (17-305)
  • reset (60-69)
  • add_message (112-272)
gui_agents/s3/agents/worker.py (5)
gui_agents/s3/agents/grounding.py (5)
  • ACI (20-22)
  • assign_screenshot (327-328)
  • set_task_instruction (330-332)
  • open (388-396)
  • wait (629-634)
gui_agents/s3/core/module.py (2)
  • BaseModule (5-17)
  • _create_agent (10-17)
gui_agents/s3/memory/procedural_memory.py (2)
  • PROCEDURAL_MEMORY (5-367)
  • construct_simple_worker_procedural_memory (14-111)
gui_agents/s3/utils/common_utils.py (5)
  • call_llm_safe (32-53)
  • call_llm_formatted (55-106)
  • parse_code_from_string (121-142)
  • split_thinking_response (109-119)
  • create_pyautogui_code (13-30)
gui_agents/s3/core/mllm.py (3)
  • reset (60-69)
  • add_system_prompt (71-84)
  • add_message (112-272)
osworld_setup/s3/bbon/generate_facts.py (2)
gui_agents/s3/bbon/behavior_narrator.py (2)
  • BehaviorNarrator (14-181)
  • judge (131-181)
osworld_setup/s3/bbon/utils.py (1)
  • get_new_tasks_classification (154-215)
osworld_setup/s3/run.py (4)
gui_agents/s3/agents/agent_s.py (1)
  • AgentS3 (48-102)
gui_agents/s3/agents/grounding.py (2)
  • OSWorldACI (180-646)
  • open (388-396)
osworld_setup/s3/lib_run_single.py (1)
  • run_single_example (12-80)
osworld_setup/s3/run_local.py (4)
  • config (66-155)
  • test (158-282)
  • get_unfinished (285-319)
  • get_result (322-354)
gui_agents/s3/core/mllm.py (1)
gui_agents/s3/core/engine.py (17)
  • LMMEngineAnthropic (60-135)
  • LMMEngineAzureOpenAI (224-282)
  • LMMEngineHuggingFace (334-367)
  • LMMEngineOpenAI (19-57)
  • LMMEngineOpenRouter (181-221)
  • LMMEngineParasail (370-405)
  • LMMEnginevLLM (285-331)
  • LMMEngineGemini (138-178)
  • generate_with_thinking (111-135)
  • generate (35-57)
  • generate (74-105)
  • generate (153-178)
  • generate (196-221)
  • generate (249-282)
  • generate (300-331)
  • generate (344-367)
  • generate (382-405)
gui_agents/s3/utils/common_utils.py (3)
gui_agents/s3/memory/procedural_memory.py (1)
  • PROCEDURAL_MEMORY (5-367)
gui_agents/s3/agents/grounding.py (1)
  • assign_screenshot (327-328)
gui_agents/s3/core/mllm.py (1)
  • get_response (274-305)
gui_agents/s3/bbon/behavior_narrator.py (3)
gui_agents/s3/core/mllm.py (1)
  • LMMAgent (17-305)
gui_agents/s3/memory/procedural_memory.py (1)
  • PROCEDURAL_MEMORY (5-367)
gui_agents/s3/utils/common_utils.py (3)
  • call_llm_formatted (55-106)
  • split_thinking_response (109-119)
  • compress_image (156-172)
gui_agents/s3/cli_app.py (2)
gui_agents/s3/agents/grounding.py (1)
  • OSWorldACI (180-646)
gui_agents/s3/agents/agent_s.py (5)
  • AgentS3 (48-102)
  • predict (35-45)
  • predict (87-102)
  • reset (31-33)
  • reset (77-85)
osworld_setup/s3/bbon/utils.py (1)
gui_agents/s3/bbon/comparative_judge.py (2)
  • image_to_openai_message_format (31-55)
  • get_final_screenshot_file (10-28)
gui_agents/s3/agents/grounding.py (5)
gui_agents/s3/memory/procedural_memory.py (1)
  • PROCEDURAL_MEMORY (5-367)
gui_agents/s3/core/mllm.py (3)
  • LMMAgent (17-305)
  • reset (60-69)
  • add_message (112-272)
gui_agents/s3/utils/common_utils.py (1)
  • call_llm_safe (32-53)
gui_agents/s3/agents/code_agent.py (3)
  • CodeAgent (87-276)
  • reset (102-108)
  • execute (110-221)
gui_agents/s3/agents/worker.py (1)
  • reset (62-79)
🪛 Ruff (0.13.2)
osworld_setup/s3/run_local.py

266-266: Do not catch blind exception: Exception

(BLE001)


267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


322-322: Unused function argument: total_file_json

(ARG001)


346-346: Do not use bare except

(E722)

gui_agents/s3/core/module.py

11-11: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

gui_agents/s3/utils/formatters.py

9-9: Do not assign a lambda expression, use a def

Rewrite single_action_check as a def

(E731)


11-13: Do not assign a lambda expression, use a def

Rewrite SINGLE_ACTION_FORMATTER as a def

(E731)


19-19: Do not catch blind exception: Exception

(BLE001)


19-19: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


21-21: Do not assign a lambda expression, use a def

Rewrite code_valid_check as a def

(E731)


23-25: Do not assign a lambda expression, use a def

Rewrite CODE_VALID_FORMATTER as a def

(E731)


27-27: Do not assign a lambda expression, use a def

Rewrite thoughts_answer_tag_check as a def

(E731)


29-31: Do not assign a lambda expression, use a def

Rewrite THOUGHTS_ANSWER_TAG_FORMATTER as a def

(E731)


33-33: Do not assign a lambda expression, use a def

Rewrite integer_answer_check as a def

(E731)


35-37: Do not assign a lambda expression, use a def

Rewrite INTEGER_ANSWER_FORMATTER as a def

(E731)

gui_agents/s3/bbon/comparative_judge.py

3-3: Redefinition of unused List from line 3

Remove definition: List

(F811)


24-24: Do not use bare except

(E722)


52-52: Consider moving this statement to an else block

(TRY300)


53-53: Do not catch blind exception: Exception

(BLE001)

osworld_setup/s3/lib_run_single.py

6-6: from typing import * used; unable to detect undefined names

(F403)


7-7: from wrapt_timeout_decorator import * used; unable to detect undefined names

(F403)


18-18: Do not catch blind exception: Exception

(BLE001)


18-18: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


25-25: f-string without any placeholders

Remove extraneous f prefix

(F541)

osworld_setup/s3/bbon/run_judge.py

106-106: Loop control variable ans not used within loop body

Rename unused ans to _ans

(B007)


106-106: Loop control variable thoughts not used within loop body

Rename unused thoughts to _thoughts

(B007)


114-114: Unpacked variable maximum_gain is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


128-128: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


145-145: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


145-145: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


145-145: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


145-145: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


145-145: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

gui_agents/s3/agents/code_agent.py

42-42: Consider moving this statement to an else block

(TRY300)


44-44: Do not catch blind exception: Exception

(BLE001)


45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


93-93: Avoid specifying long messages outside the exception class

(TRY003)


272-272: Do not catch blind exception: Exception

(BLE001)


273-273: Use explicit conversion flag

Replace with conversion flag

(RUF010)


274-274: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

gui_agents/s3/agents/worker.py

95-95: Multiple statements on one line (colon)

(E701)


196-196: f-string without any placeholders

Remove extraneous f prefix

(F541)


203-203: f-string without any placeholders

Remove extraneous f prefix

(F541)


252-252: f-string without any placeholders

Remove extraneous f prefix

(F541)


260-260: Do not catch blind exception: Exception

(BLE001)


261-261: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


264-264: f-string without any placeholders

Remove extraneous f prefix

(F541)


271-271: f-string without any placeholders

Remove extraneous f prefix

(F541)


320-320: Do not catch blind exception: Exception

(BLE001)


321-321: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

osworld_setup/s3/bbon/generate_facts.py

25-25: Do not use bare except

(E722)


25-26: try-except-pass detected, consider logging the exception

(S110)


29-29: Avoid specifying long messages outside the exception class

(TRY003)


37-37: Do not catch blind exception: Exception

(BLE001)


38-38: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


38-38: Create your own exception

(TRY002)


38-38: Avoid specifying long messages outside the exception class

(TRY003)


61-61: Do not use bare except

(E722)


77-77: Do not use bare except

(E722)


77-78: try-except-pass detected, consider logging the exception

(S110)


95-95: Do not catch blind exception: Exception

(BLE001)

osworld_setup/s3/run.py

71-71: Do not catch blind exception: Exception

(BLE001)


72-72: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


88-88: Do not catch blind exception: Exception

(BLE001)


89-89: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


125-125: Do not catch blind exception: Exception

(BLE001)


158-158: Do not catch blind exception: Exception

(BLE001)


160-160: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


161-161: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


166-166: Do not catch blind exception: Exception

(BLE001)


167-167: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


175-175: Do not catch blind exception: Exception

(BLE001)


176-176: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


178-178: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


179-179: Do not catch blind exception: Exception

(BLE001)


180-180: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


182-182: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


189-189: Do not catch blind exception: Exception

(BLE001)


190-190: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


192-192: Unused function argument: frame

(ARG001)


200-200: f-string without any placeholders

Remove extraneous f prefix

(F541)


202-202: f-string without any placeholders

Remove extraneous f prefix

(F541)


203-203: Do not catch blind exception: Exception

(BLE001)


204-204: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


210-210: Do not catch blind exception: Exception

(BLE001)


211-211: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


219-219: Do not catch blind exception: Exception

(BLE001)


220-220: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


395-395: Do not catch blind exception: Exception

(BLE001)


396-396: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


439-439: Unused function argument: total_file_json

(ARG001)


463-463: Do not use bare except

(E722)

gui_agents/s3/core/mllm.py

39-39: Avoid specifying long messages outside the exception class

(TRY003)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


272-272: Prefer TypeError exception for invalid type

(TRY004)


272-272: Avoid specifying long messages outside the exception class

(TRY003)

gui_agents/s3/utils/common_utils.py

29-29: Use of possibly insecure function; consider using ast.literal_eval

(S307)


47-47: Do not catch blind exception: Exception

(BLE001)


117-117: Consider moving this statement to an else block

(TRY300)


118-118: Do not catch blind exception: Exception

(BLE001)


118-118: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


156-156: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

gui_agents/s3/bbon/behavior_narrator.py

46-46: Function definition does not bind loop variable width

(B023)


48-48: Function definition does not bind loop variable height

(B023)


50-50: Function definition does not bind loop variable width

(B023)


52-52: Function definition does not bind loop variable height

(B023)


54-54: Function definition does not bind loop variable width

(B023)


54-54: Function definition does not bind loop variable height

(B023)


81-81: Avoid specifying long messages outside the exception class

(TRY003)

gui_agents/s3/cli_app.py

41-41: Do not use bare except

(E722)


44-44: Unused function argument: signum

(ARG001)


44-44: Unused function argument: frame

(ARG001)


130-130: Starting a process with a shell, possible injection detected

(S605)


135-135: Starting a process with a shell, possible injection detected

(S605)


153-153: Local variable subtask_traj is assigned to but never used

Remove assignment to unused variable subtask_traj

(F841)


182-182: Starting a process with a shell, possible injection detected

(S605)


183-183: f-string without any placeholders

Remove extraneous f prefix

(F541)


186-186: Starting a process with a shell, possible injection detected

(S605)


187-187: f-string without any placeholders

Remove extraneous f prefix

(F541)


209-209: Use of exec detected

(S102)

gui_agents/s3/memory/procedural_memory.py

16-63: f-string without any placeholders

Remove extraneous f prefix

(F541)

osworld_setup/s3/bbon/utils.py

9-9: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


38-38: Consider moving this statement to an else block

(TRY300)


40-40: Do not catch blind exception: Exception

(BLE001)


94-94: Consider moving this statement to an else block

(TRY300)


96-96: Do not catch blind exception: Exception

(BLE001)


149-149: Do not catch blind exception: Exception

(BLE001)


254-254: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


274-274: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)

gui_agents/s3/agents/grounding.py

31-48: f-string without any placeholders

Remove extraneous f prefix

(F541)


190-190: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


350-350: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


365-365: Use explicit conversion flag

Replace with conversion flag

(RUF010)


366-366: Use explicit conversion flag

Replace with conversion flag

(RUF010)


368-368: Use explicit conversion flag

Replace with conversion flag

(RUF010)


379-379: Use explicit conversion flag

Replace with conversion flag

(RUF010)


383-383: Use explicit conversion flag

Replace with conversion flag

(RUF010)


385-385: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


394-394: Use explicit conversion flag

Replace with conversion flag

(RUF010)


396-396: Use explicit conversion flag

Replace with conversion flag

(RUF010)


431-431: Use explicit conversion flag

Replace with conversion flag

(RUF010)


440-440: Use explicit conversion flag

Replace with conversion flag

(RUF010)


441-441: Use explicit conversion flag

Replace with conversion flag

(RUF010)


444-444: Use explicit conversion flag

Replace with conversion flag

(RUF010)


461-461: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


479-479: Use explicit conversion flag

Replace with conversion flag

(RUF010)


482-482: Use explicit conversion flag

Replace with conversion flag

(RUF010)


524-524: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


621-621: Use explicit conversion flag

Replace with conversion flag

(RUF010)


624-624: Use explicit conversion flag

Replace with conversion flag

(RUF010)

gui_agents/s3/core/engine.py

9-9: Redefinition of unused AzureOpenAI from line 6

Remove definition: AzureOpenAI

(F811)


21-21: Unused method argument: kwargs

(ARG002)


35-35: Unused method argument: max_new_tokens

(ARG002)


38-40: Avoid specifying long messages outside the exception class

(TRY003)


62-62: Unused method argument: base_url

(ARG002)


62-62: Unused method argument: kwargs

(ARG002)


77-79: Avoid specifying long messages outside the exception class

(TRY003)


92-92: Local variable thoughts is assigned to but never used

Remove assignment to unused variable thoughts

(F841)


112-112: Unused method argument: temperature

(ARG002)


112-112: Unused method argument: max_new_tokens

(ARG002)


117-119: Avoid specifying long messages outside the exception class

(TRY003)


140-140: Unused method argument: kwargs

(ARG002)


156-158: Avoid specifying long messages outside the exception class

(TRY003)


161-163: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Unused method argument: kwargs

(ARG002)


199-201: Avoid specifying long messages outside the exception class

(TRY003)


204-206: Avoid specifying long messages outside the exception class

(TRY003)


227-227: Unused method argument: base_url

(ARG002)


234-234: Unused method argument: kwargs

(ARG002)


252-254: Avoid specifying long messages outside the exception class

(TRY003)


257-259: Avoid specifying long messages outside the exception class

(TRY003)


262-264: Avoid specifying long messages outside the exception class

(TRY003)


287-287: Unused method argument: kwargs

(ARG002)


307-307: Unused method argument: kwargs

(ARG002)


311-313: Avoid specifying long messages outside the exception class

(TRY003)


316-318: Avoid specifying long messages outside the exception class

(TRY003)


335-335: Unused method argument: kwargs

(ARG002)


347-349: Avoid specifying long messages outside the exception class

(TRY003)


352-354: Avoid specifying long messages outside the exception class

(TRY003)


371-371: Unused method argument: kwargs

(ARG002)


385-387: Avoid specifying long messages outside the exception class

(TRY003)


390-392: Avoid specifying long messages outside the exception class

(TRY003)

🪛 Shellcheck (0.11.0)
osworld_setup/s3/run.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

Comment on lines +149 to +218
action_upper = action.upper().strip()
if action_upper == "DONE":
logger.info(f"Step {step_count + 1}: Task completed successfully")
completion_reason = "DONE"
break
elif action_upper == "FAIL":
logger.info(f"Step {step_count + 1}: Task failed by agent request")
completion_reason = "FAIL"
break

# Extract and execute code
code_type, code = extract_code_block(action)

if code:
result = execute_code(code_type, code, env_controller)
# Prepare formatted output and error for logging
output = result.get("output", "")
error = result.get("error", "")
message = result.get("message", "")
status = result.get("status", "")

log_lines = [
f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:",
f"Status: {status}" if status else None,
]

if output:
log_lines.append("Output:\n" + ("-" * 40) + f"\n{output}\n" + ("-" * 40))
if error:
log_lines.append("Error:\n" + ("!" * 40) + f"\n{error}\n" + ("!" * 40))
if message and not output and not error:
log_lines.append("Message:\n" + ("-" * 40) + f"\n{message}\n" + ("-" * 40))

# Remove None entries and join
formatted_log = "\n".join([line for line in log_lines if line])
logger.info(formatted_log)
else:
logger.warning(f"Step {step_count + 1}: No code block found in action")
result = {"status": "skipped", "message": "No code block found"}
logger.info(
f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:\n"
f"Status: skipped\n"
f"Message:\n{'-' * 40}\n{result['message']}\n{'-' * 40}"
)
# Add assistant's thoughts and code to message history
self.agent.add_message(response, role="assistant")

# Process result and add formatted environment results as user message
result_context = format_result(result, step_count)
self.agent.add_message(result_context, role="user")

step_count += 1

# Handle budget exhaustion
if 'completion_reason' not in locals():
logger.info(f"Budget exhausted after {step_count} steps")
completion_reason = f"BUDGET_EXHAUSTED_AFTER_{step_count}_STEPS"

# Generate final summary
logger.info("Generating execution summary")
summary = self._generate_summary(execution_history, task_instruction)

result = {
"task_instruction": task_instruction,
"completion_reason": completion_reason,
"summary": summary,
"execution_history": execution_history,
"steps_executed": step_count,
"budget": self.budget
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Count DONE/FAIL turns before breaking.

When the model returns DONE/FAIL we break out before step_count is incremented, so the result on Line 216 reports steps_executed = 0 even though we consumed a turn and appended history. That skews cost tracking and downstream logic that relies on the right step count. Increment the counter before each early break (or derive the value from execution_history). For example:

             if action_upper == "DONE":
                 logger.info(f"Step {step_count + 1}: Task completed successfully")
                 completion_reason = "DONE"
+                step_count += 1
                 break
             elif action_upper == "FAIL":
                 logger.info(f"Step {step_count + 1}: Task failed by agent request")
                 completion_reason = "FAIL"
+                step_count += 1
                 break

With this change the steps_executed field and final log lines reflect the actual number of turns processed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
action_upper = action.upper().strip()
if action_upper == "DONE":
logger.info(f"Step {step_count + 1}: Task completed successfully")
completion_reason = "DONE"
break
elif action_upper == "FAIL":
logger.info(f"Step {step_count + 1}: Task failed by agent request")
completion_reason = "FAIL"
break
# Extract and execute code
code_type, code = extract_code_block(action)
if code:
result = execute_code(code_type, code, env_controller)
# Prepare formatted output and error for logging
output = result.get("output", "")
error = result.get("error", "")
message = result.get("message", "")
status = result.get("status", "")
log_lines = [
f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:",
f"Status: {status}" if status else None,
]
if output:
log_lines.append("Output:\n" + ("-" * 40) + f"\n{output}\n" + ("-" * 40))
if error:
log_lines.append("Error:\n" + ("!" * 40) + f"\n{error}\n" + ("!" * 40))
if message and not output and not error:
log_lines.append("Message:\n" + ("-" * 40) + f"\n{message}\n" + ("-" * 40))
# Remove None entries and join
formatted_log = "\n".join([line for line in log_lines if line])
logger.info(formatted_log)
else:
logger.warning(f"Step {step_count + 1}: No code block found in action")
result = {"status": "skipped", "message": "No code block found"}
logger.info(
f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:\n"
f"Status: skipped\n"
f"Message:\n{'-' * 40}\n{result['message']}\n{'-' * 40}"
)
# Add assistant's thoughts and code to message history
self.agent.add_message(response, role="assistant")
# Process result and add formatted environment results as user message
result_context = format_result(result, step_count)
self.agent.add_message(result_context, role="user")
step_count += 1
# Handle budget exhaustion
if 'completion_reason' not in locals():
logger.info(f"Budget exhausted after {step_count} steps")
completion_reason = f"BUDGET_EXHAUSTED_AFTER_{step_count}_STEPS"
# Generate final summary
logger.info("Generating execution summary")
summary = self._generate_summary(execution_history, task_instruction)
result = {
"task_instruction": task_instruction,
"completion_reason": completion_reason,
"summary": summary,
"execution_history": execution_history,
"steps_executed": step_count,
"budget": self.budget
}
action_upper = action.upper().strip()
if action_upper == "DONE":
logger.info(f"Step {step_count + 1}: Task completed successfully")
completion_reason = "DONE"
step_count += 1
break
elif action_upper == "FAIL":
logger.info(f"Step {step_count + 1}: Task failed by agent request")
completion_reason = "FAIL"
step_count += 1
break
🤖 Prompt for AI Agents
In gui_agents/s3/agents/code_agent.py around lines 149 to 218, the loop breaks
on agent actions "DONE" or "FAIL" before step_count is incremented, causing
steps_executed to undercount; increment step_count immediately before each early
break (for both DONE and FAIL) so the counter reflects the consumed turn, keep
completion_reason assignment and break as-is, and ensure any logs that reference
step_count use the incremented value; alternatively derive steps_executed from
len(execution_history) when building the final result if you prefer not to
mutate step_count mid-loop.

Comment on lines +2 to +9
import re
from collections import defaultdict
from io import BytesIO
from typing import Any, Dict, List, Optional, Tuple

import pytesseract
from PIL import Image
from pytesseract import Output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Decode the base64 screenshot before handing it to Pillow/Tesseract.

obs["screenshot"] is base64-encoded in this pipeline. Passing that string straight into BytesIO raises TypeError: a bytes-like object is required. Decode (and handle already-bytes input) before calling Image.open, and import base64 for it.

-import re
+import base64
+import re
@@
-        image = Image.open(BytesIO(b64_image_data))
+        if isinstance(b64_image_data, str):
+            image_bytes = base64.b64decode(b64_image_data)
+        else:
+            image_bytes = b64_image_data
+        image = Image.open(BytesIO(image_bytes))

Also applies to: 247-250

🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 2 to 9 (and likewise at lines
247-250), the code passes obs["screenshot"] directly into BytesIO which fails
when the screenshot is a base64-encoded string; update the import list to
include base64, detect if obs["screenshot"] is a str and decode it with
base64.b64decode (otherwise accept bytes as-is), then pass the resulting bytes
into BytesIO before calling Image.open and pytesseract; ensure the code handles
decoding errors gracefully (e.g., try/except) and keeps existing behavior when
input is already bytes.

Comment on lines +31 to +48
UBUNTU_APP_SETUP = f"""import subprocess;
import difflib;
import pyautogui;
pyautogui.press('escape');
time.sleep(0.5);
output = subprocess.check_output(['wmctrl', '-lx']);
output = output.decode('utf-8').splitlines();
window_titles = [line.split(None, 4)[2] for line in output];
closest_matches = difflib.get_close_matches('APP_NAME', window_titles, n=1, cutoff=0.1);
if closest_matches:
closest_match = closest_matches[0];
for line in output:
if closest_match in line:
window_id = line.split()[0]
break;
subprocess.run(['wmctrl', '-ia', window_id])
subprocess.run(['wmctrl', '-ir', window_id, '-b', 'add,maximized_vert,maximized_horz'])
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add missing time import in Ubuntu app switcher snippet.

UBUNTU_APP_SETUP executes time.sleep(...) but never imports time, so the generated snippet throws NameError as soon as it runs. Add import time; alongside the other imports in the template.

-UBUNTU_APP_SETUP = """import subprocess;
-import difflib;
-import pyautogui;
+UBUNTU_APP_SETUP = """import subprocess;
+import difflib;
+import pyautogui;
+import time;
 pyautogui.press('escape');

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.13.2)

31-48: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 31 to 48, the UBUNTU_APP_SETUP
template calls time.sleep(...) but never imports the time module; update the
template to add "import time;" alongside the existing imports at the top of the
f-string so the generated snippet defines time before use and avoids a NameError
at runtime.

Comment on lines +106 to +112
subprocess.run(
'echo \"osworld-public-evaluation\" | sudo -S ss --kill --tcp state TIME-WAIT sport = :2002',
shell=True,
check=True,
text=True,
capture_output=True
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove hard-coded sudo credentials from generated commands.

Both command templates bake in the root password ("osworld-public-evaluation") and feed it to sudo -S. That is a hard-coded credential in source control and a direct compliance/security violation. Please refactor these paths to elevate without embedding secrets (e.g. rely on pre-configured permissions or prompt the environment securely).

Also applies to: 418-420

🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 106-112 (and similarly at
418-420), remove the hard-coded sudo password embedded in the command; do not
echo a password into sudo -S. Instead, run the command without embedding
credentials (preferably as a list argument to subprocess.run and without
shell=True), or check for elevated privileges first (e.g., os.geteuid() == 0)
and raise a clear error asking the caller to run the process with appropriate
permissions; alternatively, if interactive elevation is required, obtain the
password securely (e.g., getpass.getpass or a protected env/config source) but
avoid storing secrets in code. Also replace shell=True usage with a safe list
form and preserve check=True/capture_output/text behavior; apply the same
changes to the occurrence at lines 418-420.

Comment on lines +309 to +314
if len(numericals) > 0:
text_id = int(numericals[-1])
else:
text_id = 0
elem = ocr_elements[text_id]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against out-of-range word ids from the text span agent.

If the LLM returns an id outside ocr_elements, this currently raises IndexError and crashes the action. Validate the id (and the OCR result) before indexing so we can surface a controlled failure.

         if len(numericals) > 0:
             text_id = int(numericals[-1])
         else:
             text_id = 0
-        elem = ocr_elements[text_id]
+        if not ocr_elements:
+            raise RuntimeError("OCR returned no tokens to ground against.")
+        if text_id < 0 or text_id >= len(ocr_elements):
+            raise ValueError(f"Grounding LLM produced invalid word id {text_id}; valid range is 0-{len(ocr_elements) - 1}.")
+        elem = ocr_elements[text_id]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(numericals) > 0:
text_id = int(numericals[-1])
else:
text_id = 0
elem = ocr_elements[text_id]
if len(numericals) > 0:
text_id = int(numericals[-1])
else:
text_id = 0
if not ocr_elements:
raise RuntimeError("OCR returned no tokens to ground against.")
if text_id < 0 or text_id >= len(ocr_elements):
raise ValueError(f"Grounding LLM produced invalid word id {text_id}; valid range is 0-{len(ocr_elements) - 1}.")
elem = ocr_elements[text_id]
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 309 to 314, the code assumes
the numeric id parsed from the LLM is valid and indexes ocr_elements directly,
which can raise IndexError; validate that numericals is non-empty, parse the
last numeric safely, ensure it is within 0 <= text_id < len(ocr_elements) before
using it, and if it is missing or out-of-range return or raise a controlled
error (with a clear message and optional logging) instead of indexing into
ocr_elements.

Comment on lines +41 to +42
result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action)
result["screenshot_num"] = i + 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pass the required screenshot_num to BehaviorNarrator.judge.
BehaviorNarrator.judge expects (screenshot_num, before_img_bytes, after_img_bytes, pyautogui_action) (see behavior_narrator.py). The current asyncio.to_thread call omits the first argument, so it raises TypeError and stops caption generation. Forward the step index before the image bytes.

-    result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action)
+    screenshot_num = i + 1
+    result = await asyncio.to_thread(
+        judge.judge,
+        screenshot_num,
+        before_bytes,
+        after_bytes,
+        pyautogui_action,
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action)
result["screenshot_num"] = i + 1
screenshot_num = i + 1
result = await asyncio.to_thread(
judge.judge,
screenshot_num,
before_bytes,
after_bytes,
pyautogui_action,
)
result["screenshot_num"] = i + 1
🤖 Prompt for AI Agents
In osworld_setup/s3/bbon/generate_facts.py around lines 41-42, the call to
asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action) is
missing the required first parameter screenshot_num; update the to_thread
invocation to pass the step index (i + 1) as the first argument — e.g.
asyncio.to_thread(judge.judge, i + 1, before_bytes, after_bytes,
pyautogui_action) — and you can keep setting result["screenshot_num"] = i + 1
afterwards.

Comment on lines +16 to +19
try:
agent.reset(runtime_logger)
except Exception as e:
agent.reset()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Call AgentS3.reset without extra arguments

AgentS3.reset (via UIAgent.reset) only accepts self, so passing runtime_logger raises TypeError before the episode even starts. The fallback in the except block hits the same method but also without the logger, so nothing relies on that argument anyway. Just call reset() bare.

-    try:
-        agent.reset(runtime_logger)
+    try:
+        agent.reset()
     except Exception as e:
         agent.reset()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
agent.reset(runtime_logger)
except Exception as e:
agent.reset()
try:
agent.reset()
except Exception as e:
agent.reset()
🧰 Tools
🪛 Ruff (0.13.2)

18-18: Do not catch blind exception: Exception

(BLE001)


18-18: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🤖 Prompt for AI Agents
In osworld_setup/s3/lib_run_single.py around lines 16-19, AgentS3.reset only
accepts self so passing runtime_logger causes a TypeError; remove the extra
argument and call agent.reset() without parameters (and eliminate the redundant
except block), i.e. replace the try/except that calls
agent.reset(runtime_logger) with a single bare agent.reset() call.

@@ -0,0 +1,54 @@
# Deplying Agent S3 in OSWorld
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in heading.

Opening header should read “Deploying Agent S3 in OSWorld.”

🤖 Prompt for AI Agents
In osworld_setup/s3/OSWorld.md around lines 1 to 1, the header contains a typo
"Deplying Agent S3 in OSWorld"; change the heading to "Deploying Agent S3 in
OSWorld" by correcting the misspelled word so the opening header reads exactly
that.

Comment on lines +30 to +39
file_handler = logging.FileHandler(
os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8"
)
debug_handler = logging.FileHandler(
os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8"
)
stdout_handler = logging.StreamHandler(sys.stdout)
sdebug_handler = logging.FileHandler(
os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Ensure the logs/ directory exists before adding file handlers

Right now the first logging.FileHandler instantiation will raise FileNotFoundError whenever logs/ hasn't been created yet, so the runner dies before doing anything useful. Please create the directory before wiring up the handlers.

 datetime_str: str = datetime.datetime.now().strftime("%Y%m%d@%H%M%S")
 
+os.makedirs("logs", exist_ok=True)
+
 file_handler = logging.FileHandler(
     os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8"
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_handler = logging.FileHandler(
os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8"
)
debug_handler = logging.FileHandler(
os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8"
)
stdout_handler = logging.StreamHandler(sys.stdout)
sdebug_handler = logging.FileHandler(
os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8"
)
datetime_str: str = datetime.datetime.now().strftime("%Y%m%d@%H%M%S")
os.makedirs("logs", exist_ok=True)
file_handler = logging.FileHandler(
os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8"
)
debug_handler = logging.FileHandler(
os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8"
)
stdout_handler = logging.StreamHandler(sys.stdout)
sdebug_handler = logging.FileHandler(
os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8"
)
🤖 Prompt for AI Agents
In osworld_setup/s3/run_local.py around lines 30 to 39, the code creates
logging.FileHandler instances for the logs/ directory without ensuring that
directory exists, which will raise FileNotFoundError; before instantiating any
FileHandler, create the logs directory (e.g., os.makedirs("logs",
exist_ok=True)) and handle potential permissions errors, then proceed to
construct the FileHandler objects so handlers can open files successfully.

Comment on lines +281 to +283
env.close()
logger.info(f"Average score: {sum(scores) / len(scores)}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against empty scores before computing the average

When every example is skipped or fails before recording a score, len(scores) stays at zero and this line raises ZeroDivisionError, terminating the script right at shutdown. Handle the empty case explicitly.

-    logger.info(f"Average score: {sum(scores) / len(scores)}")
+    if scores:
+        logger.info("Average score: %.2f", sum(scores) / len(scores))
+    else:
+        logger.warning("Average score unavailable: no successful runs recorded.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
env.close()
logger.info(f"Average score: {sum(scores) / len(scores)}")
env.close()
if scores:
logger.info("Average score: %.2f", sum(scores) / len(scores))
else:
logger.warning("Average score unavailable: no successful runs recorded.")
🤖 Prompt for AI Agents
In osworld_setup/s3/run_local.py around lines 281-283, computing
sum(scores)/len(scores) can raise ZeroDivisionError when scores is empty; guard
by checking if scores before dividing and log a clear message or default value.
Modify the shutdown code to: after env.close(), if scores: compute and log the
average; else log "No scores recorded" (or log average as 0/None) so the script
does not crash on shutdown.

@alckasoc alckasoc merged commit 6e01592 into main Oct 3, 2025
2 checks passed
jessepwj pushed a commit to jessepwj/claw-ui-s that referenced this pull request Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant