Conversation
|
Caution Review failedThe pull request is closed. WalkthroughReplaces Agent S2.5 with Agent S3 across docs and code. Adds a new S3 stack: AgentS3/UIAgent, Worker, OSWorldACI grounding, CodeAgent, LMMAgent and multi-provider LMM engines, procedural memory, formatters and utils, CLI app, behavior narration/comparative judging, and OSWorld orchestration (runners, fact generation, judge scripts, and docs). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as cli_app.py
participant A3 as AgentS3
participant W as Worker
participant GA as OSWorldACI
participant LLM as LMMAgent/LMMEngine
participant CA as CodeAgent
User->>CLI: submit instruction
CLI->>GA: assign_screenshot(obs)
CLI->>A3: predict(instruction, obs)
A3->>W: generate_next_action(instruction, obs)
W->>LLM: reflection prompt (optional)
LLM-->>W: reflection
W->>LLM: planning prompt (+history, reflection)
LLM-->>W: plan (may include code)
W->>CA: call_code_agent(plan or subtask) [optional]
CA->>LLM: step prompts
LLM-->>CA: step responses
CA-->>W: execution results / summary
W-->>A3: executor_info + actions/code
A3-->>CLI: info, actions/code
alt code requires user approval
CLI->>User: show_permission_dialog(code)
User-->>CLI: approve/deny
end
alt approved
CLI->>GA: execute(action/code)
GA-->>CLI: result (next/done/fail)
end
CLI-->>User: display status
sequenceDiagram
autonumber
participant Runner as run.py / run_local.py
participant Env as DesktopEnv
participant GA as OSWorldACI
participant A3 as AgentS3
participant Lib as lib_run_single.py
participant FS as Filesystem
Runner->>Env: create environment
Runner->>GA: init grounding (engine params)
Runner->>A3: init AgentS3 (Worker + GA)
Runner->>Lib: run_single_example(...)
Lib->>Env: reset()
loop step N
Lib->>Env: observe()
Lib->>A3: predict(instruction, obs)
A3-->>Lib: actions/code
Lib->>Env: step(action)
Env-->>Lib: new_obs, reward, done
Lib->>FS: save step_N.png and traj.jsonl
end
Lib->>Env: evaluate()
Lib->>FS: write result.txt
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (27)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 23
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
images/s3_results.pngis excluded by!**/*.png
📒 Files selected for processing (22)
README.md(9 hunks)gui_agents/s3/agents/agent_s.py(1 hunks)gui_agents/s3/agents/code_agent.py(1 hunks)gui_agents/s3/agents/grounding.py(1 hunks)gui_agents/s3/agents/worker.py(1 hunks)gui_agents/s3/bbon/behavior_narrator.py(1 hunks)gui_agents/s3/bbon/comparative_judge.py(1 hunks)gui_agents/s3/cli_app.py(1 hunks)gui_agents/s3/core/engine.py(1 hunks)gui_agents/s3/core/mllm.py(1 hunks)gui_agents/s3/core/module.py(1 hunks)gui_agents/s3/memory/procedural_memory.py(1 hunks)gui_agents/s3/utils/common_utils.py(1 hunks)gui_agents/s3/utils/formatters.py(1 hunks)osworld_setup/s3/OSWorld.md(1 hunks)osworld_setup/s3/bbon/generate_facts.py(1 hunks)osworld_setup/s3/bbon/run_judge.py(1 hunks)osworld_setup/s3/bbon/utils.py(1 hunks)osworld_setup/s3/lib_run_single.py(1 hunks)osworld_setup/s3/run.py(1 hunks)osworld_setup/s3/run.sh(1 hunks)osworld_setup/s3/run_local.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (17)
osworld_setup/s3/run_local.py (3)
gui_agents/s3/agents/agent_s.py (1)
AgentS3(48-102)gui_agents/s3/agents/grounding.py (2)
OSWorldACI(180-646)open(388-396)osworld_setup/s3/lib_run_single.py (1)
run_single_example(12-80)
gui_agents/s3/core/module.py (2)
gui_agents/s3/core/mllm.py (2)
LMMAgent(17-305)add_system_prompt(71-84)gui_agents/s1/core/BaseModule.py (1)
BaseModule(6-18)
gui_agents/s3/utils/formatters.py (1)
gui_agents/s3/utils/common_utils.py (4)
extract_agent_functions(144-154)parse_code_from_string(121-142)create_pyautogui_code(13-30)split_thinking_response(109-119)
gui_agents/s3/bbon/comparative_judge.py (4)
gui_agents/s3/core/mllm.py (1)
LMMAgent(17-305)gui_agents/s3/memory/procedural_memory.py (1)
PROCEDURAL_MEMORY(5-367)gui_agents/s3/utils/common_utils.py (2)
call_llm_formatted(55-106)split_thinking_response(109-119)osworld_setup/s3/bbon/utils.py (2)
get_final_screenshot_file(101-133)image_to_openai_message_format(9-42)
osworld_setup/s3/lib_run_single.py (1)
gui_agents/s3/agents/agent_s.py (4)
reset(31-33)reset(77-85)predict(35-45)predict(87-102)
gui_agents/s3/agents/agent_s.py (2)
gui_agents/s3/agents/grounding.py (1)
ACI(20-22)gui_agents/s3/agents/worker.py (3)
Worker(24-335)reset(62-79)generate_next_action(170-335)
osworld_setup/s3/bbon/run_judge.py (2)
osworld_setup/s3/bbon/utils.py (4)
get_new_tasks_classification(154-215)evaluate_comparative_results(254-283)load_task_instruction(63-98)load_facts(45-61)gui_agents/s3/bbon/comparative_judge.py (2)
ComparativeJudge(58-100)judge(62-100)
gui_agents/s3/agents/code_agent.py (3)
gui_agents/s3/memory/procedural_memory.py (1)
PROCEDURAL_MEMORY(5-367)gui_agents/s3/utils/common_utils.py (2)
call_llm_safe(32-53)split_thinking_response(109-119)gui_agents/s3/core/mllm.py (3)
LMMAgent(17-305)reset(60-69)add_message(112-272)
gui_agents/s3/agents/worker.py (5)
gui_agents/s3/agents/grounding.py (5)
ACI(20-22)assign_screenshot(327-328)set_task_instruction(330-332)open(388-396)wait(629-634)gui_agents/s3/core/module.py (2)
BaseModule(5-17)_create_agent(10-17)gui_agents/s3/memory/procedural_memory.py (2)
PROCEDURAL_MEMORY(5-367)construct_simple_worker_procedural_memory(14-111)gui_agents/s3/utils/common_utils.py (5)
call_llm_safe(32-53)call_llm_formatted(55-106)parse_code_from_string(121-142)split_thinking_response(109-119)create_pyautogui_code(13-30)gui_agents/s3/core/mllm.py (3)
reset(60-69)add_system_prompt(71-84)add_message(112-272)
osworld_setup/s3/bbon/generate_facts.py (2)
gui_agents/s3/bbon/behavior_narrator.py (2)
BehaviorNarrator(14-181)judge(131-181)osworld_setup/s3/bbon/utils.py (1)
get_new_tasks_classification(154-215)
osworld_setup/s3/run.py (4)
gui_agents/s3/agents/agent_s.py (1)
AgentS3(48-102)gui_agents/s3/agents/grounding.py (2)
OSWorldACI(180-646)open(388-396)osworld_setup/s3/lib_run_single.py (1)
run_single_example(12-80)osworld_setup/s3/run_local.py (4)
config(66-155)test(158-282)get_unfinished(285-319)get_result(322-354)
gui_agents/s3/core/mllm.py (1)
gui_agents/s3/core/engine.py (17)
LMMEngineAnthropic(60-135)LMMEngineAzureOpenAI(224-282)LMMEngineHuggingFace(334-367)LMMEngineOpenAI(19-57)LMMEngineOpenRouter(181-221)LMMEngineParasail(370-405)LMMEnginevLLM(285-331)LMMEngineGemini(138-178)generate_with_thinking(111-135)generate(35-57)generate(74-105)generate(153-178)generate(196-221)generate(249-282)generate(300-331)generate(344-367)generate(382-405)
gui_agents/s3/utils/common_utils.py (3)
gui_agents/s3/memory/procedural_memory.py (1)
PROCEDURAL_MEMORY(5-367)gui_agents/s3/agents/grounding.py (1)
assign_screenshot(327-328)gui_agents/s3/core/mllm.py (1)
get_response(274-305)
gui_agents/s3/bbon/behavior_narrator.py (3)
gui_agents/s3/core/mllm.py (1)
LMMAgent(17-305)gui_agents/s3/memory/procedural_memory.py (1)
PROCEDURAL_MEMORY(5-367)gui_agents/s3/utils/common_utils.py (3)
call_llm_formatted(55-106)split_thinking_response(109-119)compress_image(156-172)
gui_agents/s3/cli_app.py (2)
gui_agents/s3/agents/grounding.py (1)
OSWorldACI(180-646)gui_agents/s3/agents/agent_s.py (5)
AgentS3(48-102)predict(35-45)predict(87-102)reset(31-33)reset(77-85)
osworld_setup/s3/bbon/utils.py (1)
gui_agents/s3/bbon/comparative_judge.py (2)
image_to_openai_message_format(31-55)get_final_screenshot_file(10-28)
gui_agents/s3/agents/grounding.py (5)
gui_agents/s3/memory/procedural_memory.py (1)
PROCEDURAL_MEMORY(5-367)gui_agents/s3/core/mllm.py (3)
LMMAgent(17-305)reset(60-69)add_message(112-272)gui_agents/s3/utils/common_utils.py (1)
call_llm_safe(32-53)gui_agents/s3/agents/code_agent.py (3)
CodeAgent(87-276)reset(102-108)execute(110-221)gui_agents/s3/agents/worker.py (1)
reset(62-79)
🪛 Ruff (0.13.2)
osworld_setup/s3/run_local.py
266-266: Do not catch blind exception: Exception
(BLE001)
267-267: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
322-322: Unused function argument: total_file_json
(ARG001)
346-346: Do not use bare except
(E722)
gui_agents/s3/core/module.py
11-11: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
gui_agents/s3/utils/formatters.py
9-9: Do not assign a lambda expression, use a def
Rewrite single_action_check as a def
(E731)
11-13: Do not assign a lambda expression, use a def
Rewrite SINGLE_ACTION_FORMATTER as a def
(E731)
19-19: Do not catch blind exception: Exception
(BLE001)
19-19: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
21-21: Do not assign a lambda expression, use a def
Rewrite code_valid_check as a def
(E731)
23-25: Do not assign a lambda expression, use a def
Rewrite CODE_VALID_FORMATTER as a def
(E731)
27-27: Do not assign a lambda expression, use a def
Rewrite thoughts_answer_tag_check as a def
(E731)
29-31: Do not assign a lambda expression, use a def
Rewrite THOUGHTS_ANSWER_TAG_FORMATTER as a def
(E731)
33-33: Do not assign a lambda expression, use a def
Rewrite integer_answer_check as a def
(E731)
35-37: Do not assign a lambda expression, use a def
Rewrite INTEGER_ANSWER_FORMATTER as a def
(E731)
gui_agents/s3/bbon/comparative_judge.py
3-3: Redefinition of unused List from line 3
Remove definition: List
(F811)
24-24: Do not use bare except
(E722)
52-52: Consider moving this statement to an else block
(TRY300)
53-53: Do not catch blind exception: Exception
(BLE001)
osworld_setup/s3/lib_run_single.py
6-6: from typing import * used; unable to detect undefined names
(F403)
7-7: from wrapt_timeout_decorator import * used; unable to detect undefined names
(F403)
18-18: Do not catch blind exception: Exception
(BLE001)
18-18: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
25-25: f-string without any placeholders
Remove extraneous f prefix
(F541)
osworld_setup/s3/bbon/run_judge.py
106-106: Loop control variable ans not used within loop body
Rename unused ans to _ans
(B007)
106-106: Loop control variable thoughts not used within loop body
Rename unused thoughts to _thoughts
(B007)
114-114: Unpacked variable maximum_gain is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
128-128: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
145-145: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
145-145: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
145-145: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
145-145: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
145-145: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
gui_agents/s3/agents/code_agent.py
42-42: Consider moving this statement to an else block
(TRY300)
44-44: Do not catch blind exception: Exception
(BLE001)
45-45: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
93-93: Avoid specifying long messages outside the exception class
(TRY003)
272-272: Do not catch blind exception: Exception
(BLE001)
273-273: Use explicit conversion flag
Replace with conversion flag
(RUF010)
274-274: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
gui_agents/s3/agents/worker.py
95-95: Multiple statements on one line (colon)
(E701)
196-196: f-string without any placeholders
Remove extraneous f prefix
(F541)
203-203: f-string without any placeholders
Remove extraneous f prefix
(F541)
252-252: f-string without any placeholders
Remove extraneous f prefix
(F541)
260-260: Do not catch blind exception: Exception
(BLE001)
261-261: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
264-264: f-string without any placeholders
Remove extraneous f prefix
(F541)
271-271: f-string without any placeholders
Remove extraneous f prefix
(F541)
320-320: Do not catch blind exception: Exception
(BLE001)
321-321: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
osworld_setup/s3/bbon/generate_facts.py
25-25: Do not use bare except
(E722)
25-26: try-except-pass detected, consider logging the exception
(S110)
29-29: Avoid specifying long messages outside the exception class
(TRY003)
37-37: Do not catch blind exception: Exception
(BLE001)
38-38: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
38-38: Create your own exception
(TRY002)
38-38: Avoid specifying long messages outside the exception class
(TRY003)
61-61: Do not use bare except
(E722)
77-77: Do not use bare except
(E722)
77-78: try-except-pass detected, consider logging the exception
(S110)
95-95: Do not catch blind exception: Exception
(BLE001)
osworld_setup/s3/run.py
71-71: Do not catch blind exception: Exception
(BLE001)
72-72: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
88-88: Do not catch blind exception: Exception
(BLE001)
89-89: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
125-125: Do not catch blind exception: Exception
(BLE001)
158-158: Do not catch blind exception: Exception
(BLE001)
160-160: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
161-161: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
166-166: Do not catch blind exception: Exception
(BLE001)
167-167: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
175-175: Do not catch blind exception: Exception
(BLE001)
176-176: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
178-178: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
179-179: Do not catch blind exception: Exception
(BLE001)
180-180: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
182-182: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
189-189: Do not catch blind exception: Exception
(BLE001)
190-190: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
192-192: Unused function argument: frame
(ARG001)
200-200: f-string without any placeholders
Remove extraneous f prefix
(F541)
202-202: f-string without any placeholders
Remove extraneous f prefix
(F541)
203-203: Do not catch blind exception: Exception
(BLE001)
204-204: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
210-210: Do not catch blind exception: Exception
(BLE001)
211-211: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
219-219: Do not catch blind exception: Exception
(BLE001)
220-220: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
395-395: Do not catch blind exception: Exception
(BLE001)
396-396: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
439-439: Unused function argument: total_file_json
(ARG001)
463-463: Do not use bare except
(E722)
gui_agents/s3/core/mllm.py
39-39: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
272-272: Prefer TypeError exception for invalid type
(TRY004)
272-272: Avoid specifying long messages outside the exception class
(TRY003)
gui_agents/s3/utils/common_utils.py
29-29: Use of possibly insecure function; consider using ast.literal_eval
(S307)
47-47: Do not catch blind exception: Exception
(BLE001)
117-117: Consider moving this statement to an else block
(TRY300)
118-118: Do not catch blind exception: Exception
(BLE001)
118-118: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
156-156: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
gui_agents/s3/bbon/behavior_narrator.py
46-46: Function definition does not bind loop variable width
(B023)
48-48: Function definition does not bind loop variable height
(B023)
50-50: Function definition does not bind loop variable width
(B023)
52-52: Function definition does not bind loop variable height
(B023)
54-54: Function definition does not bind loop variable width
(B023)
54-54: Function definition does not bind loop variable height
(B023)
81-81: Avoid specifying long messages outside the exception class
(TRY003)
gui_agents/s3/cli_app.py
41-41: Do not use bare except
(E722)
44-44: Unused function argument: signum
(ARG001)
44-44: Unused function argument: frame
(ARG001)
130-130: Starting a process with a shell, possible injection detected
(S605)
135-135: Starting a process with a shell, possible injection detected
(S605)
153-153: Local variable subtask_traj is assigned to but never used
Remove assignment to unused variable subtask_traj
(F841)
182-182: Starting a process with a shell, possible injection detected
(S605)
183-183: f-string without any placeholders
Remove extraneous f prefix
(F541)
186-186: Starting a process with a shell, possible injection detected
(S605)
187-187: f-string without any placeholders
Remove extraneous f prefix
(F541)
209-209: Use of exec detected
(S102)
gui_agents/s3/memory/procedural_memory.py
16-63: f-string without any placeholders
Remove extraneous f prefix
(F541)
osworld_setup/s3/bbon/utils.py
9-9: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
38-38: Consider moving this statement to an else block
(TRY300)
40-40: Do not catch blind exception: Exception
(BLE001)
94-94: Consider moving this statement to an else block
(TRY300)
96-96: Do not catch blind exception: Exception
(BLE001)
149-149: Do not catch blind exception: Exception
(BLE001)
254-254: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
274-274: Local variable results is assigned to but never used
Remove assignment to unused variable results
(F841)
gui_agents/s3/agents/grounding.py
31-48: f-string without any placeholders
Remove extraneous f prefix
(F541)
190-190: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
350-350: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
365-365: Use explicit conversion flag
Replace with conversion flag
(RUF010)
366-366: Use explicit conversion flag
Replace with conversion flag
(RUF010)
368-368: Use explicit conversion flag
Replace with conversion flag
(RUF010)
379-379: Use explicit conversion flag
Replace with conversion flag
(RUF010)
383-383: Use explicit conversion flag
Replace with conversion flag
(RUF010)
385-385: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
394-394: Use explicit conversion flag
Replace with conversion flag
(RUF010)
396-396: Use explicit conversion flag
Replace with conversion flag
(RUF010)
431-431: Use explicit conversion flag
Replace with conversion flag
(RUF010)
440-440: Use explicit conversion flag
Replace with conversion flag
(RUF010)
441-441: Use explicit conversion flag
Replace with conversion flag
(RUF010)
444-444: Use explicit conversion flag
Replace with conversion flag
(RUF010)
461-461: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
479-479: Use explicit conversion flag
Replace with conversion flag
(RUF010)
482-482: Use explicit conversion flag
Replace with conversion flag
(RUF010)
524-524: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
621-621: Use explicit conversion flag
Replace with conversion flag
(RUF010)
624-624: Use explicit conversion flag
Replace with conversion flag
(RUF010)
gui_agents/s3/core/engine.py
9-9: Redefinition of unused AzureOpenAI from line 6
Remove definition: AzureOpenAI
(F811)
21-21: Unused method argument: kwargs
(ARG002)
35-35: Unused method argument: max_new_tokens
(ARG002)
38-40: Avoid specifying long messages outside the exception class
(TRY003)
62-62: Unused method argument: base_url
(ARG002)
62-62: Unused method argument: kwargs
(ARG002)
77-79: Avoid specifying long messages outside the exception class
(TRY003)
92-92: Local variable thoughts is assigned to but never used
Remove assignment to unused variable thoughts
(F841)
112-112: Unused method argument: temperature
(ARG002)
112-112: Unused method argument: max_new_tokens
(ARG002)
117-119: Avoid specifying long messages outside the exception class
(TRY003)
140-140: Unused method argument: kwargs
(ARG002)
156-158: Avoid specifying long messages outside the exception class
(TRY003)
161-163: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Unused method argument: kwargs
(ARG002)
199-201: Avoid specifying long messages outside the exception class
(TRY003)
204-206: Avoid specifying long messages outside the exception class
(TRY003)
227-227: Unused method argument: base_url
(ARG002)
234-234: Unused method argument: kwargs
(ARG002)
252-254: Avoid specifying long messages outside the exception class
(TRY003)
257-259: Avoid specifying long messages outside the exception class
(TRY003)
262-264: Avoid specifying long messages outside the exception class
(TRY003)
287-287: Unused method argument: kwargs
(ARG002)
307-307: Unused method argument: kwargs
(ARG002)
311-313: Avoid specifying long messages outside the exception class
(TRY003)
316-318: Avoid specifying long messages outside the exception class
(TRY003)
335-335: Unused method argument: kwargs
(ARG002)
347-349: Avoid specifying long messages outside the exception class
(TRY003)
352-354: Avoid specifying long messages outside the exception class
(TRY003)
371-371: Unused method argument: kwargs
(ARG002)
385-387: Avoid specifying long messages outside the exception class
(TRY003)
390-392: Avoid specifying long messages outside the exception class
(TRY003)
🪛 Shellcheck (0.11.0)
osworld_setup/s3/run.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
| action_upper = action.upper().strip() | ||
| if action_upper == "DONE": | ||
| logger.info(f"Step {step_count + 1}: Task completed successfully") | ||
| completion_reason = "DONE" | ||
| break | ||
| elif action_upper == "FAIL": | ||
| logger.info(f"Step {step_count + 1}: Task failed by agent request") | ||
| completion_reason = "FAIL" | ||
| break | ||
|
|
||
| # Extract and execute code | ||
| code_type, code = extract_code_block(action) | ||
|
|
||
| if code: | ||
| result = execute_code(code_type, code, env_controller) | ||
| # Prepare formatted output and error for logging | ||
| output = result.get("output", "") | ||
| error = result.get("error", "") | ||
| message = result.get("message", "") | ||
| status = result.get("status", "") | ||
|
|
||
| log_lines = [ | ||
| f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:", | ||
| f"Status: {status}" if status else None, | ||
| ] | ||
|
|
||
| if output: | ||
| log_lines.append("Output:\n" + ("-" * 40) + f"\n{output}\n" + ("-" * 40)) | ||
| if error: | ||
| log_lines.append("Error:\n" + ("!" * 40) + f"\n{error}\n" + ("!" * 40)) | ||
| if message and not output and not error: | ||
| log_lines.append("Message:\n" + ("-" * 40) + f"\n{message}\n" + ("-" * 40)) | ||
|
|
||
| # Remove None entries and join | ||
| formatted_log = "\n".join([line for line in log_lines if line]) | ||
| logger.info(formatted_log) | ||
| else: | ||
| logger.warning(f"Step {step_count + 1}: No code block found in action") | ||
| result = {"status": "skipped", "message": "No code block found"} | ||
| logger.info( | ||
| f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:\n" | ||
| f"Status: skipped\n" | ||
| f"Message:\n{'-' * 40}\n{result['message']}\n{'-' * 40}" | ||
| ) | ||
| # Add assistant's thoughts and code to message history | ||
| self.agent.add_message(response, role="assistant") | ||
|
|
||
| # Process result and add formatted environment results as user message | ||
| result_context = format_result(result, step_count) | ||
| self.agent.add_message(result_context, role="user") | ||
|
|
||
| step_count += 1 | ||
|
|
||
| # Handle budget exhaustion | ||
| if 'completion_reason' not in locals(): | ||
| logger.info(f"Budget exhausted after {step_count} steps") | ||
| completion_reason = f"BUDGET_EXHAUSTED_AFTER_{step_count}_STEPS" | ||
|
|
||
| # Generate final summary | ||
| logger.info("Generating execution summary") | ||
| summary = self._generate_summary(execution_history, task_instruction) | ||
|
|
||
| result = { | ||
| "task_instruction": task_instruction, | ||
| "completion_reason": completion_reason, | ||
| "summary": summary, | ||
| "execution_history": execution_history, | ||
| "steps_executed": step_count, | ||
| "budget": self.budget | ||
| } |
There was a problem hiding this comment.
Count DONE/FAIL turns before breaking.
When the model returns DONE/FAIL we break out before step_count is incremented, so the result on Line 216 reports steps_executed = 0 even though we consumed a turn and appended history. That skews cost tracking and downstream logic that relies on the right step count. Increment the counter before each early break (or derive the value from execution_history). For example:
if action_upper == "DONE":
logger.info(f"Step {step_count + 1}: Task completed successfully")
completion_reason = "DONE"
+ step_count += 1
break
elif action_upper == "FAIL":
logger.info(f"Step {step_count + 1}: Task failed by agent request")
completion_reason = "FAIL"
+ step_count += 1
breakWith this change the steps_executed field and final log lines reflect the actual number of turns processed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| action_upper = action.upper().strip() | |
| if action_upper == "DONE": | |
| logger.info(f"Step {step_count + 1}: Task completed successfully") | |
| completion_reason = "DONE" | |
| break | |
| elif action_upper == "FAIL": | |
| logger.info(f"Step {step_count + 1}: Task failed by agent request") | |
| completion_reason = "FAIL" | |
| break | |
| # Extract and execute code | |
| code_type, code = extract_code_block(action) | |
| if code: | |
| result = execute_code(code_type, code, env_controller) | |
| # Prepare formatted output and error for logging | |
| output = result.get("output", "") | |
| error = result.get("error", "") | |
| message = result.get("message", "") | |
| status = result.get("status", "") | |
| log_lines = [ | |
| f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:", | |
| f"Status: {status}" if status else None, | |
| ] | |
| if output: | |
| log_lines.append("Output:\n" + ("-" * 40) + f"\n{output}\n" + ("-" * 40)) | |
| if error: | |
| log_lines.append("Error:\n" + ("!" * 40) + f"\n{error}\n" + ("!" * 40)) | |
| if message and not output and not error: | |
| log_lines.append("Message:\n" + ("-" * 40) + f"\n{message}\n" + ("-" * 40)) | |
| # Remove None entries and join | |
| formatted_log = "\n".join([line for line in log_lines if line]) | |
| logger.info(formatted_log) | |
| else: | |
| logger.warning(f"Step {step_count + 1}: No code block found in action") | |
| result = {"status": "skipped", "message": "No code block found"} | |
| logger.info( | |
| f"CODING_AGENT_EXECUTION_RESULT - Step {step_count + 1}:\n" | |
| f"Status: skipped\n" | |
| f"Message:\n{'-' * 40}\n{result['message']}\n{'-' * 40}" | |
| ) | |
| # Add assistant's thoughts and code to message history | |
| self.agent.add_message(response, role="assistant") | |
| # Process result and add formatted environment results as user message | |
| result_context = format_result(result, step_count) | |
| self.agent.add_message(result_context, role="user") | |
| step_count += 1 | |
| # Handle budget exhaustion | |
| if 'completion_reason' not in locals(): | |
| logger.info(f"Budget exhausted after {step_count} steps") | |
| completion_reason = f"BUDGET_EXHAUSTED_AFTER_{step_count}_STEPS" | |
| # Generate final summary | |
| logger.info("Generating execution summary") | |
| summary = self._generate_summary(execution_history, task_instruction) | |
| result = { | |
| "task_instruction": task_instruction, | |
| "completion_reason": completion_reason, | |
| "summary": summary, | |
| "execution_history": execution_history, | |
| "steps_executed": step_count, | |
| "budget": self.budget | |
| } | |
| action_upper = action.upper().strip() | |
| if action_upper == "DONE": | |
| logger.info(f"Step {step_count + 1}: Task completed successfully") | |
| completion_reason = "DONE" | |
| step_count += 1 | |
| break | |
| elif action_upper == "FAIL": | |
| logger.info(f"Step {step_count + 1}: Task failed by agent request") | |
| completion_reason = "FAIL" | |
| step_count += 1 | |
| break |
🤖 Prompt for AI Agents
In gui_agents/s3/agents/code_agent.py around lines 149 to 218, the loop breaks
on agent actions "DONE" or "FAIL" before step_count is incremented, causing
steps_executed to undercount; increment step_count immediately before each early
break (for both DONE and FAIL) so the counter reflects the consumed turn, keep
completion_reason assignment and break as-is, and ensure any logs that reference
step_count use the incremented value; alternatively derive steps_executed from
len(execution_history) when building the final result if you prefer not to
mutate step_count mid-loop.
| import re | ||
| from collections import defaultdict | ||
| from io import BytesIO | ||
| from typing import Any, Dict, List, Optional, Tuple | ||
|
|
||
| import pytesseract | ||
| from PIL import Image | ||
| from pytesseract import Output |
There was a problem hiding this comment.
Decode the base64 screenshot before handing it to Pillow/Tesseract.
obs["screenshot"] is base64-encoded in this pipeline. Passing that string straight into BytesIO raises TypeError: a bytes-like object is required. Decode (and handle already-bytes input) before calling Image.open, and import base64 for it.
-import re
+import base64
+import re
@@
- image = Image.open(BytesIO(b64_image_data))
+ if isinstance(b64_image_data, str):
+ image_bytes = base64.b64decode(b64_image_data)
+ else:
+ image_bytes = b64_image_data
+ image = Image.open(BytesIO(image_bytes))Also applies to: 247-250
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 2 to 9 (and likewise at lines
247-250), the code passes obs["screenshot"] directly into BytesIO which fails
when the screenshot is a base64-encoded string; update the import list to
include base64, detect if obs["screenshot"] is a str and decode it with
base64.b64decode (otherwise accept bytes as-is), then pass the resulting bytes
into BytesIO before calling Image.open and pytesseract; ensure the code handles
decoding errors gracefully (e.g., try/except) and keeps existing behavior when
input is already bytes.
| UBUNTU_APP_SETUP = f"""import subprocess; | ||
| import difflib; | ||
| import pyautogui; | ||
| pyautogui.press('escape'); | ||
| time.sleep(0.5); | ||
| output = subprocess.check_output(['wmctrl', '-lx']); | ||
| output = output.decode('utf-8').splitlines(); | ||
| window_titles = [line.split(None, 4)[2] for line in output]; | ||
| closest_matches = difflib.get_close_matches('APP_NAME', window_titles, n=1, cutoff=0.1); | ||
| if closest_matches: | ||
| closest_match = closest_matches[0]; | ||
| for line in output: | ||
| if closest_match in line: | ||
| window_id = line.split()[0] | ||
| break; | ||
| subprocess.run(['wmctrl', '-ia', window_id]) | ||
| subprocess.run(['wmctrl', '-ir', window_id, '-b', 'add,maximized_vert,maximized_horz']) | ||
| """ |
There was a problem hiding this comment.
Add missing time import in Ubuntu app switcher snippet.
UBUNTU_APP_SETUP executes time.sleep(...) but never imports time, so the generated snippet throws NameError as soon as it runs. Add import time; alongside the other imports in the template.
-UBUNTU_APP_SETUP = """import subprocess;
-import difflib;
-import pyautogui;
+UBUNTU_APP_SETUP = """import subprocess;
+import difflib;
+import pyautogui;
+import time;
pyautogui.press('escape');Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.13.2)
31-48: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 31 to 48, the UBUNTU_APP_SETUP
template calls time.sleep(...) but never imports the time module; update the
template to add "import time;" alongside the existing imports at the top of the
f-string so the generated snippet defines time before use and avoids a NameError
at runtime.
| subprocess.run( | ||
| 'echo \"osworld-public-evaluation\" | sudo -S ss --kill --tcp state TIME-WAIT sport = :2002', | ||
| shell=True, | ||
| check=True, | ||
| text=True, | ||
| capture_output=True | ||
| ) |
There was a problem hiding this comment.
Remove hard-coded sudo credentials from generated commands.
Both command templates bake in the root password ("osworld-public-evaluation") and feed it to sudo -S. That is a hard-coded credential in source control and a direct compliance/security violation. Please refactor these paths to elevate without embedding secrets (e.g. rely on pre-configured permissions or prompt the environment securely).
Also applies to: 418-420
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 106-112 (and similarly at
418-420), remove the hard-coded sudo password embedded in the command; do not
echo a password into sudo -S. Instead, run the command without embedding
credentials (preferably as a list argument to subprocess.run and without
shell=True), or check for elevated privileges first (e.g., os.geteuid() == 0)
and raise a clear error asking the caller to run the process with appropriate
permissions; alternatively, if interactive elevation is required, obtain the
password securely (e.g., getpass.getpass or a protected env/config source) but
avoid storing secrets in code. Also replace shell=True usage with a safe list
form and preserve check=True/capture_output/text behavior; apply the same
changes to the occurrence at lines 418-420.
| if len(numericals) > 0: | ||
| text_id = int(numericals[-1]) | ||
| else: | ||
| text_id = 0 | ||
| elem = ocr_elements[text_id] | ||
|
|
There was a problem hiding this comment.
Guard against out-of-range word ids from the text span agent.
If the LLM returns an id outside ocr_elements, this currently raises IndexError and crashes the action. Validate the id (and the OCR result) before indexing so we can surface a controlled failure.
if len(numericals) > 0:
text_id = int(numericals[-1])
else:
text_id = 0
- elem = ocr_elements[text_id]
+ if not ocr_elements:
+ raise RuntimeError("OCR returned no tokens to ground against.")
+ if text_id < 0 or text_id >= len(ocr_elements):
+ raise ValueError(f"Grounding LLM produced invalid word id {text_id}; valid range is 0-{len(ocr_elements) - 1}.")
+ elem = ocr_elements[text_id]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(numericals) > 0: | |
| text_id = int(numericals[-1]) | |
| else: | |
| text_id = 0 | |
| elem = ocr_elements[text_id] | |
| if len(numericals) > 0: | |
| text_id = int(numericals[-1]) | |
| else: | |
| text_id = 0 | |
| if not ocr_elements: | |
| raise RuntimeError("OCR returned no tokens to ground against.") | |
| if text_id < 0 or text_id >= len(ocr_elements): | |
| raise ValueError(f"Grounding LLM produced invalid word id {text_id}; valid range is 0-{len(ocr_elements) - 1}.") | |
| elem = ocr_elements[text_id] |
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 309 to 314, the code assumes
the numeric id parsed from the LLM is valid and indexes ocr_elements directly,
which can raise IndexError; validate that numericals is non-empty, parse the
last numeric safely, ensure it is within 0 <= text_id < len(ocr_elements) before
using it, and if it is missing or out-of-range return or raise a controlled
error (with a clear message and optional logging) instead of indexing into
ocr_elements.
| result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action) | ||
| result["screenshot_num"] = i + 1 |
There was a problem hiding this comment.
Pass the required screenshot_num to BehaviorNarrator.judge.
BehaviorNarrator.judge expects (screenshot_num, before_img_bytes, after_img_bytes, pyautogui_action) (see behavior_narrator.py). The current asyncio.to_thread call omits the first argument, so it raises TypeError and stops caption generation. Forward the step index before the image bytes.
- result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action)
+ screenshot_num = i + 1
+ result = await asyncio.to_thread(
+ judge.judge,
+ screenshot_num,
+ before_bytes,
+ after_bytes,
+ pyautogui_action,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = await asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action) | |
| result["screenshot_num"] = i + 1 | |
| screenshot_num = i + 1 | |
| result = await asyncio.to_thread( | |
| judge.judge, | |
| screenshot_num, | |
| before_bytes, | |
| after_bytes, | |
| pyautogui_action, | |
| ) | |
| result["screenshot_num"] = i + 1 |
🤖 Prompt for AI Agents
In osworld_setup/s3/bbon/generate_facts.py around lines 41-42, the call to
asyncio.to_thread(judge.judge, before_bytes, after_bytes, pyautogui_action) is
missing the required first parameter screenshot_num; update the to_thread
invocation to pass the step index (i + 1) as the first argument — e.g.
asyncio.to_thread(judge.judge, i + 1, before_bytes, after_bytes,
pyautogui_action) — and you can keep setting result["screenshot_num"] = i + 1
afterwards.
| try: | ||
| agent.reset(runtime_logger) | ||
| except Exception as e: | ||
| agent.reset() |
There was a problem hiding this comment.
Call AgentS3.reset without extra arguments
AgentS3.reset (via UIAgent.reset) only accepts self, so passing runtime_logger raises TypeError before the episode even starts. The fallback in the except block hits the same method but also without the logger, so nothing relies on that argument anyway. Just call reset() bare.
- try:
- agent.reset(runtime_logger)
+ try:
+ agent.reset()
except Exception as e:
agent.reset()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| agent.reset(runtime_logger) | |
| except Exception as e: | |
| agent.reset() | |
| try: | |
| agent.reset() | |
| except Exception as e: | |
| agent.reset() |
🧰 Tools
🪛 Ruff (0.13.2)
18-18: Do not catch blind exception: Exception
(BLE001)
18-18: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
🤖 Prompt for AI Agents
In osworld_setup/s3/lib_run_single.py around lines 16-19, AgentS3.reset only
accepts self so passing runtime_logger causes a TypeError; remove the extra
argument and call agent.reset() without parameters (and eliminate the redundant
except block), i.e. replace the try/except that calls
agent.reset(runtime_logger) with a single bare agent.reset() call.
| @@ -0,0 +1,54 @@ | |||
| # Deplying Agent S3 in OSWorld | |||
There was a problem hiding this comment.
Fix typo in heading.
Opening header should read “Deploying Agent S3 in OSWorld.”
🤖 Prompt for AI Agents
In osworld_setup/s3/OSWorld.md around lines 1 to 1, the header contains a typo
"Deplying Agent S3 in OSWorld"; change the heading to "Deploying Agent S3 in
OSWorld" by correcting the misspelled word so the opening header reads exactly
that.
| file_handler = logging.FileHandler( | ||
| os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8" | ||
| ) | ||
| debug_handler = logging.FileHandler( | ||
| os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8" | ||
| ) | ||
| stdout_handler = logging.StreamHandler(sys.stdout) | ||
| sdebug_handler = logging.FileHandler( | ||
| os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8" | ||
| ) |
There was a problem hiding this comment.
Ensure the logs/ directory exists before adding file handlers
Right now the first logging.FileHandler instantiation will raise FileNotFoundError whenever logs/ hasn't been created yet, so the runner dies before doing anything useful. Please create the directory before wiring up the handlers.
datetime_str: str = datetime.datetime.now().strftime("%Y%m%d@%H%M%S")
+os.makedirs("logs", exist_ok=True)
+
file_handler = logging.FileHandler(
os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| file_handler = logging.FileHandler( | |
| os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) | |
| debug_handler = logging.FileHandler( | |
| os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) | |
| stdout_handler = logging.StreamHandler(sys.stdout) | |
| sdebug_handler = logging.FileHandler( | |
| os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) | |
| datetime_str: str = datetime.datetime.now().strftime("%Y%m%d@%H%M%S") | |
| os.makedirs("logs", exist_ok=True) | |
| file_handler = logging.FileHandler( | |
| os.path.join("logs", "normal-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) | |
| debug_handler = logging.FileHandler( | |
| os.path.join("logs", "debug-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) | |
| stdout_handler = logging.StreamHandler(sys.stdout) | |
| sdebug_handler = logging.FileHandler( | |
| os.path.join("logs", "sdebug-{:}.log".format(datetime_str)), encoding="utf-8" | |
| ) |
🤖 Prompt for AI Agents
In osworld_setup/s3/run_local.py around lines 30 to 39, the code creates
logging.FileHandler instances for the logs/ directory without ensuring that
directory exists, which will raise FileNotFoundError; before instantiating any
FileHandler, create the logs directory (e.g., os.makedirs("logs",
exist_ok=True)) and handle potential permissions errors, then proceed to
construct the FileHandler objects so handlers can open files successfully.
| env.close() | ||
| logger.info(f"Average score: {sum(scores) / len(scores)}") | ||
|
|
There was a problem hiding this comment.
Guard against empty scores before computing the average
When every example is skipped or fails before recording a score, len(scores) stays at zero and this line raises ZeroDivisionError, terminating the script right at shutdown. Handle the empty case explicitly.
- logger.info(f"Average score: {sum(scores) / len(scores)}")
+ if scores:
+ logger.info("Average score: %.2f", sum(scores) / len(scores))
+ else:
+ logger.warning("Average score unavailable: no successful runs recorded.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| env.close() | |
| logger.info(f"Average score: {sum(scores) / len(scores)}") | |
| env.close() | |
| if scores: | |
| logger.info("Average score: %.2f", sum(scores) / len(scores)) | |
| else: | |
| logger.warning("Average score unavailable: no successful runs recorded.") |
🤖 Prompt for AI Agents
In osworld_setup/s3/run_local.py around lines 281-283, computing
sum(scores)/len(scores) can raise ZeroDivisionError when scores is empty; guard
by checking if scores before dividing and log a clear message or default value.
Modify the shutdown code to: after env.close(), if scores: compute and log the
average; else log "No scores recorded" (or log average as 0/None) so the script
does not crash on shutdown.
Summary by CodeRabbit
New Features
Documentation
Chores